End-to-end 6-DoF Object Pose Estimation through Diﬀerentiable Rasterization

,


Introduction
Inferring the six degrees of freedom (6-DoF) pose (3D rotations + 3D translations) of an object given a single RGB image is extremely challenging.Indeed, this process underlies a deep knowledge of the object itself and of the 3D world that is not easy to distill from a single frame; the kind of object, its 3D shape and the possible 3D transformation that leads to visually plausible outputs must be inferred jointly.In this work, we show how an approximate differentiable renderer can be exploited to refine the 6-DoF pose estimation prediction using only 2D silhouette information.Keeping the object volume fixed we can back-propagate to the second renderer input, namely the object pose (see Fig. 1).We demonstrate that this differentiable block can be stacked on a 6-DoF pose estimator to significantly refine the estimated pose using only the 2D alignment information between the input object mask and the rendered silhouette.Leaving aside camera intrinsics, Fig. 1.The overall proposed framework.A deep convolutional encoder is fed with the object mask and predicts both the object's class and 6-DoF pose.By means of a differentiable renderer the predicted cluster medoid can be projected back according to the predicted pose, adding a further online alignment supervision w.r.t. the input mask.
a renderer can be generally thought as a black-box with two inputs and one output.The renderer takes as input (i) a given representation of the 3D object (e.g.voxels, mesh etc.) and (ii) the 6-DoF pose of the object w.r.t. the camera and produces the 2D image of the object or, as in our setting, solely its silhouette.Typically a rendering algorithm includes many non-differentiable operations (e.g.rounding, hard assignments etc.); it thus cannot be used in a deep learning architecture as it would break the back-propagation chain.Nonetheless, in the context of 3D volume estimation recent works [18,47,24,14,41] have been proposed which exploit approximated differentiable renderers to back-propagate the loss to the first renderer input, namely the 3D representation of the object, but leaving fixed the set of possible camera poses.
Since we rely on an fixed 3D model of the object we can abandon the redundant and expensive voxel representation in favor of meshes, which are lightweight and better tailored to represent 3D models [34].Also, in contrast w.r.t.previous works, the rendering pipeline is implemented via a rastering algorithm, significantly faster than the conventional ray-tracing approach.Eventually, to solve the issue that true 3D model of the object is not usually known at test time, we indicate as a viable solution to perform coarse-grained classification and use a representative 3D model of the object category (e.g. a cluster medoid) instead.We experimentally demonstrate that the proposed pipeline is able to correct the estimated pose effectively even when using surrogate models.[33,16,30,6].More recently the large-scale database of synthetic models ShapeNet [5] dataset is having an analogous impact on the 3D community, showing that, in presence of enough data, 3D geometry and deep learning can be integrated successfully [37,46,7,36,22,2].One of the areas in which this marriage is being fertile the most is the one of estimating the 3D shape of an object given an image, or to generate novel views of the same object.Indeed, pre deep learning methods [4,15,20,42,29,35] often need multiple views at test time and rely on the assumption that descriptors can be matched across views [13,1], handling poorly self-occlusions, lack of texture [32] and large viewpoint changes [25].Conversely, more recent works [37,46,7,36,22,2] are built upon powerful deep learning models trained on virtually infinite synthetic data rendered from ShapeNet [5].From a high level perspective, we can distinguish methods that learn an implicit representation of object pose and volume and then decode it by means of another deep network [38,11,7,48] from methods that infer from the image a valid 3D representation (e.g.voxel-based) that can be re-projected by means of a differentiable renderer [47,41,14,44] to eventually measure its consistency w.r.t. the input image.Works leveraging the latter approach are strictly related to our proposed method in that they all found different ways to back-propagate through the renderer in order to correct the predicted object volume.Yan et al [47], Gadelha et al [14] and Wiles et al [44] take inspiration from the spatial transformer network [18] in the way the predicted volume is sampled to produce the output silhouette, even though they differ in the way the contribution of each voxel is counted for each line of sight.Rendering process proposed in Rezende et al [31] has to be trained via REIN-FORCE [45] since it is not differentiable.Tulsiani et al [41] frame the rendering phase in a probabilistic setting and define ray potential to enforce consistency.Our method differs substantially from all these works in several features.First, we keep the volume fixed and backpropagate through the renderer to correct the object pose, while the aforementioned works project the predicted 3D volume from a pre-defined set of poses (e.g.24 azimuthal angles 0 • , 15 • , . . .345 • around y-axis) and backpropagate the alignment error to correct the volume.Furthermore, while all these works use ray-tracing algorithm for rendering, our work is the first to propose a differentiable raster-based renderer.Eventually, all mentioned works represent the volume using voxels, which is inefficient and redundant since almost all valuable information is in the surface [34], while we use its natural parametrization by vertices and faces, i.e. the mesh.Convolutional neural networks (CNNs) have demonstrated analogous effectiveness in the task of object pose estimation, traditionally framed as a Perspective-n-Points (PnP) correspondence problem between the 3D world points and their 2D projections in the image [21,26].With respect to descriptor-based methods [8,9,25], modern methods relying on CNNs [23,37,40] can solve ambiguities and handle occluded keypoints thanks to their high representational power and composite field of view, and have shown impressive results in specific tasks such as the one of human pose estimation [27,43,39,49].Building upon this success, recent methods [50,28] combine CNN-extracted keypoints and deformable shape models in a unique optimization framework to jointly estimate the object pose and shape.Differently from all these works, here we propose a substantially new method to integrate the object shape and pose estimation and model fitting in a unique end-to-end differentiable framework.To the best of our knowledge, this is the first work in which a differentiable renderer is used to correct the 6-DoF object pose estimation just by back-propagating 2D information on silhouette alignment error.

Model Description
Given a single RGB image in which one or more objects of interest has already been segmented, we train a deep convolutional encoder to predict the class and the 6-DoF pose (rotation and translation) of each object w.r.t the camera.We then exploit an approximate renderer to re-project the silhouette of object on the image according to the pose predicted by the encoder.As the true object models are not available at test time, for re-projection a representative object (i.e.medoid) of the predicted class is used.Also, since the rendering phase is approximated with a differentiable function, we can not only measure the alignment error w.r.t. the input object mask, but also back-propagate it to the encoder weights.Eventually, this allows us to fine-tune the encoder online optimizing just the alignment error.Our overall architecture is depicted in Fig. 1.In what follows both the encoder and the renderer models are detailed.

Encoder
The deep convolutional encoder network is schematized in Fig. 2. The first part of the network is dedicated to feature extraction and it is shared by the classification and the pose estimation branch.The network has been designed inspired by [38] which showed favorable results in a related task.Features extracted are then used by two fully-connected independent branches to infer the object class and the camera pose respectively.All layers but the last are followed by leaky ReLu activation with α = 0.2.Differently from most of the literature [47,14,44] we do not quantize the pose space into a discrete set of pre-defined poses to ease the task.Conversely, given a rotation matrix R 3×3 and a translation vector t 3×1 we regress the object pose by optimizing the mean square error between the predicted and the true pose: where X is the set of RGB images, Y p is the set of true P 3×4 pose matrices and f p (x i , θ) is the pose predicted by the encoder for example x i according to its weights θ.From a technical standpoint, for each X, Y, Z axis the encoder regresses the cosine of the Euler rotation angle and the respective translation.The output roto-translation matrix is then composed following Euler ZYX convention: in this way predicted matrices are guaranteed to be always geometrically consistent.For the classification branch we instead optimize the following categorical cross-entropy function: where x i ∈ X is an input RGB image, f c (x i , θ) is the encoder predicted distribution over possible clusters for example x i and y i in the true one-hot distribution for example x i .

Differentiable Renderer
To measure the reliability of the predicted 6-DoF pose and to be able to correct it at test time, we design a fully differentiable renderer for re-projecting the silhouette of the 3D model on the image according to the predicted object pose.This allows to refine the estimated pose by back-propagating the alignment error between the 2D silhouettes.To the best of our knowledge, it is the first time that a fully-differentiable raster-based renderer is used to this purpose.Differently from concurrent works such as [47], our rendering process starts from the raw mesh triangles and not from a 3D voxel representation.While the latter is easier to predict by a neural network since it has a static shape, its footprint scales with the cube of the resolution and forces to use ray-tracing techniques to render the final image, known to be slow and harder to parallelize.Despite rastering does not allow for photo-realistic shaded images, as it does not imply light sources rays tracing, it is still well suited for all tasks which require the object shape silhouette from different point of views as in our case.
Our renderer is composed of two main parts: -A rastering algorithm, which applies the predicted camera to the 3D triangles meshes to obtain 2D projected floating point coordinates of the corners; -An in/out test to determine which projected points lie inside the triangles, i.e. which triangles must be filled.
While the first step is fully differentiable, a naive implementation of the latter exploits boolean masks to select the pixels to be filled, which eventually breaks the backpropagation through the network.Inspired by [18], we employed a spatial transformation to assign a value to each pixel based on a relation between its coordinates and those of the triangles corners.While a boolean mask represents hard membership, this approach assigns each pixels a continuous value, thus applying a soft (differentiable) membership.From a more technical standpoint, given all triangles T which compose the mesh of current model, we project the 3D triangle vertices V 3D as follows: where K 3x3 is the camera calibration matrix and P −1 3×4 .Then, defined as ] the three edges of the ith projected triangle, the renderer's output for pixel in location (u, v) can be computed as: and H, W indicate the image height and width in pixels.We refer the reader to Fig. 3 for a better intuition of Equation 5.It is worth noticing that the i-th triangle contributes to the output only if all the three determinant products are positive, meaning that (u, v) point lies on the left side of all three triangle edges i.e. it is inside the triangle.

Dataset
We train our model on ShapeNetCore(v2) [5] dataset, which comprises more than 50K unique 3D models from 55 distinct man-made objects.We focus in particular on the car synset since it is one of the most populated category with 7497 different 3D CAD vehicle models.Each model is stored in .objformat along with its materials and textures: dimensions, number of vertices and details vary greatly from one model another.

Data collection
To collect the data, we first load a random model on the origin t = (0, 0, 0) of our reference system.We then create a camera in location t = (x, y, z).While on xy plane the location is randomly sampled in a q x × q y grid, we keep fixed z = k under the assumption that the camera is mounted somewhere at height k on a moving agent (e.g. an unmanned vehicle).We then force the camera to point an empty object e that is randomly sampled at z = 0 and x, y sampled as above in a e x × e y grid: in this way we make the object to appear translated in the camera image.Eventually, the camera image is dumped along with the camera pose to constitute an example x i .We refer the reader to Fig. 4 to get a better insight into the procedure.Data collection details: In our experiments we set q x = q y = 10 and k = 1.5, which is the average height of a European vehicle.For the empty object we set e x = e y = 3. Models are standardized s.t. the major dimension has length 6.For each cluster, the models are split with ratio 0.6-0.2-0.2 into train, validation and test set respectively.Medoids are expected to be known at test-time and do not belong to any of the splits.Models are rendered using Blender CYCLES engine [3] to maximize photo-realism.
Selecting the representative 3D model Since the true 3D object model is hardly available at test time, we want to verify if a surrogate 3D model can be instead successfully employed for the rendering process.Analogously to Du et al [12] we distinguish three main vehicle clusters, namely i) Sedan passenger cars, ii) Sport-utility vehicles (SUV, which are also passenger cars but have off-road features like raised ground clearance) and iii) Cargo vehicles such as trucks and ambulances.Aligned CAD models for the three clusters are depicted in Fig. 4(c).Following Tatarchenko et al [38] we selected the representative model for each cluster, by extracting and comparing the HOG descriptors from two standard rendered views of each CAD model (i.e.frontal and side).Eventually we compute the L2 distance between descriptors and for each cluster we retain the cluster medoid, i.e. the model with the least average distance from all the others.
Fig. 4. On the left is depicted how all camera poses predicted by the encoder independently for each object (a) can be roto-translated to a common origin to reconstruct the overall scene (b), also in Fig. 7. On the right, the average silhouette of vehicles belonging to sedan, SUV and cargo is shown (c).For each cluster all 3D meshes are overlaid before taking the snapshot from the side view; the high overlap highlights the low intra-cluster variance.

Model Evaluation
Metrics The encoder ability to estimate the 3D pose of the object is measured by means of geodesic distance between predicted and true rotation matrix [40,17] as: where ||A|| F = i,j |a ij | 2 indicates the Frobenius norm.In particular, we report the median value of the aforementioned distance over all predictions in test set as Median Viewpoint Error (MVE).We also report the percentage of examples in which the pose rotation error is smaller than π/6 as Acc π 6 .To measure the re-projection alignment error we instead rely on mean intersection over union (mIoU) metric defined over the N test examples as where S i is the ground truth silhouette and Si = g(f p (x i ), f c (x i ), K) is the renderer output given the predicted object pose, cluster and camera intrinsics K.
Model performance To prove the effectiveness of the proposed method we first train the 6-DoF pose estimation network alone to jointly estimate the object class and its 6-DoF pose.In this way, we get a baseline to measure the successive contribute of the prediction refinement through our differentiable rendering module.State-of-the-art results on test set reported in Table 1(first row) indicate this to be already a strong baseline.The prediction refinement module is then plugged-in, and the evaluation is repeated.For each example, the medoid of the predicted class is rendered according to the predicted pose, back-propagating the alignment error between the true and the rendered silhouette for 30 optimization steps.Results of this analysis are reported in Table 1(second row) and indicate a huge performance gain (20%) obtainable by maximizing the 2D alignment between object masks.The significant improvement in all the metrics, despite none of these is optimized explicitly, suggests that the proposed differentiable rendering module is a viable solution for refining the predicted 6-DoF even at test time, requiring minimal information (i.e.only the object mask).The process of prediction refinement can be appreciated in Fig. 5.

Renderer ablation study
We measure, at first, the impact of rendering resolution on the optimization process by refining the object 6-DoF estimated pose using different rendering resolutions.Results reported in Table 2 show that working at higher resolution is definitely helpful while very-low resolution are hardly beneficial, if not detrimental, for the optimization process.This supports the need to abandon the voxel-based representation, whose computational footprint increases with the cube of resolution.We then compare our renderer with the publicly available implementation of Perspective Transformer Network (PTN) by Yan et al [47].Results are shown in Fig. 6(a).Since PTN relies on a fixed 32x32x32 voxel representation, rendering at higher resolution hardly changes the output's fidelity w.r.t. the true silhouette.Conversely, our mesh-based renderer is able to effectively take advantage of the higher resolution.Comparing   our rendering time with PTN [47] in Fig. 6(b), we see that PTN scores favorably only for very-low voxel and image resolutions, while as resolution increases the PTN rendering time increases exponentially due to the voxel-based representation.Eventually, in Fig. 6(c) we show that our average viewpoint error continues to decrease along with the number of refinement optimization steps.
Training details Encoder is trained until convergence with batch size=64 and ADAM optimizer with learning rate 10 −5 (other hyper-parameters as suggested in the original paper [19]).Batch size is decreased to 20 and learning rate to 10 −6 during renderer fine-tuning.We find useful dropout (p = 0.5) after all dense layers and L2 weight decay over feature extraction for regularization purposes.

Conclusions
In this work we introduce a 6-DoF pose estimation framework which allows an online refinement of the predicted pose from minimal 2D information (i.e. the object mask).A fully differentiable raster-based renderer is developed for reprojecting the object silhouette on the image according to the predicted 6-DoF pose: this allows to correct the predicted pose by simply back-propagating the alignment error between the observed and the rendered silhouette.Experimental results indicate i) the overall effectiveness of the online optimization phase, ii) that proxy representative models can be profitably used in place of the true ones in case these are not available and iii) the benefit of working in higher resolution, well-handled by our raster-based renderer but hardly managed by concurrent ray-tracing, voxel-based algorithms.

Fig. 2 .
Fig. 2. Architecture of the encoder network.Visual features are extracted from the input image by means of 2D convolutions (first three layers have 5x5 kernel, last two have 3x3 kernel.All convolutional layers have stride 2 and are followed by leaky ReLu non-linearities).The flattened feature vector is fed to two fully connected branch, which estimate the object class and pose respectively.

Fig. 3 .
Fig.3.Exemplification of the approximated rastering process.First each triangle composing the mesh is projected in the 2D image (a) using Eq. 4. The determinant product inside the max of Eq. 5 selects the points which lie on the left side of each edge of the triangle (b), (c), (d).The product of these three terms gives an approximated yet differentiable rendering of the triangle's silhouette (e).

Fig. 5 .
Fig. 5. Online refinement of the estimated pose; We overlay in red the predicted silhouette for each optimization step.Despite the initial estimate (t=0) was noticeably wrong, the 6-DoF object pose is gradually corrected using only 2D silhouette alignment information.

Fig. 6 .
Fig. 6.(a) Intersection over union between rendered silhouette and the ground truth one for both our renderer and Perspective Transformer Networks (PTN) [47], at different rendering resolutions.(b) Rendering time for different image (and PTN voxel) resolutions.(c) Average viewpoint error improvement for different number of optimization steps.See text for details.

Fig. 7 .
Fig.7.Qualitative results for multiple object scenes.Since all predicted poses lie in the same reference system (see Fig.4), different views of the scene can be produced by means of any rendering engine.It is worth noticing that each object has been substituted by the representative model for its predicted class.

Table 1 .
[40,37]ummarizing model performance.It is worth noticing that none of the metrics in the table are explicitly optimized during refinement.Results of concurrent works on the vehicle class are shown for reference, despite the task of[40,37]is only viewpoint estimation (not 6-DoF pose) and all are trained on different dataset.

Table 2 .
Gains obtained in pose estimation using different rendering resolutions.Increasing the resolution used for rendering the silhouette is much beneficial to the optimization process.Conversely, for very low resolution this phase is hardly helpful.