DeepSDF: Learning Continuous Signed Distance Functions for Shape Representation 02/27/2024 DeepSDF: Learning Continuous Signed Distance Functions for Shape Representation Abstract In this work, we introduce DeepSDF, a learned continuous Signed Distance Function (SDF) representation of a class of shapes that enables high quality shape representation, interpolation and completion from partial and noisy 3D input data. DeepSDF represents a shape's surface by a continuous volumetric field. The magnitude of a point in the field represents the distance to the surface boundary and the sign indicates whether the region is inside \((-)\) or outside \((+)\) of the shape, hence our representation implicitly encodes a shape's boundary as the zero-level-set of the learned function. (If the point in the field is perfectly on the surface, the SDF value is 0) Introduction we may need to deal with an unknown number of vertices and arbitrary topology. In this work, we present a novel representation and approach for generative 3D modeling that is efficient, expressive, and fully continuous. Contributions include: the formulation of generative shape-conditioned 3D modeling with a continuous implicit surface, a learning method for 3D shapes based on a probabilistic auto-decoder. Related Work Representation for 3d Shape Learning Representations for data-driven 3D learning approaches can be largely classified into three categories: Point-based Mesh-based Voxel-based Representation Learning Techniques Modern representation learning techniques aim at automatically discovering a set of features that compactly but expressively describe data. For a more extensive review of the field, we refer to Bengio et al. Generative Adversarial Networks Auto-encoders Optimizing Latent Vectors: which simultaneously optimizes the latent vectors (3D 오브젝트 클래스별 임베딩) assigned to each data point and the decoder weights through back-propagation. Throughout the paper we refer to this class of networks as auto-decoders, for they are trained with reconstruction loss on decoder-only architectures Modeling SDFs with Neural Networks A signed distance function is a continuous function that, for a given spatial point, outputs the point's distance to the closest surface, whose sign encodes whether the point is inside (negative) or outside (positive) of the watertight surface: \[ \,\\ SDF(x) = s: x \in \mathbb{R}^3, \, s \in \mathbb{R} \\ \] The surface is implicitly represented by the \(SDF(\cdot) = 0\). A view of this implicit surface can be rendered through raycasting or rasterization of a mesh obtained with Marching Cubes. Our key idea is to directly regress the continuous SDF from point samples using deep neural networks. The resulting trained network is able to predict the SDF value of a given query position, from which we can extract the zero level-set surface. As a universal function approximator, deep feed-forward networks in theory can learn the fully continuous shape functions with arbitrary precision. The most direct application of this approach is to train a single deep network for a given target shape. Difference of DeepSDF model input shape between a single shape and multi-shapes Given a target shape, we prepare a set of pairs \(X\) composed of 3D point samples and their SDF values: (xyz 입력 좌표 및 sdf 레이블 값) \[ \,\\ X := {(x, s) : SDF(x) = s} \\ \] We train the parameters \(\theta\) of a multi-layer fully-connected neural network \(f_\theta\) on the training set S to make \(f_\theta\) a good approximator of the given SDF in the target domain \(\Omega\): \[ \,\\ f_\theta(x) \approx SDF(x), \, \forall x \in \Omega \\ \] The training is done by minimizing the sum over losses between the predicted and real SDF values of points in \(X\) under the following \(L_1\) loss function (MAE loss): \[ \,\\ \mathcal{L}(f_\theta(x), \, s) = | \, \text{clamp}\, (f_\theta(x), \, \delta) - \text{clamp}\, (s, \, \delta) \, | \\ \] where \(\text{clamp}(a, \, \delta) := \text{min}(\delta, \, \text{max}(-\delta, \, a)) \) (각각의 점이 대상 표면 주변 거리 \(-\delta \,\) ~ \(\delta \, \) 범위 내에서 예측되도록 조정) In this paper, the \(\delta = 0.1\) and a feed-forward network composed of eight fully connected layers, each of them applied with dropouts. All internal layers are 512-dimensional and hae \(\text{ReLU}\) non-linearities. The output non-linearity regressing the SDF value is \(\text{tanh}\). We found training with batch-normalization to be unstable and applied the weight-normalization technique instead. For training, we use the \(\text{Adam}\) optimizer. DeepSDF decoder architecture Learning the Latent Space of Shapes We introduce a latent vector \(z\), which can be thought of as encoding the desired shape, as a second input to the neural network. The latent vector \(z\) is mapped to a 3D shape represented by a continuous SDF. For some shape indexed by \(i, \, f_\theta\) is a function of a latent code \(z_i\) and a query 3D location \(x\), and outputs the shape's SDFs: \[ \,\\ f_\theta(z_i, \, x) \approx SDF^i(x) \\ \] By conditioning (임베딩) the network output on a latent vector, this formulation allows modeling multiple SDFs with a single neural network. (one 3D object has one embedding vector) Motivating Encoder-less Learning In the encoder-decoder architecture, since the trained encoder is unused at test time, it is unclear whether using the encoder is the most effective use of computational resources during training. (auto-decoder, decoder only architecture for learning a shape embedding without an encoder) Auto-decoder-based DeepSDf Formulation Given a dataset of \(N\) shapes represented with signed distnace function, we prepare a set of \(K\) point samples and their signed ditance values: \[ \,\\ X_i = \{ \, (x_j, \, s_j) \, : \, s_j = SDF^i(x_j) \, \} \\ \] For an auto-decoder, as there is no encoder, each latent code \(z_i\) is paired with training shape \(X_i\). The posterior over shape code \(z_i\) given the shape SDF samples \(X_i\) can be decomposed as: \[ \,\\ p_\theta(z_i\, | \, X_i) = p(z_i) \, \cdot \, \prod{(x_j, \, s_j) \in X_i} \,\, p_\theta(s_j\, | \, z_i;x_j) \\ \] where \(\theta\) parameterizes the SDF likelihood. 3D shape \(X_i\) 를 가장 잘 설명하는 임베딩 벡터 \(z_i\)를 학습 latent vector \(z_i\)를 찾는 것은 데이터 \((x_j, s_j) = X_i\) 를 동시에 잘 설명할 수 있는 \(z_i\)를 찾는 것 \(\prod{(x_j, \, s_j) \in X_i} \,\, p_\theta(s_j\, | \, z_i;x_j)\) : Maximum Likelihood Estimation, \(X_i\)를 가장 잘 설명하는 \(z_i\)와 파라미터 \(\theta\) 탐색 In the latent shape-code space, we assume the prior distribution over codes \(p(z_i)\) to be zero-mean and \(\sigma^2\) 굳이 정규분포를 따르는 \(z\, \)를 사용해야 하는 이유는? (학습 안정성? 확률적 해석을 위한?) 그냥 임베딩 하면 되는것 아닌지? In the auto-decoder-based DeepSDF formulation we express the SDF likelihood via a deep feed-forward network \(f_\theta(z_i, \, x_j )\) and, without loss of generality, assume that the likelihood takes the form: \[ \,\\ p_\theta(s_j\, | \, z_i;x_j) = \text{exp}(-\mathcal{L}(f_\theta(z_i, \, x_j), \, s_j)) \\ \] where the SDF prediction \(\tilde{s}_j\) is represented using a fully-connected network. \(\mathcal{L}(\tilde{s}_j, \, s_j)\) is a loss function penalizing the difference between network prediction and actual SDF value \(s_j\). 손실함수를 \(\text{exp}\)의 입력으로 하여 손실값이 작을수록 가능도가 높은 것으로 해석 In the implementation \( \text{clamped} \, \, L_1\) cost is used for back-propagation. At training time we maximize the \[ \,\\ \underset{\theta, \{z_i\}_{i=1}^N}{\arg \min} \sum_{i=1}^N \left( \sum_{j=1}^K \mathcal{L}(f_{\theta}(z_i, x_j), s_j) + \frac{1}{\sigma^2} \|z_i\|_2^2 \right) \\ \] where the term \(\frac{1}{\sigma^2} \|z_i\|_2^2\) is to 정규화 Minimizing loss \( \mathcal{L}(f_{\theta}(z_i, x_j), s_j) \) == maximizing the likelihood \( p_\theta(s_j | z_i; x_j) \) Results Latent Space Shape Interpolation To show that our learned shape embedding is complete and continuous, we render the results of the decoder when a pair of shapes are interpolated in the latent vector space DeepSDF represents signed distance functions (SDFs) of shapes via latent code-conditioned feed-forward decoder networks