Image to 3D Scene Pipeline

This project aims to develop an interior scene generation system, funded by KOCCA, that converts single interior images into structured 3D scenes. By combining models for 2D object detection, automatic mask and semantic segmentation, and single-view image-to-3D reconstruction, the pipeline detects and segments interior elements from a single input photo, reconstructs each object as an aligned and textured 3D mesh, and then merges them into a single reconstructed scene. Applications include virtual walkthroughs, scene editing, AR/VR staging, and interior design automation.

SAM 3D Objects represents a new approach to 3D reconstruction and object pose estimation from a single natural image, reconstructing detailed 3D shapes, textures, and layouts of objects from everyday images. In these images, small objects, indirect views, and occlusion are common, but recognition and context can help the reconstruction when pixel information alone is not enough. Using SAM 3D Objects, users can start from an image, select objects, and quickly generate positioned 3D models. This allows users to manipulate individual objects in a reconstructed 3D scene or control the camera to view the scene from different perspectives.

From the left, original image · generated 3D scene

To measure the wall-clock time of SAM 3D inference on test images, the following setup was used:

Inference was performed on about 230 objects. For each object, the schedule was wait = 0, warmup = 1, active = 3, so there were four runs per object. Only the three active-step wall-clock times were recorded and averaged to obtain the mean time per object. The time was measured as the difference in time.perf_counter() after calling torch.cuda.synchronize() at each step so that the GPU was fully synchronized.



    (...)
    
        with torch_profile(
            activities=[ProfilerActivity.CPU, ProfilerActivity.CUDA], 
            schedule=torch.profiler.schedule(wait=args.wait, warmup=args.warmup, active=args.active),
        ) as profiler:
            
            elapsed_times = []
            for i in range(args.wait + args.warmup + args.active):
                start = time.perf_counter()
                
                generator(args, output_path, use_inference_cache=args.use_inference_cache)

                torch.cuda.synchronize()
                end = time.perf_counter()
                
                gc.collect()
                torch.cuda.empty_cache()

                profiler.step()
                if i + 1 > args.wait + args.warmup:                            
                    
                    active_step = i - args.wait - args.warmup + 1
                    elapsed_times.append(end - start)

    (...)

Profiling wall-clock time of SAM 3D inference

On an NVIDIA A5000 GPU (24 GB VRAM), the mean wall-clock time per single-object inference is 37.004264873904155 seconds. If model-loading overhead is excluded, the runtime is expected to decrease by about 20%. In this profiling, the setup was kept conservative, and the model was reloaded for each run.

This pipeline uses Segment Anything (SAM) to generate object masks from a single interior image before the 3D reconstruction stage. This step is necessary because SAM 3D does not provide automatic mask generation and requires object masks as input. The generated masks are used as input for SAM 3D reconstruction, allowing the pipeline to process multiple objects in the scene automatically.

Figure 3 below shows the 3D scene generated by the automatic masking pipeline and SAM 3D, using the image from Figure 2 as input. However, a main issue is that the orientation of each object in the generated 3D scene does not always match the orientation of the objects in the input image, and the objects are sometimes not properly placed on the floor plane. To solve this problem, an additional post-processing pipeline is required to correct the orientation and alignment of each object so that they better match the image.

From the left, generated scene · OBB of each object

The alignment correction logic relies on these assumptions:

Each object's OBB bottom face is its true bottom.
Object bottoms are nearly parallel to the global XY-plane.
There are no "floating" objects (there are no objects with a non-zero z-offset).

Based on these assumptions, the correction works as follows. For each object, the OBB is computed, and the inverse of its local basis matrix is applied to remove the object's rotation. The object is then moved back to its original position, and its z-coordinate is set to zero so that it is aligned with the floor plane. The correction process can be expressed as the following equation: $$ \,\\ \mathbf{O'}_i = \mathbf{B}_{\mathrm{OBB},i}^{-1} \left( \mathbf{O}_i - \mathbf{c}_i \right) + \mathbf{c}_i - \mathbf{z}_{\mathrm{snap},i} \qquad \text{for } i = 1, 2, \ldots, n \,\\ $$ where for each object $i$, $\mathbf{O}_i$ is the matrix of vertices of object $i$ in world coordinates, $\mathbf{O}_i'$ is the matrix of vertices after correction, $\mathbf{B}_{\mathrm{OBB},i}$ is the $3 \times 3$ orthonormal OBB basis matrix for object $i$, $\mathbf{c}_i$ is the center of the OBB for object $i$, and $\mathbf{z}_{\mathrm{snap},i} = (0, 0, z_{\min,i})$ is the snap vector for object $i$, where $z_{\min,i}$ is the minimum $z$-coordinate of $\mathbf{B}_{\mathrm{OBB},i}^{-1} (\mathbf{O}_i - \mathbf{c}_i) + \mathbf{c}_i$ over all vertices of object $i$, ensuring the object is snapped to the floor $z = 0$.

This process is inspired by the local coordinate system transformation, called Plane to Plane.

From the left, before alignment correction · after alignment correction

The current alignment correction places all objects on the floor (z = 0), so the relative height between objects is lost. A better method to preserve or estimate z-values is needed for stacked or floating objects.
The choice of OBB basis (which axis is "up") is not always consistent across objects, so some meshes may have the wrong upright direction unless a consistent rule for selecting the basis is added.
Segment Anything should be able to recognize an object that is split into several parts due to occlusion as a single object.

Image to 3D Scene Pipeline

Introduction

SAM 3D

Profiling

Automatic Mask Generation with SAM

Post-Processing for Alignment Correction

Future Works