latentspace.

Voxels to Graph

Introduction ๐Ÿ”—

This project explores the key contributions of Building-GAN by AutodeskAILab: 1) a novel 3D representation called voxel graph that can encode irregular voxel grids with non-uniform space partitioning, overcoming limitations of traditional regular voxel grids, and 2) a graph-conditioned generative adversarial network (GAN) leveraging graph neural networks (GNNs). The implementation uses their dataset to study these approaches. First, let's examine the structure of the raw data in the next section.
Generated Buildings


Raw Data

In the Data Collection part of the paper, it says that the dataset was created as follows:

Since there is no publicly available dataset for volumetric designs from real buildings, we create a synthetic dataset with 120,000 volumetric designs for commercial buildings using parametric models. The heuristics behind the parametric models are based on the rules and knowledge provided by professional architects. Although these parametric models are able to explore possible volumetric designs, they are not capable of fitting the constraints. Therefore, we generate the designs first and then compute the voxel graph, program graph, FAR, and TPR for each design. (...) Here, we consider 6 program types: lobby/corridor, restroom, stairs, elevator, office, and mechanical room. Each program node feature includes the program type and the story level.

Each datum of the created dataset consists of three JSONs as follows:
  1. graph_global_*.json: contains the FAR (Floor Area Ratio), site area (Total land area to build a building), program ratios, and program types (6 program types described above + void type).
    
        {
            "far": 3.1144640998959425,
            "site_area": 961,
            "global_node": [
                {
                    "type": 3,
                    "proportion": 0.04811226194453724,
                
                    (...)
                },
                {
                    "type": 0,
                    "proportion": 0.25526227865018364,
                
                    (...)
                },
    
                (...)
            ]
        }
    
    graph_global_*.json

  2. graph_local_*.json: contains the nodes. Each node has the following attributes: floor \(\text{F}\) (the floor level on which the node is placed), program type \(\text{T}\), index \(\text{I}\), and neighbors that indicate connectivity between nodes. In the local graph, the unique index of each node is created as \(\text{[F, T, I]}\), and the neighbors contain these unique indices. In the example below, the first node's unique index is \(\text{[0, 3, 0]}\), and it has \(\text{[1, 3, 0]}\) as a neighbor. Similarly, the node with index \(\text{[1, 3, 0]}\) has \(\text{[0, 3, 0]}\) as a neighbor.
    
        {
            "node": [
                {
                    "floor": 0,
                    "type": 3,
                    "type_id": 0,
                    "neighbors": [[0, 0, 0], [1, 3, 0]]
                
                    (...)
                },
                {
                    "floor": 1,
                    "type": 3,
                    "type_id": 0,
                    "neighbors": [[1, 0, 0], [2, 3, 0], [0, 3, 0]]
                
                    (...)
                },
                
                (...)
            ]
        }
    
    graph_local_*.json

  3. voxel_*.json: contains the voxel voxel nodes. Each voxel node has the following attributes: location (it serves as a unique index in the voxel graph), program type, dimension (depth, height, width of an irregular voxel), coordinate (center of the voxel), and neighbors that indicate connectivity between nodes.
    
        {
            "voxel_node": [
                {
                    "location": [0, 2, 7],
                    "type": 3,
                    "coordinate": [0.0, 20.0, 37.0],
                    "dimension": [7.0, 3.0, 3.0],
                    "neighbors": [[1, 2, 7], [0, 1, 7], [0, 3, 7], [0, 2, 6]]
                
                    (...)
                },
                {
                    "location": [1, 2, 7],
                    "type": 3,
                    "coordinate": [7.0, 20.0, 37.0],
                    "dimension": [4.0, 3.0, 3.0],
                    "neighbors": [[0, 2, 7], [2, 2, 7], [1, 1, 7], [1, 3, 7], [1, 2, 6]]
                
                    (...)
                },
                
                (...)
            ]
        }
    
    voxel_*.json


Representing Voxels as a Graph

A Graph data structure consists of vertices and edges \(G(V, E)\). The vertices are sometimes also referred to as nodes and the edges are lines or arcs that connect any two nodes in the graph. The graph data structure can be represented as an adjacency list (or matrix) to indicate connectivity between nodes.

To convert voxel data into a graph structure, we need to assign an index for each voxel. The index of each voxel is \(\text{(x, y, z)} \)-shaped structure, and it is unique, so that the index can be converted to \(\text{(0, 1, 2,} \cdot \cdot \cdot \text{, n)} \)-shaped structure. As each voxel is a node, the last index of the node is the number of voxels-1.

Since the voxel is a cuboid-shaped geometry, connectivity between voxel nodes can be determined by the 3D 6-way connectivity. In the figure below, let's see the uppermost voxel with index \(319\) of the right diagram. Among the six faces of this voxel, only three (right, front, and bottom) are connected to adjacent voxels. The other faces do not touch any neighboring voxels. According to the 3D 6-way connectivity system, the node with index \(319\) has neighbor nodes with indices \(255\), \(311\), and \(318\). Therefore, using the adjacency list representation, it can be expressed as \(\text{319: [255, 311, 318]}\). Similarly, by the same logic, the node with index \(24\) has faces that are adjacent to four voxels located to the left, right, front, and top. Therefore, using the adjacency list representation, it can be expressed as \(\text{24: [16, 25, 32, 88]}\).

In this system, the number of neighboring voxel nodes is at least 3 and at most 6, since a voxel is a cuboid-shaped geometry with 6 faces.

Voxel to Graph
From the left, Unique Index ยท Adjacency List Representation


Each node in the graph data has the same shape features as other nodes. As I mentioned above, each voxel is converted into a node, so the node's features are the voxel's geometric attributes. In the raw data, each voxel node has the following features: coordinate \((3)\), dimension \((3)\), and location \((3)\) (voxel_graph_features is a flattened vector \((9)\) containing all three). Additionally, the FAR \((1)\) feature from graph_global_*.json and the floor level \((1)\) of each voxel are combined as part of the node features. The final feature vector of each node has a length of \((11)\); therefore, each graph is represented as an \((\text{N}, 11)\)-shaped matrix, where \(\text{N}\) is the number of nodes.

Here, I will skip the details of processing the local graph. You can find the code for processing the local graph at this link.


    class VoxelGraphData:
        def __init__(self, voxel_graph_data: dict):
            voxel_graph_types_onehot = voxel_graph_data["voxel_graph_types_onehot"]
            voxel_graph_features = voxel_graph_data["voxel_graph_features"]
            voxel_graph_far_per_node = torch.zeros(voxel_graph_types_onehot.shape[0], 1) + voxel_graph_data["far"]
            voxel_graph_floor_levels_normalized = voxel_graph_data["voxel_graph_floor_levels_normalized"].unsqueeze(0).t()

            self.x = torch.cat(
                [
                    voxel_graph_features,
                    voxel_graph_far_per_node,
                    voxel_graph_floor_levels_normalized,
                ],
                dim=1,
            )

            self.data_number = voxel_graph_data["data_number"]
            self.voxel_graph_types = voxel_graph_data["voxel_graph_types"]
            self.voxel_graph_types_onehot = voxel_graph_types_onehot
            self.edge_index = voxel_graph_data["voxel_graph_edge_indices"]
            self.voxel_graph_floor_levels = voxel_graph_data["voxel_graph_floor_levels"]
            self.voxel_graph_node_coordinate = voxel_graph_data["voxel_graph_node_coordinate"]
            self.voxel_graph_node_dimension = voxel_graph_data["voxel_graph_node_dimension"]
            self.voxel_graph_location = voxel_graph_data["voxel_graph_location"]
            self.voxel_graph_node_ratio = voxel_graph_types_onehot * voxel_graph_data["voxel_graph_node_ratio"]
            self.voxel_graph_node_ratio = self.voxel_graph_node_ratio.max(dim=1)[0].unsqueeze(1)
VoxelGraphData class


Here, since PyTorch Geometric is used for training models with graphs, node features and the connectivity between nodes are represented as x and edge_index, respectively. Each graph is instantiated by Data of PyTorch Geometric. The raw data, excluding x and edge_index, can also be passed as kwargs to be used during training or evaluation. The thing to note is that, in training with graphs, mini-batching is performed at the node level. Therefore, you must ensure that each data field matches the number of nodes; otherwise, problems may occur during batching.


    class GraphDataset(Dataset):
        def __init__(self, configuration: Configuration):
            super().__init__()

            self.configuration = configuration
            
            (...)

            self.voxel_graph_data = []
            for local_graph_file, voxel_graph_file in zip(self.local_graph_data_files, self.voxel_graph_data_files):
                assert local_graph_file.split("/")[-1].split("_")[0] == voxel_graph_file.split("/")[-1].split("_")[0]
                
                (...)
    
                voxel_graph = torch.load(voxel_graph_file)
                self.voxel_graph_data.append(
                    Data(
                        x=voxel_graph.x,
                        edge_index=voxel_graph.edge_index,
                        voxel_level=voxel_graph.voxel_graph_floor_levels,
                        type=voxel_graph.voxel_graph_types,
                        types_onehot=voxel_graph.voxel_graph_types_onehot,
                        coordinate=voxel_graph.voxel_graph_node_coordinate,
                        dimension=voxel_graph.voxel_graph_node_dimension,
                        location=voxel_graph.voxel_graph_location,
                        node_ratio=voxel_graph.voxel_graph_node_ratio,
                        data_number=[voxel_graph.data_number] * voxel_graph.x.shape[0],
                    )
                )
    
        def __getitem__(self, i):
            return self.local_graph_data[i], self.voxel_graph_data[i]
    
        def __len__(self):
            return len(self.local_graph_data)
GraphDataset class


If you are curious about how to convert voxels into a graph at the code level, you can check the detailed logic through the following link.


VoxelGNNs

The GNN model can be broadly divided into link prediction and node classification tasks. In the paper, the model generates a building consisting of voxels with labels given a local graph and a voxel graph, so the task corresponds to node prediction. The diagram below provides a brief illustration of the training process. The process follows a WGAN-GP architecture, in which both the generator and discriminator are based on Graph Neural Networks.

VoxelGNNGenerator and VoxelGNNDiscriminator both use the local graph and the voxel graph as inputs. Additionally, the generator takes noise z as an input, while the discriminator takes the labels generated by VoxelGNNGenerator as an additional input. The discriminator learns to determine whether the node labels are real or fake based on the two input graphs and z. The generator is then updated using both the discriminator's feedback and auxiliary losses.

Training Process


The detailed code of VoxelGNNGenerator is as follows. The very first step in the generator's forward pass is to compute matched_x, which is matched by combining features from the local graph and the voxel graph. The local graph is a bubble diagram that abstractly represents the building's spatial composition, so it and the voxel graph have different numbers of nodes and cannot be matched 1:1. Therefore, their features are combined by aggregating nodes of the same type, using averaging. In this way, the features of the local graph nodes are abstractly combined for each type of voxel node.

Then, the abstractly combined graph features are passed through the matched_features_encoder module for encoding. Next, these encoded matched features are concatenated with the original voxel graph features and the noise vector z to create combined_features. This concatenation serves multiple purposes: the voxel graph features preserve the geometric details, and the noise vector z enables the model to produce diverse outputs even with the same conditions.

The combined_features are then passed through the mlp_encoder to produce encoded graph features, which serve as the input to the encoder of VoxelGNNGenerator. The encoder performs message passing using the voxel graph's edge connections to generate node embeddings.


    class VoxelGNNGenerator(nn.Module):
        def __init__(self, configuration: Configuration):
            super().__init__()

            (...)

        def forward(self, local_graph, voxel_graph, z):
            device = local_graph.x.device
    
            matched_x = torch.zeros((voxel_graph.x.shape[0], local_graph.x.shape[1]), device=device)
    
            for type_idx in torch.unique(voxel_graph.type):
                voxel_mask = voxel_graph.type == type_idx
                local_mask = local_graph.type == type_idx
    
                if local_mask.sum() > 0:
                    matched_x[voxel_mask] = local_graph.x[local_mask].mean(dim=0)
    
            encoded_matched_features = self.matched_features_encoder(
                matched_x.to(self.matched_features_encoder[0].weight.device)
            )
    
            combined_features = torch.cat(
                [
                    encoded_matched_features, 
                    voxel_graph.x.to(encoded_matched_features.device), 
                    z.squeeze(0).to(encoded_matched_features.device)
                ], dim=-1
            )
    
            x = self.mlp_encoder(combined_features)
            encoded = self.encoder(x=x, edge_index=voxel_graph.edge_index)
    
            final_features = torch.cat([encoded, x, encoded_matched_features, voxel_graph.x, z.squeeze(0)], dim=-1)
    
            logits = self.decoder(final_features)

            # https://github.com/AutodeskAILab/Building-GAN/blob/master/util.py
            label_soft = torch.nn.functional.gumbel_softmax(logits, tau=1.0)
            label_hard = torch.zeros_like(label_soft)
            label_hard.scatter_(-1, label_soft.argmax(dim=1, keepdim=True), 1.0)
            label_hard = label_hard - label_soft.detach() + label_soft
    
            return logits, label_hard, label_soft
VoxelGNNGenerator


Finally, the model creates final_features by concatenating the encoded output with all previous feature representations: the encoder output, the initial combined features, the encoded matched features, the original voxel features, and the noise vector. This rich feature representation is passed through a decoder to produce logits for each voxel's class prediction.

In this project, auxiliary losses are used in addition to WGAN-GP: calculating the FAR of the building generated by the model and the ratio of voxel types. For this purpose, the logits need to be converted to hard labels, but using simple argmax would break the gradient, so Gumbel Softmax is used. The temperature parameter \(\tau\) controls how discrete the output is: when \(\tau\) is close to \(0\), the output becomes like a one-hot vector (hard), when \(\tau = 1.0\) the output is soft, and when \(\tau\) approaches \(\infty\), the output becomes almost uniform. \[ \,\\ y_i = \frac{\exp( ( \log(\pi_i) + g_i ) / \tau)}{\sum_{j=1}^{\tiny{K}} \exp( ( \log(\pi_j) + g_j ) / \tau)} \,\\ \] The discriminator's loss is defined as d_fake.mean() - d_real.mean(), which minimizes the distance between the real data distribution and the generated data distribution (maximizing the output difference of a function satisfying the 1-Lipschitz condition). The discriminator is trained to assign low values to fake data and high values to real data.

To minimize this distance, the generator must produce buildings with a distribution that resembles the real data. In other words, the generator aims to increase the discriminator's score for fake data, using adversarial loss as the main objective, with the above-mentioned auxiliary losses as additional objectives: \[ \,\\ \begin{aligned} \text{Loss}_{\text{G}} \, = &\quad\, \text{loss}_{\text{adv}} \cdot \lambda_{\text{adv}} \\ & + \text{loss}_{\text{ratio}} \cdot \lambda_{\text{ratio}} \\ & + \text{loss}_{\text{far}} \cdot \lambda_{\text{far}} \end{aligned} \,\\ \]

    def _compute_generator_loss(self, local_graph, voxel_graph, logits, label_hard):

        d_fake = self.discriminator(local_graph, voxel_graph, label_hard)
        if self.configuration.USE_WGANGP:
            g_loss_adv = -d_fake.mean()
        
        else:
            g_loss_adv = torch.nn.functional.binary_cross_entropy(d_fake, torch.ones_like(d_fake))
            
        g_loss_adv *= self.configuration.LAMBDA_ADV
        
        label_ratio_g = label_hard.squeeze(0).sum(dim=0) / voxel_graph.num_nodes
        label_ratio = voxel_graph.types_onehot.sum(dim=0) / voxel_graph.num_nodes

        g_loss_ratio = torch.nn.functional.mse_loss(label_ratio_g[:-2], label_ratio[:-2])
        g_loss_ratio *= self.configuration.LAMBDA_RATIO

        g_loss_ratio_voids = torch.nn.functional.mse_loss(label_ratio_g[-2:], label_ratio[-2:])
        g_loss_ratio_voids *= self.configuration.LAMBDA_RATIO_VOID
        
        voxel_types_generated = label_hard.squeeze(0).argmax(dim=1)
            
        far_unique = []
        far_unique_generated = []

        si = 0
        for gi in range(voxel_graph.num_graphs):
            each_voxel_graph = voxel_graph[gi]
            each_far = each_voxel_graph.x[0][9]
            each_dimension = each_voxel_graph.x[:, 3:6] * self.configuration.NORMALIZATION_FACTOR_DIMENSION
            
            ei = si + each_voxel_graph.num_nodes
            each_voxel_types_generated = voxel_types_generated[si:ei]
            each_dimension_to_use = each_dimension[each_voxel_types_generated != self.configuration.VOID]
            
            each_gfa = (each_dimension_to_use[:, 1] * each_dimension_to_use[:, 2]).sum()
            each_far_generated = each_gfa / each_voxel_graph.site_area[0]
            
            far_unique.append(each_far)
            far_unique_generated.append(each_far_generated)
            
            si = ei
            
        g_loss_far = torch.nn.functional.mse_loss(torch.tensor(far_unique_generated), torch.tensor(far_unique))
        g_loss_far *= self.configuration.LAMBDA_FAR
        
        g_loss = g_loss_adv + g_loss_ratio + g_loss_ratio_voids + g_loss_far
        
        return g_loss
\( \text{Loss}_{\text{G}} \) computation


The lambdas for the generator are carefully balanced to ensure stable training. The adversarial loss takes priority with \(\lambda_{\text{adv}} = 1.0\), while the auxiliary losses for program type ratios and FAR are weighted at \(\lambda_{\text{ratio}} = 0.1\) and \(\lambda_{\text{far}} = 0.1\) respectively. This configuration allows the model to focus primarily on generating buildings' program distribution while maintaining architectural constraints.


Training VoxelGNNs

For this implementation, a subset of 10,000 graphs from the total 120,000 dataset is used. The monitoring results from training on 10,000 graph data for 1,000 epochs are as follows. Each metric commonly used for evaluating classification models was measured. All graphs show convergence.
Losses & Metrics
From the left, Generator Loss ยท Accuracy ยท Recall ยท Precision ยท F1


However, since the total number of nodes within the 10,000 data samples is approximately 4,000,000 nodes, due to the large number of nodes, even if misclassified labels are mixed in, the metrics cannot reflect the performance differences. Also, since VoxelGNNGenerator performs node-wise label prediction, there is a problem in that it is difficult to evaluate the generation quality at the graph level.
Graph-wise f1 scores
From the left, Minimum f1 score of training dataset
Minimum f1 score of validation dataset
Weighted sum of f1 scores


To address this problem, the validation method computes the graph-wise f1 scores on both training and validation datasets. By calculating f1 scores for node labels separately for each graph, we can evaluate the quality of each building, even though generating these metrics takes more time. Therefore, the model selection criterion \(\mathcal{C}\) uses a weighted sum of the worst-performing graphs from both sets. The model is saved when the weighted sum of f1 score improves, calculated as, where \(w_{\text{t}} = 0.05\) and \(w_{\text{v}} = 1.0\). \[ \,\\ \mathcal{C} = \min_{g \,\in\, G_{\text{t}}} \text{f1}(g) \cdot w_{\text{t}} + \min_{g \,\in\, G_{\text{v}}} \text{f1}(g) \cdot w_{\text{v}} \,\\ \]

    current_f1_score = (
        f1_score_min_train * self.configuration.F1_SCORE_TRAIN_WEIGHT
        + f1_score_min_validation * self.configuration.F1_SCORE_VALIDATION_WEIGHT
    )
    
    if best_f1_score < current_f1_score:
        best_f1_score = current_f1_score

        torch.save(
            {
                "epoch_start": epoch,
                "epoch_end": epoch_end,
                "best_f1_score": best_f1_score,
                "f1_score_train": f1_score_train,
                "f1_score_validation": f1_score_validation,
                "f1_score_min_train": f1_score_min_train,
                "f1_score_min_validation": f1_score_min_validation,
                "recall_score_train": recall_score_train,
                "recall_score_validation": recall_score_validation,
                "accuracy_score_train": accuracy_score_train,
                "accuracy_score_validation": accuracy_score_validation,
                "generator": self.generator.state_dict(),
                "discriminator": self.discriminator.state_dict(),
                "optimizer_generator": self.optimizer_generator.state_dict(),
                "optimizer_discriminator": self.optimizer_discriminator.state_dict(),
                "scheduler_generator": self.scheduler_generator.state_dict(),
            },
            os.path.join(self.log_dir, "states.pt"),
        )
Condition to save model status


Graph-based models operate with graphs as batch, but since computations are performed at the node level, even when the batch size remains constant, the number of nodes varies with each input. Therefore, when using BatchNorm, statistical instability can occur in graphs with fewer nodes. Particularly when using running mean and variance in eval() mode, performance may drop significantly. This is because graphs with fewer nodes tend to have larger differences from the overall training distribution. As a solution to this issue, LayerNorm can be used. It performs normalization independently for each node across the feature dimension, making it more suitable for graphs with varying numbers of nodes and providing more stable performance during evaluation.
BatchNorm vs. LayerNorm in validation
From the left, BatchNorm ยท LayerNorm


Building Generation

Finally, by passing the test dataset through the trained VoxelGNNGenerator model, we can obtain results like the following.
Generated Buildings w/ Test Dataset




Future Works

This project revealed several important findings to me. The most significant discovery was that layer normalization performs significantly better than batch normalization when working with graphs that have varying numbers of nodes. This insight will be helpful for building better graph neural networks in the future.

In the future, I will try the following things:


References