Polygon segmentations

In the early stages of architectural design, the axis defines the direction in which a building is to be placed. This axis plays an important role in determining the optimal layout with respect to the functionality, aesthetics, and environmental conditions of the building. The goal of this project is to develop a segmenter that determines how many axes a given 2D polygon should be segmented into and how those segmentations should be formed.

Segmented lands by a human architect

The figures above show imaginary dividing lines set arbitrarily by the architect. The apartments in the figures are placed based on the segmented polygons. Human architects intuitively know how many axes a given 2D polygon should be segmented into and how to create these segmentations, but expressing this intuition to a computer is difficult.

To achieve this, a combination of deep learning and graph theory is used. In the graph, each point of the polygon is treated as a node, and the connections between points are treated as edges. Based on this concept, a GNN-based model is implemented to learn how to optimally segment given polygons.

Before generating data, it is necessary to understand graphs and Graph Neural Networks. A graph is defined as \( G = (V, E) \). In this expression, \( V \) is the set of vertices (nodes) and \( E \) is the set of edges.

Graphs are mainly expressed as an adjacency matrix. When the number of points is \( n \), the size of the adjacency matrix \( A \) is \( n \times n \). When a graph is handled in machine learning, it is also expressed as a feature matrix that describes the characteristics of the points. When the number of features is \( f \), the dimension of the feature matrix \( X \) is \( n \times f \).

Understanding of the graph expression In this figure, , . — Understanding of the graph expression
In this figure, \( n = 4 \), \( f = 3 \)
\( A = n \times n = 4 \times 4 \)
\( X = n \times f = 4 \times 3 \).

Graphs are used in various fields to represent data, and they are useful when teaching geometry to deep learning models for the following reasons:

Representing Complex Structures Simply: Graphs effectively represent complex structures and relationships, making them suitable for modeling various forms of geometric data.
Flexibility: Graphs can accommodate varying numbers of nodes and edges, making them advantageous when dealing with objects of diverse shapes and sizes.

A Graph Neural Network (GNN) is a type of neural network designed to operate on graph structures. Unlike traditional neural networks, which work on fixed-size inputs such as vectors or matrices, GNNs can handle graph-structured data with variable size and connectivity. This makes GNNs particularly suitable for tasks in which relationships are important, such as social networks and geometries. A GNN primarily uses the connections and the states of neighboring nodes to update (learn) the state of each node (Message Passing). Predictions are then made based on the final states. This final state is generally referred to as the node embedding; the term encoding is also appropriate, since the raw features of the nodes are transformed into other representations.

There are various methods for Message Passing, but since this task deals with geometry, the focus is placed on models based on the Spatial Convolutionan Network. This method is known to be suitable for representing data with important geometric and spatial structures. A GNN using a Spatial Convolutional Network enables each node to integrate information from neighboring nodes, allowing for a more accurate understanding of local characteristics. Through this, the model can better comprehend the complex shapes and features of geometry.

Convolution operations From the left, 2D Convolution · Graph Convolution — Convolution operations
From the left, 2D Convolution · Graph Convolution

The idea of a Spatial Convolutional Network (SCN) is similar to that of a Convolutional Neural Network (CNN), since an image can be transformed into a grid-shaped graph.

A CNN processes images by using filters to combine the surrounding pixels of a central pixel. An SCN extends this idea to graph structures by combining the features of neighboring connected nodes instead of neighboring pixels. Specifically, CNNs are useful for processing images in a regular grid structure, where the filter considers the surrounding area of each pixel to extract features. In contrast, SCNs operate on general graph structures, combining the features of each node with those of its connected neighbors to generate embeddings.

Having briefly examined graphs and GNNs above, the next step is to prepare the data. Many raw polygons already exist around us in the form of land parcels. First, raw polygons are collected from vworld. Below is a portion of all the raw polygons collected from vworld.

Now that the raw polygons have been collected, the characteristics of the polygons to be included in the feature matrix are defined. They are as follows:

X coordinate (normalized)
Y coordinate (normalized)
Inner angle of each vertex (normalized)
Incoming edge length (normalized)
Outgoing edge length (normalized)
Concavity or convexity of each vertex
Minimum rotated rectangle ratio
Minimum rotated rectangle aspect ratio
Polygon area

These data then need to be converted into graph form. The following is an example of hand labeling:

Land polygon in Gangbukgu, Seoul From the left, raw polygon & labeled link · adjacency matrix · feature matrix — Land polygon in Gangbukgu, Seoul
From the left, raw polygon & labeled link · adjacency matrix · feature matrix

However, since it is impossible to label countless raw polygons by hand, this data must be generated automatically. Full automation by a single algorithm would be ideal, but if that were possible, there would be no need for deep learning. Therefore, a naive algorithm was created to reduce the manual work to some degree. This algorithm is inspired by the triangulations of the polygon. The process of the algorithm is as follows:

Triangulate a polygon: Retrieve the triangulation edges.
Generate Splitter Combinations: Generate all possible combinations of splitters for iterating over them.
Iterate Over Combinations: Iterate over all possible combinations of splitters to find the best segmented polygon.
Check Split Validity: Check if all segmentations are valid.
Compute Scores and evaluation: Compute scores to store the best segmentations.

The scores (even_area_score, ombr_ratio_score, slope_similarity_score) used in the fifth step are computed as follows, and each score is aggregated as a weighted sum to obtain the segmentations with the lowest total score.

\[ even\_area\_score = \frac{(A_1 - \sum_{i=2}^{n} A_i)}{A_{polygon}} \times {w_1} \]

\[ ombr\_ratio\_score = |(1 - \frac{A_{split1}}{A_{ombr1}}) - \sum_{i=2}^{n} (1 - \frac{A_{spliti}}{A_{ombri}})| \times {w_2} \]

\[ avg\_slope\_similarity_i = \frac{\sum_{j=1}^{k_i} |\text{slope}_j - \text{slope}_{\text{main}}|}{k_i} \]

\[ slope\_similarity\_score = \frac{\sum_{i=1}^{n} avg\_slope\_similarity_i}{n} \times {w_3} \]

\[ score = even\_area\_score + ombr\_ratio\_score + slope\_similarity\_score \]

The whole code for this algorithm can be found here, and the key part of the algorithm is shown below.


    def segment_polygon(
        self,
        polygon: Polygon,
        number_to_split: int,
        segment_threshold_length: float,
        even_area_weight: float,
        ombr_ratio_weight: float,
        slope_similarity_weight: float,
        return_splitters_only: bool = True,
    ):
        """Segment a given polygon

        Args:
            polygon (Polygon): polygon to segment
            number_to_split (int): number of splits to segment
            segment_threshold_length (float): segment threshold length
            even_area_weight (float): even area weight
            ombr_ratio_weight (float): ombr ratio weight
            slope_similarity_weight (float): slope similarity weight
            return_splitters_only (bool, optional): return splitters only. Defaults to True.

        Returns:
            Tuple[List[Polygon], List[LineString], List[LineString]]: splits, triangulations edges, splitters
        """

        _, triangulations_edges = self.triangulate_polygon(
            polygon=polygon,
            segment_threshold_length=segment_threshold_length,
        )

        splitters_selceted = None
        splits_selected = None
        splits_score = None

        for splitters in list(itertools.combinations(triangulations_edges, number_to_split - 1)):
            exterior_with_splitters = ops.unary_union(list(splitters) + self.explode_polygon(polygon))

            exterior_with_splitters = shapely.set_precision(
                exterior_with_splitters, self.TOLERANCE_LARGE, mode="valid_output"
            )

            exterior_with_splitters = ops.unary_union(exterior_with_splitters)

            splits = list(ops.polygonize(exterior_with_splitters))

            if len(splits) != number_to_split:
                continue

            if any(split.area < polygon.area * 0.25 for split in splits):
                continue

            is_acute_angle_in = False
            is_triangle_shape_in = False
            for split in splits:
                split_segments = self.explode_polygon(split)
                splitter_indices = []

                for ssi, split_segment in enumerate(split_segments):
                    if split_segment.length <= self.TOLERANCE_LARGE * 2:
                        continue

                    reduced_split_segment = DataCreatorHelper.extend_linestring(
                        split_segment, -self.TOLERANCE_LARGE, -self.TOLERANCE_LARGE
                    )
                    buffered_split_segment = reduced_split_segment.buffer(self.TOLERANCE, cap_style=CAP_STYLE.flat)

                    if buffered_split_segment.intersects(MultiLineString(splitters)):
                        splitter_indices.append(ssi)
                        splitter_indices.append(ssi + 1)

                degrees = self.compute_polyon_inner_degrees(split)
                degrees += [degrees[0]]

                if (np.array(degrees)[splitter_indices] < 20).sum():
                    is_acute_angle_in = True
                    break

                if len(self.explode_polygon(self.simplify_polygon(split))) == 3:
                    is_triangle_shape_in = True
                    break

            if is_acute_angle_in or is_triangle_shape_in:
                continue

            sorted_splits_area = sorted([split.area for split in splits], reverse=True)
            even_area_score = (sorted_splits_area[0] - sum(sorted_splits_area[1:])) / polygon.area * even_area_weight

            ombr_ratio_scores = []
            slope_similarity_scores = []

            for split in splits:
                ombr = split.minimum_rotated_rectangle
                each_ombr_ratio = split.area / ombr.area
                inverted_ombr_score = 1 - each_ombr_ratio
                ombr_ratio_scores.append(inverted_ombr_score)

                slopes = []
                for splitter in splitters:
                    if split.buffer(self.TOLERANCE_LARGE).intersects(splitter):
                        slopes.append(self.compute_slope(splitter.coords[0], splitter.coords[1]))

                splitter_main_slope = max(slopes, key=abs)

                split_slopes_similarity = []
                split_segments = self.explode_polygon(split)
                for split_seg in split_segments:
                    split_seg_slope = self.compute_slope(split_seg.coords[0], split_seg.coords[1])
                    split_slopes_similarity.append(abs(splitter_main_slope - split_seg_slope))

                avg_slope_similarity = sum(split_slopes_similarity) / len(split_slopes_similarity)
                slope_similarity_scores.append(avg_slope_similarity)

            ombr_ratio_score = abs(ombr_ratio_scores[0] - sum(ombr_ratio_scores[1:])) * ombr_ratio_weight
            slope_similarity_score = sum(slope_similarity_scores) / len(splits) * slope_similarity_weight

            score_sum = even_area_score + ombr_ratio_score + slope_similarity_score

            if splits_score is None or splits_score > score_sum:
                splits_score = score_sum
                splits_selected = splits
                splitters_selceted = splitters

        if return_splitters_only:
            return splitters_selceted

        return splits_selected, triangulations_edges, splitters_selceted

However, this algorithm is not perfect and has some problems. Because it uses weights for computing the scores, it can be sensitive to them. The figures below illustrate this. The first shows a successful case, and the second shows a failure case.

Results of the algorithm From the left, triangulations · segmentations · oriented bounding boxes for segmentations (1 of 4) — Results of the algorithm
From the left, triangulations · segmentations · oriented bounding boxes for segmentations

Results of the algorithm From the left, triangulations · segmentations · oriented bounding boxes for segmentations (3 of 4) — Results of the algorithm
From the left, triangulations · segmentations · oriented bounding boxes for segmentations

Since these problems cannot be handled by the naive algorithm, the algorithm was first used to process the raw polygon data, after which the data was labeled manually as follows. This produced approximately 40000 data points, including augmented originals. The full dataset can be found here.

The next step is to create a graph model using Pytorch Geometric and train it on the graph data. Pytorch Geometric is a library based on PyTorch that simplifies writing and training graph neural networks.

The role assigned to the model is to generate the lines that segment polygons. This can be framed as a task primarily used in GNNs, known as link prediction. Link prediction models usually use an encoder-decoder structure. The encoder creates node embeddings, which are vector representations of the nodes that extract their features. The decoder then uses these embeddings to predict the probability that a pair of nodes is connected.

During inference, the model feeds all possible node pairs into the decoder. It then calculates the probability that each pair is connected. Only pairs with probabilities above a certain threshold are kept, indicating likely connections.

Generally, a simple operation such as the dot product is used to predict links based on the similarity of node pairs. However, this approach was considered unsuitable for tasks using geometric data, so a decoder was additionally trained. Below are the encode and decode methods of the model. The complete model code can be found here.


    class PolygonSegmenter(nn.Module):
        def __init__(
            self,
            conv_type: str,
            in_channels: int,
            hidden_channels: int,
            out_channels: int,
            encoder_activation: nn.Module,
            decoder_activation: nn.Module,
            predictor_activation: nn.Module,
            use_skip_connection: bool = Configuration.USE_SKIP_CONNECTION,
        ):
                
            ( ... )

        def encode(self, data: Batch) -> torch.Tensor:
            """Encode the features of polygon graphs

            Args:
                data (Batch): graph batch

            Returns:
                torch.Tensor: Encoded features
            """

            encoded = self.encoder(data.x, data.edge_index, edge_weight=data.edge_weight)

            return encoded

        def decode(self, data: Batch, encoded: torch.Tensor, edge_label_index: torch.Tensor) -> torch.Tensor:
            """Decode the encoded features of the nodes to predict whether the edges are connected

            Args:
                data (Batch): graph batch
                encoded (torch.Tensor): Encoded features
                edge_label_index (torch.Tensor): indices labels

            Returns:
                torch.Tensor: whether the edges are connected
            """

            # Merge raw features and encoded features to inject geometric informations
            if self.use_skip_connection:
                encoded = torch.cat([data.x, encoded], dim=1)

            decoded = self.decoder(torch.cat([encoded[edge_label_index[0]], encoded[edge_label_index[1]]], dim=1)).squeeze()

            return decoded

The encoding process transforms the original feature matrix \( F \) into a new matrix \( E \) with potentially different dimensions. The following expresses the feature matrix before and after the encode function.

\( n \) as the number of nodes

\( m \) as the number of features

\( c \) as the number of channels

\( F \in \mathbb{R}^{n \times m} \) as the feature matrix before the encode function

\( E \in \mathbb{R}^{n \times c} \) as the feature matrix after the encode function

\[ F = \begin{bmatrix} f_{11} & f_{12} & \cdots & f_{1m} \\ f_{21} & f_{22} & \cdots & f_{2m} \\ \vdots & \vdots & \ddots & \vdots \\ f_{n1} & f_{n2} & \cdots & f_{nm} \end{bmatrix} \]

\[ E = \begin{bmatrix} e_{11} & e_{12} & \cdots & e_{1c} \\ e_{21} & e_{22} & \cdots & e_{2c} \\ \vdots & \vdots & \ddots & \vdots \\ e_{n1} & e_{n2} & \cdots & e_{nc} \end{bmatrix} \]

In the decode method, the encoded features of the nodes and the raw features are used to predict the connections (links) between them. This skip connection merges raw features and encoded features to inject geometric information. Using skip connections helps preserve the original features and enhances overall model performance by combining low-level and high-level information.

In the feedforward process, the decoder is trained with connected labels and labels that should not be connected. This is a technique called negative sampling, which is used to improve model performance in link prediction tasks. By providing examples of what should not be linked, negative sampling helps the model distinguish between actual links and non-links, leading to improved accuracy in predicting future or missing links.

In most networks, actual links are significantly fewer than non-links, which can bias the model toward predicting no link. Negative sampling allows for controlled selection of negative examples, balancing the training data and enhancing the learning process.


        def forward(self, data: Batch) -> Tuple[torch.Tensor, torch.Tensor, torch.Tensor]:
            """Forward method of the models, segmenter and predictor

            Args:
                data (Batch): graph batch

            Returns:
                Tuple[torch.Tensor, torch.Tensor, torch.Tensor]: whether the edges are connected, predicted k and target k
            """

            # Encode the features of polygon graphs
            encoded = self.encode(data)

            ( ... )

            # Sample negative edges
            negative_edge_index = negative_sampling(
                edge_index=data.edge_label_index_only,
                num_nodes=data.num_nodes,
                num_neg_samples=int(data.edge_label_index_only.shape[1] * Configuration.NEGATIVE_SAMPLE_MULTIPLIER),
                method="sparse",
            )

            # Decode the encoded features of the nodes to predict whether the edges are connected
            decoded = self.decode(data, encoded, torch.hstack([data.edge_label_index_only, negative_edge_index]).int())

            return decoded, k_predictions, k_targets

Up to this point, the encoder and decoder have been defined. Since the encoder and decoder process nodes in batches, they cannot recognize each graph separately. Because the model also needs to learn how many segmentations a graph should be divided into when a graph is input, a predictor was defined to train k separately within the segmenter class.

The segmenter generates the segmentations using topk through the encoder and decoder processes described above. It then sorts the generated links in order of connection strength, and the predictor determines how many links to use.

Inference process From the top, topk segmentations · segmentation selected by predictor — Inference process
From the top, topk segmentations · segmentation selected by predictor

The model is now ready to be trained. The model was trained for 500 epochs, during which both the training loss and the validation loss were recorded to monitor the training progress and convergence. As shown in the plots:

According to the top plots, both the training loss and validation loss decrease quickly initially and then generalize.
All evaluation metrics (accuracy, F1 score, recall, and AUROC) improve rapidly at first and then stabilize, indicating good generalization.

All the metrics appear favorable, but it is questionable whether they can be fully trusted. This concern may stem from the effect of negative sampling.

Based on the visualization of some of the test data, the model clearly predicts polygons that do not require segmenting with high accuracy. This suggests that the model may be overfitting on negative samples. High performance on negative samples may not accurately reflect the model's ability to identify positive cases correctly.

Taking inspiration from IoU loss, GeometricLoss was therefore defined to evaluate segmentation quality and provide a reliable evaluation metric. The GeometricLoss aims to quantify the discrepancy between the predicted and ground-truth geometric structures within a graph. The geometric loss is calculated as follows:

Polygon Creation from Node Features: Each graph's node features, representing coordinates, are converted into a polygon.
Connecting Predicted Edges: These edges represent the predicted segmentation of the polygon.
Connecting Ground-Truth Edges: These edges represent the correct segmentation of the polygon.
Creating Buffered Unions: The predicted and ground-truth edges are combined and then buffered.
Calculating Intersection: This intersection represents the common area between the predicted and true geometric structures.
Computing Geometric Loss: The intersection area is normalized by the ground-truth area and then negated.

An example for the geometric loss calculation From the left, loss: -0.000478528 · loss: -0.999999994 — An example for the geometric loss calculation
From the left, loss: -0.000478528 · loss: -0.999999994

The geometric loss serves solely as an evaluation metric for the model and is therefore not added to the BCE loss. This is because the model's training batches are based on nodes rather than graphs, so calculating this loss for every graph per epoch would significantly slow down the process. For this reason, the loss is calculated only for 4 samples per batch and used exclusively as a model evaluation metric. The results are as follows:

Geometric losses for 500 epochs From the left, train geometric loss · validation geometric loss — Geometric losses for 500 epochs
From the left, train geometric loss · validation geometric loss

The GeometricLoss class is defined as follows, and the code can be found here.

GNNs are node embedding-based models, so they tend to recognize the characteristics of individual nodes rather than the overall shape of the graph. While GNNs generalize shapes well during training, it has been challenging to overfit them accurately to the intended labels.

Comparison with CNNs: CNNs excel at capturing local patterns and combining them into more abstract representations, which can be advantageous for tasks involving polygons.
Imbalanced labels for predictor: The predictor is trained to select one of 0, 1, or 2, but the number of samples for 2 is very small.
Exploring reliable metrics: Metrics that are more suitable for geometric tasks should be explored.
Overfitting: How positive labels can be completely overfitted remains to be studied.

Objective ✂️

A brief overview of graphs and GNNs

Data preparation

Model for link prediction

Training and evaluating

Limitations and future works

References