U-Net: Convolutional Networks for Biomedical Image Segmentation

07/10/2024

In this paper, we present a network and training strategy that relies on the strong use of data augmentation to use the available annotated samples more efficiently.
The architecture consists of a contracting path to capture context and a symmetric expanding path that enables precise localization. (오토인코더, representation learning)

The typical use of convolutional networks is on classification tasks, where the output to an image is a single class label. However, in many visual tasks,especially in biomedical image processing, the desired output should include localization, i.e., a class label is supposed to be assigned to each pixel.
In this paper, we build upon a more elegant architecture, the so-called "fully convolutional network". We modify and extend this architecture such that itworks with very few training images and yields more precise segmentations;

U-net architecture

The main idea in Fully Convolutional Networks for Semantic Segmentation is to supplement a usual contracting network by successive layers, where pooling operators are replaced by upsampling operators. (원본 이미지 데이터와 같은 사이즈의 크기가 되도록. output channels = number of classes.)
In order to localize, high resolution features from the contracting path are combined with the upsampled output. (Skip Connection that uses the Concatenation method to combine)
업샘플링 레이어에서 다운샘플링 단계(feature extraction)에서 사용된 output을 함께 사용함. feature map을 함께 사용하여 고해상도 출력 생성 (더 넓은 범위에서 본 이미지의 특성을 입력으로 사용함으로써 맥락에 대한 정보를 업샘플링 레이어로 전달)

The contracting path followsthe typical architecture of a convolutional network. It consists of the repeated application of two 3x3 convolutions (unpadded convolutions), each followed by a rectified linear unit (ReLU) and a 2x2 max pooling operation with stride 2 for downsampling. (경계부분 정보 신뢰도가 낮다는 가정?)
Every step in the expansive path consists of an upsampling of the feature map followed by a 2x2 convolution (up-convolution) that halves the number of feature channels, a concatenation with the correspondingly cropped feature map from the contracting path, and two 3x3 convolutions, each followed by a ReLU.
At the final layer a 1x1 convolution is used to map each 64-component feature vector to the desired number of classes.

The energy function is computed by a pixel-wise soft-max over the final feature map combined with the cross entropy loss function.

Softmax dimension

Ideally the initial weights should be adapted such that eachfeature map in the network has approximately unit variance.
For a network with our architecture (alternating convolution and ReLU layers) this can be achieved by drawing the initial weights from a Gaussian distribution with a standarddeviation of \(\sqrt{2/N}\), where N denotes the number of incoming nodes of one neuron. E.g. for a 3x3 convolution and 64 feature channels in the previous layer \(N = 3 \cdot 3 \cdot 64 = 576\). (He initialization) (레이어마다 적용해주면 되는듯?)

Data augmentation is essential to teach the network the desired invariance androbustness properties, when only few training samples are available.
elastic deformations, 3x3 coarse grid, \(N(0, 10)\) 에서 샘플링하여 픽셀위치 미세하게 변경 → bicubic interpolation
U-Net 인코더 부분의 Dropout 레이어가 implicit data augmentation을 수행할 수 있음

Thanks to data augmentation with elastic deformations, it only needs very few annotated images and has a very reasonable training time of only 10 hours on a NVidia Titan GPU (6 GB). We provide the full Caffe-based implementation and the trained networks. We are sure that the u-net architecture can be applied easily to many more tasks.