next up previous [pdf]

Next: Training Up: Pham et al.: DL Previous: Introduction

Encoder-Decoder Architecture

Convolutional Neural Networks (CNNs) are a specialization of the neural networks for data in the form of multiple arrays (LeCun et al., 2015). CNNs replace matrix multiplication in traditional neural networks with a convolution operator to focus on locality and spatial relationship. CNNs can learn highly complex non-linear relationships in the input data with the usage of non-linear activation functions.

Image segmentation in computer vision understands an image at pixel level, and assigns each pixel to an object class. Various methods utilizing CNNs have been used for semantic pixel-wise labeling, but the output images are coarse because max-pooling and subsampling reduce the feature map size (Badrinarayanan et al., 2015). SegNet architecture has encoder layers to learn low resolution features and uses decoder layers to map them to input resolution for pixel-wise classification (Badrinarayanan et al., 2015) (Figure 1). We define the channel detection problem as an image segmentation task in which we assign a label of channel or non-channel to each pixel of the seismic image. The proposed architecture for automatic channel detection consists of four layers in the encoder and corresponding four layers in the decoder.

architecture
architecture
Figure 1.
SegNet architecture (Image modified after Badrinarayanan et al. (2015)).
[pdf] [png] [scons]

Each encoder layer has a convolutional layer which learns useful features (Figure 2) and a pooling layer. Our architecture for automatic channel detection has trainable filters in each convolutional layer with the size of 3x3x3. Each filter is only connected to local patches in the feature maps of the previous layer (LeCun et al., 2015). Each convolutional layer comes with a batch normalization layer to normalize the data and control overfitting (Ioffe and Szegedy, 2015). Non-linear activation function ReLU is inserted after the batch normalization layer to learn non-linear relationships. Max-pooling layers with 2x2x2 kernels are added in between each convolutional layer to reduce the spatial size of feature maps and control overfitting.

conv
conv
Figure 2.
15 example feature maps generated by a convolutional layer.
[pdf] [png] [scons]

Each decoder layer upsamples the input feature maps and convolves the outputs with trainable decoder filters to produce dense maps. Upsampling layers use transposed convolution algorithm (Dumoulin and Visin, 2016) with learnable 2x2x2 filters (Figure 3). The coarse outputs are convolved with learnable 3x3x3 filters to produce denser feature maps (Figure 4). The output from last decoder layer is fed into a 1x1x1 convolutional layer to produce feature maps corresponding to two labels of channel or non-channel. The last layer is softmax layer that produces the probabilities of each label for each pixel in the seismic image.

transpconv
transpconv
Figure 3.
15 example feature maps generated by transposed convolution upsample filters.
[pdf] [png] [scons]

conv2
conv2
Figure 4.
15 example feature maps generated by denser upsample filters.
[pdf] [png] [scons]

Neural networks can be expressed in a Bayesian way to understand the uncertainty (Ghahramani, 2015). The training phase is the transformation of the prior probability distributions $P(\theta\vert m)$, defined before observing data, into the posterior distributions $P(\theta\vert D,m)$, defined after observing data.

\begin{displaymath}
P(\theta\vert D,m) = \frac{P(D\vert\theta,m) P(\theta\vert m)}{P(D\vert m)}
\end{displaymath} (1)

where $D$ is the observed data, $m$ is the model, and $\theta$ is the network parameters. The prediction can also be expressed by Bayes rule.
\begin{displaymath}
P(x\vert D,m) = \int P(x\vert\theta,D,m) P(\theta\vert D,m) d\theta
\end{displaymath} (2)

where $x$ is new data. Different models can be compared by using Bayes rule.
\begin{displaymath}
P(m\vert D) = \frac{P(D\vert m) P(m)}{P(D)}
\end{displaymath} (3)

The uncertainty of the neural networks can come from all sources such as parameters uncertainty and model structure uncertainty. Bayesian SegNet is a development of SegNet architecture and a probabilistic image segmentation framework understanding the network parameters uncertainty by using dropout layers (Kendall et al., 2015). The dropout method randomly removes units in a network, which is a way of getting samples from posterior distribution of softmax class probabilities. Therefore, dropout is an approximation of Bayesian inference over the network's weights (Gal and Ghahramani, 2015b). It can be used at test time to create a Bernoulli distribution over the filter's weights (Gal and Ghahramani, 2015a). Our model has a dropout layer between the last encoder layer and the first decoder layer, which removes 30% of the units. Adam optimizer (Kingma and Ba, 2014) with 0.001 as learning rate is used for backpropagation. We take 100 samples at test time and calculate the variance of the distribution over the probabilities of channel to quantify the prediction uncertainty.


next up previous [pdf]

Next: Training Up: Pham et al.: DL Previous: Introduction

2022-04-29