BLOG Autonomous Driving

List of Semantic Segmentation Models for Autonomous Vehicles

Prerak Mody
March 6, 2018
State-of-the-art Semantic Segmentation models need to be tuned for efficient memory consumption and fps output to be used in time-sensitive domains like autonomous vehicles.

In a previous post, we studied various open datasets that could be used to train a model for pixel-wise semantic segmentation(one of the Image annotation types) of urban scenes. Here, we take a look at various deep learning architectures that cater specifically to time-sensitive domains like autonomous vehicles. In recent years, deep learning has surpassed traditional computer vision algorithms for object detection by learning a hierarchy of features from the training dataset itself.This eliminates the need for hand-crafted features and thus such techniques are being extensively explored in academia and industry.

Deep Learning Architectures for Semantic Segmentation

Prior to deep learning architectures, semantic segmentation models relied on hand-crafted features fed into classifiers like Random Forests, SVM, etc. But after their mettle was proved in image classification tasks, these deep learning architectures started being used by researchers as a backbone for semantic segmentation tasks. Their feature learning capabilities, along with further algorithmic and network design improvements, have then helped produce fine and dense pixel predictions. We introduce one such pioneering work below called Fully Convolutional Network (FCN) on the basis of which all future models are roughly based.

FCN

A skip connection based network for end-to-end semantic segmentation.

[caption id="attachment_59777" align="aligncenter" width="700"]

VGG-16 architecture reinterpreted as FCN - For Image Semantic Segmentation

Figure 1 [Source] : VGG-16 architecture reinterpreted as FCN[/caption]Contribution: This work reinterpreted the final fully connected layers of various LSVRC (Large Scale Visual Recognition Challenge, a.k.a ImageNet) networks such as AlexNet and VGG16 as fully convolutional networks. Using the concept of skip-layer fusion to decode low-resolution feature maps to pixel-wise prediction allowed the network to learn end to end.Architecture: As seen in the above image, the upsampled outputs of a particular layer are concatenated with the outputs of the previous layer to improve the accuracy of the output. Thus, appearance (edges) from the shallower layers are combined with coarse and semantic information from the deeper layers.The upsampling operation in the deeper layers' feature maps is also trainable, unlike conventional upsampling operations that use mathematical interpolations.Drawbacks: The authors did not add more decoders since there was no additional accuracy gain and thus, high-resolution features were ignored. Also, using the encoder feature maps during inference time makes the process memory intensive.

Real-Time Semantic Segmentation

Post FCN, various other networks such as DeepLab (introduced atrous convolutions), UNet (introduced encoder-decoder structure), etc., have made pioneering contributions to the field of semantic semantic segmentation. On the basis of the aforementioned networks, various state-of-the-art models like RefineNet, PSPNet, DeepLabv3, etc. have achieved an IoU (Intersection Over Union)> 80% on benchmark datasets like Cityscapes and PASCAL VOC.[caption id="attachment_59856" align="alignnone" width="965"]

Accuracy vs Time for various Semantic Segmentation architectures

Figure 2 [Source] : Accuracy vs Time for various Semantic Segmentation architectures[/caption]But real-time domains like autonomous vehicles need to make decisions in the order of milliseconds. As can be seen from Figure 2, the aforementioned networks are quite time-intensive. Table 1 also details the memory requirements of various models. This has encouraged researchers to explore novel designs to achieve output rates of >10fps from a neural network and contain fewer parameters.[ninja_tables id=59779]

SEGNET

Using encoder pooling parameters at the decoder for efficient training

[caption id="attachment_59786" align="aligncenter" width="1217"]

VGG-16 architecture with max-pooling indices from encoder to decoder and Sparse Upsampling - For Image Semantic Segmentation

Figure 3 [Source] : A) VGG-16 architecture with max-pooling indices from encoder to decoder, B) Sparse Upsampling[/caption]ContributionThis network has much fewer trainable parameters since the decoder layers use max-pooling indices from corresponding encoder layers to perform sparse upsampling. This reduces inference time at the decoder stage since, unlike FCNs, the encoder maps are not involved in the upsampling. Such a technique also eliminates the need to learn parameters for upsampling, unlike in FCNs.ArchitectureThe SegNet architecture adopts the VGG16 network along with an encoder-decoder framework wherein it drops the fully connected layers of the network. The decoder sub-network is a mirror copy of the encoder sub-network, both containing 13 layers. Figure 3(B) shows how SegNet and FCN carry out their feature map decoding.DrawbacksStrong downsampling hurts accuracy.

U-Net

An intensive data augmentation centric encoder-decoder architecture to segment biomedical images

Contribution:

Architecture:

UNet - For Image Semantic Segmentation

It is a contraction-expansion network with skip connections between them to provide knowledge from the earlier phase to the later. The contraction part increases the number of feature maps with depth while the expansion path increases the image resolution through upsampling after catenation of images from the contraction path.Loss:

To handle the problem of boundaries within same class, the loss used was a weighted cross-entropy. The border pixels of a given cell were assigned more importance over the cells by considering the distance w.r.t borders of two adjacent cells.Data augmentation: Traditional augmentation methods such as dropout, rotation, shift and deformations were used. In particular, random elastic deformations from a Gaussian distribution were introduced. ref-http://arxiv.org/abs/1505.04597

ENET

A light network with reduced inference time using asymmetric encoder-decoder architecture.

[caption id="attachment_59789" align="aligncenter" width="1053"]

E-Nets with Cityscapes output, elementary units and network layers with output dimensions - For Image Semantic Segmentation

Figure 4 [Source]: E-Nets with A) Cityscapes output, B) elementary units, C) network layers with output dimensions[/caption]Contribution: The authors created a light network using an asymmetric encoder-decoder architecture. In addition, they made various other architectural decisions such as early downsampling, dilated and asymmetric convolutions, not using bias terms, parametric ReLU activations and Spatial DropOut.Architecture:

DeepLab

A baseline model for several works in semantic segmentation, deeplab introduces atrous convolutions.

Contributions:

Atrous convolutions: The work proposes using atrous convolution(convolution kernels with holes/zeros) to preserve resolution. Though the kernel size increases, the effective computations considering only the non-zero elements remain the same. The use of atrous convolution enhances the FoV of each kernel to any arbitrary size without sacrificing the computation expense and maintaining invariance.Atrous Spatial Pyramid Pooling: Handling scale variance is usually attempted by sampling the original image to different scales, extract the features from a DCNN and fuse them to the original resolution. This is done at both train and test time. However, this would result in expensive computations(as much as 3X for 3 scales). ASPP handles this through kernels of different hole resolutions(aka sampling rates).Fully connected CRFs for boundary recovery: The CNN score maps show that the boundaries are smooth and spread further than the object*insert fig5*. To achieve a superior boundary segmentation, deeplab used fully connected CRFs. The CRFs minimize the negative-log-likelihood of the CNN score maps and pairwise potential which allows similar color pixels in a neighborhood to have the same labels and enforces smoothness between similar pixels.Architectures:

The LargeFOV config uses the atrous conv layers at deeper layers. The ASPP config handles multiscale images through atrous conv layers as multiple parallel layers with different rates.

DeepLab: For Image Semantic Segmentation

The ASPP-L configuration has the best performance, the CRFs have improved mIoU by 3-4%.

DeepLab: For Image Semantic Segmentation [2]

The need for CRFs is refuted with a comparison of VGG-16 and ResNet-101 networks. The CRF prior maps have inefficient boundaries, this is highly noticeable with the VGG when compared to the ResNet config.

DeepLab: For Image Semantic Segmentation [3]

ref:http://arxiv.org/abs/1606.00915

DeepLabv3

DeepLabv3 aims to eliminate CRFs and explore multi-grid configurations with atrous convolution

Contributions:

Architecture:

DeepLab Architecture: For Image Semantic Segmentation [1]

The parallel atrous architecture(ASPP) employs the multi-grid atrous convolutions similar to the cascaded configuration:

DeepLab Architecture: For Image Semantic Segmentation [2]

Inference: Atrous convolutions with deeper networks(ResNet-101) have demonstrated significant performance improvements. Choosing the rate for convolution in multi-grid ASPP is tricking and large rates could lead to decremental performance. A constant atrous rate is not preferred. ref-http://arxiv.org/abs/1706.05587

ICNet

[caption id="attachment_59782" align="aligncenter" width="700"]

ICNet architecture with its three branches for multi-scale inputs - For Image Semantic Segmentation [1]

Figure 5 [Source] : ICNet architecture with its three branches for multi-scale inputs[/caption]Contribution

Architecture

Final Thoughts

Various architectures have made novel improvements in the way 2-dimensional data is processed through data graphs. Although embedded platforms continue to improve with more memory and FLOPS capability, the above architectural and mathematical improvements have led to major leaps in semantic segmentation network outputs. With state-of-the-art networks, we can now achieve an output rate (in fps) that is close enough to image acquisition rates, and with acceptable quality (in mIoU) for autonomous vehicles. If you are looking to scale up your image labelling needs, try Playment!