Autonomous Vehicle Technology: Semantic Segmentation for Scene Understanding

A huge portion of the challenge in building a self-driving car is environment perception. Autonomous vehicles may use many different types of inputs to help them perceive their environment and make decisions about how to navigate. The field of computer vision includes techniques to allow a self-driving car to perceive its environment simply by looking at inputs from cameras. Cameras have a much higher spatial resolution than radar and lidar, and while raw camera images themselves are two-dimensional, their higher resolution often allows for inference of the depth of objects in a scene. Plus, cameras are much less expensive than radar and lidar sensors, giving them a huge advantage in current self-driving car perception systems. In the future, it is even possible that self-driving cars will be outfitted simply with a suite of cameras and intelligent software to interpret the images, much like a human does with its two eyes and a brain.

Semantic segmentation helps when asking the question “where is an object in a given image?”, which is a technique which is incredibly important in the field of scene understanding. Standard convolutional neural networks (which start with convolutional layers, followed by fully connected layers, followed by a softmax or other activation function) are great for classifying objects in an image. However, if we need to identify where in an image an object exists, we need a slightly different architecture. For example, if we want to highlight the road in a video stream, this kind of task applies.

Road Identified 1

This repository contains a software pipeline which identifies the sections of an image which represent the road in images from a front-facing vehicle camera. The following techniques are used:

  • Start with a pre-trained VGG model, used for image classification
  • Remove the final fully connected layers
  • Add 1×1 convolutions, upsampling, and skip layers
  • Optimize the network with inference optimization techniques
  • Retrain the network on labeled images from the KITTI Road Detection dataset

Implementation

All of the code and resources used in this project are available in my Github repository. Enjoy!

Technologies used

  • Python
  • Tensorflow

Scene understanding

Scene understanding is important to an autonomous vehicle’s ability to perceive its environment.

One method in scene understanding is to train multiple decoders on the same encoder; for example, one decoder for semantic segmentation, one for depth perception, etc. In this way, the same network can be used for multiple purposes. This project focuses solely on semantic segmentation.

Techniques for semantic segmentation

Fully Convolutional Networks

Fully convolutional networks, or FCNs, are powerful tools in semantic segmantation tasks. (Several other techniques have since improved upon FCNs: SegNet, Dialated Convolutions, DeepLab, RefineNet, PSPNet, Large Kernel Matters to name a few.) FCNs incorporate three main features beyond that of standard convolutional networks:

  • Fully-connected layers are replaced by 1×1 convolutional layers, to preserve spatial information that would otherwise be lost
  • Upsampling through the use of transpose convolutional layers
  • Skip connections, which allow the network to use information from multiple resolutions to more precisely identify desired pixels

Fully Convolutional Network Architecture

Structurally, a fully convolutional network is comprised of an encoder and a decoder.

The encoder is a series of standard convolutional layers, the goal of which is to extract features from an image, as in a traditional convolutional neural network. Often, encoders for fully convolutional networks are taken from VGG or ResNet, being pre-trained on ImageNet (another example of the power of transfer learning, another project I worked on.)

The decoder upscales the output of the encoder to be the same resolution as the original input, resulting in prediction or “segmentation” of each pixel in the original image. This happens through the use of transpose convolutional layers. However, even though the decoder returns the output in the original dimensions, some information about the “big picture” of the image (no pun intended) is lost due to the feature extraction in the encoder. To retain this information, skip connections are used, which add values from the pooling layers in the encoder to the output of the corresponding sized decoder transpose convolutional layers.

Performance enhancements

Because semantic segmentation performance on state of the art autonomous vehicle hardware may not be able to process a video stream in real-time, various techniques can be used to speed up inference by using less processing and memory bandwidth.

  • Freezing graphs – by converting variables in a Tensorflow graph into constants once trained, memory costs decrease and model deployment can be simplified
  • Fusion – by combining adjacent network nodes without forks, operations which would previous have used multiple tensors and processor executions can be reduced into one
  • Quantization – by reducing precision of floating point constants to integers, memory and processing time can be saved
  • Machine code optimization – by compiling the various system startup and load routines into a binary, overhead in inference is greatly reduced

Network architecture for semantic segmentation

A modified version of the impressive VGG16 neural network image classification pipeline is used as a starting point. The pipeline takes a pre-trained fully-convolutional network based on Berkeley’s FCN-8 network and adds skip layers.

From the originally inputted layer, the input, keep probability, and layers 3, 4, and 7 are extracted for further use.

Next, 1×1 convolutions are constructed from layers 3, 4, and 7 in an encoding step. Skip layers are inserted by adding the 1×1 convolutions from layers 3 and 4. Layers 3, 4, and 7 are deconvolved in reverse order to complete the final piece of the decoding step.

An Adam optimizer is used to minimize the softmax cross-entropy between the logits created by the network and the correct labels for image pixels.

The neural network is trained using a sample of labeled images for a maximum of fifty epochs. A mini-batch size of ten images is used compromise between high memory footprint and smooth network convergence. The training step has an early terminator which does not continue to train the network if total training loss does not decrease for three subsequent epochs.

Finally, a separate held-out sample of test images are run through the final neural network classifier for evaluation.

Results

Overall, the semantic segmentation network designed works well. The road pixels are highlighted in the test images with close to a human level of accuracy, with an occassional windshield or sidewalk highlighted as a road, and some road areas with shadows are missed.

Some example images segmented by the pipeline:

Road Identified 1

Road Identified 2

Road Identified 3

Road Identified 4

Future improvements

  • Use the Cityscapes dataset for more images to train a network that can classify more than simply road / non-road pixels
  • Augment input images by flipping on the horizontal axis to improve network generalization
  • Implement another segmentation implementation such as SegNet, Dialated Convolutions, DeepLab, RefineNet, PSPNet, or Large Kernel Matters (see this page for a review)
  • Apply trained classifier to a video stream

Read More

Autonomous Vehicle Technology: Transfer Learning

In other autonomous vehicle software stacks, I have built, trained, and operated various deep neural networks from scratch for image classification tasks, using training data I have either obtained from others or generated myself (traffic sign classification, vehicle detection and tracking, etc). However, many deep learning tasks can use pre-existing trained neural networks from some other similar task, and with some tweaks to the network itself, can significantly reduce the effort and shorten the time to production. Transfer learning is the technique of modifying and re-purposing an existing network for a new task.

Transfer LearningSome popular high-performance networks include VGG, GoogLeNet, and ResNet. Models for these networks were previously trained for days or weeks on the ImageNet dataset. The trained weights encapsulate higher-level features learned from training on thousands of classes, yet they can be adapted to be used for other datasets as well.

Implementation

I explored using transfer learning using these networks on two different datasets. All of the code and resources used are available in my Github repository. Enjoy!

Technologies Used

  • Python
  • Keras
  • Tensorflow

Example pre-trained networks

Some existing networks which can be used for new tasks using transfer learning include:

  • VGG – A great starting point for new tasks due to its simplicity and flexibility.
  • GoogLeNet – Uses an inception module to shrink the number of parameters of the model, offering improved accuracy and inference speed over VGG.
  • ResNet – Order of magnitude more layers than other networks; even better (lower error rate) than normal humans at image classification.

Transfer learning details

Depending on the size of the new dataset, and the similarity of the new dataset to the old, different approaches are typical when applying transfer learning to repurpose a pre-existing network.

Small dataset, similar to existing

  • Remove last fully connected layer from network (most other layers encode good information)
  • Add a new fully connected layer with number of classes in new dataset
  • Randomize weights of new fully connected layer, keeping other weights frozen (don’t overfit new data)
  • Train network on new data

Small dataset, different from existing

  • Remove fully connected layers and most convolutional layers towards the end of the network (most layers encode different information)
  • Add a new fully connected layer with number of classes in new dataset
  • Randomize weights of new fully connected layer, keeping other weights frozen (don’t overfit new data)
  • Train network on new data

Large dataset, similar to existing

  • Remove last fully connected layer from network (most other layers encode good information)
  • Add a new fully connected layer with number of classes in new dataset
  • Randomize weights of new fully connected layer, and initialize other layers with previous weights (don’t freeze)
  • Train network on new data

Large dataset, different from existing

  • Remove last fully connected layer from network (most other layers encode good information)
  • Add a new fully connected layer with number of classes in new dataset
  • Randomize weights on all layers
  • Train network on new data

Read More

Autonomous Vehicle Technology: Behavioral Cloning

Humans learn through observing behavior from others. They watch and emulate the behaviors they see, making adjustments to their own actions along the way, given feedback. The same technique can be used in autonomous vehicles to model driving behavior based on direct observation of human driving. This technique is known as behavioral cloning.

I created a software suite to implement behavioral cloning for generating autonomous vehicle steering control. Using a front-facing video stream of safe driving paired with steering angles as training data, I built a convolutional neural network and trained it (using Keras) to clone driving behavior. Given a set of three front-facing camera images (front, left, and right), the model outputs a target steering wheel command.

The following techniques are used in this system:

  • Use a vehicle simulator to generate and collect data of good driving behavior
  • Build and train a convolution neural network in Keras that predicts steering angles from images
  • Train and validate the model with a training and validation set
  • Test that the model successfully drives around track one without leaving the road

Implementation

All of the code and resources used in this project are available in my Github repository. Enjoy!

Technologies used

  • Python
  • Keras
  • NumPy
  • OpenCV
  • Scikit-learn

Training a model

python model.py

Will train a model to drive the vehicle in the simulator.

Driving the simulated vehicle using the model

Once the model has been saved, it can be used with drive.py using this command:

python drive.py model.h5

Note: There is a known local system setting issue with replacing “,” with “.” when using drive.py. When this happens it can make predicted steering values clipped to max/min values. If this occurs, a known fix for this is to set the environment variable LANG to en_US.utf8.

Saving a video of the simulated vehicle using the model

python drive.py model.h5 run1

python video.py run1

Will create a video of the simulated vehicle driving with the model. The output will be a file called run1.mp4.

Optionally, one can specify the FPS (frames per second) of the video:

python video.py run1 --fps 48

Will run the video at 48 FPS. The default FPS is 60.

Model Architecture

The overall strategy for building the software’s neural network was to start with a well-known and high-performance network, and tune it for this particular steering angle prediction task.

This system includes a convolutional neural network model similar to the published NVidia architecture used for their self-driving car efforts, given that this system is attempting to solve the exact same problem (steering angle command prediction) and NVidia’s network is state of the art. This network inputs 160×320 RGB images from multiple camera angles at the front of a vehicle and outputs a single steering wheel angle command. One convolutional and one fully connected layer were removed from the NVidia architecture to reduce memory processing costs during training.

Before the convolutional layers of the model, a cropping layer removes the top (including sky) and bottom (including car image), to reduce noise in training. An additional layer normalizes the data points to have zero mean and a low standard deviation.

In between the convolutional layers, RELU activations are included to introduce non-linearity, max pooling to reduce overfitting and computatational complexity, and 50% dropout during training (also to reduce overfitting).

In between the fully-connected layers of the model, RELU activations are also introduced.

The input images are cropped to remove the top 50 and bottom 20 pixels to reduce noise in the image which are likely to be uncorrelated with steering commands. Each pixel color value in the image is then normalized to [-0.5,0.5].

Neural Network Layers

The network includes:

  • input cropping and normalization layers
  • four convolutional layers
  • three 5×5 filters with 24, 36, and 48 depth
  • one 3×3 filter with 64 depth
  • a maximum pooling layer with 2×2 pooling
  • three fully-connected layers with 100, 50, and 10 outputs
  • a final steering angle output layer
Layer Description
Input 160x320x3 RGB color image
Cropping 50 pixel top, 20 pixel bottom crop
Normalization [0,255] -> [-0.5,0.5]
Convolution 5×5 1×1 stride, valid padding, output depth 24
RELU
Max pooling 2×2 stride
Convolution 5×5 1×1 stride, valid padding, output depth 36
RELU
Max pooling 2×2 stride
Convolution 5×5 1×1 stride, valid padding, output depth 48
RELU
Max pooling 2×2 stride
Convolution 3×3 1×1 stride, valid padding, output depth 64
RELU
Max pooling 2×2 stride
Flattening 2d image -> 1d pixel values
Fully connected 100 output neurons
RELU
Dropout 50% keep fraction
Fully connected 50 output neurons
RELU
Dropout 50% keep fraction
Fully connected 10 output neurons
Output Output – 1 steering angle command

Model training

Dataset

A vehicle simulator was used to collect a dataset of images to feed into the network. Training data was chosen to keep the vehicle driving on the road, which provided center, left, and right images taken from different points on the front of the vehicle. This data includes multiple laps using center lane driving. Here is an example image of center lane driving:

Simulated center lane driving

I then recorded the vehicle recovering from the left side and right sides of the road back to center so that the vehicle would learn to correct major driving errors when the vehicle is about to run off the road. These images show what a recovery looks like starting from the left side:

Left recovery 1

Left recovery 2

Left recovery 3

To augment the data set, I also flipped images and angles during training to further generalize the model. After the collection process, I had 8253 data image frames, each including center, left, and right images for a total of 24759.

Training

During training, the entire image data set is shuffled, with 80% of the images being used for training and 20% used for validation. I configured the Keras training to use an early stopping condition based on knee-finding using the validation loss, with a patience of 2 epochs. Also, an Adam optimizer is used so that manually training the learning rate is not necessary.

Video Result

The simulated vehicle drives around the entire track without any unsafe driving behavior; in only one spot did the simulated vehicle get close to running of the track on a curve (but did not leave the driving surface, pop up on legdes, or roll over any unsafe surfaces).

Read More

Autonomous Vehicle Technology: Traffic Sign Classification

A huge portion of the challenge in building a self-driving car is environment perception. Autonomous vehicles may use many different types of inputs to help them perceive their environment and make decisions about how to navigate. The field of computer vision includes techniques to allow a self-driving car to perceive its environment simply by looking at inputs from cameras. Cameras have a much higher spatial resolution than radar and lidar, and while raw camera images themselves are two-dimensional, their higher resolution often allows for inference of the depth of objects in a scene. Plus, cameras are much less expensive than radar and lidar sensors, giving them a huge advantage in current self-driving car perception systems. In the future, it is even possible that self-driving cars will be outfitted simply with a suite of cameras and intelligent software to interpret the images, much like a human does with its two eyes and a brain.

When operating on roadways, autonomous vehicles need to be able to identify traffic signs in order to determine what actions, if any, the vehicle must take. For example, a yield sign warns drivers that other vehicle traffic will soon enter the vehicle’s path, and that those other vehicles should be given the right of way. Without a robust mechanism to quickly and correctly identify the meaning of traffic signs, autonomous vehicles would get into trouble with hazardous road conditions and with other vehicles.

I created a software pipeline containing a convolutional neural networks to classify traffic signs. The pipeline trains and validates a neural network model so it can classify traffic sign images using the German Traffic Sign Dataset. Additionally, a study of model performance on images of unseen German traffic signs from the internet is included.

Implementation

All of the code and resources used in this project are available in my Github repository. Enjoy!

Technologies used

  • Python
  • Jupyter
  • NumPy
  • OpenCV
  • SkLearn
  • Tensorflow

Data Set Summary & Exploration

The pandas library is used to calculate summary statistics of the traffic signs data set:

  • The size of training set is 34799
  • The size of the validation set is 4410
  • The size of test set is 12630
  • The shape of a traffic sign image is (32, 32, 3)
  • The number of unique classes/labels in the data set is 43

The following charts show the distribution of the instance classes in the training, validation, and test datasets.

Note how in all of the datasets, some of the classes (1-5,7-10,12-13,38) have a much higher representation in each dataset than others. This may cause bias in the predictions generated by the classifier itself; additional classification robustness could be added by adding extra instances of the classes which are under-represented.

Training data instance class distribution

Traffic Sign Classifier Training Class Distribution

Validation data instance class distribution

Traffic Sign Classifier Validation Data Class Distribution

Test data instance class distribution

Traffic Sign Classifier Test Data Class Distribution

Design and Testing of model architecture

Image data preprocessing

As a first step, all images are converted to grayscale to reduce the dimensionality of the problem that the classifier needs to learn. Given the small number of training examples (< 100k), the extra dimensionality in representing colors might overwhelm the optimizer and it would not fit a robust model. Grayscale was also used in my previous lane finding project, and given the success of finding detail in a low-resolution grayscale image similar success here is expected.

Here is an example of a traffic sign image before and after grayscaling.

Grayscale Traffic Signs

As a last step, image data is normalized to have mean zero and low standard deviation for each pixel value, to allow the learning optimizer to have an easier time converging on a lower overall classification loss.

Even though a class imbalance exists in the training data set, the training set is not augmented with more data to study the bias of the final classifier (which would be revealed during the validation step).

Final neural network architecture

The neural network model consists of the following layers:

Layer Description
Input 32x32x1 grayscale image
Convolution 5×5 1×1 stride, valid padding, outputs 28x28x6
RELU
Max pooling 2×2 stride, outputs 14x14x6
Convolution 5×5 1×1 stride, valid padding, outputs 10x10x16
RELU
Max pooling 2×2 stride, outputs 5x5x16
Fully connected 400 input neurons, 400 output neurons
RELU
Dropout 50% keep fraction
Fully connected 400 input neurons, 400 output neurons
RELU
Dropout 50% keep fraction
Fully connected 400 input neurons, 43 output neurons
Output – Softmax Output

Model training

To train the model, the following techniques are used:

  • Use of a loss function which reduces the mean of the softmax cross entropy between the output and the validation labels
  • Penalization of the loss using L2 regularization for each of the five weight groups in the network (2x convolution weight groups and 3x fully connected layer weight groups), scaled to 1% of the L2 norm
  • Optimization of the weights and biases for each of the layers using the Adam algorithm, with an initial learning rate of 0.0005 (the Adam optimizer dynamically adjusts the effective learning rate over time)
  • Mini-batching of 128 training instances looped with a maximum of 200 epochs of training and weight optimization until the accuracy is above 93.5%. 93.7% is hit after 16 epochs, at which point the training loop completes.

Improving validation set accuracy

The final model results are:

  • training set accuracy of 97.0%
  • validation set accuracy of 93.7%
  • test set accuracy of 90.5%

The neural network began with the classic LeNet-5 image classification architecture, being a canonical and well-understood image classification neural net architecture for grayscale image classification. This seemed like an obvious starting point to classify small (32×32 pixel) grayscale images with a limited set of output classes (43).

Modifications are added based on the Alexnet image processing architecture, as it is also well-understood and makes significant performance improvements on LeNet. Further modifications include using 50% dropout layers after every fully connected layer to prevent overfitting on training data, making the fully connected layers not reduce dimensionality (input and output dimensions are the same until the final output layer), and starting with initial positive values for layer bias terms rather than zero (since ReLU is used as an activation function, it is desired to prevent more connections to drop out than absolutely necessary).

Training set accuracy shows that the model is fitting to the training data well; perhaps too well, as a 97% accuracy is quite high. Luckily, the difference in performance between the validation and training sets (delta of 3.3%) shows that the model is not overfitting too greatly. Test set accuracy of 90.5% indicates that on completely unseen data in the real world, this classifier would classify slightly better than nine of out ten traffic signs correctly, which is interesting academically but surely would be a problem for a true self-driving car (as even one incorrectly classified traffic sign could prove disasterous).

Validation with images from the internet

Sample images

Here are five German traffic signs that were pulled from the internet:

General Caution

General Caution Sign

This image may be harder to classify, as it has a changing background image due to the horizon.

Priority Road

Priority Road Sign

This is likely to be a simple image to classify; it is clear with an empty background.

Bumpy Road

Bumpy Road Sign

This image has a solid, though black, background which is likely to be easy to classify (even with a small fleck of black in the right side of the triangle).

Road Work

Road Work Sign

This image is likely to be difficult to classify, being captured at a non-perpendicular angle, as well as having a complicated background involving the ground, sky, and clouds of different colors and shapes.

Keep Right

Keep Right Sign

This image should be relatively easy to classify; it has some background noise but the image itself is clear except for some clipping at the bottom of the circle.

Comparison of predictions from original set and internet images

Prediction results:

Image Prediction
General Caution General Caution
Priority Road Priority Road
Bumpy Road Bicycles Crossing
Road Work Road Work
Keep Right Keep Right

The model is able to correctly guess 4 of the 5 traffic signs, which gives an accuracy of 80%. Due to the limited number of test examples in my set of 5, the accuracy of 80% is lower than the original test set accuracy of 90.5%.

Softmax probabilities

The code for making predictions on the final model is located in one of the last cells of the Ipython notebook.

For the first image, the model is almost completely sure that this is a General Caution sign (probability of 0.99), and the image does contain a General Caution sign. The top five soft max probabilities are

Probability Prediction
.99 General Caution
.01 Pedestrians
.00 Traffic Signals
.00 Road Narrows on the Right
.00 Right-of-way at the next intersection

 

For the second image, the model is almost sure that this is a Priority Road sign (probability of 0.95), and the image does contain a Priority Road sign. The top five soft max probabilities are

Probability Prediction
.95 Priority Road
.02 Yield
.02 No Vehicles
.01 No Passing
.00 Ahead Only

 

For the third image, the model is almost sure that this is a Bicycles Crossing sign (probability of 0.95); however, the image contains a Bumpy Road sign. Note that Bumpy Road is the second highest softmax probability, but it much less confident about this prediction. The top five soft max probabilities are

Probability Prediction
.95 Bicycles Crossing
.03 Bumpy Road
.02 Dangerous Curve to the Right
.00 Road narrows on the right
.00 Road Work

 

For the fourth image, the model is almost completely sure that this is a Road Work sign (probability of 0.99), and the image does contain a Road work sign. The top five soft max probabilities are

Probability Prediction
.99 Road Work
.00 Bumpy Road
.00 Bicycle Crossing
.00 Road Narrows on the Right
.00 Slippery Road

 

For the fifth image, the model is most confident that this is a Keep Right sign (probability of 0.46), and the image does contain a Keep Right sign. Note that the “second place” probability for Speed Limit (30km/h) is not far behind. The top five soft max probabilities are

Probability Prediction
.46 Keep Right
.35 Speed Limit (30km/h)
.18 Roundabout Mandatory
.00 Speed Limit (50km/h)
.00 Priority Road

 

Read More