19 April 2018

Review Mask R-CNN


Neural networks Mask R-CNN solve the segmentation task.

The architecture is based on the Faster R-CNN network, to which an additional branch is added. In this thread, a small fully convolutional network is applied to each area of ​​interest in order to obtain a binary mask of the object - segmentation. The branch for segmentation works in parallel with the rest of the neural network responsible for the classification. A binary mask is calculated for each class, and the final choice is made based on the classification results. This network showed good results in detection and segmentation, as well as in determining the human pose.


R-CNN is a network based on work with regions (restrictive frames). In these networks there is a small number of areas where the objects are supposed to be located, and in these regions the convolutional neural networks. Fast R-CNN is a faster version of the R-CNN network. In it,  operations are subjected to a lot of features found in the image originally. Further these signs are projected onto regions, which eliminates their recalculation. The Faster R-CNN uses the Region Proposal Network to determine the most interesting regions.

Mask R-CNN

Usually in networks like Mask R-CNN firstly the segmentation of the image takes place, after which the classification of the detected areas follows. There are systems in which the classification precedes the segmentation. A feature of Mask R-CNN is the parallel operation of the sections of the network responsible for the classification and segmentation.

At the first stage of the Faster R-CNN work, a number of limiting frames are produced where the objects are supposed to be located. After each region is projected, the corresponding characteristics found on the whole image. A classification takes place on these grounds.

The input of the subnet responsible for segmentation is supplied with a set of limiting frames and a set of features, but the projection of signs on the region is somewhat different. Usually a two-dimensional grid of signs is projected onto the region by searching for the nearest sign and, as a result, these features often shift in the region. This bias has almost no effect on the quality of the classification, but introduces a significant error in segmentation. Therefore, for a segmentation branch the projection should be more accurate. In the given region control points are set, for which the characteristics are found by the operation of bilinear interpolation of the original characteristics. This approach increases accuracy by 10-50%.

The loss function in the Mask R-CNN network is composed of the function of classification loss, object boundary detection and segmentation. The segmentation loss function and the threshold function of the segmentation segment output are selected in such a way that the classes do not compete with each other when segmenting.

The authors of this architecture provide 4 pre-trained on the COCO-data network:



TensorFlow recently added Mask R-CNN to its repository. An article about this architecture of neural networks came out even earlier and got our attention immediately, so we decided to test it. Thanks to the fact that the authors put for a public access 4 pre-trained models - it turned out to be quite simple.

We have taken the current real task where it is necessary to detect people’s feet, and we have obtained very good results:

Due to the fact that the models were trained on the COCO dataset consisting of 123,287 images and 886,284 objects (66,808 people), the pre-designed models coped with the detection of people and their legs in particular.

Mask R-CNN allows not only to find the boundaries of objects and classify them, but also to conduct their segmentation. The network architecture is based on the Faster R-CNN, segmentation is framed as an additional grid in the network. In this case, segmentation is determined in parallel with the classification (by Bounding Box). Also, the authors noted a new, more accurate mechanism for projecting features, which further increased the accuracy. According to the statements, the additional branch for segmentation works quite quickly and the entire network processes 5 frames per second.