Description
Recent work in the field of computer vision has demonstrated the success of convolutional neural networks in learning deep representations for use in scene classification, object recognition and scene parsing. Deep network architectures have proven especially adept at encoding parameters of the input space despite the high complexity, variance and noise inherent in natural images. These models have been aided by large-scale datasets and the accelerated computing power of GPUs, which have enabled them to learn high-level semantics while dramatically reducing training time. One such recent model uses a fully convolutional network (FCN) to perform pixel-wise prediction for semantic segmentation with state-of-the-art results. This model, however, is restricted to single-label prediction and assumes independence between labels. To address this shortcoming, this work introduces a multi-label FCN for hierarchical object parsing. The multi-label FCN specifies a composition label space and aims to assign an image pixel to multiple coherent labels, e.g. eye, head, and upper body. This compositional model attempts to incorporate human intuitive capacity to form associations between semantically related objects. Empirical evidence shows that the multi-label FCN can be used to improve system performance and accuracy, especially when classifying low-resolution objects. This research additionally explores the use of semantically ambiguous, location-wise labels that aids in prediction performance as well.