A Supplementary Material CEREALS-Cost-Effective REgion-based Active Learning for Semantic Segmentation

TitleA Supplementary Material CEREALS-Cost-Effective REgion-based Active Learning for Semantic Segmentation
Publication TypeTechreport
Year of Publication2018

A.1 Implementation Details Instead of cropping the annotated regions out of the images, while taking into account their receptive field in input space, we instead mask out all currently unlabeled data in output space, making sure that no loss is computed on unlabeled data when learning the semantic segmentation model nor when learning the cost model. We then perform an image-based training, from unprocessed input images to spatial label maps. However, our practical implementation of CEREALS, which will be made publicly available is supporting both options. For training the utilized models we use Adam as our optimizer with learning rate, alpha and beta set to 0.0001, 0.99 and 0.999 respectively. Furthermore, we claim convergence whenever a model hasn't improved regarding the application loss for at least 10 epochs. We train with the mini-batch size set to 1, such that a gradient step is always being applied w.r.t. one full resolution image of Cityscapes. Semantic Segmentation Model We do not train the employed model in stages, but directly optimize for FCN8s. Regarding the training performed on the full training set of Cityscapes, we report a mean intersection over union (mIoU) of 0.605 which is, as all other results, computed on the full validation dataset of Cityscapes. Note, that the original model achieves a mIoU of 0.65 and that we are able to reproduce this result when the width multiplier is set to 1.0, despite all other changes. Though we utilized this particular model, CEREALS can use any model producing semantic segmentation masks as long as it provides probability distributions regarding it's posterior outcome. The cost model however, would need to be adapted or made independent of the semantic segmentation model in such a case. Cost Model The only change we made to the original model's architecture is to replace it's softmax activation with a linear activation layer. We trained the model towards minimizing the mean squared error of predicted and ground truth clicks. Since we observed some pixels to have unrealistically many clicks in the ground truth data, we clipped the values to be in [0, 10] range allowing a maximum of 10 ground truth clicks per pixel. As the semantic segmentation model, the cost model doesn't have any upsampling layer at the end, in order to allow for faster trainings. We instead downscale the provided click data by a factor of 8.

Citation Key6415