Icons / 24px / To Left Arrow Created with Sketch. Blog

U-Net based building footprint pre-annotation

In this article, I would like to present our new building pre-annotation for aerial images of SuperAnnotate platform, share it’s code and algorithm and the motivation of integrating it into our platform.

Image for post

Fig. 1: Sample pre-annotation.



  • Motivation: why did we decide to create a buildings pre-annotation
  • Description of our algorithm and code
  • Future Roadmap
  • Concluding remarks

Motivation: why did we create a building pre-annotation algorithm

Aerial images annotation is a tedious work, and annotating hundreds of thousands of buildings takes a lot of effort and funds. Here at SuperAnnotate we strive to use state of the art computer vision technology to automate and accelerate the creation of pixel-perfect annotations. As a part of that effort, several smart pre-annotation algorithms were integrated into the (SuperAnnotate) platform, allowing our users to “fix” the auto-generated annotations, instead of starting from scratch. This allows our users to generate annotations of the same quality with less effort.

See Fig. 2 on how to get the auto-generated annotations on the SuperAnnotate vector projects.

Fig. 2: Steps to run smart predictions on vector projects on SuperAnnotate. Left: First, select images then click on the smart-prediction icon. Right: Choose “Building detection from aerial imagery” from the available model's list.

The algorithm and code description

Our algorithm is based on the winning solution of Spacenet Building DetectionSpaceNet is a corpus of commercial satellite imagery and labeled training data to use for machine learning research. They host building and road detection challenges and open-source the best solutions.

The winner of the second building detection challenge uses a segmentation algorithm called U-Net, then cuts the segmentation mask into building footprints. U-Net architecture is shown in Figure 2. It can be summarized as an encoder-decoder network, with skip connections between the corresponding layers of encoder and decoder. U-Net is fast to train and has good performance even on relatively small datasets. Winner of the Spacenet challenge used only 4 layers instead of 5 by removing the last layer with 1024 channels. We verified that adding the layer back does not result in any improvement.

We merged the datasets of Vegas, Paris and Shanghai and trained a single network on the whole data. We did not use Khartoum city annotations due to lower quality of annotations. We also added some augmentation, which helped the network adapt to images from cities not present in the training data. Our pytorch code is open-sourced here with all the necessary instructions. We were able to achieve an IoU of 0.545 on the test set.

Image for post
Fig. 3. U-net architecture (example for 32x32 pixels in the lowest resolution). Each blue box corresponds to a multi-channel feature map. The number of channels is denoted on top of the box. The x-y-size is provided at the lower left edge of the box. White boxes represent copied feature maps. The arrows denote the different operations.

Future Roadmap

Currently, we have a model that works fairly well on most city images we have. Yet we believe that our model will benefit from adding more data from different cities. We also plan to implement a road detection algorithm to assist our users in road annotations.

Concluding remarks

We will keep updating on our progress in this medium channel regarding building and road pre-annotations. Please follow this channel to be first to get those updates!