1st Prize approach of The 4th Tellus Satellite Challenge-1位手法解説
The 1st place winner of The 4th Tellus Satellite Challenge explains his approach.
Shoreline extraction in a single-polarized SAR image is a very challenging task by itself. One can approach this problem by solving a two-class (land/water) segmentation problem. The shoreline location is extracted as the separation line between predicted classes. Another approach is to segment the image directly into positive/negative pixels of the shoreline. Essentially, it is a binary image segmentation task.
In my solution, I used the second approach. It imposes another problem of a highly unbalanced number of positive & negative pixels in the training set. Another issue was the data annotation quality. After a manual review of the data, I discovered a few cases where the ground-truth coastline was wrong and a couple of imperfect annotations.
Instead of predicting hard labels of 0 and 1, I rendered a heatmap as my ground-truth mask. I used a gaussian kernel of diameter 70 pixels to draw the coastline. Using heatmap instead of hard labels helped to make class distribution a little more even. It also allowed a model to make more “relaxed” predictions of the coastline. Due to soft labels in the heatmap, the loss function penalized the model less if the predicted coastline a few pixels “off” the ground-truth.
I normalized SAR images to [0…1] range by dividing by 65535 and then taking the square root. I’ve also tried using raw SAR data and log-normalized, but I found sqrt-normalized input performs the best.
For training, I used original-size images and didn’t apply downsampling to prevent the loss of any details from the source image.
My approach to segmenting the coastline was based upon Encoder-Decoder CNN segmentation model. After running many experiments, I found that EfficientNet encoders (I used EfficientNet B4 in my final solution) and UNet decoder performs the best in this task. High-resolution networks, FPN, DeepLab, and other well-known approaches did not perform well enough, presumably due to the need for high accuracy of the predictions. Since the EfficientNet encoder does not have max-pooling operators in the stem block, this may explain why it performed better than ResNets / DenseNet encoders.
It was essential to monitor IoU and competition metrics simultaneously during the training since they were correlated poorly. So I saved the best checkpoints based on the competition metric score.
I used the following software stack to build a training pipeline:
– PyTorch 1.6
Given the small about of the training set, it was crucial to prevent over-fitting the trained models. Use of Dropout of 0.5 before the final layer and Weight Decay helped, but partially. Therefore I applied a broad spectrum of augmentations to increase the diversity of training samples artificially. For augmentations, I used Albumentations library and the following set of augmentations:
– Random rotation by 90 degree & random transpose (To cover full range of D4 augmentations)
– Grid Shuffle
– Coarse dropout
– Affine shift, scale & rotation
– Elastic transformation
– Grid distortion
– Perspective distortion
– Random change of brightness & contrast
– Gaussian blur & noise
Applying such heavy image augmentation slowed down the training speed (compared to the absence of augmentations). It helped prevent over-fitting, and final validation metrics were higher than for models trained without augmentation.
I extracted tiles of 512×512 pixels during the training, sampled around the coastline with 75% probability (Remaining 25% were sampled randomly from the entire image).
Making an inference for such huge images was also non-trivial. One cannot simply put the image of the tenth of megapixels to GPU. The memory-efficient inference approach is to split original image into overlapping tiles, run inference of each tile separately and integrate predictions from all tiles into full-resolution mask. This algorithm is available through my open-source library pytorch-toolbelt.
In my final solution, I used 1024×1024 tiles with 512 pixels overlap between them. To get an additional accuracy boost, I used test-time augmentation. More specifically, I used D2 augmentation and multi-scale (+128, -128 pixels) averaging to get predictions for a single tile.
The goal in this competition was to predict the exact coastline. To go from heatmap to exact coordinates I used purely computer-vision post-processing algorithm:
1. Heatmap smoothing. I applied median blur and pyramid mean-shift filtering to make heatmap smoother.
2. Binary thresholding of the heatmap using Otsu’s algorithm. After this step we end up with hard labels of coastline, yet to be thinned.
3. Skeletonization of connected components and pruning of the skeleton to remove spurious branches.
I’ve tried to smooth the coastline using curve approximation, yet it has much worse scores than this simple post-processing step.
I think there were a few essential ingredients to my solution:
1.Heavy data augmentation. Given the small amount of training data, it was necessary to generate more training samples artificially. I used a set of spatial augmentations, including rotation and flips and several noise methods, to make the model robust to speckle noise. I did not apply any noise-removal (Lee, etc.) filters.
2.Proper cross-validation. I used 5-fold CV to train my model and selected submission based on the best CV score.
3.Good segmentation model.I’ve used Encoder-Decoder architecture with B4 encoder and UNet decoder, which is known to be a SOTA approach for binary image segmentation. I used Image-Net pre-trained weights and trained this model on provided train data. It was essential to find the right balance between model capacity and overfitting. During my experiments, I discovered that the EfficientNet B4 encoder shows the best performance on cross-validation.
4. TTA and Ensembling. During inference, I’ve used multi-scale inference with D2 augmentations. For making a final prediction, I’ve averaged predictions of my five models. This averaging helped to reduce unwanted spurious artifacts due to speckle noise and make coastline predictions more smooth.
To see the solutions of other winners, please see this article.
The 4th Tellus Satellite Challenge! ~ Check out the Winners’ Models ~