**機械学習**

# 2nd Prize approach of The 4th Tellus Satellite Challenge-2位手法解説

The 2nd place winner of The 4th Tellus Satellite Challenge explains his approach.

## Preparation

#### Reading of images:

The provided images are in tiff format with the pixel-values range in [0,65535]. There are two ways how images were pre-processed into [0,1] pixel-values range. The first one is based on linear mapping, i.e., f’=f/65535, where f is the original image, and f’ is the output one. The second one reflects the physical properties of satellites (noise ratio) as a non-linear transformation f’ = 10log(f^2+c), where c is a noise reduction coefficient, namely c=-83 in our case. We involved both models based on linear and non-linear input in our ensemble.

#### Normalizing images into the model input resolution:

The additional manipulation can involve image rescale and crop according to the model resolution. There are two main reasons to avoid rescale: Firstly, the training dataset is small; it has 25 images. So, there is a need to use as much information as possible. If we rescale the images to the resolution of the model, for some cases we use less than one percent of the original data. Secondly, all the images have the same physical resolution, i.e., one pixel in the image always represents the same size in meters, but they have different resolution, i.e., their capture different area of the landscape. By rescaling them into the same resolution, we will create an additional distortion. Therefore, we used cropping in our pipeline. Because it is useless to realize cropping in advance, we postponed it to online augmentation.

#### Handling labels:

We have used two kinds of labels. At the beginning, the labels are represented in {0,1}: the pixel has value 1 if it is a coastline, 0 otherwise. The classes are highly imbalanced, i.e., the total number of coastline pixels is tiny compared to the rest. To postpone the problem of such imbalance, we applied a form of label smoothing technique and obtained labels in the interval [0,1] – the first kind of labels. We also manually notated original labels into {sea, land, no-data} classes (the no-data class is for image areas without missing information) – the second kind. Such label classes are more balanced and include more information. In our pipeline, some models were trained on the original labels (smoothed coastline), and some were trained on the modified version (labels as classes) to secure the diversity of the models in the ensemble.

#### Data augmentation:

Our intensity augmentation involves multiplicative intensity shift and cutout. Our spatial augmentation includes flips, rotation, rescale, creating random crops, and multi-sample mosaicing. In our pipeline, we took an original image, created a crop at a random position with a side-size in the interval [1024,1536] and downscaled it into our model’s side-size, 512. Then we applied the other augmentations. Finally, we realized our own custom augmentation, multi-sample mosaicing. It means that we split the data sample into n rectangular parts and replaced some of them with a same-sized data area from a different image from the training set. The main advantage is that such a composed image includes multiple characteristics, which simulate a bigger batch size and, therefore, can postpone overfitting.

## Modeling

#### Main architecture:

The segmentation architectures are based on an encoder-decoder scheme, i.e., it decodes the input representation into a latent one, which is then projected back (decoded) to the original resolution. The spatial dimension is reduced during the encoding while the feature dimension is increased, and vice-versa for the decoding phase. The typical representative is U-Net, with main advantages of the simplicity of coding and low memory requirements compared to the other architectures. We have selected U-Net in the competition because it allowed us to use a bigger batch size than the other architectures with the same setting. Compared to FPN, it also yields a lower error, namely 16.2 vs. 17.2.

#### Backbone:

In U-Net, the encoder’s ability to extract features is limited, so it is beneficial to replace the default one with some of the SOTA networks known, e.g., from the classification problem. These networks are called backbones and can be pre-trained on ImageNet to converge faster. The most powerful are ResNet, SE-ResNet, ResNeXt, Inception-ResNet, Xception, or ResNeSt, to name a few. In our pipeline, we have selected EfficientNet, or EffNet for short, a fully convolutional architecture based on compound scaling, that allows easy control of the trade-off between the network capacity and memory requirements.

#### Loss function:

Because we planned to create an ensemble of different models, we trained several models based on U-Net with EffNet backbone. Regarding loss function, the commons are based on group-characteristic such as IOU, Tversky loss, or Sorensen-Dice loss, or pixel-characteristic, like Binary cross-entropy (BCE) or Focal loss. Each of them has pros and cons. Sorensen-Dice considers spatial dimension but generally does not lead to the best performance; Focal loss can partially solve the class imbalance problem but may overfit; BCE can be marked as a good and universal baseline. In our pipeline, we combined Dice with Focal for two models and BCE for the other two models.

#### Optimizer:

The first choice is generally Adam, a combination of AdaGrad and RMSProp, which has been several times marked as one of the best optimizers. On the other hand, there are known problems (such as CIFAR-10 classification) where it yields sub-optimal performance. In our experience, we have confirmed the behavior. Therefore, we used Adam optimizers for two models and AdaDelta for the next two based on the knowledge.

#### Training:

The rest of the training setting is as follows. We use the resolution of the models equal to 512x512px and as big batch size as possible, varying from 3 to 12. The models were trained for 100 epochs with reducing the learning rate on a plateau and with saving the best model according to the validation dataset. The models with sea/land/no-data labels have in the last layer softmax (a smooth approximation of one-hot argmax); the models with only coastline class have in the last layer sigmoid (a smooth logistic function). It means the former creates a decision between the classes, and the latter produces the probability of being coastline.

## Post-proccessing

#### Ensemble:

To produce predictions, each model creates its own ensemble. We used the technique of floating window, where we created overlapping crops in a multi-scale resolution equal to the conditions we had during the training phase. Because the inference is significantly less demanding for memory than the training, we were able to process hundreds of crops at once, so the process was fast. When the predictions were projected back into the original image, the overlapping parts were aggregated by summation, because the process of extracting the coastline described above does not depend on absolute values. These produced predictions were smoothed by Gaussian filter to decrease impact of noisy outliers.

#### Extraction coastline from predictions:

The process of extracting coastline differs for models with softmax in the last layer and for models with sigmoid in the last layer.

*Softmax models: *These models use the following label encoding: 0=sea, 1=no-data, 2=land. The models produce a structure f where each pixel (x,y) is a vector of three values with a probability of the certain class. Firstly we create f'(x, y) = argmax(f(x,y)), so it holds that f'(x,y) in {0,1,2}. From it, we create the final ‘coastline’ image f” as f”(x,y) = 1 if ma(x,y) – mi(x,y) = 2; 0 else. Here, ma and mi extract the maximum and minimum value of f’ in 3×3 neighborhood of (x,y). For f” holds that f”(x,y)=1 marks presence of coastline and f”(x,y)=0 no coastline. In other words, we say that there is a coastline if some area contains both ‘sea’ and ‘land’ classes regardless the ‘no-data’ class.

*Sigmoid models:* Firstly, we initialize f”(x, y) = 0 for all (x,y), and then check if the image has a landscape or portrait orientation. For landscape, we browse all x coordinates and for each of them we set f”(x, argmax(f(x,_)) = 1$ where by _ we mean all y coordinates. In other words, we are searching for the maximum probability of being a coastline in each column for all rows. The process is the same for the landscape, but we search for each column’s maximum in a row. The advantage of the process is the absence of a threshold, so we are able to extract even the most uncertain coastlines. The disadvantage is we can miss coastline points if the coastline slope is stronger than diagonal or miss a coastline if there are two coastlines in a row/column. We suppress the disadvantages in the postprocessing later.

#### Ensemble of coastlines:

The output image functions $f”$ of the particular models have been taken and processed in the following way. We browsed the images column/row-wise, as same as when we made predictions for models with sigmoid. If we browse rows, then for each of the rows, we find coordinates of coastline in column a create a final prediction as the weighted average of the four predictions. The models’ weights have been set according to a particular model evaluation in the public competition’s leaderboard.

## Afterword

Satellite imagery is one of many areas where artificial intelligence can be applied. This one is interesting because it is connected with spatial hardware orbiting the earth. The problem is the accessibility of the data. There are not many public available datasets as in the case of general object classification or detection. The available datasets are usually in a special format connected with GIS. So, there is a big opportunity to work in a team consisting of geographers with GIS knowledge, image processing stuff, and guys focusing on artificial intelligence.

#### Books, websites, etc. that helped you to participate in the competition

The first starting point for us was the organizers’ website, namely https://sorabatake.jp/14130, from which we continued papers that the website referenced. During the competition, we read an uncountable number of scientific papers about segmentations, see benchmarks on https://paperswithcode.com/task/semantic-segmentation, and examined old winning solutions on https://kaggle.com/c/

To see the solutions of other winners, please see this article.

**機械学習**田上健太(Tisch合同会社)