Paper reproduction. Swin Transformer: Hierarchical Vision Transformer using Shifted Windows

8 min readJun 17, 2021

Authors:

Group 7 TU Delft — CS4245 — Computer Vision by Deep Learning 2021 — Repository link.

Guru Deep Singh (5312558) — G.D.Singh@student.tudelft.nl
Kevin Luís Voogd (4682688) — K.L.Voogd@student.tudelft.nl

Introduction

In the paper Swin Transformer: Hierarchical Vision Transformer using Shifted Windows [1], the authors propose a new general-purpose backbone structure for computer vision.

Recent history shows that modeling in computer vision has been dominated by convolutional neural networks (CNNs). Currently, CNNs serve as backbone networks for a variety of vision tasks. For example, object recognition and localization, instance segmentation, image classification, image generation or style transfer among others. The architectural advances in CNNs such as greater scale, more connections, etc. have led to performance improvements that have broadly lifted the entire field.

Are there alternatives to CNNs?

Yes. Let’s take a look at other fields. For example, in Natural Language Processing (NLP) the field has taken a different path: they use Transformers as their backbone architectures. The Transformer is notable for its use of attention to model long-range dependencies in the data.

The success of transformers on NLP tasks has inspired researchers to investigate the adaptation of Transformer structures to computer vision tasks.

Why isn’t it straightforward to adapt Transformers to computer vision tasks?

Problems with the usage of Transformers for vision tasks.

There are two main problems with the usage of Transformers for computer vision.

1. Existing Transformer-based models have tokens of a fixed scale. However, in contrast to the word tokens, visual elements can be different in scale (e.g. objects of varying sizes on the scene)
2. Self-attention computational complexity is quadratic of the image size, limiting applications in computer vision where high resolution is necessary. For example, semantic segmentation requires dense prediction at the pixel level, and this would be intractable for Transformer on high-resolution images.

If you don’t know what self-attention is, it can be summarized as computing the relationships between a token all other tokens. Tokens are for example, a part of a sentence, words, or image patches. (not sure about image patches)

What are their contributions?

The authors propose a general-purpose Transformer backbone, called Swin Transformer,which constructs hierarchical feature maps and has linear computational complexity to image size.

In Figure 1, The Swin Transformer constructs a hierarchical representation by starting from small-sized patches (outlined in gray) and gradually merging neighboring patches in deeper Transformer layers. The computational complexity is linear to the input image size due to computation of self-attention only within each local window (shown in red).

Figure 1. Swin Transformer builds hierarchical feature maps by merging image patches (shown in gray) in deeper layers and has linear computation complexity to input image size due to computation of self-attention only within each local window (shown in red) [1].

A key design element of Swin Transformer is its shift of the window partition between consecutive self-attention layers, as shown in Figure 2. The shifted windows bridge the windows of the preceding layer, providing connections among them that significantly enhance modeling power.

This strategy is also efficient in regards to real-world latency: all query patches within a window share the same key set, which facilitates memory access. In contrast to earlier sliding window based self-attention approaches which have different key sets for different query pixels.

Figure 2. An illustration of the shifted window approach for computing self-attention in the proposed Swin Transformer architecture [1].

The architecture

In Figure 3 a tiny version of the Swin Transformer (Swin-T) is presented.

1. The first step consists in splitting an input RBG image into non-overlapping patches. Each patch is treated as a “token” and its feature is set as a concatenation of the raw pixel RGB values. (In the example they use 4x4 patches, so the feature dimension = 4x4x3 = 48).

Stage 1

2. A linear embedding layer is applied to project it on an arbitrary dimension C.
3. Several Swin Transformer blocks are applied on these patch tokens. The Transformer blocks maintain the number of tokens (H/4 x W/4).

In the next stages, the number of tokens is reduced by patch merging layers as the network gets deeper.

Stage 2

4. The first patch merging layer concatenates the features of each group of 2x2 neighboring patches, and applies a linear layer on the 4C-dimensional concatenated features.
5. Swin Transformation blocks are applied for feature transformation, with resolution kept at H/8 x W/8.

Stage 3 and Stage 4

6. The procedure of Stage 2 is repeated twice with resolution of H/16 x W/16 for Stage 3 and H/32 x W/32 for Stage 4.

The stages jointly produce a hierarchical representation, with the same feature map resolution as conventional convolutional networks (e.g. VGG or ResNet).

As a result, the architecture can replace the backbone networks in existing methods for various vision tasks.

But… what’s in the Swin Transformer Block?

Figure 4. Two successive Swin Transformer Blocks [2].

The Swin Transformer replaces the conventional multi-head self attention (MSA) from conventional transformers by a module based on shifted windows with other layers kept the same, followed by a 2-layer MLP with GELU nonlinearity in between. Moreover, a LayerNorm (LN) is applied before each MSA and each MLP, and a residual conection is applied after each module. The Swin Transformer Block is illustrated in Figure 3b.

Cyclic shift: a more efficient batch computation

A more efficient batch computation approach by cyclic-shifting toward the top-left direction, as illustrated in Figure 5.

What did we do?

That’s the big question, right?

We took over the architecture implementation of the repository of @berniwal and adjusted the objective was to train the Swin-T Transformer (a 0.25x version of the original Swin Transformer) on ImageNet-1K as performed in the official paper. In our case, the computing power was not sufficient to train models with large datasets. For that reason, we restricted the training of ImageNet to 10 classes and 1000 images per class. In this experiment we evaluated how the Transformer accuracy is affected with the number of epochs used for training. The number of epochs are in the range between 1 and 100.

The model uses the AdamW optimizer (starting learning rate: 0.001, weight decay: 0.05) with a Cosine Learning Rate Scheduler (warm-up: 20 epochs, decay: 30 epochs, learning rate decay rate: 0.001), and Cross-Entropy Loss.

Before using the raw images as input, those are resized to a size of 224x224 pixels. The model has 27.49 million parameters in contrast to the 29 million parameters of the official Swin-T transformer..

In the repository the trained models are available, including the training and testing loss files.

Results

Coming up next, we show the progression of the training and test losses with increasing number of epochs. We do not show all the graphs we got as there is repetition, however, those are available in the repository.

Left: Loss for the model in one epoch. Right: testing loss of the model trained in one epoch.

Left: Loss for the model in 15 epochs. Right: testing loss of the model trained in 15 epochs.

Left: Loss for the model in 65 epochs. Right: testing loss of the model trained in 65 epochs.

Left: Loss for the model in 100 epochs. Right: testing loss of the model trained in 100 epochs.

It is visible that there is a clear progression in the training loss reduction. However, it plateaus around the 30th epoch. In the plateau there are some spikes visible. Our hypothesis is that due to the stochacity of the optimizer (AdamW), the model moves out from a local minimum to a lesser good point, which causes a higher loss and then probably returns to the same or a similar local minimum.

The accuracy of the different models during the training and testing phase are shown in the next plot. There is a clear difference: the model overfits! After 20 epochs the difference becomes large, and afterwards there is about 40% of difference.

Example of predictions:

Lastly, we show a visualization of the images used in this project, specifying the prediction made by one of our models and the ground truth. The categories are:

‘0’: Cassette Player, ‘1’: Chain Saw, ‘2’: Church,
‘3’: English Springer, ‘4’: French Horn, ‘5’: Garbage Truck,
‘6’: Gas Pump, ‘7’: Golf Ball, ‘8’: Parachute, ‘9’: Tench.

On the title of each image the output of the model is shown and the ground truth. The image is the ground truth.

Discussion

First, we would like to point out that the model used, i.e Swin Transformer is a complex model with a large number of parameters (27.49 million) and we tried using a subset of the ImageNet dataset (only 10 classes), but due to such a small dataset and such complex model, we didn’t expect much performance.
Original Imagenet consists of 1,281,167 training images out of which we used only near about 10,000 i.e 0.78%. The author of the paper reached a Top 1 accuracy of 81% with training on a complete dataset but we achieved an accuracy of near 63% by using only 0.78% dataset.

However, there might be a chance that there are some classes in the Imagenet which are hard to generalize and brings down the accuracy of the model trained on the complete dataset. We might have gotten lucky with the selected classes which might be easier for the model to generalize thus giving good accuracies with such a small dataset.

We suggest more research to be done on the amount of data used and improvement in the performance of the model observed. This could be achieved by sequentially using more data for training and also could be done by using data augmentation techniques.

Further, we would also like to suggest that more experiments could be done by varying the attention heads and then performing the same analysis as above. There are a plethora of experiments that could be set up like hyper-parameter optimization for learning rate, batch size, etc. The scope of the transformers in vision is endless, however, it requires further research to compete with state-of-the-art Vision techniques.

References

[1] Z. Liu et al. Swin Transformer: Hierarchical Vision Transformer using Shifted Windows. 2021. Link: https://arxiv.org/abs/2103.14030.