Vision Transformer & Swin Transformer

Vision Transformer (ViT)

Scratch PyTorch implementation of “An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale” (Dosovitskiy et al., 2020). Implements patch embedding, multi-head self-attention, and classification head.

Links: GitHub

Swin Transformer

From-scratch implementation of “Swin Transformer: Hierarchical Vision Transformer using Shifted Windows” (Liu et al., 2021). Implements shifted window attention, patch merging, and hierarchical feature maps.

Links: GitHub