Vision Transformer (ViT)

Vision Transformer (ViT) applies transformer attention mechanisms to image patches for classification and representation learning. It is widely used in multimodal stacks with CLIP and in segmentation systems like Segment Anything Model (SAM).

Related terms

Related terms