Vision Transformer (ViT)

Vision Transformer (ViT) applies transformer attention mechanisms to image patches for classification and representation learning. It is widely used in multimodal stacks with CLIP and in segmentation systems like Segment Anything Model (SAM).

Related terms

CLIP

CLIP (Contrastive Language-Image Pretraining) maps text and images into a shared representation space for similarity and retrieval. It powers capabilities such as Find Similar Designs and works well with Vision Transformer (ViT) style architectures.

Image Segmentation

Image Segmentation partitions an image into labeled regions to isolate objects or areas for editing. It is core to Segment Anything Model (SAM) workflows and precision operations like Generative Fill.

Academy