Back

vision transformer (ViT)

A neural network architecture that treats image patches like tokens in a sentence, allowing it to capture spatial relationships efficiently.

Share: