vision transformer (ViT)
A neural network architecture that treats image patches like tokens in a sentence, allowing it to capture spatial relationships efficiently.
Share:
A neural network architecture that treats image patches like tokens in a sentence, allowing it to capture spatial relationships efficiently.