Back

vision-language encoders

Encoders that fuse visual data with textual or symbolic context. For example, “pick up the red mug” becomes a joint embedding that aligns what the robot sees (the red mug) with what it understands (the word “red” and “mug”).