vision-language encoders
Encoders that fuse visual data with textual or symbolic context. For example, “pick up the red mug” becomes a joint embedding that aligns what the robot sees (the red mug) with what it understands (the word “red” and “mug”).
Share: