Back

multimodal

A model that processes multiple types of input data — usually vision (images/video from cameras) and language (task descriptions, goals, or human instructions).


Share: