Vision–language model

A vision–language model (VLM) is a type of artificial intelligence system that can jointly interpret and generate information from both images and text, extending the capabilities of large language models (LLMs), which are limited to text. It is an example of multimodal learning.