Vision–language–action model

In robot learning, a vision–language–action model (VLA) is a class of multimodal foundation models that integrates vision, language and actions. Given an input image (or video) of the robot's surroundings and a text instruction, a VLA directly outputs low-level robot actions that can be executed to accomplish the requested task.

Source: Wikipedia — Vision–language–action model (CC BY-SA 4.0)

Vision–language–action model

In robot learning, a vision–language–action model (VLA) is a class of multimodal foundation models that integrates vision, language and actions. Given an input image (or video) of the robot's surroundings and a text instruction, a VLA directly outputs low-level robot actions that can be executed to accomplish the requested task.

Source: Wikipedia "Vision–language–action model" · CC BY-SA 4.0

Share this article: X · Bluesky
Privacy Policy