Definition

A Vision-Language Model (VLM) is a multimodal neural network that processes both images and text. Models like GPT-4V, LLaVA, and Gemini can describe scenes, answer questions about images, and ground language concepts in visual observations. VLMs understand the world but cannot act in it — they produce text, not motor commands.

A Vision-Language-Action model (VLA) extends this capability to physical action. A VLA takes camera images and a natural language instruction (e.g., "pick up the red cup and place it on the tray") and directly outputs robot actions — joint positions, end-effector velocities, or gripper commands. This closes the loop from perception and language understanding to physical execution, enabling robots to follow open-ended instructions without task-specific programming.

The distinction matters because VLMs are useful as planners, scene describers, and reward labelers, but they cannot control a robot in real time. VLAs can. The two are complementary: a VLM might decompose a complex instruction into subtasks, while a VLA executes each subtask. Together, they represent the foundation model approach to general-purpose robot intelligence.

How VLAs Work

A typical VLA architecture has three components: a vision encoder (ViT or SigLIP) that converts camera images into visual tokens, a language model backbone (LLaMA, PaLM, Gemma) that processes language instructions and visual tokens jointly, and an action head that decodes the model's hidden states into robot action vectors.

During pre-training, the vision encoder and language model are trained on internet-scale image-text data, giving the model broad visual and semantic understanding. During robot fine-tuning, the action head is added and the entire model is trained on robot demonstration data: (image, instruction, action) triplets collected via teleoperation. The language model's weights are updated to produce representations that are useful for action prediction, not just text generation.

Action tokenization is a key design choice. RT-2 discretizes actions into 256 bins per dimension and treats them as text tokens. OpenVLA uses a similar approach. More recent models like π0 use continuous action heads with diffusion decoders, which better capture the multimodal nature of manipulation actions.

Key Models

  • RT-2 (Google DeepMind, 2023) — 55B-parameter VLA built on PaLM-E. Demonstrated that large VLMs can be fine-tuned to output robot actions. Showed emergent generalization to novel objects and instructions not seen during robot training. Not open-source.
  • OpenVLA (Stanford/Berkeley, 2024) — 7B-parameter open-source VLA based on Prismatic VLM + LLaMA 2. Trained on 970K demonstrations from the Open X-Embodiment dataset. The first widely accessible VLA that researchers can fine-tune on their own data.
  • π0 (Physical Intelligence, 2024) — 3B-parameter VLA with a flow-matching action decoder. Designed for dexterous manipulation with high-frequency control. Demonstrates strong performance on bimanual and contact-rich tasks.
  • Octo (Berkeley, 2024) — A generalist policy trained on 800K episodes from Open X-Embodiment. Uses a transformer backbone with a diffusion action head. Supports both language and goal-image conditioning. Designed as a base model for fine-tuning to specific robots.

VLA vs Task-Specific Policies

VLAs accept language instructions and can generalize across many tasks without retraining. They require large-scale pre-training (internet data + cross-embodiment robot data) and significant compute. Inference is slower (5–15 Hz) due to model size. Best for multi-task deployments where flexibility matters more than reaction speed.

Task-specific policies like ACT or Diffusion Policy are trained on data from a single task. They are smaller, faster (50–200 Hz), and often achieve higher success rates on their target task. But they cannot generalize to new instructions or objects without retraining.

The practical choice depends on your deployment: if your robot performs one or two tasks in a structured environment, a task-specific policy trained on 50–200 demonstrations is faster to develop and more reliable. If your robot must handle diverse, language-described tasks, a VLA fine-tuned on your specific embodiment is the better investment.

Data Requirements

Fine-tuning an existing VLA (e.g., OpenVLA or Octo) to a new robot embodiment typically requires 100–1,000 teleoperation demonstrations covering the target tasks. Language annotations can be added post-hoc. Fine-tuning takes 12–48 hours on 4–8 A100 GPUs.

Training a VLA from scratch requires 100K+ robot demonstrations across multiple embodiments and tasks, plus internet-scale image-text pre-training data. This is currently only feasible for well-funded labs (Google, Physical Intelligence, Berkeley). The Open X-Embodiment dataset (970K episodes, 22 robot types) was created specifically to enable this.

Language annotations must describe the task at an appropriate level of detail. Overly vague labels ("do the task") hurt generalization, while overly specific labels ("move joint 3 by 0.2 radians") defeat the purpose of language conditioning. Task-level descriptions ("pick up the blue block and stack it on the red block") work best.

Key Papers

  • Brohan, A. et al. (2023). "RT-2: Vision-Language-Action Models Transfer Web Knowledge to Robotic Control." CoRL 2023. Demonstrated that a 55B VLM fine-tuned on robot data can follow novel language instructions and generalize to unseen objects.
  • Kim, M. J. et al. (2024). "OpenVLA: An Open-Source Vision-Language-Action Model." CoRL 2024. The first open-source 7B VLA, enabling researchers to fine-tune and evaluate VLAs on their own hardware and tasks.
  • Team, O. X.-E. et al. (2024). "Open X-Embodiment: Robotic Learning Datasets and RT-X Models." ICRA 2024. Aggregated 970K robot episodes across 22 embodiments, establishing the data foundation for cross-embodiment VLA training.

Related Terms

  • Foundation Model — The broader category of large pre-trained models adapted to downstream tasks
  • Policy Learning — The general framework for training observation-to-action mappings
  • Action Chunking (ACT) — A task-specific policy alternative to VLAs
  • Diffusion Policy — Denoising-based action generation, used as action heads in some VLAs
  • Teleoperation — How demonstration data for VLA training is collected

Fine-Tune VLAs at SVRC

Silicon Valley Robotics Center provides GPU clusters for VLA fine-tuning, teleoperation rigs for collecting language-annotated demonstrations, and expert guidance on choosing between VLA and task-specific approaches for your application. Our data platform manages datasets in Open X-Embodiment and LeRobot formats.

Explore Data Services   Contact Us