Vision Language Action Models for Embodied Intelligence A Structured Taxonomy Critical Analysis and Future Research Directions
DOI:
https://doi.org/10.66279/292sm294Keywords:
Vision-language-action models, Embodied AI, Transformer policies, Diffusion models, Lifelong learningAbstract
Vision-Language-Action (VLA) models have emerged as a transformative paradigm in Embodied Artificial Intelligence by unifying visual perception, linguistic reasoning, and physical control within a single cohesive computational framework. By leveraging the semantic reasoning capabilities of large pre-trained Vision-Language Models (VLMs), VLA architectures promise to transition robotic systems from specialized, single-task agents to generalist robots capable of following natural language instructions in unstructured environments. This work provides a comprehensive review of the rapidly evolving VLA landscape, offering a structured taxonomy of state-of-the-art architectures ranging from unified transformer-based policies such as RT-2 and OpenVLA to emerging diffusion-based action generation methods. Key technical innovations driving the field are critically analyzed, including the integration of autoregressive world models for predictive planning, the adoption of discrete diffusion for high-fidelity action tokenization, and the development of efficient training-free acceleration techniques for edge deployment. Furthermore, this work synthesizes critical challenges hindering widespread adoption, such as open-world generalization, long-horizon task decomposition, and the assurance of safety in neuro-symbolic control loops, while presenting concrete solution strategies for each. By outlining promising future research directions, including hierarchical planning, multi-embodiment fusion, and self-supervised lifelong learning.
Downloads
References
[1] Y. Ma, Z. Song, Y. Zhuang, J. Hao, and I. King, “A survey on vision-language-action models for embodied ai,” arXiv preprint arXiv:2405.14093, 2024.
[2] R. Sapkota, Y. Cao, K. I. Roumeliotis, and M. Karkee, “Vision-language-action models: Concepts, progress, applications and challenges,” arXiv preprint arXiv:2505.04769, 2025.
[3] B. Zitkovich, T. Yu, S. Xu, P. Xu, T. Xiao, F. Xia, J. Wu, P. Wohlhart, S. Welker, A. Wahid, et al., “Rt-2: Vision-language-action models transfer web knowledge to robotic control,” in Conference on Robot Learning, pp. 2165–2183, PMLR, 2023.
[4] M. J. Kim, K. Pertsch, S. Karamcheti, T. Xiao, A. Balakrishna, S. Nair, R. Rafailov, E. Foster, G. Lam, P. Sanketi, et al., “Openvla: An open source vision-language-action model,” arXiv preprint arXiv:2406.09246, 2024.
[5] A. Brohan, N. Brown, J. Carbajal, Y. Chebotar, J. Dabis, C. Finn, K. Gopalakrishnan, K. Hausman, A. Herzog, J. Hsu, et al., “Rt-1: Robotics transformer for real-world control at scale,” arXiv preprint arXiv:2212.06817, 2022. DOI: https://doi.org/10.15607/RSS.2023.XIX.025
[6] A. Brohan, N. Brown, J. Carbajal, Y. Chebotar, X. Chen, K. Choromanski, T. Ding, D. Driess, A. Dubey, C. Finn, et al., “Rt-2: Vision language-action models transfer web knowledge to robotic control, 2023,” URL https://arxiv. org/abs/2307.15818, 2024.
[7] C.-L. Cheang, G. Chen, Y. Jing, T. Kong, H. Li, Y. Li, Y. Liu, H. Wu, J. Xu, Y. Yang, et al., “Gr-2: A generative video-language-action model with web-scale knowledge for robot manipulation,” arXiv preprint arXiv:2410.06158, 2024.
[8] H. Touvron, T. Lavril, G. Izacard, X. Martinet, M.-A. Lachaux, T. Lacroix, B. Rozière, N. Goyal, E. Hambro, F. Azhar, et al., “Llama: Open and efficient foundation language models,” arXiv preprint arXiv:2302.13971, 2023.
[9] M. Oquab, T. Darcet, T. Moutakanni, H. Vo, M. Szafraniec, V. Khalidov, P. Fernandez, D. Haziza, F. Massa, A. El-Nouby, et al., “Dinov2: Learning robust visual features without supervision,” arXiv preprint arXiv:2304.07193, 2023.
[10] X. Zhai, B. Mustafa, A. Kolesnikov, and L. Beyer, “Sigmoid loss for language image pre-training,” in Proceedings of the IEEE/CVF international conference on computer vision, pp. 11975–11986, 2023. DOI: https://doi.org/10.1109/ICCV51070.2023.01100
[11] J. Liu, M. Liu, Z. Wang, P. An, X. Li, K. Zhou, S. Yang, R. Zhang, Y. Guo, and S. Zhang, “Robomamba: Efficient vision-language-action model for robotic reasoning and manipulation,” Advances in Neural Information Processing Systems, vol. 37, pp. 40085–40110, 2024. DOI: https://doi.org/10.52202/079017-1266
[12] O. Mees, D. Ghosh, K. Pertsch, K. Black, H. R. Walke, S. Dasari, J. Hejna, T. Kreiman, C. Xu, J. Luo, et al., “Octo: An open-source generalist robot policy,” in First Workshop on Vision-Language Models for Navigation and Manipulation at ICRA 2024, 2024.
[13] J. Bjorck, F. Castañeda, N. Cherniadev, X. Da, R. Ding, L. Fan, Y. Fang, D. Fox, F. Hu, S. Huang, et al., “Gr00t n1: An open foundation model for generalist humanoid robots,” arXiv preprint arXiv:2503.14734, 2025.
[14] W. Zhang, H. Liu, Z. Qi, Y. Wang, X. Yu, J. Zhang, R. Dong, J. He, F. Lu, H. Wang, et al.,
“Dreamvla: a vision-language-action model dreamed with comprehensive world knowledge,” arXiv preprint arXiv:2507.04447, 2025.
[15] L. Li, J. Fan, X. Ni, S. Qin, W. Li, and F. Gao, “Sva: Towards speech-enabled vision-language-action model,” Pattern Recognition, p. 112915, 2025. DOI: https://doi.org/10.1016/j.patcog.2025.112915
[16] A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, Ł. Kaiser, and I. Polosukhin, “Attention is all you need,” Advances in neural information processing systems, vol. 30, 2017.
[17] J. Devlin, M.-W. Chang, K. Lee, and K. Toutanova, “Bert: Pre-training of deep bidirectional transformers for language understanding,” in Proceedings of the 2019 conference of the North American chapter of the association for computational linguistics: human language technologies, volume 1 (long and short papers), pp. 4171–4186, 2019. DOI: https://doi.org/10.18653/v1/N19-1423
[18] A. Radford, J. W. Kim, C. Hallacy, A. Ramesh, G. Goh, S. Agarwal, G. Sastry, A. Askell, P. Mishkin, J. Clark, et al., “Learning transferable visual models from natural language supervision,” in International conference on machine learning, pp. 8748–8763, PmLR, 2021.
[19] A. O’Neill, A. Rehman, A. Maddukuri, A. Gupta, A. Padalkar, A. Lee, A. Pooley, A. Gupta, A. Mandlekar, A. Jain, et al., “Open x embodiment: Robotic learning datasets and rt-x models: Open x-embodiment collaboration 0,” in 2024 IEEE International Conference on Robotics and Automation (ICRA), pp. 6892–6903, IEEE, 2024.
[20] H. R. Walke, K. Black, T. Z. Zhao, Q. Vuong, C. Zheng, P. Hansen-Estruch, A. W. He, V. Myers, M. J. Kim, M. Du, et al., “Bridgedata v2: A dataset for robot learning at scale,” in Conference on Robot Learning, pp. 1723–1736, PMLR, 2023.
[21] A. Miech, D. Zhukov, J.-B. Alayrac, M. Tapaswi, I. Laptev, and J. Sivic, “Howto100m: Learning a text-video embedding by watching hundred million narrated video clips,” in Proceedings of the IEEE/CVF international conference on computer vision, pp. 2630–2640, 2019. DOI: https://doi.org/10.1109/ICCV.2019.00272
[22] N. Lin and M. Cai, “Epic-kitchens-100 unsupervised domain adaptation challenge for action recognition 2022: Team hnu-fpv technical report,” arXiv preprint arXiv:2207.03095, 2022.
[23] Q. Li, Y. Liang, Z. Wang, L. Luo, X. Chen, M. Liao, F. Wei, Y. Deng, S. Xu, Y. Zhang, et al., “Cogact: A foundational vision-language-action model for synergizing cognition and action in robotic manipulation,” arXiv preprint arXiv:2411.19650, 2024.
[24] H. Zhen, X. Qiu, P. Chen, J. Yang, X. Yan, Y. Du, Y. Hong, and C. Gan, “3d-vla: A 3d vision-language-action generative world model,” arXiv preprint arXiv:2403.09631, 2024.
[25] C. Chi, Z. Xu, S. Feng, E. Cousineau, Y. Du, B. Burchfiel, R. Tedrake, and S. Song, “Diffusion policy: Visuomotor policy learning via action diffusion,” The International Journal of Robotics Research, vol. 44, no. 10-11, pp. 1684–1704, 2025. DOI: https://doi.org/10.1177/02783649241273668
[26] S. Lee, Y. Wang, H. Etukuru, H. J. Kim, N. M. M. Shafiullah, and L. Pinto, “Behavior generation with latent actions,” arXiv preprint arXiv:2403.03181, 2024.
[27] K. Bousmalis, G. Vezzani, D. Rao, C. Devin, A. X. Lee, M. Bauzá, T. Davchev, Y. Zhou, A. Gupta, A. Raju, et al., “Robocat: A self-improving generalist agent for robotic manipulation,” arXiv preprint arXiv:2306.11706, 2023.
[28] Z. Dong, Y. Liu, S. Zhang, B. Ye, Y. Yuan, F. Ni, J. Gong, X. Qiu, H. Zhao, Y. Li, et al., “Actioncodec: What makes for good action tokenizers,” arXiv preprint arXiv:2602.15397, 2026.
[29] Z. Liang, Y. Li, T. Yang, C. Wu, S. Mao, T. Nian, L. Pei, S. Zhou, X. Yang, J. Pang, et al., “Discrete diffusion vla: Bringing discrete diffusion to action decoding in vision-language-action policies,” arXiv preprint arXiv:2508.20072, 2025.
[30] Y. Yang, Y. Wang, Z. Wen, L. Zhongwei, C. Zou, Z. Zhang, C. Wen, and L. Zhang, “Efficientvla: Training-free acceleration and compression for vision-language-action models,” arXiv preprint arXiv:2506.10100, 2025.
[31] K. Black, N. Brown, D. Driess, A. Esmail, M. Equi, C. Finn, N. Fusai, L. Groom, K. Hausman, B. Ichter, et al., “????0: A vision-language-action flow model for general robot control,” arXiv preprint arXiv:2410.24164, 2024. DOI: https://doi.org/10.15607/RSS.2025.XXI.010
[32] Z. Zhou, T. Cai, S. Z. Zhao, Y. Zhang, Z. Huang, B. Zhou, and J. Ma, “Autovla: A vision-language-action model for end-to-end autonomous driving with adaptive reasoning and reinforcement fine-tuning,” arXiv preprint arXiv:2506.13757, 2025.
[33] P. Ding, J. Ma, X. Tong, B. Zou, X. Luo, Y. Fan, T. Wang, H. Lu, P. Mo, J. Liu, et al., “Humanoid-vla: Towards universal humanoid control with visual integration,” arXiv preprint arXiv:2502.14795, 2025.
[34] S. Poria, N. Majumder, C.-Y. Hung, A. A. Bagherzadeh, C. Li, K. Kwok, Z. Wang, C. Tan, J. Wu, and D. Hsu, “10 open challenges steering the future of vision-language-action models,” arXiv preprint arXiv:2511.05936, 2025. DOI: https://doi.org/10.1609/aaai.v40i46.41333
[35] Y. Fan, P. Ding, S. Bai, X. Tong, Y. Zhu, H. Lu, F. Dai, W. Zhao, Y. Liu, S. Huang, et al., “Long-vla: Unleashing long-horizon capability of vision language action model for robot manipulation,” arXiv preprint arXiv:2508.19958, 2025.
[36] B. Zhang, Y. Zhang, J. Ji, Y. Lei, J. Dai, Y. Chen, and Y. Yang, “Safevla: Towards safety alignment
of vision-language-action model via constrained learning,” arXiv preprint arXiv:2503.03480, 2025.
[37] K. Pertsch, K. Stachowicz, B. Ichter, D. Driess, S. Nair, Q. Vuong, O. Mees, C. Finn, and S. Levine, “Fast: Efficient action tokenization for vision-language-action models,” arXiv preprint arXiv:2501.09747, 2025. DOI: https://doi.org/10.15607/RSS.2025.XXI.012
[38] Y. Wang, H. Zhu, M. Liu, J. Yang, H.-S. Fang, and T. He, “Vq-vla: Improving vision-language-action models via scaling vector-quantized action tokenizers,” arXiv preprint arXiv:2507.01016, 2025.
[39] L. F. Moreno Fuentes, M. Haris Khan, M. Altamirano Cabrera, V. Serpiva, D. Iarchuk, Y. Mahmoud, I. Tokmurziyev, and D. Tsetserukou, “Vlh: Vision-language-haptics foundation model,” arXiv e-prints, pp. arXiv–2508, 2025.
[40] R. Fan, M. Sun, and G. Giakos, “Toward the next frontier of embodied ai,” 2025. DOI: https://doi.org/10.20517/ir.2025.44
[41] X. Han, S. Chen, Z. Fu, Z. Feng, L. Fan, D. An, C. Wang, L. Guo, W. Meng, X. Zhang, et al., “Multimodal fusion and vision-language models: A survey for robot vision,” Information Fusion, p. 103652, 2025. DOI: https://doi.org/10.2139/ssrn.5206098
[42] J. Zheng, J. Li, Z. Wang, D. Liu, X. Kang, Y. Feng, Y. Zheng, J. Zou, Y. Chen, J. Zeng, et al., “X-vla: Soft-prompted transformer as scalable cross-embodiment vision-language-action model,” arXiv preprint arXiv:2510.10274, 2025.
Downloads
Published
Issue
Section
Categories
License
Copyright (c) 2026 Computational Discovery and Intelligent Systems (CDIS)

This work is licensed under a Creative Commons Attribution 4.0 International License.
Computational Discovery and Intelligent Systems (CDIS) content is published under a Creative Commons Attribution License (CCBY). This means that content is freely available to all readers upon publication, and content is published as soon as production is complete.
Computational Discovery and Intelligent Systems (CDIS) seeks to publish the most influential papers that will significantly advance scientific understanding. Selected articles must present new and widely significant data, syntheses, or concepts. They should merit recognition by the wider scientific community and the general public through publication in a reputable scientific journal.









