Vision Language Action Models for Embodied Intelligence A Structured Taxonomy Critical Analysis and Future Research Directions

Ola Farid; Hamdi A. Mahmoud

doi:10.66279/292sm294

Authors

Ola Farid Beni-Suef University Author

Competing Interests

No competing interests this author may have with the research subject.
Hamdi A. Mahmoud Beni-Suef University Author

Competing Interests

No competing interests this author may have with the research subject.

DOI:

https://doi.org/10.66279/292sm294

Keywords:

Vision-language-action models, Embodied AI, Transformer policies, Diffusion models, Lifelong learning

Abstract

Vision-Language-Action (VLA) models have emerged as a transformative paradigm in Embodied Artificial Intelligence by unifying visual perception, linguistic reasoning, and physical control within a single cohesive computational framework. By leveraging the semantic reasoning capabilities of large pre-trained Vision-Language Models (VLMs), VLA architectures promise to transition robotic systems from specialized, single-task agents to generalist robots capable of following natural language instructions in unstructured environments. This work provides a comprehensive review of the rapidly evolving VLA landscape, offering a structured taxonomy of state-of-the-art architectures ranging from unified transformer-based policies such as RT-2 and OpenVLA to emerging diffusion-based action generation methods. Key technical innovations driving the field are critically analyzed, including the integration of autoregressive world models for predictive planning, the adoption of discrete diffusion for high-fidelity action tokenization, and the development of efficient training-free acceleration techniques for edge deployment. Furthermore, this work synthesizes critical challenges hindering widespread adoption, such as open-world generalization, long-horizon task decomposition, and the assurance of safety in neuro-symbolic control loops, while presenting concrete solution strategies for each. By outlining promising future research directions, including hierarchical planning, multi-embodiment fusion, and self-supervised lifelong learning.

Downloads

Download data is not yet available.

References

[1] Y. Ma, Z. Song, Y. Zhuang, J. Hao, and I. King, “A survey on vision-language-action models for embodied ai,” arXiv preprint arXiv:2405.14093, 2024.

[2] R. Sapkota, Y. Cao, K. I. Roumeliotis, and M. Karkee, “Vision-language-action models: Concepts, progress, applications and challenges,” arXiv preprint arXiv:2505.04769, 2025.

[3] B. Zitkovich, T. Yu, S. Xu, P. Xu, T. Xiao, F. Xia, J. Wu, P. Wohlhart, S. Welker, A. Wahid, et al., “Rt-2: Vision-language-action models transfer web knowledge to robotic control,” in Conference on Robot Learning, pp. 2165–2183, PMLR, 2023.

[4] M. J. Kim, K. Pertsch, S. Karamcheti, T. Xiao, A. Balakrishna, S. Nair, R. Rafailov, E. Foster, G. Lam, P. Sanketi, et al., “Openvla: An open source vision-language-action model,” arXiv preprint arXiv:2406.09246, 2024.

[5] A. Brohan, N. Brown, J. Carbajal, Y. Chebotar, J. Dabis, C. Finn, K. Gopalakrishnan, K. Hausman, A. Herzog, J. Hsu, et al., “Rt-1: Robotics transformer for real-world control at scale,” arXiv preprint arXiv:2212.06817, 2022. DOI: https://doi.org/10.15607/RSS.2023.XIX.025

[6] A. Brohan, N. Brown, J. Carbajal, Y. Chebotar, X. Chen, K. Choromanski, T. Ding, D. Driess, A. Dubey, C. Finn, et al., “Rt-2: Vision language-action models transfer web knowledge to robotic control, 2023,” URL https://arxiv. org/abs/2307.15818, 2024.

[7] C.-L. Cheang, G. Chen, Y. Jing, T. Kong, H. Li, Y. Li, Y. Liu, H. Wu, J. Xu, Y. Yang, et al., “Gr-2: A generative video-language-action model with web-scale knowledge for robot manipulation,” arXiv preprint arXiv:2410.06158, 2024.

[8] H. Touvron, T. Lavril, G. Izacard, X. Martinet, M.-A. Lachaux, T. Lacroix, B. Rozière, N. Goyal, E. Hambro, F. Azhar, et al., “Llama: Open and efficient foundation language models,” arXiv preprint arXiv:2302.13971, 2023.

[9] M. Oquab, T. Darcet, T. Moutakanni, H. Vo, M. Szafraniec, V. Khalidov, P. Fernandez, D. Haziza, F. Massa, A. El-Nouby, et al., “Dinov2: Learning robust visual features without supervision,” arXiv preprint arXiv:2304.07193, 2023.

[10] X. Zhai, B. Mustafa, A. Kolesnikov, and L. Beyer, “Sigmoid loss for language image pre-training,” in Proceedings of the IEEE/CVF international conference on computer vision, pp. 11975–11986, 2023. DOI: https://doi.org/10.1109/ICCV51070.2023.01100

[11] J. Liu, M. Liu, Z. Wang, P. An, X. Li, K. Zhou, S. Yang, R. Zhang, Y. Guo, and S. Zhang, “Robomamba: Efficient vision-language-action model for robotic reasoning and manipulation,” Advances in Neural Information Processing Systems, vol. 37, pp. 40085–40110, 2024. DOI: https://doi.org/10.52202/079017-1266

[12] O. Mees, D. Ghosh, K. Pertsch, K. Black, H. R. Walke, S. Dasari, J. Hejna, T. Kreiman, C. Xu, J. Luo, et al., “Octo: An open-source generalist robot policy,” in First Workshop on Vision-Language Models for Navigation and Manipulation at ICRA 2024, 2024.

[13] J. Bjorck, F. Castañeda, N. Cherniadev, X. Da, R. Ding, L. Fan, Y. Fang, D. Fox, F. Hu, S. Huang, et al., “Gr00t n1: An open foundation model for generalist humanoid robots,” arXiv preprint arXiv:2503.14734, 2025.

[14] W. Zhang, H. Liu, Z. Qi, Y. Wang, X. Yu, J. Zhang, R. Dong, J. He, F. Lu, H. Wang, et al.,

“Dreamvla: a vision-language-action model dreamed with comprehensive world knowledge,” arXiv preprint arXiv:2507.04447, 2025.

[15] L. Li, J. Fan, X. Ni, S. Qin, W. Li, and F. Gao, “Sva: Towards speech-enabled vision-language-action model,” Pattern Recognition, p. 112915, 2025. DOI: https://doi.org/10.1016/j.patcog.2025.112915

[16] A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, Ł. Kaiser, and I. Polosukhin, “Attention is all you need,” Advances in neural information processing systems, vol. 30, 2017.

[17] J. Devlin, M.-W. Chang, K. Lee, and K. Toutanova, “Bert: Pre-training of deep bidirectional transformers for language understanding,” in Proceedings of the 2019 conference of the North American chapter of the association for computational linguistics: human language technologies, volume 1 (long and short papers), pp. 4171–4186, 2019. DOI: https://doi.org/10.18653/v1/N19-1423

[18] A. Radford, J. W. Kim, C. Hallacy, A. Ramesh, G. Goh, S. Agarwal, G. Sastry, A. Askell, P. Mishkin, J. Clark, et al., “Learning transferable visual models from natural language supervision,” in International conference on machine learning, pp. 8748–8763, PmLR, 2021.

[19] A. O’Neill, A. Rehman, A. Maddukuri, A. Gupta, A. Padalkar, A. Lee, A. Pooley, A. Gupta, A. Mandlekar, A. Jain, et al., “Open x embodiment: Robotic learning datasets and rt-x models: Open x-embodiment collaboration 0,” in 2024 IEEE International Conference on Robotics and Automation (ICRA), pp. 6892–6903, IEEE, 2024.

[20] H. R. Walke, K. Black, T. Z. Zhao, Q. Vuong, C. Zheng, P. Hansen-Estruch, A. W. He, V. Myers, M. J. Kim, M. Du, et al., “Bridgedata v2: A dataset for robot learning at scale,” in Conference on Robot Learning, pp. 1723–1736, PMLR, 2023.

[21] A. Miech, D. Zhukov, J.-B. Alayrac, M. Tapaswi, I. Laptev, and J. Sivic, “Howto100m: Learning a text-video embedding by watching hundred million narrated video clips,” in Proceedings of the IEEE/CVF international conference on computer vision, pp. 2630–2640, 2019. DOI: https://doi.org/10.1109/ICCV.2019.00272

[22] N. Lin and M. Cai, “Epic-kitchens-100 unsupervised domain adaptation challenge for action recognition 2022: Team hnu-fpv technical report,” arXiv preprint arXiv:2207.03095, 2022.

[23] Q. Li, Y. Liang, Z. Wang, L. Luo, X. Chen, M. Liao, F. Wei, Y. Deng, S. Xu, Y. Zhang, et al., “Cogact: A foundational vision-language-action model for synergizing cognition and action in robotic manipulation,” arXiv preprint arXiv:2411.19650, 2024.

[24] H. Zhen, X. Qiu, P. Chen, J. Yang, X. Yan, Y. Du, Y. Hong, and C. Gan, “3d-vla: A 3d vision-language-action generative world model,” arXiv preprint arXiv:2403.09631, 2024.

[25] C. Chi, Z. Xu, S. Feng, E. Cousineau, Y. Du, B. Burchfiel, R. Tedrake, and S. Song, “Diffusion policy: Visuomotor policy learning via action diffusion,” The International Journal of Robotics Research, vol. 44, no. 10-11, pp. 1684–1704, 2025. DOI: https://doi.org/10.1177/02783649241273668

[26] S. Lee, Y. Wang, H. Etukuru, H. J. Kim, N. M. M. Shafiullah, and L. Pinto, “Behavior generation with latent actions,” arXiv preprint arXiv:2403.03181, 2024.

[27] K. Bousmalis, G. Vezzani, D. Rao, C. Devin, A. X. Lee, M. Bauzá, T. Davchev, Y. Zhou, A. Gupta, A. Raju, et al., “Robocat: A self-improving generalist agent for robotic manipulation,” arXiv preprint arXiv:2306.11706, 2023.

[28] Z. Dong, Y. Liu, S. Zhang, B. Ye, Y. Yuan, F. Ni, J. Gong, X. Qiu, H. Zhao, Y. Li, et al., “Actioncodec: What makes for good action tokenizers,” arXiv preprint arXiv:2602.15397, 2026.

[29] Z. Liang, Y. Li, T. Yang, C. Wu, S. Mao, T. Nian, L. Pei, S. Zhou, X. Yang, J. Pang, et al., “Discrete diffusion vla: Bringing discrete diffusion to action decoding in vision-language-action policies,” arXiv preprint arXiv:2508.20072, 2025.

[30] Y. Yang, Y. Wang, Z. Wen, L. Zhongwei, C. Zou, Z. Zhang, C. Wen, and L. Zhang, “Efficientvla: Training-free acceleration and compression for vision-language-action models,” arXiv preprint arXiv:2506.10100, 2025.

[31] K. Black, N. Brown, D. Driess, A. Esmail, M. Equi, C. Finn, N. Fusai, L. Groom, K. Hausman, B. Ichter, et al., “????0: A vision-language-action flow model for general robot control,” arXiv preprint arXiv:2410.24164, 2024. DOI: https://doi.org/10.15607/RSS.2025.XXI.010

[32] Z. Zhou, T. Cai, S. Z. Zhao, Y. Zhang, Z. Huang, B. Zhou, and J. Ma, “Autovla: A vision-language-action model for end-to-end autonomous driving with adaptive reasoning and reinforcement fine-tuning,” arXiv preprint arXiv:2506.13757, 2025.

[33] P. Ding, J. Ma, X. Tong, B. Zou, X. Luo, Y. Fan, T. Wang, H. Lu, P. Mo, J. Liu, et al., “Humanoid-vla: Towards universal humanoid control with visual integration,” arXiv preprint arXiv:2502.14795, 2025.

[34] S. Poria, N. Majumder, C.-Y. Hung, A. A. Bagherzadeh, C. Li, K. Kwok, Z. Wang, C. Tan, J. Wu, and D. Hsu, “10 open challenges steering the future of vision-language-action models,” arXiv preprint arXiv:2511.05936, 2025. DOI: https://doi.org/10.1609/aaai.v40i46.41333

[35] Y. Fan, P. Ding, S. Bai, X. Tong, Y. Zhu, H. Lu, F. Dai, W. Zhao, Y. Liu, S. Huang, et al., “Long-vla: Unleashing long-horizon capability of vision language action model for robot manipulation,” arXiv preprint arXiv:2508.19958, 2025.

[36] B. Zhang, Y. Zhang, J. Ji, Y. Lei, J. Dai, Y. Chen, and Y. Yang, “Safevla: Towards safety alignment

of vision-language-action model via constrained learning,” arXiv preprint arXiv:2503.03480, 2025.

[37] K. Pertsch, K. Stachowicz, B. Ichter, D. Driess, S. Nair, Q. Vuong, O. Mees, C. Finn, and S. Levine, “Fast: Efficient action tokenization for vision-language-action models,” arXiv preprint arXiv:2501.09747, 2025. DOI: https://doi.org/10.15607/RSS.2025.XXI.012

[38] Y. Wang, H. Zhu, M. Liu, J. Yang, H.-S. Fang, and T. He, “Vq-vla: Improving vision-language-action models via scaling vector-quantized action tokenizers,” arXiv preprint arXiv:2507.01016, 2025.

[39] L. F. Moreno Fuentes, M. Haris Khan, M. Altamirano Cabrera, V. Serpiva, D. Iarchuk, Y. Mahmoud, I. Tokmurziyev, and D. Tsetserukou, “Vlh: Vision-language-haptics foundation model,” arXiv e-prints, pp. arXiv–2508, 2025.

[40] R. Fan, M. Sun, and G. Giakos, “Toward the next frontier of embodied ai,” 2025. DOI: https://doi.org/10.20517/ir.2025.44

[41] X. Han, S. Chen, Z. Fu, Z. Feng, L. Fan, D. An, C. Wang, L. Guo, W. Meng, X. Zhang, et al., “Multimodal fusion and vision-language models: A survey for robot vision,” Information Fusion, p. 103652, 2025. DOI: https://doi.org/10.2139/ssrn.5206098

[42] J. Zheng, J. Li, Z. Wang, D. Liu, X. Kang, Y. Feng, Y. Zheng, J. Zou, Y. Chen, J. Zeng, et al., “X-vla: Soft-prompted transformer as scalable cross-embodiment vision-language-action model,” arXiv preprint arXiv:2510.10274, 2025.