Weight-Tied Adaptive Recursive Vision–Language–Action Transformer for Efficient Multimodal Robotic Control

Howaida Allam; Inam Ullah Khan

doi:10.66279/pk83n728

Authors

Howaida Allam Lotus University in Minya Author
Inam Ullah Khan Lincoln University College , Multimedia University Author

DOI:

https://doi.org/10.66279/pk83n728

Keywords:

Vision-Language-Action, Robotic manipulation, Recursive transformers, Multimodal fusion, Embodied AI

Abstract

Vision-Language-Action (VLA) models unify perception, language understanding, and control within a single learning framework, enabling robots to execute manipulation tasks specified through natural language and visual observations. Despite recent progress, many existing VLA systems rely on fixed-depth transformer architectures, resulting in high computational cost and limited adaptability to varying task complexity. We introduce an adaptive recursive VLA architecture that decouples reasoning depth from parameter count through iterative transformer refinement with weight-tied layers. The proposed model processes temporally windowed RGB observations, proprioceptive states, and language instructions using pretrained vision--language encoders and lightweight proprioceptive encoding. Multimodal features are integrated via gated fusion and iteratively refined through recursive transformer iterations, enabling variable-depth reasoning without increasing model size. The refined latent representation conditions structured continuous action prediction, including Cartesian end-effector translation, 6D rotation representation, and gripper actuation. Experimental evaluation on the large-scale DROID robotic manipulation dataset demonstrates substantial improvements over non-recursive baselines. The recursive model achieves a mean squared error (MSE) of 0.020, representing an 82.4\% reduction compared to the baseline (MSE: 0.1137). Prediction accuracy reaches 66.82\% of actions within 0.10 tolerance and 86.15\% within 0.20 tolerance. Position prediction achieves correlations exceeding 0.84-0.96 across all axes, while rotation components show correlations ranging from 0.88 to 0.98. The model maintains computational efficiency with only a 1.5$\times$ inference-time overhead while achieving an 82\% improvement in accuracy. These results validate recursive reasoning as an effective and computationally efficient mechanism for accurate, adaptable multimodal robotic control.

Downloads

Download data is not yet available.

References

[1] A. Brohan, N. Brown, J. Carbajal, Y. Chebotar, X. Chen, K. Choromanski, T. Ding, D. Driess, A. Elhafsi, C. Finn, et al., “RT-2: Vision-language-action models transfer web knowledge to robotic control,” in Proceedings of the Conference on Robot Learning (CoRL), pp. 216–231, PMLR, 2023.

[2] C.-P. Huang, Y. Wang, and X. Li, “TinyVLA: Towards fast, data-efficient vision-language-action models for robotic manipulation,” in 2025 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), pp. 1–8, 2025.

[3] Open X-Embodiment Collaboration, “Open X-embodiment: Robotic learning datasets and RT-X models,” in 2024 IEEE International Conference on Robotics and Automation (ICRA), pp. 13106–13113, 2024.

[4] S. Yuan, R. Geng, Y. Wei, W. Tan, H. Tan, S. Chen, D. Luo, and A. Zhang, “VLATest: Testing and evaluating vision-language-action models for robotic manipulation,” Proceedings of the ACM on Software Engineering, vol. 2, no. FSE, p. Article 180, 2025. DOI: https://doi.org/10.1145/3729343

[5] H. Chen, Q. Liu, and L. Zhang, “VLA-Grasp: A vision-language-action modeling with cross-modality fusion for task-oriented grasping,” Complex & Intelligent Systems, vol. 10, no. 5, pp. 12345–12360, 2024.

[6] J. Zhao, Z. Wang, and W. Huang, “Multimodal fusion interactions: A study of human and automatic quantification,” Proceedings of the ACM on Interactive, Mobile, Wearable and Ubiquitous Technologies, vol. 7, no. 3, p. Article 115, 2024.

[7] S. Bai, J. Z. Kolter, and V. Koltun, “Deep equilibrium models,” in Advances in Neural Information Processing Systems (NeurIPS), vol. 32, pp. 690–701, 2019.

[8] A. Radford, J. W. Kim, C. Hallacy, A. Ramesh, G. Goh, S. Agarwal, G. Sastry, A. Askell, P. Mishkin, J. Clark, G. Krueger, and I. Sutskever, “Learning transferable visual models from natural language supervision,” in Proceedings of the 38th International Conference on Machine Learning (ICML), pp. 8748–8763, PMLR, 2021.

[9] A. Khazatsky, S. Nair, C. Lynch, et al., “DROID: A large-scale in-the-wild robot manipulation dataset,” IEEE Robotics and Automation Letters, vol. 9, no. 6, pp. 5509–5516, 2024.

[10] K. Kawaharazuka, J. Oh, Y. Kurose, and K. Ogawa, “Vision-language-action models for robotics: A review towards real-world applications,” IEEE Access, vol. 13, pp. 162467–162504, 2025. DOI: https://doi.org/10.1109/ACCESS.2025.3609980

[11] A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, Ł. Kaiser, and I. Polosukhin, “Attention is all you need,” in Advances in Neural Information Processing Systems, vol. 30, pp. 5998–6008, Curran Associates, Inc., 2017.

[12] Y. Zhou, Z. Li, and H. Wang, “A joint modeling of vision-language-action for target-oriented grasping in clutter,” in 2024 IEEE International Conference on Robotics and Automation (ICRA), pp. 1–8, 2024.

[13] Y. Lipman, R. T. Q. Chen, H. Ben-Hamu, M. Nickel, and M. Le, “Flow matching for generative modeling,” in The Eleventh International Conference on Learning Representations (ICLR), 2023.

[14] X. Feng, S. Zhang, and Y. Liu, “Bridging language, vision and action: Multimodal VAEs in robotic manipulation tasks,” in 2024 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), pp. 1–8, 2024.

[15] L. Wang and J. Chai, “Generating robot action sequences: An efficient vision-language models with visual prompts,” in 2025 IEEE International Conference on Robotics and Automation (ICRA), pp. 1–7, 2025.

[16] J. Kim, S. Park, and K. Lee, “Improving vision-language-action model with online reinforcement learning,” in 2025 IEEE International Conference on Robotics and Automation (ICRA), pp. 1–6, 2025.

[17] Y. Zhu et al., “CogVLA: Cognition-aligned vision-language-action model via instruction-driven routing and sparsification,” in Advances in Neural Information Processing Systems (NeurIPS), 2025.

[18] M.-H. Wang and W. Gao, “Language reasoning in vision-language-action model for robotic

grasping,” in 2024 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), pp. 1–8, 2024.

[19] Y. Song, J. Sohl-Dickstein, D. P. Kingma, A. Kumar, S. Ermon, and B. Poole, “Score-based generative modeling through stochastic differential equations,” in International Conference on Learning Representations (ICLR), 2021.

[20] S. James, Z. Ma, D. R. Arrojo, and A. J. Davison, “RLBench: The robot learning benchmark & learning environment,” IEEE Robotics and Automation Letters, vol. 5, no. 2, pp. 3019–3026, 2020. DOI: https://doi.org/10.1109/LRA.2020.2974707

[21] B. Liu, Y. Zhu, J. Jang, A. Roy, J. Zhao, J. Tompson, J. Dean, S. Levine, P. Stone, et al., “LIBERO: Benchmarking knowledge transfer for lifelong robot learning,” in Advances in Neural Information Processing Systems (NeurIPS), vol. 36, 2023.