A Robust Two-Stage Retrieval-Augmented Vision-Language Framework for Knowledge-Intensive Multimodal Reasoning and Alignment

Prashant Johri Johri; Arwa Abdulrahman Asiri; Ebtehal Abdulrahman Assery; Sohila Hassan; Esraa Emad; Esraa Abdelrahman

A Robust Two-Stage Retrieval-Augmented Vision-Language Framework for Knowledge-Intensive Multimodal Reasoning and Alignment

Authors

Prashant Johri Johri Galgotias University Author
Arwa Abdulrahman Asiri Taibah University Author
Ebtehal Abdulrahman Assery Taibah University Author
Sohila Hassan Beni-Suef University Author
Esraa Emad Beni-Suef University Author
Esraa Abdelrahman Beni-Suef University Author

Keywords:

Vision Language Models (VLMs), Retrieval Augmented Generation (RAG), Multimodal Reasoning Knowledge, Intensive Tasks

Abstract

Vision-Language Models (VLMs) have demonstrated significant potential in visual perception and linguistic understanding. However, they often struggle with knowledge-intensive tasks that require linking visual scenes to external background knowledge. To address these limitations, this paper proposes the RoRA-VLM (Robust Retrieval-Augmented Vision Language Model) framework. RoRA-VLM introduces a novel two-stage retrieval mechanism—Image-anchored Textual-query Expansion—to bridge the modality discrepancy between visual and textual inputs. Furthermore, it incorporates a Query-oriented Visual Token Refinement strategy for better alignment and Adversarial Noise Injection to enhance reasoning robustness against irrelevant retrieved information. Experimental results on the InfoSeek and OVEN datasets demonstrate that RoRA-VLM significantly outperforms baseline models, achieving a 62.5% accuracy on InfoSeek, which represents a 17.3% improvement over the base LLaVA-v1.5 model. These findings highlight the effectiveness of the proposed alignment and reasoning mechanisms in developing more intelligent and robust vision-language systems.

Downloads

Download data is not yet available.

References

Downloads

Published

05-02-2026

Issue

Vol. 2 No. 2 (2026): February-2026

Section

Original Research

License

This work is licensed under a Creative Commons Attribution 4.0 International License.

Computational Discovery and Intelligent Systems (CDIS) content is published under a Creative Commons Attribution License (CCBY). This means that content is freely available to all readers upon publication, and content is published as soon as production is complete.

Computational Discovery and Intelligent Systems (CDIS) seeks to publish the most influential papers that will significantly advance scientific understanding. Selected articles must present new and widely significant data, syntheses, or concepts. They should merit recognition by the wider scientific community and the general public through publication in a reputable scientific journal.

How to Cite

A Robust Two-Stage Retrieval-Augmented Vision-Language Framework for Knowledge-Intensive Multimodal Reasoning and Alignment. (2026). Computational Discovery and Intelligent Systems (CDIS), 2(2), 42-52. https://pub.scientificirg.com/index.php/CDIS/article/view/40

Download Citation

A Robust Two-Stage Retrieval-Augmented Vision-Language Framework for Knowledge-Intensive Multimodal Reasoning and Alignment

Authors

Keywords:

Abstract

Downloads

References

Downloads

Published

Issue

Section

Categories

License

How to Cite

Similar Articles

Make a Submission

Announcements

Information

Share

Latest publications

Browse

Visitors