A Robust Two-Stage Retrieval-Augmented Vision-Language Framework for Knowledge-Intensive Multimodal Reasoning and Alignment

Authors

Keywords:

Vision Language Models (VLMs), Retrieval Augmented Generation (RAG), Multimodal Reasoning Knowledge, Intensive Tasks

Abstract

Vision-Language Models (VLMs) have demonstrated significant potential in visual perception and linguistic understanding. However, they often struggle with knowledge-intensive tasks that require linking visual scenes to external background knowledge. To address these limitations, this paper proposes the RoRA-VLM (Robust Retrieval-Augmented Vision Language Model) framework. RoRA-VLM introduces a novel two-stage retrieval mechanism—Image-anchored Textual-query Expansion—to bridge the modality discrepancy between visual and textual inputs.  Furthermore, it incorporates a Query-oriented Visual Token Refinement strategy for better alignment and Adversarial Noise Injection to enhance reasoning robustness against irrelevant retrieved information. Experimental results on the InfoSeek and OVEN datasets demonstrate that RoRA-VLM significantly outperforms baseline models, achieving a 62.5% accuracy on InfoSeek, which represents a 17.3% improvement over the base LLaVA-v1.5 model. These findings highlight the effectiveness of the proposed alignment and reasoning mechanisms in developing more intelligent and robust vision-language systems.

Downloads

Download data is not yet available.

References

Downloads

Published

05-02-2026

How to Cite

A Robust Two-Stage Retrieval-Augmented Vision-Language Framework for Knowledge-Intensive Multimodal Reasoning and Alignment. (2026). Computational Discovery and Intelligent Systems (CDIS), 2(2), 42-52. https://pub.scientificirg.com/index.php/CDIS/article/view/40

Similar Articles

You may also start an advanced similarity search for this article.