A Robust Two-Stage Retrieval-Augmented Vision-Language Framework for Knowledge-Intensive Multimodal Reasoning and Alignment
Keywords:
Vision Language Models (VLMs), Retrieval Augmented Generation (RAG), Multimodal Reasoning Knowledge, Intensive TasksAbstract
Vision-Language Models (VLMs) have demonstrated significant potential in visual perception and linguistic understanding. However, they often struggle with knowledge-intensive tasks that require linking visual scenes to external background knowledge. To address these limitations, this paper proposes the RoRA-VLM (Robust Retrieval-Augmented Vision Language Model) framework. RoRA-VLM introduces a novel two-stage retrieval mechanism—Image-anchored Textual-query Expansion—to bridge the modality discrepancy between visual and textual inputs. Furthermore, it incorporates a Query-oriented Visual Token Refinement strategy for better alignment and Adversarial Noise Injection to enhance reasoning robustness against irrelevant retrieved information. Experimental results on the InfoSeek and OVEN datasets demonstrate that RoRA-VLM significantly outperforms baseline models, achieving a 62.5% accuracy on InfoSeek, which represents a 17.3% improvement over the base LLaVA-v1.5 model. These findings highlight the effectiveness of the proposed alignment and reasoning mechanisms in developing more intelligent and robust vision-language systems.
Downloads
References
Downloads
Published
Issue
Section
Categories
License
Copyright (c) 2026 Computational Discovery and Intelligent Systems (CDIS)

This work is licensed under a Creative Commons Attribution 4.0 International License.
Computational Discovery and Intelligent Systems (CDIS) content is published under a Creative Commons Attribution License (CCBY). This means that content is freely available to all readers upon publication, and content is published as soon as production is complete.
Computational Discovery and Intelligent Systems (CDIS) seeks to publish the most influential papers that will significantly advance scientific understanding. Selected articles must present new and widely significant data, syntheses, or concepts. They should merit recognition by the wider scientific community and the general public through publication in a reputable scientific journal.



