HYDRA

Abstract

Recent advances in visual reasoning (VR), particularly with the aid of Large Vision-Language Models (VLMs), show promise but require access to large-scale datasets and face challenges such as high computational costs and limited generalization capabilities. Compositional visual reasoning approaches have emerged as effective strategies; however, they heavily rely on the commonsense knowledge encoded in Large Language Models (LLMs) to perform planning, reasoning, or both, without considering the effect of their decisions on the visual reasoning process, which can lead to errors or failed procedures. To address these challenges, we introduce HYDRA, a multi-stage dynamic compositional visual reasoning framework designed for reliable and incrementally progressive general reasoning. HYDRA integrates three essential modules: a planner, a Reinforcement Learning (RL) agent serving as a cognitive controller, and a reasoner. The planner and reasoner modules utilize an LLM to generate instruction samples and executable code from the selected instruction, respectively, while the RL agent dynamically interacts with these modules, making high-level decisions on selection of the best instruction sample given information from the historical state stored through a feedback loop. This adaptable design enables HYDRA to adjust its actions based on previous feedback received during the reasoning process, leading to more reliable reasoning outputs and ultimately enhancing its overall effectiveness. Our framework demonstrates state-of-the-art performance in various VR tasks on four different widely-used datasets.

🔥Contribution

A multi-stage dynamic compositional visual reasoning framework designed for reliable and incrementally progressive general reasoning🔥🔥🔥.

Cognitive Agent. Integrating a cognitive reinforcement learning-based agent as a controller into a framework to foster hyper decision-making and behavior across diverse environments, enhancing system cohesion, performance, and reasoning capabilities.
Iterative Processing. Employing LLM as a natural language planner that enables the dynamic generation of valid instruction samples for iterative processing. The samples are vary in both the complexity and scope of perception tasks assigned with validity probabilities.
Incremental Reasoning. Applying incremental reasoning, storing information from previous states aids both the LLM and RL agent in acquiring fine-grained visual information through VFMs and the visual-perception-to-text module, thereby refining their reasoning processes.
Open-source. Our model and code base are publicly available in July 2024. Thanks you for staying tuned

HYDRA: Framework

HYDRA Framework Overview

State Memory Bank & Meta Information. All data, including code, instruction, and the output of the reasoner from former iteration, are stored in State Memory Bank. Meta information encompasses crucial data (i.e., skills, task).
Planner Module. This module generates N natural language instruction samples with confidence probabilities.
Controller Module. Dynamically interacting with other modules to facilitate hyper decision-making and functioning as a cognitive controller.
Reasoner Module. Consists of an LLM as code generator and a code executor sub-module.
Textualizer Module. It converts the perceptual output from the reasoner module into textual format.

Gradio Demo

Quantitative Results

Performance on RefCOCO and RefCOCO+.

Type	Method	IoU(%)
Type	Method	Ref	Ref+
E2E	OWL-ViT	30.3	29.4
	OWLv2	33.5	31.7
	GLIP	55.0	52.2
	ReCLIP	58.6	60.5
	KOSMOS-2	57.4	50.7
Compositional	Code-bison	44.4	38.2
	ViperGPT	59.8	60.0
	HYDRA	61.7	61.1

Performance on OK-VQA.

Type	Method	ACC(%)
E2E	PNP-VQA	35.9
	PICa	43.3
	BLIP-2	45.9
	Flamingo (9B)	44.7
	MiniGPT-4 (13B)	37.5
	LLaVA (13B)	42.5
	InstructBLIP (13B)	47.9
Compositional	IdealGPT	19.4
	ViperGPT	40.7
	HYDRA	48.6

Performance on GQA.

Type	Method	ACC(%)
E2E	BLIP-2	45.5
	MiniGPT-4 (13B)	30.8
	LLaVA (13B)	41.3
	PandaGPT (13B)	41.6
	ImageBind-LLM (7B)	41.2
Compositional	IdealGPT	41.7
	ViperGPT	37.9
	Ours	47.9

Generalization performance for the RL-Agent.

Method	Train	Test	ACC(%)
ViLT	GQA	OK-VQA	32.13
ViperGPT	--	OK-VQA	40.74
HYDRA	GQA	OK-VQA	48.17
HYDRA	OK-VQA	OK-VQA	48.63
HYDRA	OKVQA	A-OKVQA	55.94
	A-OKVQA	A-OKVQA	56.35

Qualitative Results

BibTeX



@inproceedings{ke2024hydra,
  title={HYDRA: A Hyper Agent for Dynamic Compositional Visual Reasoning},
  author={Ke, Fucai and Cai, Zhixi and Jahangard, Simindokht and Wang, Weiqing and Haghighi, Pari Delir and Rezatofighi, Hamid},
  booktitle={European Conference on Computer Vision},
  year={2024},
  organization={Springer},
  doi={10.1007/978-3-031-72661-3_8},
  isbn={978-3-031-72661-3},
  pages={132--149},
}

HYDRA: a HYper agent
for Dynamic compositional visual ReAsoning

Abstract

🔥Contribution

HYDRA: Framework

Gradio Demo

Quantitative Results

Qualitative Results

BibTeX

Acknowledgement

HYDRA: a HYper agent for Dynamic compositional visual ReAsoning

Abstract

🔥Contribution

HYDRA: Framework

Gradio Demo

Quantitative Results

Qualitative Results

BibTeX

Acknowledgement

HYDRA: a HYper agent
for Dynamic compositional visual ReAsoning