Eagle: Exploring The Design Space for Multimodal LLMs with Mixture of Encoders

Vlad Bogolin

AI/ML Engineer & Researcher | Large Language Models (LLMs)

Published Aug 31, 2024

Today's paper introduces EAGLE, a new family of multimodal large language models (MLLMs) that use a mixture of vision encoders to improve visual perception capabilities. The authors conduct a systematic exploration of the design space for MLLMs with multiple vision encoders, identifying key principles and optimizations. Their approach leads to state-of-the-art performance across various multimodal benchmarks.

Method Overview

The EAGLE approach involves integrating multiple vision encoders pre-trained on different tasks into a single MLLM architecture. The overall pipeline consists of several key steps:

First, they adapt existing vision encoders like CLIP to handle higher resolution inputs, which improves performance on tasks requiring fine-grained visual understanding. They find that simply interpolating the position embeddings to higher resolutions works well when the vision encoder is fine-tuned.

Next, they explore different strategies for fusing multiple vision encoders. After comparing various fusion methods, they find that straightforward channel-wise concatenation of visual tokens from different encoders performs best in terms of both efficiency and effectiveness.

The authors then conduct a systematic search to identify the optimal combination of vision encoders to include. Starting with CLIP, they progressively add encoders pre-trained on tasks like object detection, OCR, and segmentation, evaluating the performance gain at each step.

To address potential inconsistencies between different encoders, they introduce a "pre-alignment" stage where each encoder is individually fine-tuned with a frozen language model before joint training. This helps bridge the gap between vision-focused encoders and language tokens.

Finally, they combine all of these optimized components into the EAGLE model family, using a mixture of 4-5 complementary vision encoders fused via channel concatenation, with the pre-alignment training strategy.

Results

The EAGLE models achieve state-of-the-art performance across a wide range of multimodal benchmarks, including:

Visual question answering tasks like GQA and VQAv2
OCR and document understanding tasks like TextVQA and ChartQA
Multimodal reasoning benchmarks like MME, MMBench, and POPE

Notably, EAGLE shows significant improvements on OCR and document-related tasks compared to previous models. The authors also demonstrate that adding more vision encoders consistently improves performance, especially when combined with their optimized training strategies.

Conclusion

This paper presents a systematic exploration of the design space for multimodal large language models using multiple vision encoders. For more information please consult the full paper.

Congrats to the authors for their work!

Shi, Min, et al. "EAGLE: Exploring The Design Space for Multimodal LLMs with Mixture of Encoders." arXiv preprint arXiv:2408.15998 (2024).

Eagle: Exploring The Design Space for Multimodal LLMs with Mixture of Encoders

Vlad Bogolin

AI/ML Engineer & Researcher | Large Language Models (LLMs)

Method Overview

Results

Conclusion

AI Paper of the Day

1,399 follower

More articles by Vlad Bogolin

Insights from the community

Explore topics

Method Overview

Results

Conclusion

AI Paper of the Day

1,399 follower

More articles by Vlad Bogolin

Absolute Zero: Reinforced Self-play Reasoning with Zero Data

On Path to Multimodal Generalist: General-Level and General-Bench

Sentient Agent as a Judge: Evaluating Higher-Order Social Cognition in Large Language Models

ZeroSearch: Incentivize the Search Capability of LLMs without Searching

RM-R1: Reward Modeling as Reasoning

Voila: Voice-Language Foundation Models for Real-Time Autonomous Interaction and Voice Role-Play

PixelHacker: Image Inpainting with Structural and Semantic Consistency

ReasonIR: Training Retrievers for Reasoning Tasks

DeepCritic: Deliberate Critique with Large Language Models

WebThinker: Empowering Large Reasoning Models with Deep Research Capability

Insights from the community

Explore topics