Eagle: Exploring The Design Space for Multimodal LLMs with Mixture of Encoders
Today's paper introduces EAGLE, a new family of multimodal large language models (MLLMs) that use a mixture of vision encoders to improve visual perception capabilities. The authors conduct a systematic exploration of the design space for MLLMs with multiple vision encoders, identifying key principles and optimizations. Their approach leads to state-of-the-art performance across various multimodal benchmarks.
Method Overview
The EAGLE approach involves integrating multiple vision encoders pre-trained on different tasks into a single MLLM architecture. The overall pipeline consists of several key steps:
First, they adapt existing vision encoders like CLIP to handle higher resolution inputs, which improves performance on tasks requiring fine-grained visual understanding. They find that simply interpolating the position embeddings to higher resolutions works well when the vision encoder is fine-tuned.
Next, they explore different strategies for fusing multiple vision encoders. After comparing various fusion methods, they find that straightforward channel-wise concatenation of visual tokens from different encoders performs best in terms of both efficiency and effectiveness.
The authors then conduct a systematic search to identify the optimal combination of vision encoders to include. Starting with CLIP, they progressively add encoders pre-trained on tasks like object detection, OCR, and segmentation, evaluating the performance gain at each step.
To address potential inconsistencies between different encoders, they introduce a "pre-alignment" stage where each encoder is individually fine-tuned with a frozen language model before joint training. This helps bridge the gap between vision-focused encoders and language tokens.
Finally, they combine all of these optimized components into the EAGLE model family, using a mixture of 4-5 complementary vision encoders fused via channel concatenation, with the pre-alignment training strategy.
Results
The EAGLE models achieve state-of-the-art performance across a wide range of multimodal benchmarks, including:
Notably, EAGLE shows significant improvements on OCR and document-related tasks compared to previous models. The authors also demonstrate that adding more vision encoders consistently improves performance, especially when combined with their optimized training strategies.
Conclusion
This paper presents a systematic exploration of the design space for multimodal large language models using multiple vision encoders. For more information please consult the full paper.
Congrats to the authors for their work!
Shi, Min, et al. "EAGLE: Exploring The Design Space for Multimodal LLMs with Mixture of Encoders." arXiv preprint arXiv:2408.15998 (2024).