γ-MOD: Mixture-of-Depth Adaptation for Multimodal Large Language Models

Yaxin Luo1, Gen Luo2,*, Jiayi Ji3,4, Yiyi Zhou3, Xiaoshuai Sun3, Zhiqiang Shen5, Rongrong Ji3
1Technical University of Denmark, 2Shanghai AI Laboratory, 3Xiamen University, 4National University of Singapore, 5MBZUAI
*Corresponding author

Abstract

We present γ-MOD, a novel approach to enhance computational efficiency in Multimodal Large Language Models (MLLMs) by incorporating Mixture-of-Depth (MoD) layers. This plug-and-play strategy seamlessly replaces redundant dense layers, significantly reducing computational costs while maintaining performance. Despite recent advancements in MLLMs, their high computational demands have limited practical applications, especially for real-time inference. γ-MOD tackles this challenge by introducing a new paradigm that focuses on reducing activated tokens, offering superior efficiency compared to existing methods.

Method

Key Features:

  • ARank Metric: Guides the replacement of redundant layers with MoD layers.
  • Shared Vision-Language Router: Facilitates cross-modality token routing.
  • Masked Routing Learning: Prevents critical tokens from being skipped during model adaptation.
Gamma-MOD Architecture

Results

γ-MOD was tested on three popular MLLMs across 9 benchmark datasets.

  • Mini-Gemini-HD: Training time reduced by 41% and inference time by 58.1%, with only 1.0% accuracy drop.
  • LLaVA-HR: Training time reduced by 31% and inference time by 53.2%, with only 1.5% accuracy drop.
  • Generalization: Demonstrated the ability to generalize across different MLLMs.
Comparison with Other Models Scalability Results
Model Training Time Reduction Inference Time Reduction Accuracy
γ-MoD-LLaVA-HR-7B 31.0% 53.2% -1.5%
γ-MoD-Mini-Gemini-HD-7B 41.0% 58.1% -1.0%
γ-MoD-LLaVA-HR-13B 18.8% 50.4% -0.3%
γ-MoD-LLaVA-HR-X-13B 17.4% 58.6% +0.4%

Visualization

Our γ-MOD approach demonstrates impressive efficiency in routing tokens and focusing on critical information.

Visualization of Routing and Skipped Content

Key Observations:

  1. Consistent Routing Patterns: Question tokens are mostly retained, image tokens show the highest redundancy, and response tokens fall between these two extremes.
  2. Efficient Content Skipping: Gray areas in images represent skipped tokens, while white areas highlight regions the model focuses on more intensely.
  3. Improved Focus on Critical Information: By routing out redundant tokens, the model can allocate more computational resources to important areas, leading to more accurate responses.

Download

We also provide the checkpoints for your convenience.

Version Download
γ-MOD-llava-hr-7b-0.34 model
γ-MOD-llava-hr-13b-0.34 model
γ-MOD-llava-hr-13b-0.5 model
γ-MOD-Mini-Gemini-HD-7b-0.34 model
γ-MOD-Mini-Gemini-HD-7b-0.5 model

Citation

If you use γ-MOD in your work, please cite:

@misc{luo2024gammamodexploringmixtureofdepthadaptation,
    title={$\gamma-$MoD: Exploring Mixture-of-Depth Adaptation for Multimodal Large Language Models}, 
    author={Yaxin Luo and Gen Luo and Jiayi Ji and Yiyi Zhou and Xiaoshuai Sun and Zhiqiang Shen and Rongrong Ji},
    year={2024},
    eprint={2410.13859},
    archivePrefix={arXiv},
    primaryClass={cs.CV},
    url={https://arxiv.org/abs/2410.13859}, 
}
}