We present γ-MOD, a novel approach to enhance computational efficiency in Multimodal Large Language Models (MLLMs) by incorporating Mixture-of-Depth (MoD) layers. This plug-and-play strategy seamlessly replaces redundant dense layers, significantly reducing computational costs while maintaining performance. Despite recent advancements in MLLMs, their high computational demands have limited practical applications, especially for real-time inference. γ-MOD tackles this challenge by introducing a new paradigm that focuses on reducing activated tokens, offering superior efficiency compared to existing methods.
γ-MOD was tested on three popular MLLMs across 9 benchmark datasets.
Model | Training Time Reduction | Inference Time Reduction | Accuracy |
---|---|---|---|
γ-MoD-LLaVA-HR-7B | 31.0% | 53.2% | -1.5% |
γ-MoD-Mini-Gemini-HD-7B | 41.0% | 58.1% | -1.0% |
γ-MoD-LLaVA-HR-13B | 18.8% | 50.4% | -0.3% |
γ-MoD-LLaVA-HR-X-13B | 17.4% | 58.6% | +0.4% |
Our γ-MOD approach demonstrates impressive efficiency in routing tokens and focusing on critical information.
If you use γ-MOD in your work, please cite:
@misc{luo2024gammamodexploringmixtureofdepthadaptation,
title={$\gamma-$MoD: Exploring Mixture-of-Depth Adaptation for Multimodal Large Language Models},
author={Yaxin Luo and Gen Luo and Jiayi Ji and Yiyi Zhou and Xiaoshuai Sun and Zhiqiang Shen and Rongrong Ji},
year={2024},
eprint={2410.13859},
archivePrefix={arXiv},
primaryClass={cs.CV},
url={https://arxiv.org/abs/2410.13859},
}
}