Hi there! 👋 I am Yaxin Luo.
About Me
Decorative animated backgroundNews ======
**Next-Gen CAPTCHAs: Leveraging the Cognitive Gap for Scalable and Diverse GUI-Agent Defense** * arXiv 2026 * Jiacheng Liu *, **Yaxin Luo** *, Jiacheng Cui, Xinyi Shang, Xiaohan Zhao, Zhiqiang Shen * 📄 Paper 💻 Code 🚀 Demo 🌐 Project *
**OpenCaptchaWorld: A Comprehensive Web-based Platform for Testing and Benchmarking Multimodal LLM Agents** * NeurIPS 2025 * **Yaxin Luo** *, Zhaoyi Li *, Jiacheng Liu, Jiacheng Cui, Xiaohan Zhao, Zhiqiang Shen * 📄 Paper 💻 Code 🚀 Demo *
**APL: Anchor-Based Prompt Learning for One-Stage Weakly Supervised Referring Expression Comprehension** * ECCV 2024 * **Yaxin Luo**,Jiayi Ji, Xiaofu Chen, Yuxin Zhang, Tianhe Ren, Gen Luo * 📄 Paper 💻 Code *
**γ-MoD: Exploring Mixture-of-Depth Adaptation for Multimodal Large Language Models** * ICLR 2025 * **Yaxin Luo**, Gen Luo, Jiayi Ji, Yiyi Zhou, Xiaoshuai Sun, Zhiqiang Shen, Rongrong Ji * 📄 Paper 💻 Code *
**DViN: Dynamic Visual Routing Network for Weakly Supervised Referring Expression Comprehension** * CVPR 2025 * Xiaofu Chen, **Yaxin Luo**, Gen Luo, Jiayi Ji, Henghui Ding, Yiyi Zhou * 📄 Paper 💻 Code
Hello! I am a First-Year Machine Learning PhD student at MBZUAI, advised by Prof. Zhiqiang Shen and Prof.Mohsen Guizani. I am also closely working with my friend Xiaofu Chen. My research vision centers on advancing Native Multimodal Foundation Models that can understand, reason, generate and agentic action across diverse modalities. I am also interested in bridging the gap between high-performance unified intelligence and computational efficiency. To achieve this, I focus on developing Unified and Efficient Foundation Models capable of seamless understanding and generation, while ensuring they remain scalable and deployable through efficient architectural designs.
Previously, I earned my Bachelor’s degree from Technical University of Denmark, where I was fortunate to be supervised by Prof. Dim P. Papadopoulos. Meanwhile, I was lucky to collarating with Dr.Gen Luo and Prof.Rongrong Ji on efficient deep learning researches during my bachelor. Earlier, I spent an intense and rewarding year at the University of Edinburgh studying pure mathematics and physics—an experience that sparked my passion for science and technology, deepened my curiosity about the unknown, I was curious and wanted to explore String Theory at that time, this one year ultimately shaped who I am today. Before Edinburgh, while enrolled in a Bio-Medicine program at the University of Queensland and preparing for the UCAT test to be addimitted into the university's medical school, I failed at the end. As I only focused on managing a high-street multi-brand boutique which was located in Brisbane‘s Southbank near the casino, and was far more focused on business than on study and research; that Edinburgh year changed my priorities and set me on a research path, thanks to the advice, encourage and supports of my academic personal tutor Prof.Ana Rita Pires when I was at Edinburgh. Anyway, all those past experiences have made me who I am today.
My research interests focus on:
- \
- Unified Multimodal Foundation Models : Developing native multimodal foundation models that perform unified understanding, reasoning, and generation across video, language, and speech. I aim to construct a universal interface where diverse modalities converge, enabling models to perceive complex real-world dynamics and generate coherent, high-fidelity multimodal content.
- Efficient Foundation Models : Tackling the efficiency challenges in scaling unified models. I explore novel architectures and mechanisms—such as dynamic computation allocation, efficient attention mechanisms, and token compression—to maximize performance-per-compute. The goal is to build sustainable AI systems that support long-context understanding and high-resolution generation without prohibitive computational costs.
Recently, I am focusing on Unified Multimodal Foundation Models Projects one Analysis and one on post training
[2026-02-10] 🚀 Next-Gen CAPTCHAs is now available on arXiv! A defense framework leveraging cognitive gaps against MLLM-based GUI agents.
[2025-09-18] 🚀 OpenCaptchaWorld has been accepted by NeurIPS 2025.
Selected Publications ====== *( * indicate equal contribution)* For full and up-to-date publication list, please refer to my [Google Scholar](https://scholar.google.com/citations?user=tEaSCzYAAAAJ&hl=en) page. *
**Next-Gen CAPTCHAs: Leveraging the Cognitive Gap for Scalable and Diverse GUI-Agent Defense** * arXiv 2026 * Jiacheng Liu *, **Yaxin Luo** *, Jiacheng Cui, Xinyi Shang, Xiaohan Zhao, Zhiqiang Shen * 📄 Paper 💻 Code 🚀 Demo 🌐 Project *
**OpenCaptchaWorld: A Comprehensive Web-based Platform for Testing and Benchmarking Multimodal LLM Agents** * NeurIPS 2025 * **Yaxin Luo** *, Zhaoyi Li *, Jiacheng Liu, Jiacheng Cui, Xiaohan Zhao, Zhiqiang Shen * 📄 Paper 💻 Code 🚀 Demo *
**APL: Anchor-Based Prompt Learning for One-Stage Weakly Supervised Referring Expression Comprehension** * ECCV 2024 * **Yaxin Luo**,Jiayi Ji, Xiaofu Chen, Yuxin Zhang, Tianhe Ren, Gen Luo * 📄 Paper 💻 Code *
**γ-MoD: Exploring Mixture-of-Depth Adaptation for Multimodal Large Language Models** * ICLR 2025 * **Yaxin Luo**, Gen Luo, Jiayi Ji, Yiyi Zhou, Xiaoshuai Sun, Zhiqiang Shen, Rongrong Ji * 📄 Paper 💻 Code *
**DViN: Dynamic Visual Routing Network for Weakly Supervised Referring Expression Comprehension** * CVPR 2025 * Xiaofu Chen, **Yaxin Luo**, Gen Luo, Jiayi Ji, Henghui Ding, Yiyi Zhou * 📄 Paper 💻 Code