Hi there! 👋 I am Yaxin Luo.

About Me
Decorative animated background
Hello! I am a First-Year Machine Learning PhD student at MBZUAI, advised by Prof. Zhiqiang Shen, Prof. Ivan Laptev and Dr. Fabio Pizzati, collaborating with Prof. Raoul de Charette at Inria. I am also closely working with my friend Xiaofu Chen. I believe that the next-gen advanced machine intelligence is vision-centric, multimodal foundation model that can understand the physics of the real world and can correctly interact with and predict the changes of it.
Previously, I earned my Bachelor’s degree from Technical University of Denmark, where I was fortunate to be supervised by Prof. Dim P. Papadopoulos who offered me a good research training and habits. Earlier, I spent an intense and rewarding year at the University of Edinburgh studying pure mathematics and physics—an experience that sparked my passion for science and technology, deepened my curiosity about the unknown, I was curious and wanted to explore String Theory at that time, this one year ultimately shaped who I am today. Before Edinburgh, while enrolled in a Bio-Medicine program at the University of Queensland and preparing for the UCAT test to be addimitted into the university's medical school, I failed at the end. As I only focused on managing a high-street multi-brand boutique which was located in Brisbane‘s Southbank near the casino, and was far more focused on business than on study and research; that Edinburgh year changed my priorities and set me on a research path, thanks to the advice, encourage and supports of my academic personal tutor Prof.Ana Rita Pires when I was at Edinburgh. Anyway, all those past experiences have made me who I am today.
Recently, I am focusing on physical aware learning for vision models and doing a project for data anatomy of LLM pretraining.

My research interests span:

  • Multimodal Foundation Model : Developing native multimodal foundation models which can perform understanding, reasoning, generation and further action tasks for video, language, speech. These models will serve as the core intelligence—the "brain"—for Embodied AI, Robotics, and many other applications.
  • Physics Grounded Video Understanding & Generation : Vision-centric video model that learns causal structure and explicit physics knowledge from large-scale videos supports both understanding and generation for physical real-world; and further enabling action-conditioned prediction for embodied agents.
  • Data-centric Machine Learning : Beyond the perspectives of models and algorithms, I also enjoy analyzing and understanding data, improving data quality, compressing data for training efficiency, and building efficient / scalable data pipelines for synthesizing high-quality data for foundation models.

News

🚀 OpenCaptchaWorld released and expanded to double the dataset size!

Selected Publications

( * indicate equal contribution)

For full and up-to-date publication list, please refer to my Google Scholar page.

  • OpenCaptchaWorld: A Comprehensive Web-based Platform for Testing and Benchmarking Multimodal LLM Agents
  • APL: Anchor-Based Prompt Learning for One-Stage Weakly Supervised Referring Expression Comprehension
  • γ-MoD: Exploring Mixture-of-Depth Adaptation for Multimodal Large Language Models
    • ICLR 2025
    • Yaxin Luo, Gen Luo, Jiayi Ji, Yiyi Zhou, Xiaoshuai Sun, Zhiqiang Shen, Rongrong Ji
    • 📄 Paper 💻 Code
  • DViN: Dynamic Visual Routing Network for Weakly Supervised Referring Expression Comprehension
    • CVPR 2025
    • Xiaofu Chen, Yaxin Luo, Gen Luo, Jiayi Ji, Henghui Ding, Yiyi Zhou
    • 📄 Paper 💻 Code