Mars2 2025 challenge on multimodal reasoning, ICCVW, 2025.

This paper reviews the MARS2 2025 Challenge on Multi-modal Reasoning. We aim to bring together different approaches in multimodal machine learning and LLMs via alarge benchmark. We hope it better allows researchers to follow the state-of-the-art in this very dynamic area. Meanwhile, a growing number of testbeds have boosted the evolution of general-purpose large language models. Thus, this year’s MARS2 focuses on real-world and specialized scenarios to broaden the multimodal reasoning applications of MLLMs. Our organizing team released two tailored datasets Lens and AdsQA as test sets, which support general reasoning in 12 daily scenarios and domain-specific reasoning in advertisement videos, respectively. We evaluated 40+ baselines that include both generalist MLLMs and task-specific models, and opened up three competition tracks, i.e., Visual Grounding in Real-world Scenarios (VGRS), Visual Question Answering with Spatial Awareness (VQA-SA), and Visual Reasoning in Creative AdvertisementVideos (VR-Ads). Finally, 76 teams from the renownedacademic and industrial institutions have registered and40+ valid submissions (out of 1200+) have been included in our ranking lists. Our datasets, code sets (40+ base-lines and 15+ participants’ methods), and rankings are publicly available on the MARS2 workshop website andour GitHub organization page ttps://github.com/mars2workshop/, where our updates and announce-ments of upcoming events will be continuously provided.

[paper]