MOAT (Multimodal model Of All Trades) is a challenging benchmark for large multimodal models (LMMs). It consists of vision language (VL) tasks that require the LMM to integrate several VL capabilities and engage in human-like generalist visual problem solving. Moreover, many tasks in MOAT focus on LMMs' capability to ground complex text and visual instructions, which is crucial for the application of LMMs in-the-wild. Developing on the VL capability taxonomies proposed in previous benchmark papers, we define 9 fundamental VL capabilities in MOAT.
Notably, we purposefully insulated MOAT from the influence of domain knowledge, text generation style, and other external factors by making the questions close-ended (i.e. have a single short answer) and solvable with the information and hints provided in the question itself. This allows MOAT to focus on fundamental generalist VL capabilities. We also did not include VL capabilities like general object recognition and attribute recognition in our taxonomy, since these are required by all MOAT tasks, and performance on these fronts can be reflected in the overall accuracy on MOAT.
MOAT tasks require LMMs to integrate up to 6 fundamental VL capabilities. We report the proportion of questions requiring each VL capability, the distribution of the number of VL capabilities required, and the 15 most common capability combinations required in MOAT.
ALL existing LMMs, proprietary and open source, perform very poorly on MOAT, with the best performing model (Gemini 2.5 Pro) achieving an accuracy (44.0%) about half of that achieved by humans (82.7%) on a 189-question subset with a similar capability distribution. For individual VL capabilities, CNT and RLA saw consistent poor performance by LMMs. In addition, GNDT performance did not scale well with model size. Please refer to our paper for more detailed analysis of the results, as well as discussion on the implication of LMM architecture choices such as tiling and built-in CoT reasoning (or 'thinking') capability.
MOAT's fine-grained VL capability taxonomy enables deep analysis of cross-capability interactions. An interesting observation is that, while larger, more-advanced models are better at tasks requiring the integration of recognition and spatial capabilities, they struggle to combine instruction grounding with either recognition or spatial understanding.
We intend to further increase the diversity of the tasks in MOAT, involving more capability combinations and encompassing more domains and scenarios. Stay tuned!
@article{ye2025moat,
title={MOAT: Evaluating LMMs for Capability Integration and Instruction Grounding},
author={Zhoutong Ye and Mingze Sun and Huan-ang Gao and Xutong Wang and Xiangyang Wang
and Yu Mei and Chang Liu and Qinwei Li and Chengwen Zhang and Qinghuan Lan and Chun Yu and Yuanchun Shi},
journal={arXiv preprint arXiv:2503.09348},
year={2025}
}