MLA-Trust

Benchmarking Trustworthiness of Multimodal LLM Agents in GUI Environments

1Tsinghua University, 2East China Normal University

Introduction

framework

The emergence of multimodal LLM-based agents (MLAs) has transformed interaction paradigms by seamlessly integrating vision, language, action and dynamic environments, enabling unprecedented autonomous capabilities across GUI applications ranging from web automation to mobile systems. However, MLAs introduce critical trustworthiness challenges that extend far beyond traditional language models’ limitations, as they can directly modify digital states and trigger irreversible real-world consequences. Existing benchmarks inadequately tackle these unique challenges posed by MLAs’ actionable outputs, longhorizon uncertainty and multimodal attack vectors. In this paper, we introduce MLA-Trust, the first comprehensive and unified framework that evaluates the MLA trustworthiness across four principled dimensions: truthfulness, controllability, safety and privacy. We utilize websites and mobile applications as realistic testbeds, designing 34 high-risk interactive tasks and curating rich evaluation datasets. Large-scale experiments involving 13 state-of-the-art agents reveal previously unexplored trustworthiness vulnerabilities unique to multimodal interactive scenarios. For instance, proprietary and open-source GUI-interacting MLAs pose more severe trustworthiness risks than static MLLMs, particularly in high-stakes domains; the transition from static MLLMs into interactive MLAs considerably compromises trustworthiness, enabling harmful content generation in multi-step interactions that standalone MLLMs would typically prevent; multi-step execution, while enhancing the adaptability of MLAs, involves latent nonlinear risk accumulation across successive interactions, circumventing existing safeguards and resulting in unpredictable derived risks. Moreover, we present an extensible toolbox to facilitate continuous evaluation of MLA trustworthiness across diverse interactive environments. MLA-Trust establishes a foundation for analyzing and improving the MLA trustworthiness, promoting reliable deployment in real-world applications.

Task Pool

task-list

Task Overview: We organize a two-level taxonomy containing 8 sub-aspects to better categorize the target behaviors to be evaluated. Based on the taxonomy, we curate 34 diverse tasks to cover realistic and comprehensive scenarios with trustworthy risks, including predefined process and contextual reasoning, website and mobile ones, as summarized above. To tackle the current lack of datasets dedicated for various scenarios under these sub-aspects, we construct 11 datasets based on the existing datasets by adapting prompts and images with both manual efforts and automatic methods. We further propose 23 novel datasets from scratch specifically for the designed tasks.
link: website task; : mobile task; : mixture task. : datasets improved design from existing datasets; : datasets constructed from scratch; : datasets involving user image input. : predefined process task; : contextual reasoning task. : rule-based evaluation(e.g., keywords matching); : automatic evaluation by GPT-4 or other classifiers; : mixture evaluation. RtE stands for Refuse-to-Execute rate and ASR stands for Attack Success Rate.

Leaderboard in MLA-Trust (Updating...) 🔗














# Model Source T.I T.M C.O C.S S.T S.J P.A P.L Overall
1 GPT-4o🥇 Link 1 1 1 1 1 1 1 2 1
2 GPT-4-turbo🥈 Link 2 3 2 2 2 4 2 1 2
3 Claude-3-7-sonnet🥉 Link 3 2 4 3 4 3 3 5 3
4 Gemini-2.0-pro Link 4 4 3 4 5 5 4 3 4
5 Gemini-2.0-flash Link 5 5 5 5 3 2 5 4 5
6 LLaVA-OneVision Link 6 6 6 6 6 8 6 6 6
7 DeepSeek-VL2 Link 7 7 7 7 10 6 11 7 7
8 LLaVA-NeXT Link 8 9 10 10 7 7 9 9 8
9 Phi-4 Link 9 10 9 9 9 10 8 8 9
10 MiniCPM-o-2_6 Link 10 8 8 8 8 9 10 11 10
11 Pixtral-12B Link 11 11 11 11 11 11 7 10 11
12 InternVL2-8B Link 12 13 12 13 12 12 13 12 12
13 Qwen2.5-VL Link 13 12 13 12 13 13 12 13 13

BibTeX

      
@article{yang2025mla,
  title={MLA-Trust: Benchmarking Trustworthiness of Multimodal LLM Agents in GUI Environments},
  author={Yang, Xiao and Chen, Jiawei and Luo, Jun and Fang, Zhengwei and Dong, Yinpeng and Su, Hang and Zhu, Jun},
  journal={arXiv preprint arXiv:2506.01616},
  year={2025}
}