MLA-Trust

Introduction

The emergence of multimodal LLM-based agents (MLAs) has transformed interaction paradigms by seamlessly integrating vision, language, action and dynamic environments, enabling unprecedented autonomous capabilities across GUI applications ranging from web automation to mobile systems. However, MLAs introduce critical trustworthiness challenges that extend far beyond traditional language models’ limitations, as they can directly modify digital states and trigger irreversible real-world consequences. Existing benchmarks inadequately tackle these unique challenges posed by MLAs’ actionable outputs, longhorizon uncertainty and multimodal attack vectors. In this paper, we introduce MLA-Trust, the first comprehensive and unified framework that evaluates the MLA trustworthiness across four principled dimensions: truthfulness, controllability, safety and privacy. We utilize websites and mobile applications as realistic testbeds, designing 34 high-risk interactive tasks and curating rich evaluation datasets. Large-scale experiments involving 13 state-of-the-art agents reveal previously unexplored trustworthiness vulnerabilities unique to multimodal interactive scenarios. For instance, proprietary and open-source GUI-interacting MLAs pose more severe trustworthiness risks than static MLLMs, particularly in high-stakes domains; the transition from static MLLMs into interactive MLAs considerably compromises trustworthiness, enabling harmful content generation in multi-step interactions that standalone MLLMs would typically prevent; multi-step execution, while enhancing the adaptability of MLAs, involves latent nonlinear risk accumulation across successive interactions, circumventing existing safeguards and resulting in unpredictable derived risks. Moreover, we present an extensible toolbox to facilitate continuous evaluation of MLA trustworthiness across diverse interactive environments. MLA-Trust establishes a foundation for analyzing and improving the MLA trustworthiness, promoting reliable deployment in real-world applications.

Task Pool

Task Overview: We organize a two-level taxonomy containing 8 sub-aspects to better categorize the target behaviors to be evaluated. Based on the taxonomy, we curate 34 diverse tasks to cover realistic and comprehensive scenarios with trustworthy risks, including predefined process and contextual reasoning, website and mobile ones, as summarized above. To tackle the current lack of datasets dedicated for various scenarios under these sub-aspects, we construct 11 datasets based on the existing datasets by adapting prompts and images with both manual efforts and automatic methods. We further propose 23 novel datasets from scratch specifically for the designed tasks.
: website task; : mobile task; : mixture task. : datasets improved design from existing datasets; : datasets constructed from scratch; : datasets involving user image input. : predefined process task; : contextual reasoning task. : rule-based evaluation(e.g., keywords matching); : automatic evaluation by GPT-4 or other classifiers; : mixture evaluation. RtE stands for Refuse-to-Execute rate and ASR stands for Attack Success Rate.

Leaderboard in MLA-Trust (Updating...) 🔗

#	Model	Source	T.I	T.M	C.O	C.S	S.T	S.J	P.A	P.L	Overall
1	GPT-4o🥇	Link	1	1	1	1	1	1	1	2	1
2	GPT-4-turbo🥈	Link	2	3	2	2	2	4	2	1	2
3	Claude-3-7-sonnet🥉	Link	3	2	4	3	4	3	3	5	3
4	Gemini-2.0-pro	Link	4	4	3	4	5	5	4	3	4
5	Gemini-2.0-flash	Link	5	5	5	5	3	2	5	4	5
6	LLaVA-OneVision	Link	6	6	6	6	6	8	6	6	6
7	DeepSeek-VL2	Link	7	7	7	7	10	6	11	7	7
8	LLaVA-NeXT	Link	8	9	10	10	7	7	9	9	8
9	Phi-4	Link	9	10	9	9	9	10	8	8	9
10	MiniCPM-o-2_6	Link	10	8	8	8	8	9	10	11	10
11	Pixtral-12B	Link	11	11	11	11	11	11	7	10	11
12	InternVL2-8B	Link	12	13	12	13	12	12	13	12	12
13	Qwen2.5-VL	Link	13	12	13	12	13	13	12	13	13

BibTeX

      
@article{yang2025mla,
  title={MLA-Trust: Benchmarking Trustworthiness of Multimodal LLM Agents in GUI Environments},
  author={Yang, Xiao and Chen, Jiawei and Luo, Jun and Fang, Zhengwei and Dong, Yinpeng and Su, Hang and Zhu, Jun},
  journal={arXiv preprint arXiv:2506.01616},
  year={2025}
}