A11y-Compressor: A Framework for Enhancing the Efficiency of GUI Agent Observations through Visual Context Reconstruction and Redundancy Reduction

Hosei University, Tokyo, Japan
🎉ACL Student Research Workshop (SRW) 2026
Overview of the A11y-Compressor framework

Overview of the A11y-Compressor framework. Given a linearized a11y tree, the pipeline applies modal detection, redundancy reduction, and semantic structuring to generate a compact and structure-preserving observation representation. This representation enables more efficient and effective grounding for GUI agents.

Abstract

AI agents that interact with graphical user interfaces (GUIs) require effective observation representations for reliable grounding. The accessibility tree is a commonly used text-based format that encodes UI element attributes, but it suffers from redundancy and lacks structural information such as spatial relationships among elements. We propose A11y-Compressor, a framework that transforms linearized accessibility trees into compact and structured representations. Our implementation, Compressed-a11y, applies a lightweight and structured transformation pipeline with modal detection, redundancy reduction, and semantic structuring. Experiments on the OSWorld benchmark show that Compressed-a11y reduces input tokens to 22% of the original while improving task success rates by 5.1 percentage points on average.

Results

We evaluate A11y-Compressor on the OSWorld benchmark using 358 tasks across diverse GUI environments, including web browsing, office applications, email clients, media tools, development tools, OS operations, and multi-application tasks. We use Qwen3-VL-32B as the underlying multimodal model and compare Compressed-a11y with screenshot-based observations, linearized a11y trees, and LineRetriever.

Success rates of observation representations on OSWorld

Success Rate on OSWorld. Compressed-a11y achieves the highest overall success rate among the compared observation representations.

Average input token counts of observation representations on OSWorld

Average Input Token Count. Compressed-a11y consistently reduces input token usage across application domains.

Key Findings

Compressed-a11y reduces input tokens to approximately 22% of the linearized a11y tree while improving the overall task success rate by +5.1 percentage points. These results show that compressing accessibility trees while preserving structural information enables more efficient and reliable grounding for GUI agents.

Compared with retrieval-based reduction, Compressed-a11y maintains global UI structure, which helps the agent understand spatial relationships and interact with GUI elements more reliably across diverse applications.

BibTeX

@misc{takeshita2026a11ycompressorframeworkenhancingefficiency,
  title={A11y-Compressor: A Framework for Enhancing the Efficiency of GUI Agent Observations through Visual Context Reconstruction and Redundancy Reduction},
  author={Michito Takeshita and Takuro Kawada and Takumi Ohashi and Shunsuke Kitada and Hitoshi Iyatomi},
  year={2026},
  eprint={2605.00551},
  archivePrefix={arXiv},
  primaryClass={cs.CL},
  url={https://arxiv.org/abs/2605.00551}
}