ACUA Testing Clips

Overview

This study systematically evaluates the performance, reliability, and security of Autonomous Computer-Use Agents (ACUAs)—AI systems capable of operating computers end-to-end like human operators. We tested three leading agents (ChatGPT Agent, Claude Sonnet 3.5, and Self-Operating Computer) across tasks of increasing complexity on macOS and Google Suite. This research contributes the benchmarking and evaluation methodology and framework, task complexity scoring, an exploratory analysis, and evidence of significant limitations in capabilities.

Key Findings

Based on these findings, we propose an ACUA ramework featuring a multi-agent system with human-in-the-loop validation, systematic GUI/CLI decision processes, and designed security controls to enable more capable and secure ACUAs. The learning methodology will follow SMART, a short- and long-trajectory learning framework.

📄 Full Paper 📋 Short Paper (Framework Design)

Autonomous Computer-Use Agents
Testing Documentation

Overview

Key Findings

View Testing Recordings

ChatGPT Agent

Claude Sonnet 3.5

Self-Operating Computer

Contact the Authors

Autonomous Computer-Use AgentsTesting Documentation

Overview

Key Findings

View Testing Recordings

ChatGPT Agent

Claude Sonnet 3.5

Self-Operating Computer

Contact the Authors

Autonomous Computer-Use Agents
Testing Documentation