Autonomous Computer-Use Agents
Testing Documentation

Continuously updated as recordings are processed

Overview

This study systematically evaluates the performance, reliability, and security of Autonomous Computer-Use Agents (ACUAs)—AI systems capable of operating computers end-to-end like human operators. We tested three leading agents (ChatGPT Agent, Claude Sonnet 3.5, and Self-Operating Computer) across tasks of increasing complexity on macOS and Google Suite. This research contributes the benchmarking and evaluation methodology and framework, task complexity scoring, an exploratory analysis, and evidence of significant limitations in capabilities.

Key Findings

    — Overall completion rates: Self-Operating Computer (94%), ChatGPT Agent (38%), Claude Sonnet 3.5 (28%)
    — Frequent hallucinations including false claims of task completion
    — Critical security vulnerabilities: unauthorized software installations, brute-force login attempts, and prompt injection risks
    — Terminal operations significantly outperformed GUI interactions
    — Inconsistent security judgment for phishing identification

Based on these findings, we propose an ACUA ramework featuring a multi-agent system with human-in-the-loop validation, systematic GUI/CLI decision processes, and designed security controls to enable more capable and secure ACUAs. The learning methodology will follow SMART, a short- and long-trajectory learning framework.

View Testing Recordings

Contact the Authors