Continuously updated as recordings are processed
This study systematically evaluates the performance, reliability, and security of Autonomous Computer-Use Agents (ACUAs)—AI systems capable of operating computers end-to-end like human operators. We tested three leading agents (ChatGPT Agent, Claude Sonnet 3.5, and Self-Operating Computer) across tasks of increasing complexity on macOS and Google Suite. This research contributes the benchmarking and evaluation methodology and framework, task complexity scoring, an exploratory analysis, and evidence of significant limitations in capabilities.
Based on these findings, we propose an ACUA ramework featuring a multi-agent system with human-in-the-loop validation, systematic GUI/CLI decision processes, and designed security controls to enable more capable and secure ACUAs. The learning methodology will follow SMART, a short- and long-trajectory learning framework.