Towards General Computer Control: A Multimodal Agent for Red Dead Redemption II as a Case Study

Weihao Tan, Ziluo Ding, Wentao Zhang, Boyu Li, Bohan Zhou, Junpeng Yue, Haochong Xia, Jiechuan Jiang, Longtao Zheng, Xinrun Xu, Yifei Bi, Pengjie Gu, Xinrun Wang, Börje F. Karlsson, Bo An, Zongqing Lu
Computer Science, Artificial Intelligence, Artificial Intelligence (cs.AI)
2024-03-05 00:00:00
Recent studies have demonstrated the success of foundation agents in specific tasks or scenarios. However, existing agents cannot generalize across different scenarios, mainly due to their diverse observation and action spaces and semantic gaps, or reliance on task-specific resources. In this work, we propose the General Computer Control (GCC) setting: building foundation agents that can master any computer task by taking only screen images (and possibly audio) of the computer as input, and producing keyboard and mouse operations as output, similar to human-computer interaction. To target GCC, we propose Cradle, an agent framework with strong reasoning abilities, including self-reflection, task inference, and skill curation, to ensure generalizability and self-improvement across various tasks. To demonstrate the capabilities of Cradle, we deploy it in the complex AAA game Red Dead Redemption II, serving as a preliminary attempt towards GCC with a challenging target. Our agent can follow the main storyline and finish real missions in this complex AAA game, with minimal reliance on prior knowledge and application-specific resources. The project website is at
PDF: Towards General Computer Control: A Multimodal Agent for Red Dead Redemption II as a Case Study.pdf
Empowered by ChatGPT