Drive by Vision, Not APIs
Itemize the workflows where every step has an API and the count is near zero — so the agent must see and act on the screen like a human.
The clearest definition of useful AGI, for Luan, is a system that can do anything a human can do in front of a computer. Calling APIs is the easy part; using the computer like a human — seeing the screen and choosing where to click and type — is the hard part and the most practical path, because if you itemize the workflows where every step has an API the count is "pretty close to zero." Driving by vision also turns ordinary human computer-use into a training-data source. He leans on two analogies: humanoid robots (a human-shaped body acts in a human environment without re-plumbing it) and self-driving's camera-over-LiDAR bet.
- Define the goal as "anything a human can do on a computer" and the agent design follows.
- APIs are the easy part; human-like vision-and-action is the hard, general part.
- Seeing the screen like a human converts everyday computer-use into training data.
Adept's ACT-1 work teaching a large model to actuate a browser/computer, and the realization that API-only agents cover almost no real workflows.