INNOVATIONOngoing practice88% confidence

Drive by Vision, Not APIs

Itemize the workflows where every step has an API and the count is near zero — so the agent must see and act on the screen like a human.

Problem it solves

How an agent should interact with software when most steps have no usable API.

Best for

Builders deciding how an agent should perceive and act on software — APIs vs. seeing the screen.

Not ideal for

Narrow, fully-API-covered workflows where every required step already exposes a clean programmatic interface.

Overview

Why this framework exists

The clearest definition of useful AGI, for Luan, is a system that can do anything a human can do in front of a computer. Calling APIs is the easy part; using the computer like a human — seeing the screen and choosing where to click and type — is the hard part and the most practical path, because if you itemize the workflows where every step has an API the count is "pretty close to zero." Driving by vision also turns ordinary human computer-use into a training-data source. He leans on two analogies: humanoid robots (a human-shaped body acts in a human environment without re-plumbing it) and self-driving's camera-over-LiDAR bet.

Core principles

3 total
  1. Define the goal as "anything a human can do on a computer" and the agent design follows.
  2. APIs are the easy part; human-like vision-and-action is the hard, general part.
  3. Seeing the screen like a human converts everyday computer-use into training data.

Origin story

How this framework came to be

Adept's ACT-1 work teaching a large model to actuate a browser/computer, and the realization that API-only agents cover almost no real workflows.

Source

Traced to primary
Source · PODCAST
Why Google failed to make GPT-3 + why Multimodal Agents are the path to AGI — with David Luan of Adept
Latent Space (swyx & Alessio) · 2024
Open source →

Related frameworks

Browse all Innovation →