Observation-Reasoning-Action Agent Workflow

Decompose any agent task into three phases: observe the environment, reason over what you know, then act with humans.

Problem it solves

Complex creative or analytical workflows that require both perception and judgment get stuck when one agent tries to do everything; phasing the work makes each step debuggable and lets humans plug in where they add most value.

Best for

Teams designing multi-agent systems for creative or knowledge work where outputs must combine machine scale with human judgment.

Not ideal for

Pure transactional tasks (form-fills, lookups) where a single tool-using agent without phase separation is faster.

Overview

Why this framework exists

The Prime Video recap problem - turning hours of footage into a two-minute summary previously took weeks of manual work. The mechanism splits the agent pipeline into three sequential phases, each with a distinct cognitive job, so different agents (and humans) can specialize.

Phase one is observation: agents watch the video and produce rich, detailed understanding of every shot, scene, and the overall story. The output is a structured representation dense enough to support downstream decisions like story-arc definition and scene selection. Without good observation, reasoning hallucinates.

Phase two is reasoning: agents ask 'with what I know, what do I need to do?' Reasoning layers on top of observation - for example, a reasoning agent collaborates with the observation agent to draft a voice-over script. The separation matters because reasoning quality is bounded by the observation quality it sits on.

Phase three is action: trusted human experts now work with the agents to finalize the recap. The phase is explicitly human-in-the-loop, not autonomous. Sivasubramanian frames this as 'human and agent collaboration' that frees people from drudge work so they focus on what they love. The pattern generalizes beyond video: any task with rich input, structured reasoning, and judgment-heavy output benefits from the same decomposition.

Core principles

5 total

Separate observation from reasoning so each agent can specialize and so reasoning quality is gated by explicit, inspectable understanding of the input.
Reasoning layers on top of observation: a reasoning agent should collaborate with an observation agent rather than re-derive context from raw input.
Action is the human-collaboration phase, not full autonomy - bring trusted experts in at the moment judgment matters most.
Decomposing into phases lets non-coders (e.g. cinematography experts) participate at the action layer without learning to build agents end-to-end.
The goal is to remove drudge work, not human authorship - agents should free people to do what they love, not replace the creative decision.

Checklist

Saved in your browser

Map your workflow into three phases and write down what each phase consumes and produces.
Build or assign an observation agent that outputs a structured, inspectable description of the input domain.
Build a reasoning agent that consumes only the observation output, never the raw input, so failures are localizable.
Define the explicit human handoff in the action phase: who reviews, what they edit, and what they approve.
Measure end-to-end time and compare to the manual baseline; if the action phase dominates, push more work back into reasoning.

Origin story

How this framework came to be

Built at Amazon Prime Video, where producing a series recap could take weeks because cinematography experts had to manually create story arcs and select scenes - and they were not coders.

Source

Traced to primary

Source · PODCAST

Everything You Need to Know About AI Agents

Swami Sivasubramanian

Open source →

Related frameworks

Browse all Productivity →