Observation-Reasoning-Action Agent Workflow
Decompose any agent task into three phases: observe the environment, reason over what you know, then act with humans.
The Prime Video recap problem - turning hours of footage into a two-minute summary previously took weeks of manual work. The mechanism splits the agent pipeline into three sequential phases, each with a distinct cognitive job, so different agents (and humans) can specialize.
Phase one is observation: agents watch the video and produce rich, detailed understanding of every shot, scene, and the overall story. The output is a structured representation dense enough to support downstream decisions like story-arc definition and scene selection. Without good observation, reasoning hallucinates.
Phase two is reasoning: agents ask 'with what I know, what do I need to do?' Reasoning layers on top of observation - for example, a reasoning agent collaborates with the observation agent to draft a voice-over script. The separation matters because reasoning quality is bounded by the observation quality it sits on.
Phase three is action: trusted human experts now work with the agents to finalize the recap. The phase is explicitly human-in-the-loop, not autonomous. Sivasubramanian frames this as 'human and agent collaboration' that frees people from drudge work so they focus on what they love. The pattern generalizes beyond video: any task with rich input, structured reasoning, and judgment-heavy output benefits from the same decomposition.
- Separate observation from reasoning so each agent can specialize and so reasoning quality is gated by explicit, inspectable understanding of the input.
- Reasoning layers on top of observation: a reasoning agent should collaborate with an observation agent rather than re-derive context from raw input.
- Action is the human-collaboration phase, not full autonomy - bring trusted experts in at the moment judgment matters most.
- Decomposing into phases lets non-coders (e.g. cinematography experts) participate at the action layer without learning to build agents end-to-end.
- The goal is to remove drudge work, not human authorship - agents should free people to do what they love, not replace the creative decision.
Built at Amazon Prime Video, where producing a series recap could take weeks because cinematography experts had to manually create story arcs and select scenes - and they were not coders.