INNOVATIONOngoing practice85% confidence

Multimodal for Knowledge Work, Not Cats and Dogs

Economic value is in knowledge work, so train multimodal models on charts, tables, invoices, PDFs and UIs — not COCO.

Problem it solves

Why agent models need a purpose-built multimodal corpus rather than off-the-shelf natural-image training.

Best for

Teams pre-training multimodal models meant to do economically valuable work on screens and documents.

Not ideal for

General-purpose vision tasks (natural images, object recognition) where the value is not in screens or documents.

Overview

Why this framework exists

Most multimodal models inherit an academic focus and are trained on natural images — cat-and-dog photos straight out of a camera (COCO). But the majority of economic value is in knowledge work, which means a totally different pre-training corpus: charts, graphs, tables, invoices, PDFs, receipts, unstructured data and UIs, made fast and very good at dense OCR on screens. That corpus is the "raw putty" for a useful agent, and is what Adept built the Fuyu architecture around.

Core principles

3 total
  1. Pick the pre-training corpus by where the economic value is, not by what academia benchmarks.
  2. Knowledge-work multimodality (charts/tables/PDFs/UIs) is a different corpus from natural images.
  3. Dense OCR on screens plus speed is the base layer a reliable agent stands on.

Origin story

How this framework came to be

Adept's need, circa 2023, for fast multimodal models that understood screens — when no off-the-shelf multimodal model was trained for it.

Source

Traced to primary
Source · PODCAST
Why Google failed to make GPT-3 + why Multimodal Agents are the path to AGI — with David Luan of Adept
Latent Space (swyx & Alessio) · 2024
Open source →

Related frameworks

Browse all Innovation →