Multimodal for Knowledge Work, Not Cats and Dogs

Economic value is in knowledge work, so train multimodal models on charts, tables, invoices, PDFs and UIs — not COCO.

Problem it solves

Why agent models need a purpose-built multimodal corpus rather than off-the-shelf natural-image training.

Best for

Teams pre-training multimodal models meant to do economically valuable work on screens and documents.

Not ideal for

General-purpose vision tasks (natural images, object recognition) where the value is not in screens or documents.

Overview

Why this framework exists

Most multimodal models inherit an academic focus and are trained on natural images — cat-and-dog photos straight out of a camera (COCO). But the majority of economic value is in knowledge work, which means a totally different pre-training corpus: charts, graphs, tables, invoices, PDFs, receipts, unstructured data and UIs, made fast and very good at dense OCR on screens. That corpus is the "raw putty" for a useful agent, and is what Adept built the Fuyu architecture around.

Core principles

3 total

Pick the pre-training corpus by where the economic value is, not by what academia benchmarks.
Knowledge-work multimodality (charts/tables/PDFs/UIs) is a different corpus from natural images.
Dense OCR on screens plus speed is the base layer a reliable agent stands on.