Multimodal for Knowledge Work, Not Cats and Dogs
Economic value is in knowledge work, so train multimodal models on charts, tables, invoices, PDFs and UIs — not COCO.
Most multimodal models inherit an academic focus and are trained on natural images — cat-and-dog photos straight out of a camera (COCO). But the majority of economic value is in knowledge work, which means a totally different pre-training corpus: charts, graphs, tables, invoices, PDFs, receipts, unstructured data and UIs, made fast and very good at dense OCR on screens. That corpus is the "raw putty" for a useful agent, and is what Adept built the Fuyu architecture around.
- Pick the pre-training corpus by where the economic value is, not by what academia benchmarks.
- Knowledge-work multimodality (charts/tables/PDFs/UIs) is a different corpus from natural images.
- Dense OCR on screens plus speed is the base layer a reliable agent stands on.
Adept's need, circa 2023, for fast multimodal models that understood screens — when no off-the-shelf multimodal model was trained for it.
Source · PODCAST
Why Google failed to make GPT-3 + why Multimodal Agents are the path to AGI — with David Luan of Adept