Domain, prompt, model

The broader market narrative, driven by frontier labs, suggests that foundational models are the sole capability differentiator. I compared twenty models across 5 domains (thousands of extractions) to measure how much it actually matters: less than the prompt and far less than the problem domain.

For 48 hours earlier in June I had access to Anthropic's Fable 5 - the best AI model I've ever used. Then - on the morning of Saturday 13 June (Australian time) - the US government issued an export-control directive barring every foreign national from the model and citing national security. To comply, Anthropic disabled Fable 5 for every customer on earth.

As a founder - building software that relies on AI models to operate - I wanted to measure "how important is the model"? From my perspective sovereignty is about control - and becomes the answer to the question "do we control the model"?

Information extraction

One of my platform's foundational capabilities is to read documents and turn them into facts you can query - like the people, organisations and events named in a page, and how they're connected to each other.

Pulling those facts out of free text is typically the first¹ structured workflow the system runs as new content is ingested, and everything downstream benefits from getting it right.

Before LLMs existed, we would have used specialised machine learning models trained specifically for each of the individual steps - entity recognition/extraction and disambiguation, relationship extraction and disambiguation, etc.

Not only did this require specialised data science expertise, each of these was a specialised system on its own, to be developed and maintained as its own special snowflake.

LLMs change the economics of deploying automated systems significantly, in my opinion, because they allow us to commoditise the entire value chain across domains and problems.

Before this benchmark, to perform structured workflows in Symplast, we had been renting various AI models from our cloud provider - models that we couldn't run ourselves because they are not open nor downloadable.

For ordinary documents and early platform development that's actually pragmatic (avoiding the overhead of running our own LLM infrastructure, etc).

But there was always a tension - because Symplast is a sovereign platform, designed to run in any cloud as well as offline in a customer's own data centre, if required.

For the documents my customers actually care about (regulatory filings, government records) - sending them to someone else's data centre is the problem, not the solution. For many enterprise and government customers, their data cannot leave their control at all.

The major vendors currently base their marketing on the premise that the best AI is the frontier models - but those frontier models live in someone else's cloud.

So the choice ends up being - best results or control. But not both.

So last week I rigorously tested that assumption against twenty different AI models, both open (downloadable) and proprietary (tied to the public cloud).

I tested the models across 5 domains - sanctions reporting, financial system risk, procurement policy and inquiries, transfer pricing regulations and finally crime/intelligence fusion centre triage.

For this specific job - reading a document and extracting structured facts - a free model that can run anywhere turned out to be as good as the proprietary expensive ones for some of these domains.

The gap that does exist comes almost entirely from the instructions we give the model, not from the model itself.

The benchmark

Twenty models. Four closed and API-only - Gemini v3-flash, GPT-5.5, Claude Opus 4.8, Claude Haiku 4.5. The other sixteen open-weight and self-hostable: Google's Gemma, OpenAI's gpt-oss-120b, DeepSeek, Llama, Mistral, Qwen, Zhipu's GLM, Nvidia's Nemotron, Moonshot's Kimi.

The test set was 160 real extraction tasks - actual calls harvested from running Symplast environments, 40 each across entity extraction, relationship extraction, and the two resolution (disambiguation) tasks - drawn from four regulatory corpora: sanctions, systemic risk, procurement, and transfer pricing. This wasn't toy content, but dense, jargon-heavy, real source documents.

Three design decisions are important:

An independent gold standard. Every output is scored against a consensus answer key built by a vendor-diverse panel of models, validated by a critic that checks each candidate against the source text, with genuine disagreements sent to a human (me). The panel that builds the gold is not the contender being scored - so no model grades its own homework.

One prompt for everyone. This is the commitment that decides the whole thing. To compare models, every model runs the same short, deliberately generic, model-neutral prompt. The obvious alternative - take your tuned production setup and swap models underneath it - measures how well each model copes with a prompt written for a different model. That's an accent test, not a capability test. If you want to see the model, you have to give everyone the same words.

Paired statistics. Forty items per task is not many - typical 95% confidence intervals come out around ±0.09, so the rankings within the open pack are not separable; read any sub-0.05 gap as a tie. The findings I'll stand on are the ones measured by paired comparison - same fixtures, one variable changed - because pairing cancels the per-item difficulty that drives most of the noise.

How I scored

The benchmark uses a metric called F1 which grades the model on two opposing failures: laziness and sloppiness.

A lazy model reads a dense page and returns only the one name it's completely certain of, ignoring the rest. Yes, 100% accurate, but also not very useful - it's left plenty of data "on the table".

A sloppy model panics and grabs every noun phrase on the page, catching the real entities but burying them under hallucinations and junk.

F1 is the balance of the two. It ranges from 0 (total failure) to 1 (perfection), and it can't be gamed: a high score requires being thorough (missing nothing) and disciplined (returning no garbage) at the same time.

The model leaderboard

Here's the results for entity extraction - extraction is the hardest, most discriminating task I tested. F1 is computed relative to the gold, higher is better. ($/M in is the per-million-input-token list price; reliability is the share of calls that completed.)

model	entity F1	open weights	$/M in	reliability
gemini v3-flash — production prompt	0.591	no	0.50	-
google / gemma-4-31b	0.465	yes	0.12	99%
anthropic / claude-opus-4.8	0.455	no	5.00	69%
anthropic / claude-haiku-4.5	0.444	no	1.00	73%
zhipu / glm-4.6	0.442	yes	0.43	68%
gemini v3-flash — neutral prompt	0.436	no	0.50	100%
openai / gpt-oss-120b	0.419	yes	0.039	100%
…thirteen more, clustered 0.08 – 0.42

Two rows are the same model. gemini — production prompt (0.591) is Gemini scored from its real recorded production output: its tuned, roughly 16,000-token prompt — the taxonomy, the rules, the worked examples, all of it. gemini — neutral prompt (0.436) is the same Gemini, same weights, handed the short generic prompt every other model got.

Strip the production prompt and Gemini lands at 0.436 - below an open-weight model you can download and run on your own GPU.

Note - the resolution tasks - deciding whether a cluster of mentions is one entity or several - are near-parity for everyone: the cluster-overlap scores (B³) bunch between 0.86 and 0.95 across the entire field, Gemini included. That half of the pipeline is already a commodity; no model has a moat there. If there's an edge to find, it's in extraction.

Where the edge actually lives

A leaderboard of averages can't separate the model from the prompt, because every row changes both at once. For that you need paired comparisons - same fixtures, one variable moved - where a difference either clears zero (real) or doesn't (noise). Three of them decompose Gemini's production lead completely:

comparison	what changes	Δ entity F1	verdict
Gemini production − Gemini neutral	the prompt (same model)	+0.155	real
Gemini neutral − gemma-4-31b	the model (same prompt)	−0.029	noise
Gemini production − gemma-4-31b	everything	+0.126	real

Hold the prompt fixed and swap Google's frontier model for an open-weight model a fraction of its size: −0.029, confidence interval straddling zero. The model contributes a statistical nothing.

Hold the model fixed and swap the prompt: +0.155, comfortably real. The entire shipped advantage is the prompt. Relationship extraction tells the same story, louder - the production prompt opens a +0.237 gap; level the prompt and the model difference is, again, noise.

You might suspect the other half of the harness is hiding the real moat. Production extraction doesn't only send a prompt - it forces the model to emit output locked to a strict schema, guaranteed-valid JSON with no free text. Surely that discipline buys quality. It doesn't: same model, same prompt, strict versus flexible output moved entity F1 by +0.006, noise. Of the 0.155-point production edge, the schema accounts for about 0.006 and the prompt for the other 0.149. (Strict mode does do one thing reliably: it locks out models whose providers don't implement it - both Claude models completed zero calls under strict routing. A constraint that buys nothing and costs you reach.)

So the moat is the prompt - a ~16,000-token text file - something that is easy to move from model to model.

The model isn't nothing - it's the domain

If swapping the model is free, why not just do it? Because "the model doesn't matter" is too strong. When I tuned a fresh prompt for an open model (Gemma) and ran it head-to-head against production Gemini per domain, a sharp pattern emerged - and it isn't about the model at all. It's about the data.

domain	character	Gemma vs Gemini
crime / intelligence triage	concrete	+0.15 (Gemma ahead)
sanctions	concrete	+0.13 (Gemma ahead)
procurement	mixed	−0.03 (tie)
transfer pricing	abstract	−0.07 (Gemma behind)
systemic risk	abstract	−0.14 (Gemma behind)

On concrete, proper-noun-heavy material - people, vehicles, weapons, jurisdictions, named incidents - the open model you can run yourself matches or beats the frontier one.

On abstract, concept-heavy material - financial frameworks, transfer-pricing principles - it still trails, and no amount of prompt-tuning closes that gap. That residual is a genuine capability difference. But notice what it's indexed to: not "frontier vs open", but what kind of text you're reading.

This is the real ranking, and it's the opposite of the one the labs sell:

Domain ≫ prompt > model.

The thing that most determines whether the job is doable is the domain - and you don't choose your domain, it's a property of your data. The biggest lever you actually pull is the prompt - and it's a text file. The model - the thing the entire industry argues about, and the thing you can change by editing one string - matters least of the three.

We shouldn't obsess over exactly the cheapest, most replaceable part.

Is 0.591 any good?

A fair question, looking at that top number: if the best score is 0.591, is any of this good enough to ship?

It is - but only if you compare like with like. Spotting plain names in a news article - flat, four categories, clean prose - is the easy end, where fine-tuned models score 0.92–0.96.

That's a different sport.

This task is what the literature calls document-level joint entity-and-relation extraction: pull the entities, classify them into a deep taxonomy, and extract the relationships, all at once, scored so that any layer being wrong fails the whole item. There, general-purpose LLMs typically land from single digits to the mid-0.30s, and even purpose-built, fine-tuned systems top out around 0.61 (the DocRED benchmark) to 0.75 (its cleaner re-annotation). 0.591 from a general-purpose model driven by nothing but a prompt sits above where LLM extraction usually lands and within arm's reach of fine-tuned, task-specific state of the art - under stricter matching than most published numbers use.

There's a second reason not to read 0.591 as "59% right": it's the score of one stage, in isolation - a single model reading a single chunk, once, with no second look.

That isn't how the platform delivers a finished graph. The raw extractions then pass through a dedicated resolution stage (scoring B³ 0.86–0.95) that merges duplicate mentions and folds the same fact, seen on twenty different pages, into one confident node - a fact missed on page 12 is usually caught on page 40. And for high-stakes work, a human stays in the loop to confirm, correct, or add. So 0.591 is the cold, single-shot, automated-only floor - the right number for comparing models, but not the quality of what comes out the far end.

It also says where the headroom is. Fine-tuned, domain-specific models reach 0.80–0.90 - the way you'd close the abstract-domain gap above. We reached the top of the general-purpose band by tuning a text file. The fine-tuning headroom is still in front of us, not behind us. Hold that thought.

What you actually pay for

Once the prompt ports, you are no longer choosing a capability. You are choosing among near-equivalent extractors on the things that actually differ: price, licence, and where the weights are allowed to run. And those are not close. Gemma-4-31b is open-weight, 99% reliable, and $0.12 per million input tokens against Gemini's $0.50 - four times cheaper, on hardware you own. gpt-oss-120b is Apache-2.0 licensed, 100% reliable, and $0.039 per million - thirteen times cheaper, the cheapest credible option in the field.

LLMs are a market for lemons with the premium "insurance" against our inability to A/B test most of what we use them for. A frontier model is the reputable brand. You pay the premium partly because checking the alternatives requires extra effort. But once you run the fair test the quality signal arrives and the asymmetry that justified their premium is gone.

In Simple Made Inevitable I argued that the learning curve was always a switching cost mistaken for the value of the asset. It's the same error here, one level up. The tuned prompt is a switching cost - real, but paid once, and then yours on every model forever. The model underneath is the asset, and for this task the asset is close to fungible. Paying a permanent frontier premium to avoid a one-off porting cost is a bad capital allocation. It just never shows up as its own line item on an invoice, so nobody books it as one.

So we ported. On the concrete domains, we now run an open model we host ourselves - matching or beating what we shipped on Gemini, several times cheaper, on infrastructure a customer can own.

The trade-off is no longer - quality or sovereignty - for this task.

But porting the prompt is the small prize. The big one is what made the porting possible in the first place.

Owning the harness

Step back and ask what the frontier labs actually own.

It's not the model - models are an increasingly commoditised, frozen artefact, with open-weight options rapidly catching up to the frontier.

What they own is the harness: for software engineering, that is Anthropic's Claude Code, OpenAI's Codex, SpaceX's Cursor - these are the harnesses that take a real problem the user wants solved and then drive the model in order to solve them.

Owning the harness means they also own the improvement loop: collection of more data and a deep understanding of the way their models are used in practice - which they can then use to make a better model, over and over.

That loop is predominantly the exclusive property of organisations with a data-science department - but Symplast is built to give organisations both halves of this equation: the harness to capture the work, and the apparatus to fine-tune the model.

Put those two together, and you have every ingredient the frontier labs use: your own production data, a gold standard, and a scorer that runs continuously.

You now have the exact inputs required to fine-tune and RL an open-weight model on your own domain. The abstract-domain gap from earlier - the one no prompt could close?

That requires a fine-tune, and you now own the whole apparatus to do it.

The sovereign advantage

This entire loop runs on your data on your infrastructure.

The regulated, can-never-leave-the-building data that disqualifies you from sending anything to a frontier lab is exactly the fuel your own improvement loop needs - and the only place it can legally be used is inside your own environment.

The thing that made you "behind" is precisely the thing that lets you pull ahead.

Sovereignty stops meaning frozen, and perpetually a step behind the cloud. A sovereign system can run the same flywheel the frontier labs run - continuous evaluation, fine-tuning, and RL on its own domain - without a huge data-science team, and without a single document leaving the building.

And geez it's good to have numbers that make that argument meaningful.