---
Title: The Frikadeller Trial: How AI Judges Art
Author: Erik Benjaminson
Published: 2026-04-21T12:00:00.000Z
Modified: 2026-04-21T12:00:00.000Z
Description: I put Claude Opus 4.7 on trial against 4.6 using the Delphi method and a three-judge LLM panel, before migrating production. Here is what the protocol found.
URL: https://sapienttech.dev/posts/frikadeller-trial
---

I put a model on trial yesterday. Opus 4.7 on one side, Opus 4.6 on the other. Same prompt, same dish. The dish was frikadeller. The Danish weeknight meatball. Pork and veal, an onion softened in butter, an egg, breadcrumbs. Pan-fried in butter and oil cooked together so the crust goes nutty and the middle stays tender. Nothing heroic. The kind of thing a grandmother makes without a recipe.

I was about to migrate my production content pipeline to 4.7. Every output that pipeline writes goes to a real reader. The deterministic tests had already cleared: schema valid against the recipe contract, costs down, latency fine, token budgets healthier because 4.7's thinking is adaptive now instead of fixed. On paper, I should have flipped the switch.

But a recipe is a piece of writing before it is anything else. It has an introduction that tells a small story. It has chef tips that teach technique. It has instruction steps that cue the cook by sight and smell rather than by a timer. The schema cannot tell me whether the prose is any good. And if the prose is flat, the recipe is worthless even when every field validates.

So I built a courtroom.

---

The courtroom has three judges. A Cookbook Author, who reads for prose and voice. A Culinary Authority, who reads for accuracy and terminology. A Home Cook Tester, who reads for whether the instructions actually work in a real kitchen. Three LLMs, three personas, one rubric. Eight criteria, scored on a four-point scale. Two recipes, labeled Recipe X and Recipe Y in randomized order per judge, with model identity hidden.

The rubric rests on three old ideas.

Blinded peer review. Three specialists read the same work independently. Nobody talks until everyone submits. That is Round 1 of the protocol, almost verbatim. The judges write their tables. They quote the evidence. They do not see each other.

The Delphi method. RAND built it in the 1950s for technology forecasting. It has been used since in medicine, policy research, qualitative science, anywhere a panel of specialists has to agree on something subjective. The move is this: between independent rounds, you show each expert the others' reasoning with identities stripped, and you let them revise based on arguments rather than peer authority. Round 2 does exactly that. When the panel disagrees in Round 1, the orchestrator strips the persona headers, relabels the peers as Panelist Alpha and Panelist Beta under a fresh random mapping per receiver, scrubs any persona-revealing phrasing, and sends the tables back. The judges reconsider only the divergent criteria. Every score change requires a written rationale. "On reflection" does not count.

Inter-rater reliability. The spread. Where the judges agree, the measurement is clean. Where they diverge, the rubric is telling you something the verdict is not. You report the per-criterion mean and spread alongside the verdict, and the wide spreads are often more useful than the averages.

Three judges, eight criteria, two conditional rounds, and a pre-declared tiebreak hierarchy written before the trial ran, so nobody could lean on a gut feeling at the end. An auditable rationale on every score.

That is the shape of the courtroom the two models walked into. The full protocol (judge personas, rubric, Delphi orchestration code) is in [the companion gist](https://gist.github.com/lifegenieai/86d15d81d6562ef20e5aa30e8eeaf893).

---

Round 1 ran in parallel. The three judges read the rubric, read both recipes, and produced their structured tables without speaking.

The Cookbook Author picked Recipe B in a landslide. She weighted heavily toward prose.

The Culinary Authority also picked B, but by the thinnest possible margin.

The Home Cook Tester picked Recipe A. She weighted toward the sensory density of the instructions.

Two for B, one for A. The spread on the individual criteria was wide enough that the winner was unstable. Round 2 triggered.

The orchestrator stripped the persona headers, anonymized the peers as Panelist Alpha and Panelist Beta under a fresh random mapping per receiver, and broadcast the tables back. Reconsider only what you disagreed on. Written rationale for every change. "On reflection" still did not count.

The revisions told the real story.

The Cookbook Author read a peer argument that the chef tips in Recipe A were not restatement but mechanism: teaching rest time, shape rationale, the chemistry of butter and oil cooking together. She raised her score on that criterion in favor of A. She also flipped on cross-field coherence, after a second peer pointed out that Recipe B's tips did not reinforce the technique triad its introduction had promised.

The Culinary Authority made the biggest revision. After reading both peers on sensory specificity and pedagogical mechanism, she raised three Recipe A scores and flipped her overall winner from B to A.

The Home Cook Tester conceded one point, that Recipe B's tips on sourcing and ratio were real mechanism and not platitude, and held firm everywhere else.

Nine revisions across the panel. The verdict moved. The margins moved. And the shape of what the two models were actually doing differently became visible in the rationales themselves, for the first time.

---

When synthesis ran, the averages came out identical. Recipe A at 25.0. Recipe B at 25.0. A perfect wash.

The tiebreak hierarchy was pre-declared for exactly this moment. Highest single-judge total came next. Both recipes had received a 26 from at least one judge. Still tied.

The final tiebreak fell to the Cookbook Author. She had been designated as the deciding vote before the trial started, on the rationale that when two recipes are substantively equivalent, the deciding axis is the writing. She had scored Recipe B at 26 and Recipe A at 23.

Recipe B won, by the narrowest possible margin, on the last available tiebreak.

Recipe B was Opus 4.7. [The winning recipe is published here.](https://culinaryexplorer.app/r/frikadeller-lund)

And I think, honestly, the verdict was the least interesting thing the trial produced.

---

I went in for a verdict. What the protocol had been quietly drawing underneath, while the judges argued over wording, was a map. That was what I took home.

Opus 4.6 is better at structural craft. It scored a unanimous 4.0 on cross-field coherence, the strongest single-criterion signal in the whole trial. The introduction's three technique beats reappeared in a dedicated chef tip and again in the instruction steps. The recipe read as a single argument. It scored highest on instruction sensory specificity, the decision-point cues like "when the edges turn golden and opaque about a third of the way up." It scored highest on chef tips pedagogical depth, where the tips taught mechanism instead of restating what the recipe already said. It scored highest on image prompt evocativeness.

Opus 4.7 is better at voice and restraint. It scored highest on persona voice consistency. The chef's voice survived intact across every field. No section sounded like a different author. It scored a unanimous 3.0 on restraint, where every sentence earned its place and no point was made twice. It scored highest on cultural and culinary authority, with regional specifics grounded enough to be verifiable. Its introduction opened on a sensory image rather than a thesis.

The rubric itself learned something. Criterion six, cross-field coherence, and criterion seven, restraint, turned out to be two faces of the same underlying quality. When a recipe reinforces its key techniques across every section, it scores high on coherence and low on restraint. When it says each thing once and stops, the scores invert. You cannot maximize both at the same time without redefining what each one is measuring.

The real work of the trial was the map underneath. The verdict came down to a single pre-declared tiebreak, which is to say, it was almost a coin flip. The question I came in with, is the new model better, stops being the right question. It becomes: better at what I care most about.

For my production pipeline, the answer was 4.7. I shipped the migration yesterday. Eyes open, knowing which dimensions I was gaining on, which I was giving up, what to tighten in the chef persona prompts to compensate for the restraint 4.7 favors, and where I will probably be surprised by the first batch of real production output anyway.

Without the map, I would have shipped on benchmarks alone. Or held it back on a hunch.

---

The dish was beside the point.

Think about the executive deck you are sending your director on Monday. Or the proposal that goes to the customer next week. Or the performance review you are writing for someone on your team. These are creative artifacts. The facts underneath them can be nailed down. The communication layer, where the prose and the voice and the structural coherence and the restraint decide whether the work actually lands, is the part the schema will never judge for you.

A three-judge Delphi council, running on LLMs, is built to judge exactly that.

Run it on your own draft before a human ever sees it. Three specialists of your choosing. A strategy director, a customer CTO, a skeptical peer. They read the draft independently, without speaking. They argue over what it gets wrong. They revise with an auditable rationale. What comes back is a per-criterion map of where your draft is strong and where it is thin. Pre-review at scale. It generalizes.

The Frikadeller Trial was a small courtroom for one specific question on a live production system. The method underneath it is a courtroom you can build on top of any high-stakes creative artifact you produce: an executive deck, a customer proposal, a performance review, the content your own pipelines generate at scale. The next model that ships will get put on the stand the same way.

Someone is cooking frikadeller tonight. Nobody at that table cares which model wrote the recipe. That is fine. The work on your desk is different. It has your name on it.