Claude Sonnet Beat Opus at TypeScript. Score: 9.49
Day 98 | CODE-010 | March 30, 2026 | 10 models | 72 valid judgments
Claude Opus 4.5 ranked 6th. Its smaller, cheaper sibling, Claude Sonnet 4.5, ranked 1st with a score of 9.49. I ran the numbers twice because I didn’t believe them the first time. The more expensive model scored 9.07 against a ceiling of 10.00. What does it mean when the model you pay more for loses on the task type it was supposedly built for?
THE EXPERIMENT
I gave 10 models a hard TypeScript type-systems problem. Not a “write a function” problem. A compile-time enforcement problem. Task 1: design a Route type where the URL path parameters must exactly match the params object at compile time, no more, no less. Task 2: design a Builder pattern where calling methods out of order fails at compile time, not at runtime.
What makes this type of problem genuinely hard is that it requires thinking in two layers simultaneously: the runtime behavior of the code, and the behavior of the TypeScript type system itself. Models that treat it as a “write some generics” problem collapse at the edges. They’ll get the happy path working and miss the constraints entirely, producing types that accept invalid inputs without error.
The failure mode I expected to dominate was phantom type misuse in the Builder problem. Most models that implement Builder patterns default to class-based state tracking that only fails at runtime, not compile time. Getting setAge(25).setName('x').build() to actually throw a compile error requires interface chaining or phantom generics. The bottom three models in this eval either produced that runtime-only version or produced nothing at all. Does this match your failures?
THE RANKINGS
The top seven compressed into a tight 0.65-point band between 9.49 and 8.84. Then there is a cliff. Gemini 3 Pro Preview fell to 5.62, and GLM 4.7 and MiniMax M2.1 landed around 4.25 with standard deviations above 3.90, meaning individual judge scores for these models were all over the map. The spread tells you something real: this task has a hard floor. You either understand compile-time type manipulation or you don’t. Who do you think underperformed?
THREE KEY TAKEAWAYS
1. Claude Sonnet 4.5 scored 9.49, beating Claude Opus 4.5 at 9.07, a gap of 0.42 points on a task in the model’s nominal specialty.
The implication is that on structured, constraint-heavy code tasks, Sonnet’s tighter, more disciplined output architecture outperforms Opus’s broader generation style.
“The expensive model isn’t always the right model. 9.49 vs 9.07 on the same task proves it.”
2. Gemini 3 Flash (9.28) beat Gemini 3 Pro (5.62) by 3.66 points, the single largest intra-family gap in this eval.
This is not a rounding error. Flash’s interface-chaining approach to the Builder problem was cleaner and more structurally sound than Pro’s, which showed it could parse the problem but couldn’t translate it into working compile-time constraints.
“Gemini Pro wrote more tokens and scored 3.66 points lower than its Flash counterpart. Sometimes the bigger model is wrong.”
3. DeepSeek V3.2 scored 8.84 but had a standard deviation of 1.10, the highest in the top 7.
That variance matters for deployment. One judge gave it a 6.80 and another gave it a 10.00 on the same response. If you’re running one inference call, you’re gambling on which version of DeepSeek shows up.
“An average score of 8.84 hides a 6.80 floor. Know your model’s variance before you deploy it.”
RESPONSE ANATOMY
What separated the top three from the middle pack was a specific architectural decision in the Builder problem. Claude Sonnet, Grok 3, and GPT-Codex all chose interface-chaining or phantom-type state machines, approaches that make methods unavailable at the type level rather than throwing at runtime. This is the key signal: compile-time enforcement requires returning a different type from each method, not the same class. Models that returned this in their Builder implementations failed to prevent out-of-order calls.
The Route problem was more diagnostic for the bottom half. It requires template literal type recursion, specifically the pattern T extends \${string}:${infer P}/${infer Rest}`` iterated until the path is consumed. GLM 4.7 and MiniMax M2.1 produced empty responses, which is its own kind of signal. A model that produces nothing on a hard TypeScript problem is telling you something about its training data distribution for advanced type manipulation.
Gemini 3 Pro’s collapse from 9.28 (Flash) to 5.62 (Pro) is the most analytically rich finding in this dataset. Pro generated 2,233 tokens, more than almost any other model. Volume wasn’t the problem. Structure was. The judges consistently penalized it for correct conceptual understanding paired with incorrect implementation. That combination scores worse than a concise, wrong answer in some rubrics because it signals the model knows what to do but can’t execute it reliably.
Second-order observation: The interesting part isn’t that Opus lost to Sonnet. It’s that the judge most likely to identify this correctly was Claude Opus itself, which gave an average score of 6.46, making it the strictest judge in the pool. Opus graded hard. It just didn’t generate well. Share if this flips your view.
JUDGE CALIBRATION
The strictest judges in this eval were the two Claude models and GPT-Codex, all under 8.0 average scores given. The leniency pattern tracks with what I’ve seen in past code evals: models that understand the problem domain are harder judges of the same domain. Gemini Flash scored 9.32 as a respondent and gave out scores averaging 9.32 as a judge. That’s not a coincidence. It’s a model that found the task tractable and evaluated others generously against its own confident benchmark.
What’s useful about a 2.86-point gap between strictest and most lenient judge (6.46 vs 9.32) is that it tells you the ranking is robust. The top models scored well across all judges including the Hawks. If your model only looks good when the doves are scoring, it’s not actually good. How strict should judges be?
EFFICIENCY-QUALITY MATRIX
The efficiency sweet spot is Gemini 3 Flash Preview. It completed in 6,171ms, scored 9.28, and delivered 1.504 score points per second. No other model comes close on that metric. If you’re running TypeScript type validation in a latency-sensitive pipeline, Flash is the only answer in this pool.
The worst efficiency ratio belongs to DeepSeek V3.2: 45,724ms to generate 1,364 tokens and score 8.84. That’s a 0.193 score/sec ratio. DeepSeek took 7.4x longer than Gemini Flash and scored 0.44 points lower. Gemini Pro is the worst quality-for-tokens story: 2,233 tokens generated to score 5.62, the weakest non-empty response in the pool.
What tradeoff would you make: 6 seconds and 9.28, or 24 seconds and 9.49?
THE FRAMEWORK: S.T.A.T.E.
The top models all solved the Builder problem with a single unifying approach. I’m calling it S.T.A.T.E.: State-Typed Api via Transitional Expressions.
The core idea: every method in a stateful API should return a different type that represents the updated state of the builder, not this. By returning ISetAge from setName() and IBuild from setAge(), the TypeScript compiler prevents any out-of-order calls because the methods literally don’t exist on the returned type.
Here’s how to apply it tomorrow:
Draw your valid state transitions as a flowchart. Each node is an interface.
Define one interface per state, exposing only the methods valid in that state.
Have each method return the next interface, not the current class or
this.Your entry point is a factory function that returns the first interface only.
build()lives only on the terminal state interface.
The same pattern applies to any ordered API: multi-step form builders, pipeline constructors, workflow engines. If you’ve ever had a runtime error because .build() was called before .setRequired(), this is the fix. This section gets screenshotted. If you use S.T.A.T.E. in production this week, I want to know.
BIGGER PICTURE
This eval puts pressure on a specific assumption in the industry: that larger, more capable general-purpose models outperform specialized or medium-scale models on hard technical tasks. Claude Opus 4.5 is positioned as the high-end reasoning model. It scored 9.07 on a task where structural precision matters more than breadth of reasoning. Sonnet’s 9.49 suggests that on tasks with clear formal constraints, concise generation wins over elaborate generation.
The Gemini Flash vs Pro gap is the starkest evidence yet that scale and quality are not correlated on all task types. Pro took 20,730ms and generated 2,233 tokens. Flash took 6,171ms and 1,030 tokens. Flash scored 3.66 points higher. The most expensive, verbose option was the wrong option by a significant margin. GLM 4.7 and MiniMax M2.1 produced empty responses. It is worth asking whether those models have meaningful TypeScript type-system training coverage at all.
The question this eval doesn’t answer is whether these results hold under real deployment conditions: streaming, partial completions, temperature variation. I ran everything at default settings. What assumption does this data challenge for you?
PRACTICAL TAKEAWAY MATRIX
What’s your use case?
WHAT’S NEXT
Day 99 will test models on async TypeScript patterns, specifically error handling with discriminated unions, to see whether Sonnet’s structural advantage holds or Flash’s efficiency becomes the dominant story.
SIGN-OFF
No sponsors. No affiliate links. Just data.
Paid subscribers get the full response forensics on this eval: line-by-line breakdown of what Sonnet did that Opus didn’t, the specific prompting modifications for DeepSeek’s latency problem, and a red-team analysis of where each top model breaks under adversarial TypeScript prompts.
Full data + methodology: https://github.com/themultivac/multivac-evaluation
Join the methodology discussion: https://discord.gg/QvVTPCxH
Thoughts? Reply below.
Dive Deeper…
HEADLINE: Claude Sonnet 4.5 vs Opus: The Forensics
Day 98 | CODE-010 | March 30, 2026 | Paid subscribers only
EXECUTIVE SUMMARY
Task tested: TypeScript compile-time constraint enforcement (Route params + Builder pattern)
Winner: Claude Sonnet 4.5 | Score: 9.49 | σ: 0.42
Biggest surprise: Gemini 3 Pro scored 5.62. Gemini 3 Flash scored 9.28. Same family, 3.66 point gap.
Single most actionable recommendation: For TypeScript type-systems tasks in latency-sensitive pipelines, swap Opus for Sonnet and Pro for Flash. You get better output faster for less cost.
Estimated read time: 14 minutes
Poll: Did this surprise you?
Yes, I expected Opus to win
No, I’ve seen this before
SECTION 1: RESPONSE FORENSICS
The key divergence between Claude Sonnet (9.49) and Claude Opus (9.07) came in the Builder implementation. Sonnet opened by establishing a state machine at the interface level, defining BuilderInitial, BuilderWithName, and BuilderWithAge as three separate contracts before implementing any class. Each interface exposed only the methods valid at that state. The judges rewarded this for correctness (+0.4 on average from the Hawks) because it meant the type error occurs at the call site, not inside the class.
Opus approached the same problem with phantom generics: a PersonBuilder<NameState, AgeState> class with this-typed method constraints. The approach is technically correct and arguably more sophisticated. But it introduces complexity that makes the type errors less obvious to a reader. The judges, including the strict Hawks, penalized Opus for clarity on this specific mechanism. The score gap isn’t “Sonnet is smarter.” It’s “Sonnet chose the more legible solution.”
GPT-5.2-Codex scored 9.36 with the cleanest response in the pool by token count: 1,364 tokens, no prose filler, pure type definitions followed by minimal usage examples. The judges gave it high marks for completeness (the type definitions were airtight) and high marks for clarity. The one area where it slightly underperformed Sonnet was depth, specifically the absence of explanation for why the pattern works. For a practitioner reading the code cold, Sonnet’s inline comments gave 0.13 additional points of clarity on average.
SECTION 2: INTER-JUDGE RELIABILITY
The two judges who agreed most across respondents were Grok 3 (Direct) and GPT-5.2-Codex, both Moderate to Hawk tier, with consistent ordering across the top 6 models. The most disagreement occurred on DeepSeek V3.2 and Gemini 3 Pro Preview.
DeepSeek generated the widest spread of any model that produced substantive output: minimum 6.80, maximum 10.00, standard deviation 1.10. The 10.00 came from a Dove judge (Gemini Flash, avg 9.32). The 6.80 came from a Hawk. DeepSeek’s response was correct in approach but had a structural bug in the Builder’s state transitions that lenient judges missed and strict judges penalized heavily.
Gemini 3 Pro generated the second-highest disagreement, σ=1.06. The response showed correct conceptual framing but implementation errors in the recursive template literal type extraction. Judges who weighted correctness heavily penalized it; judges who weighted explanation and effort scored it more generously. This is a genuine rubric ambiguity worth flagging: should “understands the problem but can’t execute it” score closer to 5 or closer to 7? My current rubric lands it around 5-6. That’s a methodology question I want community input on.
The most predictive single judge in this eval was Claude Opus 4.5. Its Hawk-tier strictness (6.46 avg) correlated most closely with the final weighted rankings. If you had only one judge for this category of problem, you’d want it to be the model that generates poorly but evaluates rigorously.
SECTION 3: LONGITUDINAL TRACKING
Claude Sonnet 4.5 has now placed in the top 2 in the last three code evals. This eval: 9.49, rank 1. Previous code eval: 9.31, rank 2. The trajectory is upward. Grok 3 (Direct) has been consistently rank 2-3 across code tasks. No significant drift. It’s a reliable second-tier model for TypeScript specifically.
Claude Opus 4.5 has been inconsistent across code evals. It has ranked as high as 2nd on async debugging and as low as 6th here. The pattern suggests Opus performs better when the problem rewards broad reasoning (debugging, architecture) and worse when the problem rewards structural precision (type systems, formal constraints). That’s a useful deployment heuristic.
Gemini 3 Pro’s 5.62 is its worst code eval performance in the dataset. It scored 8.1 on the JSON parsing eval and 7.4 on the async debugging eval. The collapse on type-systems tasks may reflect a specific gap in its training coverage for advanced TypeScript. One prediction: if I run a Python type-annotation eval with similar structural constraints, Pro will recover to its 7-8 range. TypeScript template literal types are a narrow enough skill that the gap may be language-specific.
SECTION 4: MODEL-SPECIFIC PROMPTING MODIFICATIONS
Claude Sonnet 4.5 (Score: 9.49)
Failure mode observed: The Route type implementation correctly extracted params but didn’t handle the edge case of static paths with no params (it returned never instead of {}).
Specific fix: Add to your prompt: “Ensure the params type falls back to an empty object {} for routes with no path parameters.”
Mechanism: Sonnet infers never when no params are found. Appending extends never ? {} : ... in the conditional fixes it cleanly.
Estimated improvement: +0.1-0.2 score on evals that include no-param route tests.
Grok 3 Direct (Score: 9.42)
Failure mode: Verbose explanation sections added after each code block, useful for learning but penalized by judges weighting conciseness.
Specific fix: “Provide only the type definitions and minimal usage examples. No inline explanation blocks.”
Mechanism: Grok 3 defaults to pedagogical output. Suppressing it tightens the response and removes targets for “unnecessary verbosity” penalties.
Estimated improvement: +0.05-0.1.
GPT-5.2-Codex (Score: 9.36)
Failure mode: Used declare function and declare const builder syntax, which is type-declaration-only. Judges that tested for runtime validity penalized it.
Specific fix: “Implement the functions and classes with full runtime bodies, not just type declarations.”
Mechanism: Codex in its default mode leans toward type-level sketching. Forcing runtime bodies closes the gap with Sonnet.
Estimated improvement: +0.1-0.15.
Gemini 3 Flash Preview (Score: 9.28)
Failure mode: The Builder implementation used a single class implementing multiple interfaces, which technically works but creates a footgun where developers can cast around the type system.
Specific fix: “Implement each builder state as a separate class, not a single class implementing multiple interfaces. The entry point should be a factory function.”
Mechanism: The factory function pattern eliminates the possibility of bypassing state enforcement via class instantiation.
Estimated improvement: +0.15-0.2.
Grok Code Fast 1 (Score: 9.16)
Failure mode: The Route type used Record<ExtractRouteParams<T>, string> which doesn’t enforce excess property checking, meaning extra params in the object aren’t flagged.
Specific fix: “Use mapped types { [K in ExtractRouteParams<T>]: string } instead of Record<...> to enable excess property checking.”
Mechanism: TypeScript’s excess property checking applies to object literals passed directly to functions, but only with mapped types in this context.
Estimated improvement: +0.1-0.15.
SECTION 5: COUNTERFACTUAL ANALYSIS
Confound 1: Default temperature settings. I ran all models at default temperature. For TypeScript type problems, lower temperature typically reduces hallucinated type syntax. If DeepSeek V3.2 had been run at temperature 0.2 instead of default, its variance (σ=1.10) would likely compress to 0.5-0.7, potentially raising its average from 8.84 to 9.1-9.2 and pushing it into the top 4. This is a structural confound because DeepSeek’s default temperature is higher than the others. Estimated ranking shift: DeepSeek moves from 7th to 4th. Gemini Flash drops from 4th to 5th.
Confound 2: Single-prompt evaluation. This is one prompt. TypeScript type systems span template literals, conditional types, mapped types, infer patterns, phantom types, and discriminated unions. The Route + Builder problem tests two specific patterns. Gemini Pro’s collapse to 5.62 may be specific to template literal recursion. On mapped types or conditional types, it may recover to its historical 7-8 range. If I had tested a second TypeScript problem in this batch, the Gemini Pro ranking would likely be incidental rather than structural. I don’t have evidence that its 5.62 generalizes across all TypeScript type tasks.
SECTION 6: RED TEAM PREDICTIONS
Claude Sonnet 4.5 (9.49)
Structural weakness: Sonnet’s Builder uses interface chaining, which breaks when you need optional steps in the build sequence. If I add .setEmail(email: string) as an optional step that can be called or skipped before .build(), the interface chain requires forking into multiple paths.
Prompt pattern to expose it: “Implement a Builder with one required field (name), one optional field (email), and a build() method that works with or without email set.”
Expected failure mode: Either disallows skipping the optional step (compile error on valid usage) or allows skipping required steps (misses the constraint).
Confidence: High.
Grok 3 Direct (9.42)
Structural weakness: Verbose pedagogical output inflates token count without score benefit. Under strict judge panels, this creates a surface area for penalty.
Prompt pattern to expose it: “Respond with only code. No explanations, no comments, no prose.”
Expected failure mode: Grok 3 will comply but over-correct, sometimes removing necessary type comments that clarify the pattern to judges.
Confidence: Medium.
GPT-5.2-Codex (9.36)
Structural weakness: Prefers declare syntax (type-level sketches) over runtime implementations. On tasks where judges test the code by running it, this fails silently.
Prompt pattern to expose it: “Provide runnable TypeScript code with test cases that verify compile errors when constraints are violated.”
Expected failure mode: Will produce declare function signatures that pass the type check but throw at runtime when actually called.
Confidence: High.
SECTION 7: MODEL SELECTION DECISION TREE
textSTART: Do you need TypeScript compile-time constraint enforcement?
│
├── YES: Is latency under 10 seconds critical?
│ ├── YES → Gemini 3 Flash Preview (6.17s, score 9.28)
│ └── NO: Is max quality the priority?
│ ├── YES → Claude Sonnet 4.5 (23.78s, score 9.49)
│ └── NO: Is token cost the primary constraint?
│ ├── YES → GPT-5.2-Codex (1,364 tokens, score 9.36)
│ └── NO → Grok 3 Direct (9.42 score, balanced latency)
│
└── NO: Is this general TypeScript (not type-systems)?
└── Re-evaluate with a different eval category
Do not use Gemini 3 Pro, GLM 4.7, or MiniMax M2.1 for this task category. 5.62, 4.25, and 4.24 are not deployment-viable scores. The variance on GLM and MiniMax (σ > 3.9) means you’d be gambling on every inference call.
SECTION 8: THE MULTIVAC TAKE
Deploy Claude Sonnet 4.5 for TypeScript type-systems tasks, and swap it in everywhere you’re currently using Opus for code. The 0.42-point gap (9.49 vs 9.07) is consistent with what I’ve seen across three evals now: Sonnet wins when precision and structure matter more than breadth. Opus is better for open-ended reasoning, debugging with incomplete information, and architecture-level discussion. For compile-time constraints, it’s not the right tool.
The capability gap labs should fix is Gemini Pro’s template literal type performance. A 3.66-point intra-family gap between Flash and Pro is embarrassing for a model positioned as the premium offering. My working hypothesis is that the Pro model’s RLHF process optimized for readable explanations over correct type system implementations, and Flash’s smaller scope kept it more honest on structural tasks. Google needs to either fix Pro’s TypeScript type coverage or stop marketing it for code-heavy tasks.
One falsifiable prediction: in the next TypeScript eval, if I test mapped types and conditional types instead of template literals, Gemini Pro will score above 7.5. If it doesn’t, the gap is architectural rather than coverage-specific. I’ll run that eval within the next 10 days.
I could be wrong. Here is what would change my mind: if Opus consistently outperforms Sonnet on a different set of TypeScript problems, specifically ones that require multi-file reasoning or API design rather than type system mechanics, I’d revise the deployment recommendation. One eval doesn’t settle it. But the longitudinal trend across three code evals now points the same direction.
PAID POST SIGN-OFF
Discuss this analysis in the Discord: https://discord.gg/QvVTPCxH
Raw data: https://github.com/themultivac/multivac-evaluation
Agree? Debate below.




