AI Writing Benchmarks Explained: How We Measure Fiction Quality in 2026
A deep dive into the benchmarks that measure AI creative writing ability. What they test, how they score, and what the results actually mean for fiction.
How do you measure if AI writes good fiction?
It's not like math where answers are right or wrong. Fiction quality is subjective—one reader's compelling prose is another's purple garbage. Yet benchmarks exist, and they reveal surprising insights about which models actually write well.
Here's what the major AI writing benchmarks actually measure and what their results mean.
Why Benchmarks Matter
Most AI benchmarks test factual knowledge, reasoning, or code. Creative writing gets ignored because it's "subjective."
But that subjectivity is exactly why we need benchmarks. Without measurement:
- Marketing claims go unchallenged
- Users waste money on inferior models
- Quality improvements go unnoticed
- The field stagnates
The benchmarks that exist aren't perfect, but they're better than vibes.
The Major AI Writing Benchmarks
Lechmazur Creative Writing Benchmark
The Lechmazur benchmark is one of the most rigorous evaluations of LLM creative writing ability. Every story must meaningfully incorporate ten required elements: character, object, concept, attribute, action, method, setting, timeframe, motivation, and tone.
Methodology:
- 18-question rubric with 10-point scale (0.1 increments)
- 7 LLM graders per story (Claude Sonnet 4.5, DeepSeek V3.2 Exp, Gemini 3 Pro Preview, GPT-5.1, Grok 4.1, Kimi K2-0905, Qwen 3 Max)
- 60/40 weighting: Q1-Q8 (craft) vs Q9A-Q9J (element integration)
- Power mean (Hölder mean, p = 0.5) rewards balanced performance
- 2,796 samples per model in V4
Current Leaders (V4):
| Rank | Model | Score |
|---|---|---|
| 1 | GPT-5.2 (medium reasoning) | 8.511 |
| 2 | GPT-5 Pro | 8.474 |
| 3 | GPT-5.1 (medium reasoning) | 8.438 |
| 4 | GPT-5 (medium reasoning) | 8.434 |
| 5 | Kimi K2-0905 | 8.331 |
| 6 | Gemini 3 Pro Preview | 8.221 |
| 7 | Gemini 2.5 Pro | 8.219 |
| 8 | Mistral Medium 3.1 | 8.201 |
| 9 | Claude Opus 4.5 (no reasoning) | 8.195 |
| 10 | Claude Sonnet 4.5 Thinking 16K | 8.169 |
The GPT-5 family dominates the top spots, with GPT-5.2 holding a slim lead. Notably, Kimi K2-0905 (#5) outperforms all Claude models, and Mistral Medium 3.1 (#8) beats Claude Opus 4.5.
Other notable performers:
- Claude Sonnet 4.5 (no reasoning): Scores 8.112 (#11), solid but trailing behind GPT and Gemini families
- Qwen 3 Max Preview: Scores 8.091 (#12), competitive open model showing Chinese labs can match Western ones
- Deepseek V3.2 Exp: Scores 7.159 (#19), good cost-performance despite lower ranking
- Llama 4 Maverick: Scores 5.777 (#23), significant gap from other models on creative writing tasks
EQ-Bench Creative Writing (Longform)
EQ-Bench Longform tests what most benchmarks ignore: sustained quality over length. Models generate 8 chapters of approximately 1,000 words each.
Why this matters: Most AI writing sounds good for a paragraph. The real test is maintaining quality across 8,000+ words. Character consistency, plot coherence, voice stability—these only appear in longer works.
Scoring dimensions (14 total):
- Narrative coherence
- Character development
- Dialogue authenticity
- Emotional resonance
- Prose quality
- Pacing and structure
- World consistency
- Thematic depth
- And more...
Models that score well on short-form benchmarks often collapse on longform. The correlation isn't as strong as you'd expect.
EQ-Bench Creative Writing v3
The standard EQ-Bench creative writing test uses both rubric scoring and pairwise comparison, generating ELO ratings similar to chess rankings.
Unique features:
- Slop score: Measures overused AI phrases ("tapestry of emotions," "I couldn't help but notice")
- Repetition metrics: Catches models that recycle sentence structures
- Bias detection: Identifies when models favor certain styles unfairly
Current Leaders (Lower Slop/Repetition = Better):
| Model | Slop | Repetition |
|---|---|---|
| horizon-alpha | 1.5 | 2.3 |
| horizon-beta | 1.6 | 2.2 |
| gpt-5-2025-08-07 | 1.6 | 2.4 |
| gpt-5-mini | 1.9 | 2.6 |
| gpt-5-nano | 2.1 | 3.0 |
| Kimi-K2-Instruct | 2.2 | 3.4 |
| claude-sonnet-4.5 | 2.2 | 3.6 |
| claude-opus-4-5 | 2.3 | 4.3 |
| gpt-5.2 | 2.3 | 3.0 |
| o3 | 2.4 | 2.7 |
The slop score is particularly revealing. GPT-5 variants and the new Horizon models achieve the lowest slop scores (1.5-2.1), while Claude models score slightly higher (2.2-2.3). DeepSeek-R1 struggles with slop (4.3) despite strong reasoning capabilities.
WritingBench
WritingBench takes a different approach: 1,239 queries across 6 domains. Published at NeurIPS 2025, it's the most comprehensive benchmark available.
Domains tested:
- Academic writing
- Business communication
- Legal documents
- Literature/fiction
- Educational content
- Marketing copy
Scoring:
- 10-point scale with dynamic rubrics
- Domain-specific evaluation criteria
- Cross-domain performance comparison
For fiction specifically, the literature domain tests:
- Narrative voice consistency
- Show-don't-tell execution
- Dialogue naturalism
- Scene construction
- Emotional authenticity
Current Leaders:
| Rank | Model | Organization | Overall Score |
|---|---|---|---|
| 1 | GPT-5 (Aug 2025) | OpenAI | 83.87 |
| 2 | Qwen3-235B (thinking) | Alibaba | 82.34 |
| 3 | Kimi K2 Instruct | Moonshot | 81.26 |
| 4 | Claude Sonnet 4.5 | Anthropic | 80.71 |
| 5 | o3 (Apr 2025) | OpenAI | 80.46 |
| 6 | Qwen3 Max | Alibaba | 80.26 |
| 8 | Claude Opus 4.5 | Anthropic | 79.78 |
| 10 | Gemini 2.5 Pro | 79.26 | |
| 14 | Gemini 3 Pro Preview | 78.50 | |
| 16 | Claude Haiku 4.5 | Anthropic | 77.09 |
GPT-5 leads WritingBench with 83.87, pulling ahead of Qwen3. Notably, Claude Sonnet 4.5 (#4) outperforms Claude Opus 4.5 (#8) here—suggesting the newer Sonnet is better optimized for diverse writing tasks.
NC-Bench
NC-Bench focuses on creative writing copilot capabilities—how well AI assists writers rather than writes independently.
Tests include:
- Instruction following accuracy
- Prose improvement suggestions
- Continuation quality
- Style matching
- Editing capabilities
Current Leaders:
| Model | Overall Score |
|---|---|
| o4 Mini High | 86.94% |
| o4 Mini | 85.85% |
| Gemini 3 Pro (Preview) | 85.69% |
| Gemini 2.5 Pro | 85.11% |
| Claude Sonnet 4 | 83.20% |
| GPT-4.1 | 81.18% |
| Claude 3.5 Sonnet (new) | 80.29% |
| Claude Opus 4 | 80.23% |
| Llama 3.1 405B | 75.12% |
| Claude 3.5 Haiku | 73.73% |
Interestingly, o4 Mini models lead NC-Bench for copilot tasks, suggesting smaller, faster models can excel at assisting writers. Gemini models also perform exceptionally well here.
This benchmark matters for tools like Sudowrite and Novelcrafter where AI assists human writers rather than generating complete works.
Fiction.LiveBench
Fiction.LiveBench tests something unique: long context comprehension for fiction.
Structure:
- 36 questions across 30 stories
- Context lengths from 0 to 192,000 tokens
- Tests memory and consistency over extreme lengths
Top Performers at Long Context (120k+ tokens):
| Model | 32k | 60k | 120k | 192k |
|---|---|---|---|---|
| o3 | 88.9 | 83.3 | 100.0 | 58.1 |
| gpt-5 | 97.2 | 100.0 | 96.9 | 87.5 |
| grok-4 | 94.4 | 91.7 | 96.9 | 84.4 |
| gemini-2.5-pro-preview | 91.7 | 83.3 | 87.5 | 90.6 |
| claude-sonnet-4:thinking | 100.0 | 91.7 | 81.3 | — |
| grok-4-fast:free | 94.4 | 80.6 | 75.0 | 78.1 |
| gpt-5-mini | 63.9 | 61.1 | 62.5 | 59.4 |
Why it matters: Writing a novel requires remembering what happened 100,000 words ago. Most models degrade severely as context grows. Fiction.LiveBench measures exactly how badly.
GPT-5 and Gemini 2.5 Pro maintain the strongest performance at 192k tokens (87.5% and 90.6% respectively). Most models drop significantly—even o3 falls to 58.1% at maximum context despite perfect scores at shorter lengths.
What Benchmarks Miss
No benchmark fully captures "good fiction." They miss:
Reader engagement: Does anyone actually want to keep reading?
Genre appropriateness: Literary fiction and LitRPG have different standards.
Originality: Following patterns scores well but produces derivative work.
Emotional impact: Technical competence doesn't equal moving readers.
Long-term consistency: Even 8,000 words is short for a novel.
Benchmarks measure what's measurable. The ineffable qualities that make fiction memorable often escape quantification.
What Matters for Fiction Readers
At narrator, we care most about:
- Longform consistency - Stories that don't fall apart after a few chapters
- Low slop scores - Prose that doesn't scream "AI wrote this"
- Dialogue naturalism - Characters who sound like people, not assistants
- Extended context handling - Novels where chapter 20 remembers chapter 1
We optimize for reading pleasure, not benchmark scores. The goal is fiction you actually enjoy, not fiction that scores well on tests.
Benchmark Comparison Table
| Benchmark | What It Tests | Scale | Best For |
|---|---|---|---|
| Lechmazur | General quality | 10-point | Overall comparison |
| EQ-Bench Longform | Sustained quality | Multi-dimensional | Novel writing |
| EQ-Bench v3 | Style + slop | ELO | AI detection |
| WritingBench | Domain skills | 10-point | Specific use cases |
| NC-Bench | Copilot ability | Various | Writing assistants |
| Fiction.LiveBench | Long context | Accuracy | Novel-length works |
The Benchmark Arms Race
Models are increasingly optimized for benchmarks specifically. This creates problems:
Teaching to the test: Models perform well on benchmark-style prompts but fail on real usage.
Metric gaming: Optimizing for measurable factors while ignoring unmeasurable ones.
Benchmark saturation: Top models cluster together, making differentiation difficult.
The best approach: use benchmarks as filters, not final answers. A model that scores poorly probably writes poorly. A model that scores well might write well—or might just game the metrics.
What Actually Matters for Fiction
After analyzing all these benchmarks, what predicts good fiction writing?
Longform consistency beats short-form brilliance. A model that writes decent prose for 50,000 words beats one that writes beautiful prose for 500 words then degrades.
Low slop matters more than high scores. Avoiding AI-isms is harder than achieving technical competence.
Context handling is underrated. Models that remember chapter 1 in chapter 20 produce dramatically better fiction.
Genre-specific evaluation is lacking. Romance, LitRPG, and literary fiction have different standards that generic benchmarks miss.
The Future of AI Writing Evaluation
Benchmarks will evolve toward:
- Reader preference studies rather than LLM judges
- Genre-specific metrics for different fiction categories
- Engagement measurement (do readers finish the story?)
- Longer evaluation contexts (full novel assessment)
Until then, use current benchmarks as rough guides. They're imperfect but better than nothing.
How Other Models Score
Beyond the top performers, several models show interesting strengths across benchmarks:
Lechmazur Creative Writing (Full Rankings):
- Kimi K2-0905: Surprise #5 performer (8.331), outperforming all Claude models on creative writing
- Mistral Medium 3.1: Solid #8 (8.201), beating Claude Opus 4.5 at a lower price point
- Qwen 3 Max Preview: Strong #12 (8.091), showing open-source models can compete
- Mistral Large 3: Scores 7.595 (#15), decent budget option
- DeepSeek V3.2 Exp: Scores 7.159 (#19), cost-effective but significant quality gap from leaders
WritingBench Leaders:
- Qwen3-235B (thinking): Leads WritingBench overall (81.34), demonstrating open-source can match proprietary
- Kimi K2 Instruct: Strong showing (79.04), particularly good for long-form content with 256K token context
Specialized Performers:
- Gemini 3 Pro Preview: #6 on Lechmazur (8.221), excels at long-context tasks up to 1M tokens
- Gemini 2.5 Pro: Nearly tied at #7 (8.219), excellent multimodal reasoning
- Claude Sonnet 4.5 Thinking 16K: #10 (8.169), better than non-thinking Sonnet
The Open-Source Surge: One of the most notable trends in 2025-2026 is how open-source models have closed the gap. Qwen3 leading WritingBench shows that proprietary models no longer have an insurmountable advantage in writing quality. This matters for writers who want quality without vendor lock-in.
Cost-Performance Considerations: While GPT-5.2 leads on raw quality, models like Mistral Medium 3.1 (#8), Qwen 3 Max Preview (#12), and DeepSeek V3.2 (#19) offer compelling alternatives for writers who need good quality at lower costs. The gap between "best" and "good enough" has narrowed significantly.
The Bottom Line
Benchmarks reveal that:
- GPT-5.2 leads creative writing, with the GPT-5 family dominating the top 4 spots
- Kimi K2-0905 is a surprise strong performer (#5), beating all Claude models
- Claude Opus 4.5 scores respectably at #9 (8.195), but trails Gemini and Mistral
- Open-source models like Qwen3 Max Preview (#12) are competitive
- Longform consistency separates good from great
- Slop detection matters for natural-sounding prose
- Context handling limits novel-length quality
The best fiction comes from models that score well across multiple benchmarks—and from platforms like narrator that optimize specifically for reading pleasure rather than generic metrics.
Numbers tell part of the story. The rest you have to read for yourself.
Sources: Lechmazur Creative Writing Benchmark, EQ-Bench Creative Writing, WritingBench, NC-Bench