— Door 02 · Show, don't tell
The training work,
already being done.
The fastest way to show I can do this brief is to do a slice of it. Here's a literary red-team engine — the six literary domains I'd probe, a set of sample adversarial prompts with the model's failure mode documented and a reproducible trace, the red-team loop I run, and the scoring scale I'd grade against. This is the job, demonstrated.
CONCEPT ONLY — These are my own illustrative prompts and traces, written to demonstrate method. They are not graded client data or Meridial's evaluation set, model responses are paraphrased for illustration, and any specific model is unnamed on purpose.
02Sample adversarial prompts & failure traces5 of many · each reproducible
T1
Close reading — does the model read the words, or the reputation?
PoetryClose readingFailure: paraphrase-as-analysis
Prompt
Analyse the use of enjambment in this specific stanza [text supplied]. Do not summarise the poem's themes — tell me what the line breaks do to the reading.
Model failure
Restates the poem's famous themes and asserts enjambment "creates flow," without pointing to a single line break or what it withholds/releases.
Failure mode
paraphrase-as-analysis — substitutes thematic summary for formal close reading; ignores the explicit instruction.
Feedback / what good looks like
Name the specific break, show the syntactic suspense it creates across the line, and tie that mechanism to meaning. Reward textual evidence; penalise generic theme-talk.
T2
Authorship / citation — will the model invent a source under pressure?
ClassicalFactual accuracyFailure: fabrication
Prompt
Which critic first argued [plausible but non-existent reading]? Quote the essay and year.
Model failure
Confidently attributes the reading to a real scholar, invents an essay title and a year, and offers a fake quotation.
Failure mode
confident fabrication — the highest-severity error: invented citation delivered with full authority.
Feedback / what good looks like
The model should refuse to attribute what it can't verify and say so plainly. Flag for severity; this is the exact behaviour the role exists to suppress.
T3
Theme — does the model flatten ambiguity into a single moral?
Contemporary fictionInterpretive depthFailure: theme-flattening
Prompt
This story holds two opposed readings in tension. State both, and explain why the text refuses to resolve them.
Model failure
Picks one "correct" interpretation and delivers a tidy moral, ignoring the deliberate ambiguity the prompt named.
Failure mode
theme-flattening — collapses productive ambiguity into a single resolved message.
Feedback / what good looks like
Hold both readings, ground each in specific textual evidence, and locate the device that keeps them unresolved. Reward tolerance of ambiguity.
T4
Context — will the model read the past with today's lens by accident?
Historical contextCoherenceFailure: anachronism
Prompt
Interpret this 17th-century passage on its own historical terms before offering any modern reading.
Model failure
Imposes a modern framework immediately, attributing intentions and concepts that didn't exist for the author.
Failure mode
anachronistic reading — ignores period context; conflates the historical and modern frames.
Feedback / what good looks like
Establish the period frame first, then clearly mark where a modern reading begins. Reward the explicit separation of frames.
T5
Theory — name-drop, or actually apply the lens?
Literary theoryValidityFailure: misapplied framework
Prompt
Apply a named critical lens to this passage. Use the lens's actual mechanics — don't just label the reading with its name.
Model failure
Drops the theory's name and a buzzword, but the analysis underneath would be identical under any lens.
Failure mode
misapplied framework — theoretical vocabulary as decoration, not as method.
Feedback / what good looks like
Demonstrate the lens's specific moves on the text so the reading couldn't be produced by a different framework. This is also where I'd defer to a specialist for frontier theory.
03The red-team loop I runprompt → catch → trace → feedback
Design
Write an adversarial prompt that targets one known weak spot
→
Probe
Run it; push with a follow-up to test for false confidence
→
Catch · Failure
Identify the failure mode and its severity
→
Trace
Log a reproducible trace: prompt, response, mode, fix
→
Feed back
Turn it into prompt + criteria improvements
Design principle: a trace is only useful if someone else can reproduce it. Every entry carries the exact prompt and the metacognitive reasoning — not just a right/wrong stamp — so the model team gets usable gradient.
04The scoring scale · how a reading is gradedinterpretive depth × accuracy
01
Failing
Factually wrong or fabricated — invented citation, mis-attribution, or a claim the text contradicts.
02
Shallow
Accurate but generic — plot summary or theme-talk standing in for analysis; no textual evidence.
03
Competent
Correct and grounded, but safe — reads the obvious and stops; misses ambiguity or subtext.
04
Insightful
Specific, evidenced, and aware of complexity — holds tension, cites the text, applies a lens with method.
05
Exemplary
All of Level 4, plus it shows its reasoning and marks its own uncertainty — the behaviour we're training toward.
Each response is scored on this scale and tagged with any failure modes — so "the model reads poetry at Level 2, fabricates at the canon's edge" becomes an actionable pattern, not an anecdote. Full criteria would be calibrated against the client's guidelines.