— Door 03 · The feedback, structured

A repeatable way to grade
how a model reads.

The concept deliverable behind the engine — a taxonomy of how language models fail at literary reasoning, a structured scoring rubric, and a sample annotated error trace — written in the format an AI trainer's structured feedback actually takes, so a single catch becomes a pattern the model team can act on.

CONCEPT ONLY — Not an official client rubric, and no graded client data is shown. The annotated trace below uses an illustrative, paraphrased example. A real rubric would be calibrated against the client's evaluation guidelines and inter-rater calibration.
01Failure-mode taxonomy
FM-1
Confident fabrication
Invented citation, quotation or authorship, delivered with full authority. The highest-severity error.
Severity · High
FM-2
Paraphrase-as-analysis
Plot or theme summary standing in for genuine analysis; no engagement with the actual words.
Severity · Med
FM-3
Theme-flattening
Collapses deliberate ambiguity into one tidy moral; ignores tension the text sustains.
Severity · Med
FM-4
Anachronistic reading
Imposes a modern frame on a historical text without marking the shift; conflates frames.
Severity · Med
FM-5
Misapplied framework
Names a critical lens but doesn't use its mechanics; theory as decoration.
Severity · Med
FM-6
False confidence / no hedging
States contested or uncertain readings as settled fact; never marks its own uncertainty.
Severity · High
02Sample annotated error trace
Literary Eval · Trace #L-0142CONCEPT MOCK
Annotated trace · excerpt

Domain: Poetry · Task: close reading of a supplied stanza

An illustrative trace in the exact format I'd log — the prompt, the model's response, my annotation, the failure mode, and the score.

Prompt issued

"Read the supplied stanza closely. Tell me what the line breaks and punctuation do to the meaning — do not summarise the poem's themes."

Model response (paraphrased)

"This poem explores loss and memory. The enjambment creates a sense of flow and emotion, reinforcing the themes throughout the work…"

My annotation — showing the work

Step 1
The prompt explicitly forbade theme-summary and asked for formal mechanics. The response leads with theme-summary — instruction not followed.
Step 2
"Creates flow" is a generic claim attached to no specific line break. No textual evidence; the analysis would fit any poem.
Step 3
What good looks like: name the break, show the syntactic suspense it holds across the line, then connect that mechanism to the stanza's meaning.
Score · Level 2 (Shallow) Failure · FM-2 paraphrase-as-analysis Secondary · instruction-non-compliance

Feedback to the model team

Recurring across poetry close-reading prompts: the model defaults to thematic summary when asked for formal analysis. Suggested fix — add an evaluation criterion that rewards citation of specific textual features and penalises generic theme-talk; consider a prompt scaffold that requires the model to quote the line it's analysing.

03The scoring rubric · four dimensions
Factual accuracy
Are all citations, attributions and textual claims correct and verifiable? Fabrication caps the score.
Gate dimension
Textual grounding
Is the reading anchored in the actual words, with specific evidence — not generic summary?
Core craft
Interpretive depth
Does it engage subtext, ambiguity and complexity, or stop at the obvious?
Quality ceiling
Metacognition & hedging
Does it show its reasoning and mark its own uncertainty where the text is genuinely contested?
Target behaviour

A response is scored on each dimension and tagged with any failure modes — so feedback is specific and trainable, not a vibe. Factual accuracy is a gate: confident fabrication caps the overall score regardless of how elegant the prose is. Final weighting would be calibrated with the client.

KHALID RIND · NEURANEST AI · MELBOURNE (REMOTE)  ·  INFO@KHALIDRIND.IO  ·  KHALIDRIND.IO

FAILURE-MODE TAXONOMY & SCORING RUBRIC · CONCEPT FRAMEWORK · NOT OFFICIAL CLIENT RUBRIC · EXAMPLE TRACE ILLUSTRATIVE