— Door 03 · The feedback, structured

A repeatable way to grade
how a model reads.

The concept deliverable behind the engine — a taxonomy of how language models fail at literary reasoning, a structured scoring rubric, and a sample annotated error trace — written in the format an AI trainer's structured feedback actually takes, so a single catch becomes a pattern the model team can act on.

CONCEPT ONLY — Not an official client rubric, and no graded client data is shown. The annotated trace below uses an illustrative, paraphrased example. A real rubric would be calibrated against the client's evaluation guidelines and inter-rater calibration.

01Failure-mode taxonomy

FM-1

Confident fabrication

Invented citation, quotation or authorship, delivered with full authority. The highest-severity error.

Severity · High

FM-2

Paraphrase-as-analysis

Plot or theme summary standing in for genuine analysis; no engagement with the actual words.

Severity · Med

FM-3

Theme-flattening

Collapses deliberate ambiguity into one tidy moral; ignores tension the text sustains.

Severity · Med

FM-4

Anachronistic reading

Imposes a modern frame on a historical text without marking the shift; conflates frames.

Severity · Med

FM-5

Misapplied framework

Names a critical lens but doesn't use its mechanics; theory as decoration.

Severity · Med

FM-6

False confidence / no hedging

States contested or uncertain readings as settled fact; never marks its own uncertainty.

Severity · High

02Sample annotated error trace

Literary Eval · Trace #L-0142CONCEPT MOCK

Annotated trace · excerpt

Domain: Poetry · Task: close reading of a supplied stanza

An illustrative trace in the exact format I'd log — the prompt, the model's response, my annotation, the failure mode, and the score.

Prompt issued

"Read the supplied stanza closely. Tell me what the line breaks and punctuation do to the meaning — do not summarise the poem's themes."

Model response (paraphrased)

"This poem explores loss and memory. The enjambment creates a sense of flow and emotion, reinforcing the themes throughout the work…"

My annotation — showing the work

Step 1

The prompt explicitly forbade theme-summary and asked for formal mechanics. The response leads with theme-summary — instruction not followed.

Step 2

"Creates flow" is a generic claim attached to no specific line break. No textual evidence; the analysis would fit any poem.

Step 3

What good looks like: name the break, show the syntactic suspense it holds across the line, then connect that mechanism to the stanza's meaning.

Score · Level 2 (Shallow) Failure · FM-2 paraphrase-as-analysis Secondary · instruction-non-compliance

Feedback to the model team

Recurring across poetry close-reading prompts: the model defaults to thematic summary when asked for formal analysis. Suggested fix — add an evaluation criterion that rewards citation of specific textual features and penalises generic theme-talk; consider a prompt scaffold that requires the model to quote the line it's analysing.

03The scoring rubric · four dimensions

Factual accuracy

Are all citations, attributions and textual claims correct and verifiable? Fabrication caps the score.

Gate dimension

Textual grounding

Is the reading anchored in the actual words, with specific evidence — not generic summary?

Core craft

Interpretive depth

Does it engage subtext, ambiguity and complexity, or stop at the obvious?

Quality ceiling

Metacognition & hedging

Does it show its reasoning and mark its own uncertainty where the text is genuinely contested?

Target behaviour

A response is scored on each dimension and tagged with any failure modes — so feedback is specific and trainable, not a vibe. Factual accuracy is a gate: confident fabrication caps the overall score regardless of how elegant the prose is. Final weighting would be calibrated with the client.

A repeatable way to gradehow a model reads.

Domain: Poetry · Task: close reading of a supplied stanza

Prompt issued

Model response (paraphrased)

My annotation — showing the work

Feedback to the model team

A repeatable way to grade
how a model reads.