The concept deliverable behind the engine — a taxonomy of how language models fail at literary reasoning, a structured scoring rubric, and a sample annotated error trace — written in the format an AI trainer's structured feedback actually takes, so a single catch becomes a pattern the model team can act on.
"Read the supplied stanza closely. Tell me what the line breaks and punctuation do to the meaning — do not summarise the poem's themes."
Recurring across poetry close-reading prompts: the model defaults to thematic summary when asked for formal analysis. Suggested fix — add an evaluation criterion that rewards citation of specific textual features and penalises generic theme-talk; consider a prompt scaffold that requires the model to quote the line it's analysing.
A response is scored on each dimension and tagged with any failure modes — so feedback is specific and trainable, not a vibe. Factual accuracy is a gate: confident fabrication caps the overall score regardless of how elegant the prose is. Final weighting would be calibrated with the client.