Few-Shot Calibration for Decision-Relevant Classification
When AI classifies qualitative data (interview responses, survey answers) into decision-relevant categories, provide concrete examples of each category level — including an explanation of why each example belongs there. This prevents AI from inventing its own interpretation of what each level means.
The Technique
Build a labeled scale for your specific decision. For each level, provide:
- A label describing the category
- A concrete example from plausible participant data
- An explanation of why that example belongs there
Example: “Would a screen retain this churned user?” (5-point scale)
SOLUTION FIT SCALE (calibrated for screen decision)
1 - SCREEN WOULD RETAIN: Explicit visibility/access pain during activity
Example: "I check my phone 10 times per workout just to see my HR"
Why this is a 1: Specific friction with current workaround. Screen
directly solves this. High-value signal for screen investment.
2 - CHEAPER FIX WOULD RETAIN: Sounds screen-related but has lower-cost solution
Example: "App was too clunky to check mid-workout, too many taps"
Why this is a 2: Visibility complaint, but phone was present. App UX
fix could solve this without hardware investment.
3 - ENGAGEMENT FIX NEEDED: Stopped using, screen wouldn't change that
Example: "Just wasn't using it enough"
Why this is a 3: Self-blame framing, no feature complaint. Screen
doesn't fix habit formation.
4 - OPERATIONAL FIX NEEDED: Billing, support, reliability — not features
Example: "Canceled 3 times and kept getting charged."
Why this is a 4: Trust/process failure. Screen is irrelevant.
5 - UNRELATED COMPETITIVE LOSS: Left for alternative, no screen complaint
Example: "Switched to Apple Watch, Whoop was fine"
Why this is a 5: No negative language, no screen mention. Screen
alone unlikely to win back.
Why Examples Beat Descriptions
Descriptions of what a level means leave room for AI to apply its own interpretation. Examples do the teaching: they show AI the exact reasoning it should apply. With calibration, “22 people mentioned screens” becomes “8 would be retained by a screen, 6 just need app UX fixes, 8 are unrelated competitive losses” — a decision-relevant breakdown.
Generalizing Beyond Feature Decisions
The same structure works for any classification task:
- Demand calibration for new offers (1 = would pay for this immediately → 5 = interesting but not a need)
- Feature request prioritization (1 = blocking, core use case → 5 = nice-to-have, power user only)
- Feedback quality scoring (1 = specific, actionable → 5 = vague, unactionable)
- Churn reason classification (customer service → pricing → product gap → competitive loss)
Build the scale once for a given decision context, then reuse across analyses.
When to Apply
- Feature investment decisions driven by interview or survey data
- Churn analysis where you need to understand which reasons would be addressed by a specific fix
- Prioritization exercises across multiple customer requests
- NPS driver analysis where open-ends need to be classified by type of issue
Related
- Decision-Anchored Analysis Context — sets the decision frame that makes the scale meaningful
- Quote Selection Rules for AI Analysis — ensures evidence behind classifications is real
- Multi-Pass AI Analysis Verification — validates classification decisions after the fact
Sources
Caitlin Sullivan — “How to Do AI Analysis You Can Actually Trust” (Lenny’s Newsletter, 2026-02-18)
“The examples above do the teaching. I’m not saying, ‘This is what good output looks like.’ I’m saying, ‘This is good output, and here’s why I labeled it this way, so you can follow my process.’” “Without clarity like this, the model invents its own interpretation of what ‘screen-related churn reasons’ look like — and it won’t always match yours.” “You’ll spend a few extra minutes building a scale for your context, but you can reuse it across analyses.”
Unique contribution: Explicit framing of “examples teach, descriptions don’t”; the full 5-point scale template with rationale-per-level; the insight that the why explanation is what makes few-shot calibration work.