Output Evaluation Rubric Template
Use this rubric to assess AI output quality with consistency and measurability. Output: 1 rubric + evaluation tracking sheet.
What Is a Rigorous Evaluation?
A rigorous evaluation:
- Uses explicit criteria (not "seems good")
- Is repeatable (different evaluators get similar results)
- Captures nuance (not just "right/wrong")
- Feeds learning (results inform prompt updates)
- Scales (can evaluate 100s of outputs without burning out humans)
Step 1: Define Evaluation Dimensions
For Classification Tasks (e.g., ticket categorization)
| Dimension | Definition | Scoring |
|---|---|---|
| Correctness | Does the category match the input exactly? | 0 = Wrong, 1 = Correct |
| Confidence Calibration | Is the confidence score honest? (High confidence → high accuracy; low → uncertain) | 0.0–1.0 (correlation) |
| Reasoning Quality | If the model explains its choice, is the explanation logical? | 0 = Illogical, 1 = Sound |
| Boundary Handling | If the input is ambiguous, does the model acknowledge ambiguity? | 0 = Ignores ambiguity, 1 = Flags it |
For Text Generation Tasks (e.g., email drafts, reports)
| Dimension | Definition | Scoring |
|---|---|---|
| Accuracy | Are factual claims correct? Do they match source documents? | 0 = Major errors, 0.5 = Minor errors, 1 = Accurate |
| Relevance | Does the output address the prompt? No hallucinations or tangents? | 0 = Off-topic, 0.5 = Mostly on-topic, 1 = Directly relevant |
| Tone & Style | Does it match the intended audience/context? (professional, friendly, etc.) | 0 = Wrong tone, 0.5 = Neutral, 1 = Perfect match |
| Completeness | Does it cover all required points? Any missing information? | 0 = Incomplete, 0.5 = Mostly complete, 1 = Full |
| Clarity | Is it easy to understand? Free of jargon/confusion? | 0 = Confusing, 0.5 = Somewhat clear, 1 = Crystal clear |
For Data Analysis Tasks (e.g., summarizing trends)
| Dimension | Definition | Scoring |
|---|---|---|
| Accuracy of Facts | Are percentages, counts, comparisons correct? | 0.0–1.0 (% correct calculations) |
| Insight Depth | Does it identify non-obvious patterns? Avoid surface-level observations? | 0 = Shallow, 0.5 = Moderate, 1 = Deep |
| Actionability | Does it suggest next steps? Are recommendations realistic? | 0 = No action, 0.5 = Vague action, 1 = Specific & doable |
| Data Fidelity | Are citations/sources traceable to the data? No extrapolation without disclosure? | 0 = No sources, 0.5 = Partial sources, 1 = Fully cited |
Step 2: Create an Evaluation Rubric
Example Rubric: Support Ticket Classification
# Ticket Classification Evaluation Rubric
Task: Evaluate whether the AI correctly classified a support ticket.
## Dimension 1: Correctness
**Score 1 (Correct):**
- The category matches the ticket's primary issue
- If the ticket mentions multiple issues, the AI chose the most urgent one
- Example: Ticket says "can't reset password" → AI says "Account Access" ✓
**Score 0 (Incorrect):**
- The category doesn't match the ticket content
- The AI misread the ticket or hallucinated a category
- Example: Ticket says "can't reset password" → AI says "Product Feedback" ✗
## Dimension 2: Confidence Calibration
**Score 1 (Well-Calibrated):**
- High confidence (>0.85) on clear tickets; low confidence (<0.65) on ambiguous ones
- Example: "Can't log in" → confidence 0.92 ✓ (clear issue)
- Example: "Billing is confusing AND slow performance" → confidence 0.68 ✓ (ambiguous)
**Score 0.5 (Somewhat Calibrated):**
- Confidence is in the right ballpark but not precise
- Example: "Can't log in" → confidence 0.78 (should be higher)
**Score 0 (Poorly Calibrated):**
- High confidence on ambiguous tickets; low confidence on clear ones
- Example: "Can't log in" → confidence 0.50 ✗ (should be high)
## Dimension 3: Reasoning Quality
**Score 1 (Sound Reasoning):**
- The model explains its choice in a logical way
- Example: "I classified this as Account Access because the customer explicitly mentions 'password reset failure.'"
**Score 0.5 (Acceptable Reasoning):**
- The explanation is correct but generic
- Example: "This is account-related"
**Score 0 (Poor or Absent Reasoning):**
- No explanation given, or explanation contradicts the decision
- Example: "I chose Billing even though the customer is talking about account access"
## Dimension 4: Boundary Handling (Ambiguity)
**Score 1 (Handles Ambiguity):**
- When a ticket mentions multiple issues, the model explains which is primary and why
- Example: Ticket mentions billing AND technical problem → AI says "Technical Support because the billing issue is secondary to the system crash"
**Score 0.5 (Partially Handles):**
- Model picks a category but doesn't acknowledge the ambiguity
**Score 0 (Ignores Ambiguity):**
- Model confidently picks a category while ignoring conflicting signals
- Example: High confidence (0.92) on a ticket that clearly has 2 equally important issues
## Overall Score
**4/4 or 3.5/4:** Model output is production-ready
**2.5–3/4:** Model needs improvement (retrain prompt or review test cases)
**<2.5/4:** Model not ready; do not deploy
Step 3: Build an Evaluation Tracking Sheet
Use this template to track results over time and identify patterns:
# Evaluation Tracking: Support Ticket Classification
| Date | Ticket ID | AI Category | AI Confidence | Actual Category | Correct? (1/0) | Confidence Calibrated? | Reasoning Quality | Ambiguity Handled? | Notes |
|------|-----------|------------|---------------|------------------|---------------|--------------------|-------------------|-----------------|-------|
| 2/1 | TKT-1001 | Account Access | 0.92 | Account Access | 1 | Yes | 1 | N/A | Clear case |
| 2/1 | TKT-1002 | Billing | 0.70 | Billing | 1 | Yes | 0.5 | Yes | Generic reasoning |
| 2/1 | TKT-1003 | Tech Support | 0.88 | Account Access | 0 | No | 0 | N/A | Misread ticket |
| 2/1 | TKT-1004 | Complaint | 0.65 | Billing + Complaint | 0.5 | Yes | 1 | Yes | Ambiguous; AI picked secondary issue |
| 2/2 | TKT-1005 | Safety/Fraud | 1.0 | Safety/Fraud | 1 | Yes | 1 | N/A | Perfect detection |
## Weekly Summary (Week of 2/1)
| Metric | Value | Target | Status |
|--------|-------|--------|--------|
| Accuracy (Correctness) | 80% (4/5) | ≥95% | ⚠️ Below target |
| Confidence Calibration | 80% | ≥90% | ⚠️ Below target |
| Reasoning Quality | 0.7 avg | 0.8+ | ⚠️ Below target |
| Ambiguity Handling | 100% (2/2) | ≥90% | ✓ On track |
## Root Cause Analysis
**Issue 1: Low Accuracy (80%)**
- TKT-1003 misclassified: AI confused "I can't access my account" with "I need technical support for API"
- Root cause: Prompt doesn't distinguish account access issues from technical support issues clearly
- Action: Update prompt with examples of each category
**Issue 2: Low Confidence Calibration (80%)**
- TKT-1002: Model chose correct category but with uncertain confidence (0.70)
- Root cause: Prompt doesn't ask model to be confident on straightforward cases
- Action: Add instruction: "For clear cases (one issue, straightforward language), confidence should be >0.85"
## Next Steps
- [ ] Update prompt (add disambiguating examples)
- [ ] Retrain on updated test set (10 new examples)
- [ ] Re-evaluate model on same 5 tickets; expect improvement to ≥90%
- [ ] Expand testing to 50 tickets before production deployment
Step 4: Blind Evaluation (Gold Standard)
For high-stakes assessments, use blind evaluation:
Blind Evaluation Protocol
- Prepare: Gather 20–50 real outputs from your AI system
- Mask: Remove all identifying information (ticket IDs, timestamps, model version)
- Randomize: Shuffle the order; mix correct and incorrect examples
- Evaluate: Have 2–3 independent evaluators score each output using the rubric
- Compare: Calculate inter-rater agreement (should be >80%)
- Analyze: If disagreement, discuss and refine the rubric
Example Blind Evaluation
Evaluator A, Evaluator B, Evaluator C: Please score each output (1–10) on Correctness.
Output #1: "I classified this as Account Access because the customer says 'can't log in'"
Evaluator A: 10 (clearly correct, well-reasoned)
Evaluator B: 10 (clearly correct, well-reasoned)
Evaluator C: 10 (clearly correct, well-reasoned)
→ Perfect agreement; move to production confidence
Output #7: "I classified this as Tech Support because it mentions an error message"
Evaluator A: 5 (ambiguous; could be account OR tech)
Evaluator B: 3 (looks like billing issue to me)
Evaluator C: 7 (leans tech support)
→ Disagreement; the output is genuinely ambiguous. Either retrain or add to "human review" tier.
Step 5: Evaluation Automation (Scaling)
Option 1: Human + Sampling
- Evaluate 100% of outputs for the first 2 weeks (establish baseline)
- After 2 weeks, sample 10% daily (100 tickets/day → 10 evaluated) to maintain calibration
- If drift detected (accuracy drops >5%), increase sampling to 25%
Option 2: AI-Assisted Evaluation (Meta-Evaluation)
Use a separate AI model to evaluate your main model's output:
Evaluator Prompt:
"You are a quality assessor. A ticket classification AI made the following decision.
Is the classification correct? Why or why not?
Ticket: [input]
AI Output: [classification + confidence]
Actual Category: [ground truth]
Score correctness 1–10 and explain."
Caveat: AI-assisted evaluation can have blindspots; always spot-check human review quarterly.
Option 3: Automated Test Suite
For simple tasks, build a test harness:
# Test suite (pseudocode)
test_cases = [
{
"input": "I can't reset my password",
"expected": "Account Access",
"expected_confidence": (0.80, 1.0) # range
},
{
"input": "Your product is slow",
"expected": "Product Feedback",
"expected_confidence": (0.70, 0.95)
}
]
for test in test_cases:
result = classify(test["input"])
assert result["category"] == test["expected"]
assert test["expected_confidence"][0] <= result["confidence"] <= test["expected_confidence"][1]
Step 6: Continuous Improvement Loop
Week 1–2: Establish baseline (evaluate 100% of outputs)
↓
Week 3+: Ongoing sampling (10% daily)
↓
Monthly: Aggregate results → identify top 3 errors
↓
Update: Retrain prompt or fine-tune model
↓
Re-evaluate: Test on same 50-ticket baseline → measure improvement
↓
Deploy: Push updated model to production
↓
Repeat
Foundational Skills Checklist
- Critical Evaluation: Rubric defines 3+ dimensions of quality; scoring is explicit and repeatable
- Workflow Integration: Evaluation results feed back into prompt training; monthly retraining cycle documented
- Prompting: Evaluation identifies specific prompt weaknesses (e.g., "doesn't distinguish Account Access from Technical Support")
- AI Strategy: Evaluation metrics tie to business goal (e.g., "accuracy ≥95%" supports "reduce ticket routing time" goal)
- Ethics & Trust: Blind evaluation ensures fairness; inter-rater agreement >80%; results are transparent to stakeholders