Prompting Toolkit: Intent-First Design

Use this toolkit to craft and evolve prompts across models and modalities. Output: 1 prompt library per use case.

What Is Intent-First Prompting?

Intent-first prompting starts with what you want the AI to do and why, then works backward to the words. It's model-agnostic—the same intent strategy works for Claude, Gemini, ChatGPT, or any multimodal system.

Key principle: The prompt is a contract. You specify what success looks like; the model delivers to spec.

Step 1: Define Intent (The Why)

Intent Statement Template

**Task:** [What is the AI doing? Be specific.]
Example: "Classify customer support tickets into 1 of 8 predefined categories."

**Success Criteria:** [How will you know it worked?]
Example: "Accuracy ≥95%, no hallucinations about categories not in the list, confidence score <0.7 triggers human review."

**Failure Mode:** [What's the worst error?]
Example: "Confidently assigning a safety-critical issue (injury, fraud) to 'general inquiry'."

**Context:** [Who uses this output? Why does it matter?]
Example: "Support team routes tickets. Wrong routing wastes 20 min/ticket and frustrates customers."

Step 2: Craft the Instruction (Model-Agnostic Core)

Use this template regardless of which model you're using:

Core Instruction Template

You are a [ROLE: expert ticket classifier / data analyst / etc.].

**Your task:** [TASK: classify the following support ticket into 1 of these categories]

**Categories (use only these):**
1. Billing & Payment
2. Technical Support
3. Account Access
4. Product Feedback
5. Complaint / Escalation
6. General Inquiry
7. Other
8. Safety / Fraud (escalate immediately)

**Instructions:**
- Read the customer message carefully.
- Identify the PRIMARY issue (if multiple topics, choose the most urgent).
- Respond in JSON format only: {"category": "...", "confidence": 0.0-1.0, "reasoning": "..."}
- If the message doesn't fit any category, respond: {"category": "Other", "confidence": ..., "reasoning": "..."}
- If ANY mention of safety (injury, fraud, threats), immediately classify as "Safety / Fraud" and STOP.

**IMPORTANT:**
- Do NOT invent categories.
- Do NOT explain beyond the JSON response.
- Confidence must reflect your certainty (0.7+ = high confidence, <0.5 = uncertain, trigger human review).

**Ticket:**
[INSERT TICKET TEXT HERE]

Step 3: Model-Specific Adjustments

For Claude (Anthropic)

Strengths: reasoning, nuance, instruction-following
Adjustments:

Use clear structural language ("You are X. Your task is Y.")
Ask for reasoning; Claude excels at explanation.
Add a "think step-by-step" phrase for complex tasks.

Tweaked instruction for Claude:

You are an expert customer support analyst. Your task is to classify support tickets.

[Same categories and instructions as above, plus:]

Before responding, think through your reasoning:
1. What is the PRIMARY issue the customer is describing?
2. Which category fits best?
3. How confident are you (0.0-1.0)?

Then respond in JSON format: {"category": "...", "confidence": ..., "reasoning": "..."}

For GPT-4 / ChatGPT

Strengths: speed, diverse modalities
Adjustments:

Be more explicit about JSON formatting (ChatGPT can be loose with format).
Use system/user message structure if available.

Tweaked instruction for ChatGPT:

System: You are a customer support ticket classifier. You classify tickets into exactly one category.

User: 
[Same core instruction, with emphasis on JSON structure]

You MUST respond with valid JSON only:
{
  "category": "[exact category name from list]",
  "confidence": 0.0-1.0,
  "reasoning": "[2-3 sentence explanation]"
}

For Gemini (Google)

Strengths: multimodal, real-time data
Adjustments:

Gemini handles multimodal well; if you have images/docs, include them.
Be explicit about safety boundaries.

Tweaked instruction for Gemini:

Role: You are a content classifier.
Task: Classify the following input into one of these categories: [list].
Safety rule: If any input mentions harm, injury, or fraud, classify as "Safety / Fraud" immediately.
Output format: JSON only.

Step 4: Test & Iterate (The Evaluation Loop)

Test Dataset

Create 5-10 labeled examples that cover:

Straightforward case (clear category)
Ambiguous case (could fit 2 categories)
Rare/edge case (unusual request)
Safety/fraud case
Out-of-scope case (should be "Other")

Evaluation Matrix

Test Case	Input	Expected Output	Claude Output	GPT-4 Output	Gemini Output	Notes
1. Straightforward	"I can't log in"	Account Access / 0.95	✓ / 0.92	✓ / 0.88	✓ / 0.91	All good
2. Ambiguous	"Bill wrong & can't login"	Billing / 0.65	Billing/0.70	Account/0.68	Billing/0.65	Gemini & Claude agree
3. Rare case	"API returned error 503"	Tech Support / 0.8	Tech/0.82	Tech/0.78	Tech/0.75	All classify correctly
4. Safety	"Someone stole my password"	Safety / Fraud / 1.0	✓ / 1.0	✓ / 1.0	✓ / 1.0	Perfect
5. Out-of-scope	"How's the weather?"	Other / 0.9	Other/0.88	Other/0.92	Other/0.85	All good

Iteration Rules

If accuracy <95%:

Add clarifying examples to the prompt (show the model a category it confused).
Adjust the instruction language (e.g., "Safety is ANY mention of injury, fraud, or threat").
Try a different model (Claude for nuance, GPT-4 for speed).

If confidence scores are wildly different across models:

Increase the specificity of the prompt.
Ask the model to "explain first, then classify" (forces reasoning).

Step 5: Deploy & Monitor (Continuous Learning)

Monitoring Dashboard

**Prompt Performance (Weekly)**

| Metric | Target | Actual | Status |
|--------|--------|--------|--------|
| Accuracy | ≥95% | 94.2% | ⚠️ Declining |
| Confidence (avg) | ≥0.85 | 0.78 | ⚠️ Drift |
| Hallucinations | 0 | 2/500 | ✓ Acceptable |
| Human override rate | <5% | 3.2% | ✓ Good |

**Recent Changes:**
- Model upgraded to Opus 4.5 (2/1). Accuracy dropped 1.5%; investigating prompt compatibility.

**Next Action:**
- Retrain prompt for Opus 4.5 using updated test set.

Quarterly Prompt Refresh

Every 3 months, review:

Model accuracy drift (new model versions may need prompt tweaks).
Category distribution changes (new ticket types emerging?).
User feedback on misclassifications (add examples to prompt).
Cost/latency (should we switch models?).

Prompt Library Template

Keep all production prompts in a single document (or database) for version control and comparison:

# Production Prompts (2026 Q1)

## Prompt v2.3: Support Ticket Classification
**Created:** 2026-01-15  
**Last Updated:** 2026-02-01  
**Model(s):** Claude 3.5 Sonnet (primary), GPT-4 (backup)  
**Accuracy:** 94.8% (100 test cases)  
**Status:** ACTIVE

**Full Prompt:**
[Paste complete prompt here]

**Test Results:**
[Link to evaluation matrix]

**Known Issues:**
- Gemini 3.0 struggles with "Technical Support vs. Product Feedback" distinction; not recommended for this task.

**Next Review:** 2026-05-01

---

## Prompt v2.2: Support Ticket Classification (DEPRECATED)
**Retired:** 2026-02-01  
**Reason:** Accuracy <92% with new model releases.  
**Archive:** [Link to archived version]

Foundational Skills Checklist

Prompting: Intent clearly stated, instruction model-agnostic, test cases documented
Critical Evaluation: Accuracy measured, confidence scores validated, edge cases tested
Workflow Integration: Prompt versioned, handoff to human review defined (e.g., confidence <0.7)
AI Strategy: Prompt supports business goal (ROI, risk threshold, compliance requirement)

Quick Reference: Prompt Dos & Don'ts

Do	Don't
Be specific ("classify into 8 categories")	Be vague ("improve the process")
Show examples	Assume the model knows your context
Define success criteria	Guess whether it worked
Version your prompts	Change them randomly
Test on realistic data	Test on toy examples
Monitor drift over time	"Set and forget"
Iterate based on failures	Blame the model for ambiguity
Ask for JSON/structured output	Accept free-form text