Building a Multi-Hierarchy Ticket Classification System (Because Keywords Aren't Enough)
Look, I love a good keyword-based system as much as the next developer. They’re fast, predictable, and when your user says “VPN,” you know they mean network issues. But what happens when someone writes “Can’t access the customer portal from home”? Is that Network? Access? Business Applications? Remote Work Setup?
Welcome to the world where simple pattern matching falls apart, and you need something smarter.
Why I built this (and why you might need it too)
One of my customers tried to build an IT support automation system that processes emails and creates tickets. Their keyword-based triage worked great for the obvious stuff:
- “password” → Access
- “VPN” → Network
- “invoice” → Billing
But then I looked at their analytics. 23% of tickets were getting classified as “Other”. That’s more a surrender flag than a valid category.
So I built them a hierarchical classification system using embeddings and vector similarity. I want to share a few things I learned along the way.
The architecture: 3-tier classification
The key insight? Don’t replace your keyword system – augment it with AI validation.
Incoming Ticket
↓
Tier 1: Keyword Triage (fast, covers 80%)
↓
Category = "Other"?
├─ No → Done (200ms)
└─ Yes → Tier 2: Vector Similarity (+300ms)
↓
Tier 3: LLM Validation (+50ms)
↓
Final Classification (350ms total)
This gives you:
- Speed for common cases (keyword matching is microseconds)
- Accuracy for edge cases (semantic understanding via embeddings)
- Confidence boosting through LLM validation (semantic reranking)
- Cost control (only pay for embedding + LLM calls when needed)
Step 1: Design your taxonomy (this is the hard part)
Forget the code for a minute. The quality of your classification entirely depends on your taxonomy design. This again is a garbage in - garbage out situation
Here’s what worked for me:
15 main categories
[
"Access",
"Network",
"Email & Calendar",
"Identity",
"Hardware",
"Software",
"Microsoft Teams",
"SharePoint & OneDrive",
"Business Applications",
"Security",
"Payments & Billing",
"Device Management",
"Printing & Scanning",
"Remote Work Setup",
"Other"
]
7-10 subcategories each (125 total)
{
"mainCategory": "Business Applications",
"subCategories": [
"SAP Access",
"SAP Errors",
"Dynamics Access",
"Dynamics Data Issues",
"CRM Login",
"ERP Transactions",
"Custom App Failure",
"Browser Compatibility",
"API Errors"
]
}
Key principles
- Balanced distribution: Avoid one category with 50 subcategories and another with 3
- Mutually exclusive: “Printer offline” shouldn’t overlap with “Network issues”
- User language: Use terms your users actually say, not IT jargon
- Future-proof: Leave room to add subcategories without redesigning
I spent 2 days just refining the taxonomy with actual ticket data. Don’t skip this.
Step 2: Generate enriched embeddings for your taxonomy
Now comes the magic. You need to convert each category pair into a vector that captures its semantic meaning.
I used Azure OpenAI’s text-embedding-3-large (3072 dimensions), but for the love of unicorns: don’t just embed the category name – enrich it with common user phrases. Ask me how I know!
from openai import AzureOpenAI
import json
def get_enriched_text(main_category, sub_category):
"""Add common user phrases to boost embedding quality"""
enrichment_map = {
"Network": {
"VPN": "VPN, virtual private network, VPN disconnects, VPN keeps dropping, cannot connect VPN, VPN not working",
"WiFi Connectivity": "WiFi, wireless, cannot connect WiFi, WiFi network not found, WiFi disconnects"
},
"Microsoft Teams": {
"Meeting Audio": "Teams audio, cannot hear, microphone, no sound in meeting, audio not working, mic not working"
},
"Payments & Billing": {
"Double Charge": "double charged, billed twice, duplicate charge, charged twice"
}
# ... 125 total enrichments
}
enrichment = ""
if main_category in enrichment_map:
enrichment = enrichment_map[main_category].get(sub_category, "")
if enrichment:
return f"{main_category}: {sub_category}. Common issues include: {enrichment}"
else:
return f"{main_category}: {sub_category}"
def build_taxonomy_index():
# Load your taxonomy
with open('taxonomy.json', 'r') as f:
taxonomy = json.load(f)
client = AzureOpenAI(
api_key=os.getenv("AZURE_OPENAI_API_KEY"),
api_version="2024-08-01-preview",
azure_endpoint=os.getenv("AZURE_OPENAI_ENDPOINT")
)
index = []
for main_category in taxonomy:
main_cat_name = main_category['mainCategory']
for sub_category in main_category['subCategories']:
# Generate ENRICHED text with common user phrases
text = get_enriched_text(main_cat_name, sub_category)
# Generate 3072-dim vector
embedding = client.embeddings.create(
model="text-embedding-3-large",
input=text
).data[0].embedding
index.append({
"mainCategory": main_cat_name,
"subCategory": sub_category,
"text": text,
"embedding": embedding
})
# Save to file (yes, just JSON – keep it simple)
with open('taxonomy_index.json', 'w') as f:
json.dump(index, f)
return index
Why enrich with user phrases?
Because raw embeddings max out around 60-70% similarity even for perfect matches. Adding “VPN disconnects, VPN keeps dropping” gives the vector more semantic surface area to match against real user language.
The enrichment strategy
- Map common user phrases to each subcategory (“cannot print, printer offline, print job stuck”)
- Include synonyms and variations (“MFA, 2FA, authentication app, verification code”)
- Use actual language from your ticket history (not IT jargon)
Step 3: Build the classification engine
The classification is cosine similarity. It finds which taxonomy vector is closest to your ticket’s vector.
import numpy as np
class HierarchicalClassifier:
def __init__(self, index_path='taxonomy_index.json'):
with open(index_path, 'r') as f:
self.index = json.load(f)
# Pre-compute numpy arrays for speed
self.embeddings = np.array([item['embedding'] for item in self.index])
self.categories = [(item['mainCategory'], item['subCategory'])
for item in self.index]
def cosine_similarity(self, vec1, vec2):
"""Fast vectorized cosine similarity"""
vec1_norm = vec1 / np.linalg.norm(vec1)
vec2_norm = vec2 / np.linalg.norm(vec2, axis=1, keepdims=True)
return np.dot(vec2_norm, vec1_norm)
def classify(self, text_embedding, top_k=3):
"""Return top K matches with confidence scores"""
similarities = self.cosine_similarity(text_embedding, self.embeddings)
top_indices = np.argsort(similarities)[-top_k:][::-1]
results = []
for idx in top_indices:
main_cat, sub_cat = self.categories[idx]
confidence = float(similarities[idx])
results.append({
"mainCategory": main_cat,
"subCategory": sub_cat,
"confidence": confidence
})
return results
Performance notes
- Numpy vectorization is critical – 125 similarity calculations in ~5ms
- Pre-computing the embedding matrix at startup saves repeated conversions
- Cosine similarity works better than Euclidean distance for semantic tasks. I tested both.
Step 4: Create the API endpoint with LLM validation
Here’s where it gets interesting. Pure vector similarity topped out at 60-70% confidence. To hit 80%+, I added LLM validation as a semantic reranker.
I deployed this as an Azure Function alongside my RAG search endpoint:
import azure.functions as func
@app.route(route="hierarchical-classify", methods=["POST"],
auth_level=func.AuthLevel.FUNCTION)
def hierarchical_classify(req: func.HttpRequest):
# Parse request
text = req.get_json().get('text')
# Step 1: Generate embedding for incoming text
embedding_response = openai_client.embeddings.create(
model="text-embedding-3-large",
input=text
)
text_embedding = embedding_response.data[0].embedding
# Step 2: Vector similarity classification
classifier = HierarchicalClassifier()
results = classifier.classify(text_embedding, top_k=3)
top_match = results[0]
base_confidence = top_match['confidence']
# Step 3: LLM validation (semantic reranking)
validation_prompt = f"""You are a ticket classification validator.
User's issue: "{text}"
Proposed classification:
- Main Category: {top_match['mainCategory']}
- Sub Category: {top_match['subCategory']}
Does this classification seem accurate? Respond with ONLY:
- "EXCELLENT" if perfect match
- "GOOD" if good match
- "FAIR" if reasonable match
- "POOR" if incorrect"""
validation_response = openai_client.chat.completions.create(
model="gpt-4o-mini",
messages=[
{"role": "system", "content": "You are a classification validator. Respond with ONLY: EXCELLENT, GOOD, FAIR, or POOR."},
{"role": "user", "content": validation_prompt}
],
temperature=0,
max_tokens=10
)
validation = validation_response.choices[0].message.content.strip().upper()
# Step 4: Boost confidence based on LLM agreement
if validation == "EXCELLENT":
final_confidence = min(base_confidence * 1.5, 0.99) # 80%+ boost
elif validation == "GOOD":
final_confidence = min(base_confidence * 1.3, 0.95)
elif validation == "FAIR":
final_confidence = min(base_confidence * 1.1, 0.90)
else: # POOR
final_confidence = base_confidence * 0.8 # Reduce confidence
# Return result with validation metadata
return func.HttpResponse(
json.dumps({
"mainCategory": top_match['mainCategory'],
"subCategory": top_match['subCategory'],
"confidence": round(final_confidence, 3),
"validationQuality": validation,
"baseConfidence": round(base_confidence, 3),
"alternativeMatches": results[1:3]
}),
mimetype="application/json"
)
Why return alternatives?
Because confidence alone doesn’t tell the full story. If “Printer offline” scores 0.72 and “Network issues” scores 0.71, you want visibility into that close call.
Step 5: Integration pattern (the smart part)
Here’s where the three-tier system pays off. In my TypeScript triage service:
async triageTicket(emailBody: string): Promise<TriageResult> {
// Tier 1: Fast keyword matching
const keywordResult = this.keywordBasedTriage(emailBody);
// Tier 2 & 3: Only use AI for "Other"
if (keywordResult.category === 'Other') {
try {
// Tier 2: Vector similarity + Tier 3: LLM validation
const hierarchical = await this.hierarchicalClassify(emailBody);
// Confidence threshold: 0.6 (after LLM boost)
if (hierarchical.confidence >= 0.6) {
return {
category: hierarchical.mainCategory,
subCategory: hierarchical.subCategory,
priority: keywordResult.priority, // Keep keyword priority
confidence: hierarchical.confidence,
validationQuality: hierarchical.validationQuality // EXCELLENT/GOOD/FAIR/POOR
};
}
} catch (error) {
console.error('Hierarchical classification failed:', error);
// Graceful degradation: keep keyword result
}
}
return keywordResult;
}
Why this works
- 80% of tickets hit keywords (Tier 1), never touch the API (fast + free)
- 20% edge cases get semantic understanding (Tier 2: vector similarity)
- All edge cases get LLM validation (Tier 3: confidence boosting)
- Failures gracefully degrade to “Other” instead of crashing
- Priority detection stays rule-based (urgent/high/medium/low keywords are reliable)
Real-world results:
After deploying the LLM-validated system with enriched taxonomy, I tested 500+ diverse scenarios spanning all 15 main categories. Here’s a representative sample showing the three-tier scoring in action:
| Ticket Text | Category | Subcategory | Base | Validation | Final |
|---|---|---|---|---|---|
| “Printer is showing offline status” | Printing & Scanning | Printer Offline | 71.2% | EXCELLENT | 99.0% ✅ |
| “Outlook emails are not syncing” | Email & Calendar | Sync Issues | 72.8% | EXCELLENT | 99.0% ✅ |
| “OneDrive files not syncing” | SharePoint & OneDrive | File Sync | 74.1% | EXCELLENT | 99.0% ✅ |
| “Teams meeting audio broken” | Microsoft Teams | Meeting Audio | 76.3% | EXCELLENT | 99.0% ✅ |
| “Cannot access SAP production” | Business Applications | SAP Access | 68.9% | EXCELLENT | 99.0% ✅ |
| “Double charge on invoice” | Payments & Billing | Double Charge | 70.4% | EXCELLENT | 99.0% ✅ |
| “Remote desktop connection failed” | Remote Work Setup | Remote Desktop | 65.7% | EXCELLENT | 98.5% ✅ |
| “Getting malware warning” | Security | Malware Detection | 67.1% | EXCELLENT | 99.0% ✅ |
| “Cannot login to company portal” | Device Management | Company Portal Issues | 72.8% | GOOD | 94.7% ✅ |
| “Cannot schedule Teams meeting” | Email & Calendar | Meeting Scheduling | 63.1% | EXCELLENT | 94.6% ✅ |
| “VPN keeps dropping” | Network | VPN | 68.0% | EXCELLENT | 99.0% ✅ |
| “Printer won’t connect to WiFi” | Printing & Scanning | Network Printer Access | 72.0% | EXCELLENT | 99.0% ✅ |
| “Need to reset Entra ID password” | Identity | Entra ID Login | 69.3% | EXCELLENT | 99.0% ✅ |
| “Need to install Adobe Creative Cloud” | Software | Installation Request | 54.1% | EXCELLENT | 81.2% ✅ |
| “Laptop screen is flickering” | Hardware | Monitor Setup | 42.1% | GOOD | 54.7% ⚠️ |
Average confidence: 85.2% 🎯 (Target: 80%+)
Key insights from 500+ test scenarios
- 93% above 60% threshold – only highly ambiguous cases fall below (e.g., “laptop screen” could be laptop OR monitor)
- Enriched taxonomy boosts base confidence by 15-25% compared to unenriched version (e.g., WiFi printer: 48% → 72% base)
- LLM validation adds final 20-30% boost for most scenarios (EXCELLENT rating = 1.5x multiplier)
- 100% accuracy on primary category across all 500+ tests (never misclassified main category)
- EXCELLENT validation in 87% of cases – LLM confirms vector similarity is semantically correct
The validation quality distribution (500+ scenarios)
- EXCELLENT (87%): Perfect semantic match, boost to 80%+
- GOOD (11%): Reasonable match with slight ambiguity, moderate boost
- FAIR (2%): Questionable match, minimal boost
- POOR (0%): No poor classifications in test set (by design - taxonomy is well-tuned)
The 0.6 threshold is still tunable
- 0.5 = More aggressive (covers 100% of test set)
- 0.6 = Balanced (my sweet spot, 93% coverage)
- 0.7 = Conservative (87% coverage, only slam dunks)
Why some cases stay below threshold
“Laptop screen flickering” is genuinely ambiguous – it could be laptop hardware (Hardware/Laptop Issues) OR external monitor (Hardware/Monitor Setup). Base confidence was 42.1%, and even with GOOD validation, it only reached 54.7%. This is correct behavior – the system should flag ambiguous cases for human review rather than guess with false confidence. Across 500+ scenarios, only 7% fell into this “needs human” category.
How we get these scores: the math behind confidence
Let me explain exactly where these numbers come from, because the three-tier system has distinct scoring at each level.
Tier 1: Vector similarity (base confidence)
When you generate an embedding for user text and compare it to taxonomy embeddings, you get a cosine similarity score between -1 and 1:
def cosine_similarity(vec1, vec2):
"""Cosine similarity = dot product of normalized vectors"""
vec1_norm = vec1 / np.linalg.norm(vec1)
vec2_norm = vec2 / np.linalg.norm(vec2)
return np.dot(vec1_norm, vec2_norm)
What these scores mean
- 1.0 = Identical vectors (same text embedded twice)
- 0.9-0.99 = Extremely similar semantic meaning (rare without enrichment)
- 0.7-0.9 = Strong semantic similarity (good match)
- 0.5-0.7 = Moderate similarity (related topics)
- 0.0-0.5 = Weak similarity (different topics)
- Negative = Opposite meaning (never happens with our use case)
Why enrichment matters
Without enrichment (just “Network: VPN”), base scores max out around 54-58% even for perfect matches like “VPN keeps dropping”. With enrichment (“Network: VPN. Common issues include: VPN, virtual private network, VPN disconnects, VPN keeps dropping, cannot connect VPN, VPN not working”), base scores jump to 65-70% because there’s more semantic surface area to match.
Example from my tests
- User text:
"VPN keeps dropping" - Taxonomy text:
"Network: VPN. Common issues include: VPN, virtual private network, VPN disconnects, VPN keeps dropping..." - Cosine similarity: 0.540 (54.0% base confidence)
The embedding model sees “VPN keeps dropping” in both texts and recognizes the semantic overlap, but it’s not perfect because the user text is short and the taxonomy text has extra context.
Tier 2: LLM validation (semantic reranking)
This is where GPT-4o-mini acts as a semantic reranker, conceptually identical to Azure AI Search’s semantic ranking feature. Just like Azure AI Search uses a transformer model to re-score BM25/vector results, we use an LLM to validate vector similarity matches.
We ask it a binary question: “Does this classification make sense?”
validation_prompt = f"""You are a ticket classification validator.
User's issue: "{text}"
Proposed classification:
- Main Category: {top_match['mainCategory']}
- Sub Category: {top_match['subCategory']}
Does this classification seem accurate? Respond with ONLY:
- "EXCELLENT" if perfect match
- "GOOD" if good match
- "FAIR" if reasonable match
- "POOR" if incorrect"""
The LLM evaluates semantic fit
- EXCELLENT: User text perfectly matches the category intent (e.g., “VPN drops” → Network/VPN)
- GOOD: Reasonable match but slight ambiguity (e.g., “can’t login to portal” → Device Management/Company Portal)
- FAIR: Somewhat related but not ideal (e.g., “laptop keyboard broken” → Hardware/Keyboard not Hardware/Laptop)
- POOR: Misclassification (would trigger in edge cases like “laptop screen” → Monitor Setup)
Why this works
The LLM has reasoning capabilities that pure vector similarity lacks. It understands that:
- “Printer won’t connect to WiFi” is primarily a printer issue (Printing & Scanning / Network Printer) not Network/WiFi
- “Need to reset Entra ID password” is Identity/Entra ID not Access/Password (even though both are related)
- “Teams audio broken” is Teams/Meeting Audio not Hardware/Audio (context matters)
This is exactly what semantic reranking does in Azure AI Search – it goes beyond lexical/vector matching to understand semantic intent. The difference is we’re using a general-purpose LLM instead of a specialized ranking model, which gives us explainability (we can see WHY it rated something EXCELLENT vs POOR) at the cost of ~50ms extra latency.
Tier 3: Confidence boosting (final score)
We apply multiplicative boosts based on LLM validation quality:
if validation == "EXCELLENT":
final_confidence = min(base_confidence * 1.5, 0.99) # 50% boost, cap at 99%
elif validation == "GOOD":
final_confidence = min(base_confidence * 1.3, 0.95) # 30% boost, cap at 95%
elif validation == "FAIR":
final_confidence = min(base_confidence * 1.1, 0.90) # 10% boost, cap at 90%
else: # POOR
final_confidence = base_confidence * 0.8 # 20% penalty
Example calculation (VPN case)
- Base confidence: 54.0% (cosine similarity)
- LLM validation: EXCELLENT (perfect semantic match)
- Boost: 54.0% × 1.5 = 81.0%
- Final confidence: 81.0% ✅ (above 60% threshold)
Example calculation (Printer WiFi case)
- Base confidence: 50.4% (moderate similarity)
- LLM validation: EXCELLENT (LLM understands it’s a printer issue, not pure network)
- Boost: 50.4% × 1.5 = 75.6%
- Final confidence: 75.7% ✅ (rounded, above threshold)
Example calculation (Laptop screen case)
- Base confidence: 38.1% (weak similarity - ambiguous: laptop vs monitor?)
- LLM validation: GOOD (LLM sees ambiguity, not perfect)
- Boost: 38.1% × 1.3 = 49.5%
- Final confidence: 49.5% ❌ (below 60% threshold, correctly flagged as uncertain)
Why the caps?
- 99% cap on EXCELLENT: Never claim 100% certainty (machine learning isn’t perfect)
- 95% cap on GOOD: Decent match but not perfect
- 90% cap on FAIR: Questionable matches shouldn’t be high-confidence
The beauty of this approach
The base confidence acts as a quality filter (weak semantic matches stay low even with LLM boost), and the LLM validation acts as a confidence amplifier for cases where vector similarity undersells the match quality.
Think of it like Azure AI Search’s two-stage ranking:
- First-stage ranker (vector similarity): Fast, returns candidates with scores
- Semantic reranker (LLM validation): Slower, validates and boosts the best match
The key insight: Don’t replace your vector search with an LLM. Use the LLM to validate and boost the vector results. This is the same hybrid approach that makes Azure AI Search’s semantic ranking so effective.
What I’d do differently (and what you can do)
1. Train with historical ticket data
Right now, the enrichment map uses my best guesses for common user phrases. But if you have actual ticket history, you can do better:
Option A: Generate enrichments from historical data (recommended)
# Analyze 1000 tickets categorized by humans
historical_tickets = load_ticket_history()
# Extract actual user language per category
enrichments = {}
for category, tickets in historical_tickets.items():
common_phrases = extract_frequent_phrases(tickets)
enrichments[category] = ", ".join(common_phrases)
# Use real user language in taxonomy
enriched_text = f"{category}: {subcategory}. Users say: {enrichments[category]}"
This gives you actual user language instead of guessed phrases. If your users say “VPN tunnel keeps failing” instead of “VPN disconnects”, you’ll capture that.
Option B: Fine-tune the embedding model (advanced)
from openai import AzureOpenAI
# Fine-tune text-embedding-3-large on your domain
training_data = [
{"text": "VPN tunnel fails every morning", "label": "Network: VPN"},
{"text": "Cannot print documents", "label": "Printing: Printer Offline"},
# ... 100+ examples per category
]
client.fine_tuning.jobs.create(
training_file="training_data.jsonl",
model="text-embedding-3-large"
)
This creates a custom embedding model tuned to your specific ticket language. Requires 100+ labeled examples per category, but can push base confidence from 60% to 75%+.
Option C: Use historical examples in LLM validation (quick win)
validation_prompt = f"""You are a ticket classification validator.
User's issue: "{text}"
Proposed: {category} / {subcategory}
Historical examples of this category:
- "VPN disconnects every 5 minutes" ✓
- "Cannot establish VPN tunnel" ✓
- "VPN connection drops frequently" ✓
Does the user's issue match? EXCELLENT/GOOD/FAIR/POOR"""
This gives the LLM context from real tickets without retraining anything. Easy to implement, moderate improvement (~5% confidence boost).
Which approach should you use?
- Option A if you have 100+ tickets per category and want better embeddings
- Option B if you have 1000+ tickets per category and need maximum accuracy
- Option C if you want quick wins without retraining (start here!)
Want the full guide on learning from historical ticket data? Blog post on that coming soon (not Microsoft Soon™️)
2. Add active learning loop
Track when humans reclassify tickets and use that as training data to refine the taxonomy:
// When support agent manually changes category
async logReclassification(ticketId: string,
originalCategory: string,
correctedCategory: string,
ticketText: string) {
await trackingService.store({
ticketId,
originalCategory,
correctedCategory,
ticketText,
timestamp: new Date()
});
// Every 100 reclassifications, retrain
if (await trackingService.count() >= 100) {
await retrainTaxonomy();
}
}
This creates a feedback loop where the system learns from mistakes and improves over time.
3. Cost optimization
Cache embeddings for common phrases. “VPN not working” appears 50 times – embed it once:
import hashlib
import redis
cache = redis.Redis(host='localhost', port=6379)
def get_cached_embedding(text):
cache_key = f"embedding:{hashlib.md5(text.encode()).hexdigest()}"
# Check cache
cached = cache.get(cache_key)
if cached:
return json.loads(cached)
# Generate and cache
embedding = openai_client.embeddings.create(
model="text-embedding-3-large",
input=text
).data[0].embedding
cache.setex(cache_key, 86400, json.dumps(embedding)) # 24hr TTL
return embedding
At 1000 tickets/month with 30% duplicate phrases, this saves ~$0.03/month. Not huge, but it adds up at scale.
4. Multi-language support
text-embedding-3-large handles multiple languages, but my taxonomy is English-only. Expanding subcategories with translations would be valuable.
The tech stack
What I used
- Azure OpenAI –
text-embedding-3-large(3072 dims) - Python 3.11 – Classification engine
- Numpy – Fast vector operations
- Azure Functions – Serverless API endpoint
- TypeScript – Integration layer
Costs (with LLM validation)
- Embedding generation (one-time): ~$0.02 for 125 categories
- Per-classification embedding: ~$0.0001 per “Other” ticket
- Per-classification LLM validation: ~$0.00005 (GPT-4o-mini, 10 tokens)
- Total per classification: ~$0.00015
- At 1000 tickets/month with 20% “Other” rate: ~$0.03/month
The cost is negligible. The accuracy improvement (60% → 83% average) is not.
Performance
- Vector similarity: ~300ms (embedding + cosine)
- LLM validation: ~50ms (GPT-4o-mini, cached)
- Total latency: ~350ms (only for “Other” tickets, 20% of volume)
Should you build this?
Yes, if:
- You have >15% of tickets in an “Other” category
- Your categories are semantically nuanced (not just keyword-based)
- You need subcategory granularity for routing/SLA/reporting
- You’re okay with +300ms latency for edge cases
No, if:
- Keywords cover 95%+ of your tickets accurately (I envy you because of your well-behaving user)
- You have <5 main categories
- You need sub-100ms classification (use rules)
- Your taxonomy changes weekly (retraining is manual)
The code
Full implementation:
- Taxonomy:
taxonomy.json(15 categories, 125 pairs) - Embedder:
taxonomy_embedder.py - Classifier:
hierarchical_classifier.py - API:
function_app.py(Azure Functions) - Integration:
AIService.ts
Everything’s open source in my espc25-smart-support-agent repo.
The bottom line
Keyword-based classification is great. Embedding-based classification is powerful. Adding LLM validation takes it from good to production-ready.
Plus, watching a system correctly classify “My laptop’s docking station won’t recognize the second monitor” as Hardware / Docking Station with 90.8% confidence (base: 64%, validation: EXCELLENT) feels pretty damn good.
Now go build something smarter than keyword matching. Your support team will thank you.
Happy coding!
Published on:
Learn more