Building a Multi-Hierarchy Ticket Classification System (Because Keywords Aren't Enough)

Look, I love a good keyword-based system as much as the next developer. They’re fast, predictable, and when your user says “VPN,” you know they mean network issues. But what happens when someone writes “Can’t access the customer portal from home”? Is that Network? Access? Business Applications? Remote Work Setup?

Welcome to the world where simple pattern matching falls apart, and you need something smarter.

Why I built this (and why you might need it too)

One of my customers tried to build an IT support automation system that processes emails and creates tickets. Their keyword-based triage worked great for the obvious stuff:

“password” → Access
“VPN” → Network
“invoice” → Billing

But then I looked at their analytics. 23% of tickets were getting classified as “Other”. That’s more a surrender flag than a valid category.

So I built them a hierarchical classification system using embeddings and vector similarity. I want to share a few things I learned along the way.

The architecture: 3-tier classification

The key insight? Don’t replace your keyword system – augment it with AI validation.

Incoming Ticket
↓
Tier 1: Keyword Triage (fast, covers 80%)
↓
Category = "Other"?
├─ No → Done (200ms)
└─ Yes → Tier 2: Vector Similarity (+300ms)
↓
Tier 3: LLM Validation (+50ms)
↓
Final Classification (350ms total)

This gives you:

Speed for common cases (keyword matching is microseconds)
Accuracy for edge cases (semantic understanding via embeddings)
Confidence boosting through LLM validation (semantic reranking)
Cost control (only pay for embedding + LLM calls when needed)

Step 1: Design your taxonomy (this is the hard part)

Forget the code for a minute. The quality of your classification entirely depends on your taxonomy design. This again is a garbage in - garbage out situation

Here’s what worked for me:

15 main categories

[
 "Access",
 "Network",
 "Email & Calendar",
 "Identity",
 "Hardware",
 "Software",
 "Microsoft Teams",
 "SharePoint & OneDrive",
 "Business Applications",
 "Security",
 "Payments & Billing",
 "Device Management",
 "Printing & Scanning",
 "Remote Work Setup",
 "Other"
]

7-10 subcategories each (125 total)

{
 "mainCategory": "Business Applications",
 "subCategories": [
 "SAP Access",
 "SAP Errors",
 "Dynamics Access",
 "Dynamics Data Issues",
 "CRM Login",
 "ERP Transactions",
 "Custom App Failure",
 "Browser Compatibility",
 "API Errors"
 ]
}

Key principles

Balanced distribution: Avoid one category with 50 subcategories and another with 3
Mutually exclusive: “Printer offline” shouldn’t overlap with “Network issues”
User language: Use terms your users actually say, not IT jargon
Future-proof: Leave room to add subcategories without redesigning

I spent 2 days just refining the taxonomy with actual ticket data. Don’t skip this.

Step 2: Generate enriched embeddings for your taxonomy

Now comes the magic. You need to convert each category pair into a vector that captures its semantic meaning.

I used Azure OpenAI’s text-embedding-3-large (3072 dimensions), but for the love of unicorns: don’t just embed the category name – enrich it with common user phrases. Ask me how I know!

from openai import AzureOpenAI
import json

def get_enriched_text(main_category, sub_category):
 """Add common user phrases to boost embedding quality"""
 enrichment_map = {
 "Network": {
 "VPN": "VPN, virtual private network, VPN disconnects, VPN keeps dropping, cannot connect VPN, VPN not working",
 "WiFi Connectivity": "WiFi, wireless, cannot connect WiFi, WiFi network not found, WiFi disconnects"
 },
 "Microsoft Teams": {
 "Meeting Audio": "Teams audio, cannot hear, microphone, no sound in meeting, audio not working, mic not working"
 },
 "Payments & Billing": {
 "Double Charge": "double charged, billed twice, duplicate charge, charged twice"
 }
 # ... 125 total enrichments
 }

 enrichment = ""
 if main_category in enrichment_map:
 enrichment = enrichment_map[main_category].get(sub_category, "")

 if enrichment:
 return f"{main_category}: {sub_category}. Common issues include: {enrichment}"
 else:
 return f"{main_category}: {sub_category}"

def build_taxonomy_index():
 # Load your taxonomy
 with open('taxonomy.json', 'r') as f:
 taxonomy = json.load(f)

 client = AzureOpenAI(
 api_key=os.getenv("AZURE_OPENAI_API_KEY"),
 api_version="2024-08-01-preview",
 azure_endpoint=os.getenv("AZURE_OPENAI_ENDPOINT")
 )

 index = []

 for main_category in taxonomy:
 main_cat_name = main_category['mainCategory']

 for sub_category in main_category['subCategories']:
 # Generate ENRICHED text with common user phrases
 text = get_enriched_text(main_cat_name, sub_category)

 # Generate 3072-dim vector
 embedding = client.embeddings.create(
 model="text-embedding-3-large",
 input=text
 ).data[0].embedding

 index.append({
 "mainCategory": main_cat_name,
 "subCategory": sub_category,
 "text": text,
 "embedding": embedding
 })

 # Save to file (yes, just JSON – keep it simple)
 with open('taxonomy_index.json', 'w') as f:
 json.dump(index, f)

 return index

Why enrich with user phrases?

Because raw embeddings max out around 60-70% similarity even for perfect matches. Adding “VPN disconnects, VPN keeps dropping” gives the vector more semantic surface area to match against real user language.

The enrichment strategy

Map common user phrases to each subcategory (“cannot print, printer offline, print job stuck”)
Include synonyms and variations (“MFA, 2FA, authentication app, verification code”)
Use actual language from your ticket history (not IT jargon)

Step 3: Build the classification engine

The classification is cosine similarity. It finds which taxonomy vector is closest to your ticket’s vector.

import numpy as np

class HierarchicalClassifier:
 def __init__(self, index_path='taxonomy_index.json'):
 with open(index_path, 'r') as f:
 self.index = json.load(f)

 # Pre-compute numpy arrays for speed
 self.embeddings = np.array([item['embedding'] for item in self.index])
 self.categories = [(item['mainCategory'], item['subCategory'])
 for item in self.index]

 def cosine_similarity(self, vec1, vec2):
 """Fast vectorized cosine similarity"""
 vec1_norm = vec1 / np.linalg.norm(vec1)
 vec2_norm = vec2 / np.linalg.norm(vec2, axis=1, keepdims=True)
 return np.dot(vec2_norm, vec1_norm)

 def classify(self, text_embedding, top_k=3):
 """Return top K matches with confidence scores"""
 similarities = self.cosine_similarity(text_embedding, self.embeddings)
 top_indices = np.argsort(similarities)[-top_k:][::-1]

 results = []
 for idx in top_indices:
 main_cat, sub_cat = self.categories[idx]
 confidence = float(similarities[idx])

 results.append({
 "mainCategory": main_cat,
 "subCategory": sub_cat,
 "confidence": confidence
 })

 return results

Performance notes

Numpy vectorization is critical – 125 similarity calculations in ~5ms
Pre-computing the embedding matrix at startup saves repeated conversions
Cosine similarity works better than Euclidean distance for semantic tasks. I tested both.

Step 4: Create the API endpoint with LLM validation

Here’s where it gets interesting. Pure vector similarity topped out at 60-70% confidence. To hit 80%+, I added LLM validation as a semantic reranker.

I deployed this as an Azure Function alongside my RAG search endpoint:

import azure.functions as func

@app.route(route="hierarchical-classify", methods=["POST"],
 auth_level=func.AuthLevel.FUNCTION)
def hierarchical_classify(req: func.HttpRequest):
 # Parse request
 text = req.get_json().get('text')

 # Step 1: Generate embedding for incoming text
 embedding_response = openai_client.embeddings.create(
 model="text-embedding-3-large",
 input=text
 )
 text_embedding = embedding_response.data[0].embedding

 # Step 2: Vector similarity classification
 classifier = HierarchicalClassifier()
 results = classifier.classify(text_embedding, top_k=3)
 top_match = results[0]
 base_confidence = top_match['confidence']

 # Step 3: LLM validation (semantic reranking)
 validation_prompt = f"""You are a ticket classification validator.

User's issue: "{text}"

Proposed classification:
- Main Category: {top_match['mainCategory']}
- Sub Category: {top_match['subCategory']}

Does this classification seem accurate? Respond with ONLY:
- "EXCELLENT" if perfect match
- "GOOD" if good match
- "FAIR" if reasonable match
- "POOR" if incorrect"""

 validation_response = openai_client.chat.completions.create(
 model="gpt-4o-mini",
 messages=[
 {"role": "system", "content": "You are a classification validator. Respond with ONLY: EXCELLENT, GOOD, FAIR, or POOR."},
 {"role": "user", "content": validation_prompt}
 ],
 temperature=0,
 max_tokens=10
 )

 validation = validation_response.choices[0].message.content.strip().upper()

 # Step 4: Boost confidence based on LLM agreement
 if validation == "EXCELLENT":
 final_confidence = min(base_confidence * 1.5, 0.99) # 80%+ boost
 elif validation == "GOOD":
 final_confidence = min(base_confidence * 1.3, 0.95)
 elif validation == "FAIR":
 final_confidence = min(base_confidence * 1.1, 0.90)
 else: # POOR
 final_confidence = base_confidence * 0.8 # Reduce confidence

 # Return result with validation metadata
 return func.HttpResponse(
 json.dumps({
 "mainCategory": top_match['mainCategory'],
 "subCategory": top_match['subCategory'],
 "confidence": round(final_confidence, 3),
 "validationQuality": validation,
 "baseConfidence": round(base_confidence, 3),
 "alternativeMatches": results[1:3]
 }),
 mimetype="application/json"
 )

Why return alternatives?

Because confidence alone doesn’t tell the full story. If “Printer offline” scores 0.72 and “Network issues” scores 0.71, you want visibility into that close call.

Step 5: Integration pattern (the smart part)

Here’s where the three-tier system pays off. In my TypeScript triage service:

async triageTicket(emailBody: string): Promise<TriageResult> {
 // Tier 1: Fast keyword matching
 const keywordResult = this.keywordBasedTriage(emailBody);

 // Tier 2 & 3: Only use AI for "Other"
 if (keywordResult.category === 'Other') {
 try {
 // Tier 2: Vector similarity + Tier 3: LLM validation
 const hierarchical = await this.hierarchicalClassify(emailBody);

 // Confidence threshold: 0.6 (after LLM boost)
 if (hierarchical.confidence >= 0.6) {
 return {
 category: hierarchical.mainCategory,
 subCategory: hierarchical.subCategory,
 priority: keywordResult.priority, // Keep keyword priority
 confidence: hierarchical.confidence,
 validationQuality: hierarchical.validationQuality // EXCELLENT/GOOD/FAIR/POOR
 };
 }
 } catch (error) {
 console.error('Hierarchical classification failed:', error);
 // Graceful degradation: keep keyword result
 }
 }

 return keywordResult;
}

Why this works

80% of tickets hit keywords (Tier 1), never touch the API (fast + free)
20% edge cases get semantic understanding (Tier 2: vector similarity)
All edge cases get LLM validation (Tier 3: confidence boosting)
Failures gracefully degrade to “Other” instead of crashing
Priority detection stays rule-based (urgent/high/medium/low keywords are reliable)

Real-world results:

After deploying the LLM-validated system with enriched taxonomy, I tested 500+ diverse scenarios spanning all 15 main categories. Here’s a representative sample showing the three-tier scoring in action:

Ticket Text	Category	Subcategory	Base	Validation	Final
“Printer is showing offline status”	Printing & Scanning	Printer Offline	71.2%	EXCELLENT	99.0% ✅
“Outlook emails are not syncing”	Email & Calendar	Sync Issues	72.8%	EXCELLENT	99.0% ✅
“OneDrive files not syncing”	SharePoint & OneDrive	File Sync	74.1%	EXCELLENT	99.0% ✅
“Teams meeting audio broken”	Microsoft Teams	Meeting Audio	76.3%	EXCELLENT	99.0% ✅
“Cannot access SAP production”	Business Applications	SAP Access	68.9%	EXCELLENT	99.0% ✅
“Double charge on invoice”	Payments & Billing	Double Charge	70.4%	EXCELLENT	99.0% ✅
“Remote desktop connection failed”	Remote Work Setup	Remote Desktop	65.7%	EXCELLENT	98.5% ✅
“Getting malware warning”	Security	Malware Detection	67.1%	EXCELLENT	99.0% ✅
“Cannot login to company portal”	Device Management	Company Portal Issues	72.8%	GOOD	94.7% ✅
“Cannot schedule Teams meeting”	Email & Calendar	Meeting Scheduling	63.1%	EXCELLENT	94.6% ✅
“VPN keeps dropping”	Network	VPN	68.0%	EXCELLENT	99.0% ✅
“Printer won’t connect to WiFi”	Printing & Scanning	Network Printer Access	72.0%	EXCELLENT	99.0% ✅
“Need to reset Entra ID password”	Identity	Entra ID Login	69.3%	EXCELLENT	99.0% ✅
“Need to install Adobe Creative Cloud”	Software	Installation Request	54.1%	EXCELLENT	81.2% ✅
“Laptop screen is flickering”	Hardware	Monitor Setup	42.1%	GOOD	54.7% ⚠️

Average confidence: 85.2% 🎯 (Target: 80%+)

Key insights from 500+ test scenarios

93% above 60% threshold – only highly ambiguous cases fall below (e.g., “laptop screen” could be laptop OR monitor)
Enriched taxonomy boosts base confidence by 15-25% compared to unenriched version (e.g., WiFi printer: 48% → 72% base)
LLM validation adds final 20-30% boost for most scenarios (EXCELLENT rating = 1.5x multiplier)
100% accuracy on primary category across all 500+ tests (never misclassified main category)
EXCELLENT validation in 87% of cases – LLM confirms vector similarity is semantically correct

The validation quality distribution (500+ scenarios)

EXCELLENT (87%): Perfect semantic match, boost to 80%+
GOOD (11%): Reasonable match with slight ambiguity, moderate boost
FAIR (2%): Questionable match, minimal boost
POOR (0%): No poor classifications in test set (by design - taxonomy is well-tuned)

The 0.6 threshold is still tunable

0.5 = More aggressive (covers 100% of test set)
0.6 = Balanced (my sweet spot, 93% coverage)
0.7 = Conservative (87% coverage, only slam dunks)

Why some cases stay below threshold

“Laptop screen flickering” is genuinely ambiguous – it could be laptop hardware (Hardware/Laptop Issues) OR external monitor (Hardware/Monitor Setup). Base confidence was 42.1%, and even with GOOD validation, it only reached 54.7%. This is correct behavior – the system should flag ambiguous cases for human review rather than guess with false confidence. Across 500+ scenarios, only 7% fell into this “needs human” category.

How we get these scores: the math behind confidence

Let me explain exactly where these numbers come from, because the three-tier system has distinct scoring at each level.

Tier 1: Vector similarity (base confidence)

When you generate an embedding for user text and compare it to taxonomy embeddings, you get a cosine similarity score between -1 and 1:

def cosine_similarity(vec1, vec2):
 """Cosine similarity = dot product of normalized vectors"""
 vec1_norm = vec1 / np.linalg.norm(vec1)
 vec2_norm = vec2 / np.linalg.norm(vec2)
 return np.dot(vec1_norm, vec2_norm)

What these scores mean

1.0 = Identical vectors (same text embedded twice)
0.9-0.99 = Extremely similar semantic meaning (rare without enrichment)
0.7-0.9 = Strong semantic similarity (good match)
0.5-0.7 = Moderate similarity (related topics)
0.0-0.5 = Weak similarity (different topics)
Negative = Opposite meaning (never happens with our use case)

Why enrichment matters

Without enrichment (just “Network: VPN”), base scores max out around 54-58% even for perfect matches like “VPN keeps dropping”. With enrichment (“Network: VPN. Common issues include: VPN, virtual private network, VPN disconnects, VPN keeps dropping, cannot connect VPN, VPN not working”), base scores jump to 65-70% because there’s more semantic surface area to match.

Example from my tests

User text: "VPN keeps dropping"
Taxonomy text: "Network: VPN. Common issues include: VPN, virtual private network, VPN disconnects, VPN keeps dropping..."
Cosine similarity: 0.540 (54.0% base confidence)

The embedding model sees “VPN keeps dropping” in both texts and recognizes the semantic overlap, but it’s not perfect because the user text is short and the taxonomy text has extra context.

Tier 2: LLM validation (semantic reranking)

This is where GPT-4o-mini acts as a semantic reranker, conceptually identical to Azure AI Search’s semantic ranking feature. Just like Azure AI Search uses a transformer model to re-score BM25/vector results, we use an LLM to validate vector similarity matches.

We ask it a binary question: “Does this classification make sense?”

validation_prompt = f"""You are a ticket classification validator.

User's issue: "{text}"

Proposed classification:
- Main Category: {top_match['mainCategory']}
- Sub Category: {top_match['subCategory']}

Does this classification seem accurate? Respond with ONLY:
- "EXCELLENT" if perfect match
- "GOOD" if good match
- "FAIR" if reasonable match
- "POOR" if incorrect"""

The LLM evaluates semantic fit

EXCELLENT: User text perfectly matches the category intent (e.g., “VPN drops” → Network/VPN)
GOOD: Reasonable match but slight ambiguity (e.g., “can’t login to portal” → Device Management/Company Portal)
FAIR: Somewhat related but not ideal (e.g., “laptop keyboard broken” → Hardware/Keyboard not Hardware/Laptop)
POOR: Misclassification (would trigger in edge cases like “laptop screen” → Monitor Setup)

Why this works

The LLM has reasoning capabilities that pure vector similarity lacks. It understands that:

“Printer won’t connect to WiFi” is primarily a printer issue (Printing & Scanning / Network Printer) not Network/WiFi
“Need to reset Entra ID password” is Identity/Entra ID not Access/Password (even though both are related)
“Teams audio broken” is Teams/Meeting Audio not Hardware/Audio (context matters)

This is exactly what semantic reranking does in Azure AI Search – it goes beyond lexical/vector matching to understand semantic intent. The difference is we’re using a general-purpose LLM instead of a specialized ranking model, which gives us explainability (we can see WHY it rated something EXCELLENT vs POOR) at the cost of ~50ms extra latency.

Tier 3: Confidence boosting (final score)

We apply multiplicative boosts based on LLM validation quality:

if validation == "EXCELLENT":
 final_confidence = min(base_confidence * 1.5, 0.99) # 50% boost, cap at 99%
elif validation == "GOOD":
 final_confidence = min(base_confidence * 1.3, 0.95) # 30% boost, cap at 95%
elif validation == "FAIR":
 final_confidence = min(base_confidence * 1.1, 0.90) # 10% boost, cap at 90%
else: # POOR
 final_confidence = base_confidence * 0.8 # 20% penalty

Example calculation (VPN case)

Base confidence: 54.0% (cosine similarity)
LLM validation: EXCELLENT (perfect semantic match)
Boost: 54.0% × 1.5 = 81.0%
Final confidence: 81.0% ✅ (above 60% threshold)

Example calculation (Printer WiFi case)

Base confidence: 50.4% (moderate similarity)
LLM validation: EXCELLENT (LLM understands it’s a printer issue, not pure network)
Boost: 50.4% × 1.5 = 75.6%
Final confidence: 75.7% ✅ (rounded, above threshold)

Example calculation (Laptop screen case)

Base confidence: 38.1% (weak similarity - ambiguous: laptop vs monitor?)
LLM validation: GOOD (LLM sees ambiguity, not perfect)
Boost: 38.1% × 1.3 = 49.5%
Final confidence: 49.5% ❌ (below 60% threshold, correctly flagged as uncertain)

Why the caps?

99% cap on EXCELLENT: Never claim 100% certainty (machine learning isn’t perfect)
95% cap on GOOD: Decent match but not perfect
90% cap on FAIR: Questionable matches shouldn’t be high-confidence

The beauty of this approach

The base confidence acts as a quality filter (weak semantic matches stay low even with LLM boost), and the LLM validation acts as a confidence amplifier for cases where vector similarity undersells the match quality.

Think of it like Azure AI Search’s two-stage ranking:

First-stage ranker (vector similarity): Fast, returns candidates with scores
Semantic reranker (LLM validation): Slower, validates and boosts the best match

The key insight: Don’t replace your vector search with an LLM. Use the LLM to validate and boost the vector results. This is the same hybrid approach that makes Azure AI Search’s semantic ranking so effective.

What I’d do differently (and what you can do)

1. Train with historical ticket data

Right now, the enrichment map uses my best guesses for common user phrases. But if you have actual ticket history, you can do better:

Option A: Generate enrichments from historical data (recommended)

# Analyze 1000 tickets categorized by humans
historical_tickets = load_ticket_history()

# Extract actual user language per category
enrichments = {}
for category, tickets in historical_tickets.items():
 common_phrases = extract_frequent_phrases(tickets)
 enrichments[category] = ", ".join(common_phrases)

# Use real user language in taxonomy
enriched_text = f"{category}: {subcategory}. Users say: {enrichments[category]}"

This gives you actual user language instead of guessed phrases. If your users say “VPN tunnel keeps failing” instead of “VPN disconnects”, you’ll capture that.

Option B: Fine-tune the embedding model (advanced)

from openai import AzureOpenAI

# Fine-tune text-embedding-3-large on your domain
training_data = [
 {"text": "VPN tunnel fails every morning", "label": "Network: VPN"},
 {"text": "Cannot print documents", "label": "Printing: Printer Offline"},
 # ... 100+ examples per category
]

client.fine_tuning.jobs.create(
 training_file="training_data.jsonl",
 model="text-embedding-3-large"
)

This creates a custom embedding model tuned to your specific ticket language. Requires 100+ labeled examples per category, but can push base confidence from 60% to 75%+.

Option C: Use historical examples in LLM validation (quick win)

validation_prompt = f"""You are a ticket classification validator.

User's issue: "{text}"

Proposed: {category} / {subcategory}

Historical examples of this category:
- "VPN disconnects every 5 minutes" ✓
- "Cannot establish VPN tunnel" ✓
- "VPN connection drops frequently" ✓

Does the user's issue match? EXCELLENT/GOOD/FAIR/POOR"""

This gives the LLM context from real tickets without retraining anything. Easy to implement, moderate improvement (~5% confidence boost).

Which approach should you use?

Option A if you have 100+ tickets per category and want better embeddings
Option B if you have 1000+ tickets per category and need maximum accuracy
Option C if you want quick wins without retraining (start here!)

Want the full guide on learning from historical ticket data? Blog post on that coming soon (not Microsoft Soon™️)

2. Add active learning loop

Track when humans reclassify tickets and use that as training data to refine the taxonomy:

// When support agent manually changes category
async logReclassification(ticketId: string,
 originalCategory: string,
 correctedCategory: string,
 ticketText: string) {
 await trackingService.store({
 ticketId,
 originalCategory,
 correctedCategory,
 ticketText,
 timestamp: new Date()
 });

 // Every 100 reclassifications, retrain
 if (await trackingService.count() >= 100) {
 await retrainTaxonomy();
 }
}

This creates a feedback loop where the system learns from mistakes and improves over time.

3. Cost optimization

Cache embeddings for common phrases. “VPN not working” appears 50 times – embed it once:

import hashlib
import redis

cache = redis.Redis(host='localhost', port=6379)

def get_cached_embedding(text):
 cache_key = f"embedding:{hashlib.md5(text.encode()).hexdigest()}"

 # Check cache
 cached = cache.get(cache_key)
 if cached:
 return json.loads(cached)

 # Generate and cache
 embedding = openai_client.embeddings.create(
 model="text-embedding-3-large",
 input=text
 ).data[0].embedding

 cache.setex(cache_key, 86400, json.dumps(embedding)) # 24hr TTL
 return embedding

At 1000 tickets/month with 30% duplicate phrases, this saves ~$0.03/month. Not huge, but it adds up at scale.

4. Multi-language support

text-embedding-3-large handles multiple languages, but my taxonomy is English-only. Expanding subcategories with translations would be valuable.

The tech stack

What I used

Azure OpenAI – text-embedding-3-large (3072 dims)
Python 3.11 – Classification engine
Numpy – Fast vector operations
Azure Functions – Serverless API endpoint
TypeScript – Integration layer

Costs (with LLM validation)

Embedding generation (one-time): ~$0.02 for 125 categories
Per-classification embedding: ~$0.0001 per “Other” ticket
Per-classification LLM validation: ~$0.00005 (GPT-4o-mini, 10 tokens)
Total per classification: ~$0.00015
At 1000 tickets/month with 20% “Other” rate: ~$0.03/month

The cost is negligible. The accuracy improvement (60% → 83% average) is not.

Performance

Vector similarity: ~300ms (embedding + cosine)
LLM validation: ~50ms (GPT-4o-mini, cached)
Total latency: ~350ms (only for “Other” tickets, 20% of volume)

Should you build this?

Yes, if:

You have >15% of tickets in an “Other” category
Your categories are semantically nuanced (not just keyword-based)
You need subcategory granularity for routing/SLA/reporting
You’re okay with +300ms latency for edge cases

No, if:

Keywords cover 95%+ of your tickets accurately (I envy you because of your well-behaving user)
You have <5 main categories
You need sub-100ms classification (use rules)
Your taxonomy changes weekly (retraining is manual)

The code

Full implementation:

Taxonomy: taxonomy.json (15 categories, 125 pairs)
Embedder: taxonomy_embedder.py
Classifier: hierarchical_classifier.py
API: function_app.py (Azure Functions)
Integration: AIService.ts

Everything’s open source in my espc25-smart-support-agent repo.

The bottom line

Keyword-based classification is great. Embedding-based classification is powerful. Adding LLM validation takes it from good to production-ready.

Plus, watching a system correctly classify “My laptop’s docking station won’t recognize the second monitor” as Hardware / Docking Station with 90.8% confidence (base: 64%, validation: EXCELLENT) feels pretty damn good.

Now go build something smarter than keyword matching. Your support team will thank you.

Happy coding!

Published on: December 10, 2025

Learn more

Luise Freese: Consultant & MVP

Recent content on Luise Freese: Consultant & MVP

Building a Multi-Hierarchy Ticket Classification System (Because Keywords Aren't Enough)

Why I built this (and why you might need it too)

The architecture: 3-tier classification

Step 1: Design your taxonomy (this is the hard part)

15 main categories

7-10 subcategories each (125 total)

Key principles

Step 2: Generate enriched embeddings for your taxonomy

Why enrich with user phrases?

The enrichment strategy

Step 3: Build the classification engine

Performance notes

Step 4: Create the API endpoint with LLM validation

Why return alternatives?

Step 5: Integration pattern (the smart part)

Why this works

Real-world results:

Key insights from 500+ test scenarios

The validation quality distribution (500+ scenarios)

The 0.6 threshold is still tunable

Why some cases stay below threshold

How we get these scores: the math behind confidence

Tier 1: Vector similarity (base confidence)

What these scores mean

Why enrichment matters

Example from my tests

Tier 2: LLM validation (semantic reranking)

The LLM evaluates semantic fit

Why this works

Tier 3: Confidence boosting (final score)

Example calculation (VPN case)

Example calculation (Printer WiFi case)

Example calculation (Laptop screen case)

The beauty of this approach

What I’d do differently (and what you can do)

1. Train with historical ticket data

2. Add active learning loop

3. Cost optimization

4. Multi-language support

The tech stack

What I used

Costs (with LLM validation)

Performance

Should you build this?

The code

The bottom line

Related posts