LLM Routing Architecture¶
Overview¶
Aegis implements a cognitive hierarchy for LLM routing, selecting the appropriate model tier based on task complexity, cost constraints, and availability. This ensures optimal resource utilization while maintaining quality.
Cognitive Hierarchy¶
graph TB
Task[Incoming Task] --> Router{LLM Router}
Router -->|Strategic| Opus[Tier 1: Claude Opus 4.5<br/>Strategic decisions]
Router -->|Fast/Simple| Haiku[Tier 1.5: Claude Haiku 4.5<br/>High-frequency ops]
Router -->|Operational| GLM[Tier 2: GLM-4.7<br/>90% of work]
Router -->|Offline/Vision| Local[Tier 3: Ollama<br/>Local models]
Opus -->|Fallback| Haiku
Haiku -->|Fallback| GLM
GLM -->|Fallback| Local
style Opus fill:#e74c3c,color:#fff
style Haiku fill:#3498db,color:#fff
style GLM fill:#2ecc71,color:#fff
style Local fill:#95a5a6,color:#fff
Model Tiers¶
Tier 1: Claude Opus 4.5¶
Provider: Anthropic (via Claude Code)
Model ID: claude-opus-4-5-20251101
Context Window: 200K tokens
Use Cases: - Strategic decision-making - Architecture reviews - Complex problem-solving - High-stakes debugging
Cost: High ($15/MTok input, $75/MTok output) Rate Limit: Not specified (use sparingly)
Access Pattern: Through Claude Code CLI, not programmatic API
When to Use:
# Reserved for critical decisions only
# Example: Architecture refactoring, security audits, strategic planning
Why Not for Routine Work? - Expensive (5-10x cost of Haiku, 50x cost of GLM) - Rate-limited - Overkill for most tasks
Tier 1.5: Claude Haiku 4.5¶
Provider: Anthropic (via Claude Agent SDK)
Model ID: claude-haiku-4-5-20251001
Context Window: 200K tokens
Max Output: 8,192 tokens
Use Cases: - Classification and tagging - Data extraction and parsing - Quick summaries - Validation and verification - Formatting and conversion
Performance: - 3x faster than Sonnet - 1/3 cost of Sonnet - Response time: ~500ms
Cost: Low ($1/MTok input, $5/MTok output) Rate Limit: High (sustained throughput)
Auto-Routing: Tasks matching these types automatically use Haiku
HAIKU_TASK_TYPES = [
"classify", "extract", "parse", "summarize", "validate",
"tag", "format", "convert", "check", "simple"
]
Implementation:
from aegis.llm import query_haiku
# Direct Haiku usage
response = await query_haiku("Classify this email as spam or ham: ...")
# Auto-routed based on task_type
response = await query("Extract names from: ...", task_type="extract")
# Force Haiku preference
response = await query("Quick summary", prefer_haiku=True)
Example Use Cases:
# Email classification
await query_haiku("Classify: 'Meeting at 3pm today'", task_type="classify")
# Data extraction
await query_haiku("Extract phone numbers from: ...", task_type="extract")
# Format conversion
await query_haiku("Convert to JSON: name: John, age: 30", task_type="format")
# Quick validation
await query_haiku("Is this valid email? user@domain", task_type="validate")
Tier 2: GLM-4.7¶
Provider: Z.ai (Zhipu AI via Anthropic-compatible API)
Model ID: glm-4.7
Context Window: 128K tokens
Max Output: 4,096 tokens
Use Cases: - 90%+ of operational work - Code generation - Debugging assistance - Documentation writing - Research synthesis - Conversational tasks
Performance: - Response time: ~1-2 seconds - Throughput: ~8 requests/minute (rate limited)
Cost: Free (subscription model, ~$50/month unlimited) Rate Limit: ~8 req/min (soft limit)
Access via Claude Agent SDK:
# Configuration in ~/.claude/settings.json
{
"env": {
"ANTHROPIC_AUTH_TOKEN": "zai_key_here",
"ANTHROPIC_BASE_URL": "https://api.z.ai/api/anthropic",
"ANTHROPIC_DEFAULT_SONNET_MODEL": "glm-4.7"
}
}
# Usage
from claude_agent_sdk import query, ClaudeAgentOptions
async for message in query(prompt="Your task", options=ClaudeAgentOptions()):
print(message)
Direct Usage via Aegis API:
from aegis.llm import query
# Default tier (GLM)
response = await query("Explain Docker Compose")
# Explicit model selection
response = await query("Generate code for...", model="glm-4.7")
GLM Clients:
1. GLM Agent Client (recommended):
from aegis.llm import GLMClient
client = GLMClient() # Uses Agent SDK
response = await client.complete("Prompt", system="System", temperature=0.7)
2. GLM Direct Client (fallback):
from aegis.llm import GLMDirectClient
client = GLMDirectClient() # Direct HTTP to Z.ai
response = await client.complete("Prompt")
API Key Rotation:
from aegis.llm import get_current_api_key, rotate_api_key
# Check current key
current = get_current_api_key()
# Rotate to backup key
rotate_api_key()
Fallback Strategy: - Primary: Z.ai API key 1 - Backup: Z.ai API key 2 (automatically rotates on 429 errors) - Final: Ollama local models
Tier 3: Ollama (Local Models)¶
Provider: Ollama (localhost:11434) Models: Multiple specialized models
| Model | Size | Purpose | Performance |
|---|---|---|---|
| qwen3:4b | 2.5 GB | General tasks | ~40 tok/s |
| qwen3:30b | 18 GB | Complex reasoning | ~8 tok/s |
| qwen3-vl:4b | 3.3 GB | Vision (multimodal) | ~15 tok/s |
| qwen3-vl:30b | 19 GB | High-quality vision | ~4 tok/s |
| deepseek-r1:32b | 19 GB | Chain-of-thought reasoning | ~6 tok/s |
| moondream:latest | 1.7 GB | Fast vision | ~20 tok/s |
| nomic-embed-text | 274 MB | Embeddings (768d) | ~1000 vecs/s |
| tinyllama | 637 MB | Edge inference | ~60 tok/s |
Use Cases: - Offline operation (no internet) - Vision tasks (image analysis) - Reasoning tasks (chain-of-thought) - Sensitive operations (no external API) - Fallback when cloud APIs fail
Cost: Free (hardware cost only) Rate Limit: None (local)
Implementation:
from aegis.llm import query_ollama
# General task
response = await query_ollama("Explain Docker", model="qwen3:4b")
# Vision task
response = await query_ollama(
"What's in this image?",
model="qwen3-vl:4b",
images=["/path/to/image.png"]
)
# Reasoning task
response = await query_ollama(
"Solve this step by step: ...",
model="deepseek-r1:32b"
)
Embedding Generation:
from aegis.llm.ollama import get_embedding
# Generate embedding
embedding = await get_embedding("Text to embed", model="nomic-embed-text")
# Returns: list[float] of length 768
Model Selection by Task:
# Automatic model selection
from aegis.llm.router import LLMRouter
router = LLMRouter()
# Vision task → qwen3-vl:4b
tier = router.select_tier(task_type="vision")
# Reasoning task → deepseek-r1:32b
tier = router.select_tier(task_type="reasoning")
# General task → qwen3:4b
tier = router.select_tier(task_type="operational")
LLM Router Architecture¶
Implementation: /home/agent/projects/aegis-core/aegis/llm/router.py
Selection Logic¶
from aegis.llm.router import LLMRouter, Tier
router = LLMRouter(allow_paid=False, allow_haiku=True)
# Tier selection based on task type
tier = router.select_tier(
task_type="operational", # Task type
require_offline=False, # Force local models
prefer_fast=False # Prefer speed over capability
)
# Execute with selected tier
response, used_tier, cost = await router.complete(
prompt="Your prompt",
system="System instructions",
tier=tier, # Optional override
temperature=0.7,
max_tokens=2048
)
Task Type Mapping¶
| Task Type | Selected Tier | Reasoning |
|---|---|---|
| strategic | Opus | Critical decisions |
| research | Haiku → GLM | Fast retrieval → synthesis |
| vision | Local (qwen3-vl) | No cloud vision API |
| reasoning | Local (deepseek-r1) | CoT not needed for most tasks |
| classify | Haiku | Fast, cheap, accurate |
| extract | Haiku | Pattern matching |
| parse | Haiku | Structured output |
| summarize | Haiku | Quick compression |
| validate | Haiku | Boolean decisions |
| operational | GLM | Default tier |
Fallback Cascade¶
Automatic Fallback:
# Tier selection with fallback
async def query_with_fallback(prompt: str) -> str:
try:
# Try Haiku first for fast tasks
return await query_haiku(prompt)
except Exception as e:
logger.warning("Haiku failed, falling back to GLM", error=str(e))
try:
# Fall back to GLM
return await query(prompt, prefer_haiku=False)
except Exception as e:
logger.warning("GLM failed, falling back to Ollama", error=str(e))
try:
# Final fallback to local
return await query_ollama(prompt)
except Exception as e:
logger.error("All LLM tiers failed", error=str(e))
raise RuntimeError("No LLM tier available")
Built-in Fallback (in aegis.llm.query):
1. Tier 1.5 (Haiku) → Tier 2 (GLM) → Tier 3 (Ollama)
2. Tier 2 (GLM) → Tier 1.5 (Haiku) → Tier 3 (Ollama)
3. Tier 3 (Ollama) → Fail (no fallback)
Model Selection Heuristics¶
Implementation: /home/agent/projects/aegis-core/aegis/llm/model_selector.py
Cognitive Hierarchy Table¶
COGNITIVE_HIERARCHY = {
"strategic": {
"tier": 1,
"model": "claude-opus-4-5",
"description": "Strategic decisions, architecture reviews",
"cost_tier": "high"
},
"fast": {
"tier": 1.5,
"model": "claude-haiku-4-5",
"description": "High-frequency, cost-sensitive operations",
"cost_tier": "low"
},
"operational": {
"tier": 2,
"model": "glm-4.7",
"description": "90% of routine work",
"cost_tier": "free"
},
"vision": {
"tier": 3,
"model": "qwen3-vl:4b",
"description": "Image understanding, visual tasks",
"cost_tier": "free"
},
"reasoning": {
"tier": 3,
"model": "deepseek-r1:32b",
"description": "Chain-of-thought, step-by-step reasoning",
"cost_tier": "free"
}
}
Programmatic Selection¶
from aegis.llm.model_selector import select_model_for_task, TaskType, CostTier
# Select model for task
recommendation = select_model_for_task(
task_type=TaskType.CLASSIFICATION,
cost_tier=CostTier.LOW,
context_length=2000,
require_vision=False
)
print(recommendation.model_id) # "claude-haiku-4-5"
print(recommendation.tier) # 1.5
print(recommendation.reasoning) # "Fast, cost-effective for classification"
print(recommendation.estimated_cost) # 0.002
Recommendation Table¶
from aegis.llm.model_selector import format_recommendation_table
# Generate markdown table
table = format_recommendation_table([
TaskType.CLASSIFICATION,
TaskType.EXTRACTION,
TaskType.CODE_GENERATION,
TaskType.REASONING
])
print(table)
Output:
| Task Type | Model | Tier | Cost | Reasoning |
|-----------|-------|------|------|-----------|
| Classification | claude-haiku-4-5 | 1.5 | Low | Fast, accurate |
| Extraction | claude-haiku-4-5 | 1.5 | Low | Pattern matching |
| Code Generation | glm-4.7 | 2 | Free | Sufficient quality |
| Reasoning | deepseek-r1:32b | 3 | Free | CoT capabilities |
Response Caching¶
Purpose: Reduce API costs and latency by memoizing responses
Implementation: /home/agent/projects/aegis-core/aegis/memory/cache.py
Cache Architecture¶
Storage: PostgreSQL llm_cache table
Key: SHA256 hash of (prompt + model + system + temperature)
TTL: 24 hours (default, configurable)
Usage¶
Automatic Caching:
from aegis.llm import query
# Automatically cached
response = await query(
"What is Docker?",
use_cache=True, # Default
cache_ttl=86400 # 24 hours
)
# Cache hit on second call
response = await query("What is Docker?") # Instant response
Manual Cache Control:
from aegis.memory.cache import cache_llm_response, get_cached_llm_response
# Cache response
cache_llm_response(
prompt="What is Docker?",
response="Docker is...",
model="glm-4.7",
ttl=3600 # 1 hour
)
# Retrieve cached response
cached = get_cached_llm_response(
prompt="What is Docker?",
model="glm-4.7"
)
Decorator Pattern:
from aegis.memory.cache import cached_query
@cached_query(ttl=3600)
async def expensive_llm_call(prompt: str) -> str:
return await query(prompt)
# First call: hits API
result = await expensive_llm_call("Complex question")
# Second call: cached
result = await expensive_llm_call("Complex question") # Instant
Cache Statistics¶
Metrics: - Cache hit rate: ~40-60% - Average latency reduction: 95% (2000ms → 100ms) - API cost savings: ~$5-10/month
Cache Invalidation:
# Clear specific cache entry
db.execute("DELETE FROM llm_cache WHERE prompt_hash = %s", (hash,))
# Clear old entries (daily cron)
db.execute("DELETE FROM llm_cache WHERE created_at < NOW() - INTERVAL '7 days'")
Token Usage Tracking¶
Purpose: Monitor and bill token usage per customer
Implementation: /home/agent/projects/aegis-core/aegis/monetization/metering.py
Usage Recording¶
from aegis.monetization.metering import record_llm_tokens
# Automatically recorded in query()
response = await query(
"Generate code",
customer_id="cust-123", # For billing
track_usage=True # Default
)
# Manual recording
record_llm_tokens(
customer_id="cust-123",
tokens=1500,
model="glm-4.7",
prompt_tokens=1000,
completion_tokens=500
)
Token Counting¶
Haiku/Opus: Actual tokens from API response GLM: Estimated (prompt + completion length / 4) Ollama: Actual tokens from API response
Storage: api_usage_events table
Cost Analysis¶
Monthly Cost Breakdown¶
| Tier | Provider | Model | Volume | Cost/Month |
|---|---|---|---|---|
| 1 | Anthropic | Opus 4.5 | 10K tokens | $1.50 |
| 1.5 | Anthropic | Haiku 4.5 | 500K tokens | $3.00 |
| 2 | Z.ai | GLM-4.7 | Unlimited | $50.00 |
| 3 | Ollama | Local | Unlimited | $0.00 |
| Total | ~$55/month |
Cost Optimization: - 90% of requests use GLM (flat fee) - 8% use Haiku (low cost) - 2% use Opus (high cost, sparingly) - 0% use Ollama (fallback only)
Cost per Request: - Average: $0.0001 (mostly GLM flat fee amortized) - Haiku request: $0.0002 - Opus request: $0.05 - Ollama request: $0.00
ROI Analysis¶
Without Routing (All Opus): - 10,000 requests/month - Average 1,000 tokens each - Cost: 10M tokens × $15/MTok = $150/month
With Routing (Current): - 10,000 requests/month - 90% GLM (flat $50), 8% Haiku ($3), 2% Opus ($3) - Cost: $56/month - Savings: $94/month (63%)
Monitoring and Observability¶
LLM Metrics¶
Collected Metrics: - Requests per tier (count) - Response latency (ms, p50/p95/p99) - Token usage (prompt + completion) - Error rate by tier (%) - Cache hit rate (%) - Fallback frequency (%)
Storage: llm_metrics table (future)
Dashboard: /admin/llm-metrics (future)
Logging¶
Structured Logs (via structlog):
{
"event": "llm_query",
"tier": "glm",
"model": "glm-4.7",
"prompt_tokens": 150,
"completion_tokens": 300,
"latency_ms": 1250,
"cache_hit": false,
"customer_id": "aegis-internal"
}
Log Levels: - DEBUG: All LLM calls with full context - INFO: Tier selection, fallback events - WARNING: API failures, quota warnings - ERROR: All tiers failed
Future LLM Routing Improvements¶
Q1 2026: Intelligence¶
- A/B testing between models
- Quality scoring (compare outputs)
- Automatic tier promotion/demotion
- Cost-quality Pareto frontier
Q2 2026: Advanced Routing¶
- Multi-model ensembles (combine outputs)
- Speculative execution (run multiple tiers, use fastest)
- Adaptive routing (learn from feedback)
- User preference learning
Q3 2026: Scale¶
- Batch inference (group similar requests)
- Request deduplication (merge identical queries)
- Priority queuing (urgent vs. background)
- Regional routing (use closest API endpoint)
Q4 2026: Optimization¶
- Model fine-tuning (GLM on Aegis data)
- Distillation (compress Opus → Haiku knowledge)
- Prompt optimization (genetic algorithms)
- Token compression (summarize long contexts)