LLM Routing Architecture¶

Overview¶

Aegis implements a cognitive hierarchy for LLM routing, selecting the appropriate model tier based on task complexity, cost constraints, and availability. This ensures optimal resource utilization while maintaining quality.

Cognitive Hierarchy¶

graph TB
    Task[Incoming Task] --> Router{LLM Router}

    Router -->|Strategic| Opus[Tier 1: Claude Opus 4.5<br/>Strategic decisions]
    Router -->|Fast/Simple| Haiku[Tier 1.5: Claude Haiku 4.5<br/>High-frequency ops]
    Router -->|Operational| GLM[Tier 2: GLM-4.7<br/>90% of work]
    Router -->|Offline/Vision| Local[Tier 3: Ollama<br/>Local models]

    Opus -->|Fallback| Haiku
    Haiku -->|Fallback| GLM
    GLM -->|Fallback| Local

    style Opus fill:#e74c3c,color:#fff
    style Haiku fill:#3498db,color:#fff
    style GLM fill:#2ecc71,color:#fff
    style Local fill:#95a5a6,color:#fff

Model Tiers¶

Tier 1: Claude Opus 4.5¶

Provider: Anthropic (via Claude Code) Model ID: claude-opus-4-5-20251101 Context Window: 200K tokens

Use Cases: - Strategic decision-making - Architecture reviews - Complex problem-solving - High-stakes debugging

Cost: High ($15/MTok input, $75/MTok output) Rate Limit: Not specified (use sparingly)

Access Pattern: Through Claude Code CLI, not programmatic API

When to Use:

# Reserved for critical decisions only
# Example: Architecture refactoring, security audits, strategic planning

Why Not for Routine Work? - Expensive (5-10x cost of Haiku, 50x cost of GLM) - Rate-limited - Overkill for most tasks

Tier 1.5: Claude Haiku 4.5¶

Provider: Anthropic (via Claude Agent SDK) Model ID: claude-haiku-4-5-20251001 Context Window: 200K tokens Max Output: 8,192 tokens

Use Cases: - Classification and tagging - Data extraction and parsing - Quick summaries - Validation and verification - Formatting and conversion

Performance: - 3x faster than Sonnet - 1/3 cost of Sonnet - Response time: ~500ms

Cost: Low ($1/MTok input, $5/MTok output) Rate Limit: High (sustained throughput)

Auto-Routing: Tasks matching these types automatically use Haiku

HAIKU_TASK_TYPES = [
    "classify", "extract", "parse", "summarize", "validate",
    "tag", "format", "convert", "check", "simple"
]

Implementation:

from aegis.llm import query_haiku

# Direct Haiku usage
response = await query_haiku("Classify this email as spam or ham: ...")

# Auto-routed based on task_type
response = await query("Extract names from: ...", task_type="extract")

# Force Haiku preference
response = await query("Quick summary", prefer_haiku=True)

Example Use Cases:

# Email classification
await query_haiku("Classify: 'Meeting at 3pm today'", task_type="classify")

# Data extraction
await query_haiku("Extract phone numbers from: ...", task_type="extract")

# Format conversion
await query_haiku("Convert to JSON: name: John, age: 30", task_type="format")

# Quick validation
await query_haiku("Is this valid email? user@domain", task_type="validate")

Tier 2: GLM-4.7¶

Provider: Z.ai (Zhipu AI via Anthropic-compatible API) Model ID: glm-4.7 Context Window: 128K tokens Max Output: 4,096 tokens

Use Cases: - 90%+ of operational work - Code generation - Debugging assistance - Documentation writing - Research synthesis - Conversational tasks

Performance: - Response time: ~1-2 seconds - Throughput: ~8 requests/minute (rate limited)

Cost: Free (subscription model, ~$50/month unlimited) Rate Limit: ~8 req/min (soft limit)

Access via Claude Agent SDK:

# Configuration in ~/.claude/settings.json
{
  "env": {
    "ANTHROPIC_AUTH_TOKEN": "zai_key_here",
    "ANTHROPIC_BASE_URL": "https://api.z.ai/api/anthropic",
    "ANTHROPIC_DEFAULT_SONNET_MODEL": "glm-4.7"
  }
}

# Usage
from claude_agent_sdk import query, ClaudeAgentOptions

async for message in query(prompt="Your task", options=ClaudeAgentOptions()):
    print(message)

Direct Usage via Aegis API:

from aegis.llm import query

# Default tier (GLM)
response = await query("Explain Docker Compose")

# Explicit model selection
response = await query("Generate code for...", model="glm-4.7")

GLM Clients:

1. GLM Agent Client (recommended):

from aegis.llm import GLMClient

client = GLMClient()  # Uses Agent SDK
response = await client.complete("Prompt", system="System", temperature=0.7)

2. GLM Direct Client (fallback):

from aegis.llm import GLMDirectClient

client = GLMDirectClient()  # Direct HTTP to Z.ai
response = await client.complete("Prompt")

API Key Rotation:

from aegis.llm import get_current_api_key, rotate_api_key

# Check current key
current = get_current_api_key()

# Rotate to backup key
rotate_api_key()

Fallback Strategy: - Primary: Z.ai API key 1 - Backup: Z.ai API key 2 (automatically rotates on 429 errors) - Final: Ollama local models

Tier 3: Ollama (Local Models)¶

Provider: Ollama (localhost:11434) Models: Multiple specialized models

Model	Size	Purpose	Performance
qwen3:4b	2.5 GB	General tasks	~40 tok/s
qwen3:30b	18 GB	Complex reasoning	~8 tok/s
qwen3-vl:4b	3.3 GB	Vision (multimodal)	~15 tok/s
qwen3-vl:30b	19 GB	High-quality vision	~4 tok/s
deepseek-r1:32b	19 GB	Chain-of-thought reasoning	~6 tok/s
moondream:latest	1.7 GB	Fast vision	~20 tok/s
nomic-embed-text	274 MB	Embeddings (768d)	~1000 vecs/s
tinyllama	637 MB	Edge inference	~60 tok/s

Use Cases: - Offline operation (no internet) - Vision tasks (image analysis) - Reasoning tasks (chain-of-thought) - Sensitive operations (no external API) - Fallback when cloud APIs fail

Cost: Free (hardware cost only) Rate Limit: None (local)

Implementation:

from aegis.llm import query_ollama

# General task
response = await query_ollama("Explain Docker", model="qwen3:4b")

# Vision task
response = await query_ollama(
    "What's in this image?",
    model="qwen3-vl:4b",
    images=["/path/to/image.png"]
)

# Reasoning task
response = await query_ollama(
    "Solve this step by step: ...",
    model="deepseek-r1:32b"
)

Embedding Generation:

from aegis.llm.ollama import get_embedding

# Generate embedding
embedding = await get_embedding("Text to embed", model="nomic-embed-text")
# Returns: list[float] of length 768

Model Selection by Task:

# Automatic model selection
from aegis.llm.router import LLMRouter

router = LLMRouter()

# Vision task → qwen3-vl:4b
tier = router.select_tier(task_type="vision")

# Reasoning task → deepseek-r1:32b
tier = router.select_tier(task_type="reasoning")

# General task → qwen3:4b
tier = router.select_tier(task_type="operational")

LLM Router Architecture¶

Implementation: /home/agent/projects/aegis-core/aegis/llm/router.py

Selection Logic¶

from aegis.llm.router import LLMRouter, Tier

router = LLMRouter(allow_paid=False, allow_haiku=True)

# Tier selection based on task type
tier = router.select_tier(
    task_type="operational",  # Task type
    require_offline=False,     # Force local models
    prefer_fast=False          # Prefer speed over capability
)

# Execute with selected tier
response, used_tier, cost = await router.complete(
    prompt="Your prompt",
    system="System instructions",
    tier=tier,  # Optional override
    temperature=0.7,
    max_tokens=2048
)

Task Type Mapping¶

Task Type	Selected Tier	Reasoning
strategic	Opus	Critical decisions
research	Haiku → GLM	Fast retrieval → synthesis
vision	Local (qwen3-vl)	No cloud vision API
reasoning	Local (deepseek-r1)	CoT not needed for most tasks
classify	Haiku	Fast, cheap, accurate
extract	Haiku	Pattern matching
parse	Haiku	Structured output
summarize	Haiku	Quick compression
validate	Haiku	Boolean decisions
operational	GLM	Default tier

Fallback Cascade¶

Automatic Fallback:

href="#__codelineno-13-1"># Tier selection with fallback class="k">async def query_with_fallback(prompt: str) -> str: try: # Try Haiku first for fast tasks return await query_haiku(prompt) except Exception as e: logger.warning("Haiku failed, falling back to GLM", error=str(e)) try: # Fall back to GLM return await query(prompt, prefer_haiku=False) except Exception as e: logger.warning("GLM failed, falling back to Ollama", error=str(e)) try: # Final fallback to local return await query_ollama(prompt) except Exception as e: logger.error("All LLM tiers failed", error=str(e)) raise RuntimeError("No LLM tier available")

Built-in Fallback (in aegis.llm.query): 1. Tier 1.5 (Haiku) → Tier 2 (GLM) → Tier 3 (Ollama) 2. Tier 2 (GLM) → Tier 1.5 (Haiku) → Tier 3 (Ollama) 3. Tier 3 (Ollama) → Fail (no fallback)

Model Selection Heuristics¶

Implementation: /home/agent/projects/aegis-core/aegis/llm/model_selector.py

Cognitive Hierarchy Table¶

COGNITIVE_HIERARCHY = {
    "strategic": {
        "tier": 1,
        "model": "claude-opus-4-5",
        "description": "Strategic decisions, architecture reviews",
        "cost_tier": "high"
    },
    "fast": {
        "tier": 1.5,
        "model": "claude-haiku-4-5",
        "description": "High-frequency, cost-sensitive operations",
        "cost_tier": "low"
    },
    "operational": {
        "tier": 2,
        "model": "glm-4.7",
        "description": "90% of routine work",
        "cost_tier": "free"
    },
    "vision": {
        "tier": 3,
        "model": "qwen3-vl:4b",
        "description": "Image understanding, visual tasks",
        "cost_tier": "free"
    },
    "reasoning": {
        "tier": 3,
        "model": "deepseek-r1:32b",
        "description": "Chain-of-thought, step-by-step reasoning",
        "cost_tier": "free"
    }
}

Programmatic Selection¶

from aegis.llm.model_selector import select_model_for_task, TaskType, CostTier

# Select model for task
recommendation = select_model_for_task(
    task_type=TaskType.CLASSIFICATION,
    cost_tier=CostTier.LOW,
    context_length=2000,
    require_vision=False
)

print(recommendation.model_id)         # "claude-haiku-4-5"
print(recommendation.tier)             # 1.5
print(recommendation.reasoning)        # "Fast, cost-effective for classification"
print(recommendation.estimated_cost)   # 0.002

Recommendation Table¶

from aegis.llm.model_selector import format_recommendation_table

# Generate markdown table
table = format_recommendation_table([
    TaskType.CLASSIFICATION,
    TaskType.EXTRACTION,
    TaskType.CODE_GENERATION,
    TaskType.REASONING
])

print(table)

Output:

| Task Type | Model | Tier | Cost | Reasoning |
|-----------|-------|------|------|-----------|
| Classification | claude-haiku-4-5 | 1.5 | Low | Fast, accurate |
| Extraction | claude-haiku-4-5 | 1.5 | Low | Pattern matching |
| Code Generation | glm-4.7 | 2 | Free | Sufficient quality |
| Reasoning | deepseek-r1:32b | 3 | Free | CoT capabilities |

Response Caching¶

Purpose: Reduce API costs and latency by memoizing responses

Implementation: /home/agent/projects/aegis-core/aegis/memory/cache.py

Cache Architecture¶

Storage: PostgreSQL llm_cache table Key: SHA256 hash of (prompt + model + system + temperature) TTL: 24 hours (default, configurable)

Usage¶

Automatic Caching:

from aegis.llm import query

# Automatically cached
response = await query(
    "What is Docker?",
    use_cache=True,  # Default
    cache_ttl=86400  # 24 hours
)

# Cache hit on second call
response = await query("What is Docker?")  # Instant response

Manual Cache Control:

from aegis.memory.cache import cache_llm_response, get_cached_llm_response

# Cache response
cache_llm_response(
    prompt="What is Docker?",
    response="Docker is...",
    model="glm-4.7",
    ttl=3600  # 1 hour
)

# Retrieve cached response
cached = get_cached_llm_response(
    prompt="What is Docker?",
    model="glm-4.7"
)

Decorator Pattern:

from aegis.memory.cache import cached_query

@cached_query(ttl=3600)
async def expensive_llm_call(prompt: str) -> str:
    return await query(prompt)

# First call: hits API
result = await expensive_llm_call("Complex question")

# Second call: cached
result = await expensive_llm_call("Complex question")  # Instant

Cache Statistics¶

Metrics: - Cache hit rate: ~40-60% - Average latency reduction: 95% (2000ms → 100ms) - API cost savings: ~$5-10/month

Cache Invalidation:

# Clear specific cache entry
db.execute("DELETE FROM llm_cache WHERE prompt_hash = %s", (hash,))

# Clear old entries (daily cron)
db.execute("DELETE FROM llm_cache WHERE created_at < NOW() - INTERVAL '7 days'")

Token Usage Tracking¶

Purpose: Monitor and bill token usage per customer

Implementation: /home/agent/projects/aegis-core/aegis/monetization/metering.py

Usage Recording¶

from aegis.monetization.metering import record_llm_tokens

# Automatically recorded in query()
response = await query(
    "Generate code",
    customer_id="cust-123",  # For billing
    track_usage=True         # Default
)

# Manual recording
record_llm_tokens(
    customer_id="cust-123",
    tokens=1500,
    model="glm-4.7",
    prompt_tokens=1000,
    completion_tokens=500
)

Token Counting¶

Haiku/Opus: Actual tokens from API response GLM: Estimated (prompt + completion length / 4) Ollama: Actual tokens from API response

Storage: api_usage_events table

Cost Analysis¶

Monthly Cost Breakdown¶

Tier	Provider	Model	Volume	Cost/Month
1	Anthropic	Opus 4.5	10K tokens	$1.50
1.5	Anthropic	Haiku 4.5	500K tokens	$3.00
2	Z.ai	GLM-4.7	Unlimited	$50.00
3	Ollama	Local	Unlimited	$0.00
Total				~$55/month

Cost Optimization: - 90% of requests use GLM (flat fee) - 8% use Haiku (low cost) - 2% use Opus (high cost, sparingly) - 0% use Ollama (fallback only)

Cost per Request: - Average: $0.0001 (mostly GLM flat fee amortized) - Haiku request: $0.0002 - Opus request: $0.05 - Ollama request: $0.00

ROI Analysis¶

Without Routing (All Opus): - 10,000 requests/month - Average 1,000 tokens each - Cost: 10M tokens × $15/MTok = $150/month

With Routing (Current): - 10,000 requests/month - 90% GLM (flat $50), 8% Haiku ($3), 2% Opus ($3) - Cost: $56/month - Savings: $94/month (63%)

Monitoring and Observability¶

LLM Metrics¶

Collected Metrics: - Requests per tier (count) - Response latency (ms, p50/p95/p99) - Token usage (prompt + completion) - Error rate by tier (%) - Cache hit rate (%) - Fallback frequency (%)

Storage: llm_metrics table (future)

Dashboard: /admin/llm-metrics (future)

Logging¶

Structured Logs (via structlog):

{
  "event": "llm_query",
  "tier": "glm",
  "model": "glm-4.7",
  "prompt_tokens": 150,
  "completion_tokens": 300,
  "latency_ms": 1250,
  "cache_hit": false,
  "customer_id": "aegis-internal"
}

Log Levels: - DEBUG: All LLM calls with full context - INFO: Tier selection, fallback events - WARNING: API failures, quota warnings - ERROR: All tiers failed

Future LLM Routing Improvements¶

Q1 2026: Intelligence¶

A/B testing between models
Quality scoring (compare outputs)
Automatic tier promotion/demotion
Cost-quality Pareto frontier

Q2 2026: Advanced Routing¶

Multi-model ensembles (combine outputs)
Speculative execution (run multiple tiers, use fastest)
Adaptive routing (learn from feedback)
User preference learning

Q3 2026: Scale¶

Batch inference (group similar requests)
Request deduplication (merge identical queries)
Priority queuing (urgent vs. background)
Regional routing (use closest API endpoint)

Q4 2026: Optimization¶

Model fine-tuning (GLM on Aegis data)
Distillation (compress Opus → Haiku knowledge)
Prompt optimization (genetic algorithms)
Token compression (summarize long contexts)