Skip to content

LLM Routing Architecture

Overview

Aegis implements a cognitive hierarchy for LLM routing, selecting the appropriate model tier based on task complexity, cost constraints, and availability. This ensures optimal resource utilization while maintaining quality.

Cognitive Hierarchy

graph TB
    Task[Incoming Task] --> Router{LLM Router}

    Router -->|Strategic| Opus[Tier 1: Claude Opus 4.5<br/>Strategic decisions]
    Router -->|Fast/Simple| Haiku[Tier 1.5: Claude Haiku 4.5<br/>High-frequency ops]
    Router -->|Operational| GLM[Tier 2: GLM-4.7<br/>90% of work]
    Router -->|Offline/Vision| Local[Tier 3: Ollama<br/>Local models]

    Opus -->|Fallback| Haiku
    Haiku -->|Fallback| GLM
    GLM -->|Fallback| Local

    style Opus fill:#e74c3c,color:#fff
    style Haiku fill:#3498db,color:#fff
    style GLM fill:#2ecc71,color:#fff
    style Local fill:#95a5a6,color:#fff

Model Tiers

Tier 1: Claude Opus 4.5

Provider: Anthropic (via Claude Code) Model ID: claude-opus-4-5-20251101 Context Window: 200K tokens

Use Cases: - Strategic decision-making - Architecture reviews - Complex problem-solving - High-stakes debugging

Cost: High ($15/MTok input, $75/MTok output) Rate Limit: Not specified (use sparingly)

Access Pattern: Through Claude Code CLI, not programmatic API

When to Use:

# Reserved for critical decisions only
# Example: Architecture refactoring, security audits, strategic planning

Why Not for Routine Work? - Expensive (5-10x cost of Haiku, 50x cost of GLM) - Rate-limited - Overkill for most tasks


Tier 1.5: Claude Haiku 4.5

Provider: Anthropic (via Claude Agent SDK) Model ID: claude-haiku-4-5-20251001 Context Window: 200K tokens Max Output: 8,192 tokens

Use Cases: - Classification and tagging - Data extraction and parsing - Quick summaries - Validation and verification - Formatting and conversion

Performance: - 3x faster than Sonnet - 1/3 cost of Sonnet - Response time: ~500ms

Cost: Low ($1/MTok input, $5/MTok output) Rate Limit: High (sustained throughput)

Auto-Routing: Tasks matching these types automatically use Haiku

HAIKU_TASK_TYPES = [
    "classify", "extract", "parse", "summarize", "validate",
    "tag", "format", "convert", "check", "simple"
]

Implementation:

from aegis.llm import query_haiku

# Direct Haiku usage
response = await query_haiku("Classify this email as spam or ham: ...")

# Auto-routed based on task_type
response = await query("Extract names from: ...", task_type="extract")

# Force Haiku preference
response = await query("Quick summary", prefer_haiku=True)

Example Use Cases:

# Email classification
await query_haiku("Classify: 'Meeting at 3pm today'", task_type="classify")

# Data extraction
await query_haiku("Extract phone numbers from: ...", task_type="extract")

# Format conversion
await query_haiku("Convert to JSON: name: John, age: 30", task_type="format")

# Quick validation
await query_haiku("Is this valid email? user@domain", task_type="validate")


Tier 2: GLM-4.7

Provider: Z.ai (Zhipu AI via Anthropic-compatible API) Model ID: glm-4.7 Context Window: 128K tokens Max Output: 4,096 tokens

Use Cases: - 90%+ of operational work - Code generation - Debugging assistance - Documentation writing - Research synthesis - Conversational tasks

Performance: - Response time: ~1-2 seconds - Throughput: ~8 requests/minute (rate limited)

Cost: Free (subscription model, ~$50/month unlimited) Rate Limit: ~8 req/min (soft limit)

Access via Claude Agent SDK:

# Configuration in ~/.claude/settings.json
{
  "env": {
    "ANTHROPIC_AUTH_TOKEN": "zai_key_here",
    "ANTHROPIC_BASE_URL": "https://api.z.ai/api/anthropic",
    "ANTHROPIC_DEFAULT_SONNET_MODEL": "glm-4.7"
  }
}

# Usage
from claude_agent_sdk import query, ClaudeAgentOptions

async for message in query(prompt="Your task", options=ClaudeAgentOptions()):
    print(message)

Direct Usage via Aegis API:

from aegis.llm import query

# Default tier (GLM)
response = await query("Explain Docker Compose")

# Explicit model selection
response = await query("Generate code for...", model="glm-4.7")

GLM Clients:

1. GLM Agent Client (recommended):

from aegis.llm import GLMClient

client = GLMClient()  # Uses Agent SDK
response = await client.complete("Prompt", system="System", temperature=0.7)

2. GLM Direct Client (fallback):

from aegis.llm import GLMDirectClient

client = GLMDirectClient()  # Direct HTTP to Z.ai
response = await client.complete("Prompt")

API Key Rotation:

from aegis.llm import get_current_api_key, rotate_api_key

# Check current key
current = get_current_api_key()

# Rotate to backup key
rotate_api_key()

Fallback Strategy: - Primary: Z.ai API key 1 - Backup: Z.ai API key 2 (automatically rotates on 429 errors) - Final: Ollama local models


Tier 3: Ollama (Local Models)

Provider: Ollama (localhost:11434) Models: Multiple specialized models

Model Size Purpose Performance
qwen3:4b 2.5 GB General tasks ~40 tok/s
qwen3:30b 18 GB Complex reasoning ~8 tok/s
qwen3-vl:4b 3.3 GB Vision (multimodal) ~15 tok/s
qwen3-vl:30b 19 GB High-quality vision ~4 tok/s
deepseek-r1:32b 19 GB Chain-of-thought reasoning ~6 tok/s
moondream:latest 1.7 GB Fast vision ~20 tok/s
nomic-embed-text 274 MB Embeddings (768d) ~1000 vecs/s
tinyllama 637 MB Edge inference ~60 tok/s

Use Cases: - Offline operation (no internet) - Vision tasks (image analysis) - Reasoning tasks (chain-of-thought) - Sensitive operations (no external API) - Fallback when cloud APIs fail

Cost: Free (hardware cost only) Rate Limit: None (local)

Implementation:

from aegis.llm import query_ollama

# General task
response = await query_ollama("Explain Docker", model="qwen3:4b")

# Vision task
response = await query_ollama(
    "What's in this image?",
    model="qwen3-vl:4b",
    images=["/path/to/image.png"]
)

# Reasoning task
response = await query_ollama(
    "Solve this step by step: ...",
    model="deepseek-r1:32b"
)

Embedding Generation:

from aegis.llm.ollama import get_embedding

# Generate embedding
embedding = await get_embedding("Text to embed", model="nomic-embed-text")
# Returns: list[float] of length 768

Model Selection by Task:

# Automatic model selection
from aegis.llm.router import LLMRouter

router = LLMRouter()

# Vision task → qwen3-vl:4b
tier = router.select_tier(task_type="vision")

# Reasoning task → deepseek-r1:32b
tier = router.select_tier(task_type="reasoning")

# General task → qwen3:4b
tier = router.select_tier(task_type="operational")


LLM Router Architecture

Implementation: /home/agent/projects/aegis-core/aegis/llm/router.py

Selection Logic

from aegis.llm.router import LLMRouter, Tier

router = LLMRouter(allow_paid=False, allow_haiku=True)

# Tier selection based on task type
tier = router.select_tier(
    task_type="operational",  # Task type
    require_offline=False,     # Force local models
    prefer_fast=False          # Prefer speed over capability
)

# Execute with selected tier
response, used_tier, cost = await router.complete(
    prompt="Your prompt",
    system="System instructions",
    tier=tier,  # Optional override
    temperature=0.7,
    max_tokens=2048
)

Task Type Mapping

Task Type Selected Tier Reasoning
strategic Opus Critical decisions
research Haiku → GLM Fast retrieval → synthesis
vision Local (qwen3-vl) No cloud vision API
reasoning Local (deepseek-r1) CoT not needed for most tasks
classify Haiku Fast, cheap, accurate
extract Haiku Pattern matching
parse Haiku Structured output
summarize Haiku Quick compression
validate Haiku Boolean decisions
operational GLM Default tier

Fallback Cascade

Automatic Fallback:

# Tier selection with fallback
async def query_with_fallback(prompt: str) -> str:
    try:
        # Try Haiku first for fast tasks
        return await query_haiku(prompt)
    except Exception as e:
        logger.warning("Haiku failed, falling back to GLM", error=str(e))

    try:
        # Fall back to GLM
        return await query(prompt, prefer_haiku=False)
    except Exception as e:
        logger.warning("GLM failed, falling back to Ollama", error=str(e))

    try:
        # Final fallback to local
        return await query_ollama(prompt)
    except Exception as e:
        logger.error("All LLM tiers failed", error=str(e))
        raise RuntimeError("No LLM tier available")

Built-in Fallback (in aegis.llm.query): 1. Tier 1.5 (Haiku) → Tier 2 (GLM) → Tier 3 (Ollama) 2. Tier 2 (GLM) → Tier 1.5 (Haiku) → Tier 3 (Ollama) 3. Tier 3 (Ollama) → Fail (no fallback)


Model Selection Heuristics

Implementation: /home/agent/projects/aegis-core/aegis/llm/model_selector.py

Cognitive Hierarchy Table

COGNITIVE_HIERARCHY = {
    "strategic": {
        "tier": 1,
        "model": "claude-opus-4-5",
        "description": "Strategic decisions, architecture reviews",
        "cost_tier": "high"
    },
    "fast": {
        "tier": 1.5,
        "model": "claude-haiku-4-5",
        "description": "High-frequency, cost-sensitive operations",
        "cost_tier": "low"
    },
    "operational": {
        "tier": 2,
        "model": "glm-4.7",
        "description": "90% of routine work",
        "cost_tier": "free"
    },
    "vision": {
        "tier": 3,
        "model": "qwen3-vl:4b",
        "description": "Image understanding, visual tasks",
        "cost_tier": "free"
    },
    "reasoning": {
        "tier": 3,
        "model": "deepseek-r1:32b",
        "description": "Chain-of-thought, step-by-step reasoning",
        "cost_tier": "free"
    }
}

Programmatic Selection

from aegis.llm.model_selector import select_model_for_task, TaskType, CostTier

# Select model for task
recommendation = select_model_for_task(
    task_type=TaskType.CLASSIFICATION,
    cost_tier=CostTier.LOW,
    context_length=2000,
    require_vision=False
)

print(recommendation.model_id)         # "claude-haiku-4-5"
print(recommendation.tier)             # 1.5
print(recommendation.reasoning)        # "Fast, cost-effective for classification"
print(recommendation.estimated_cost)   # 0.002

Recommendation Table

from aegis.llm.model_selector import format_recommendation_table

# Generate markdown table
table = format_recommendation_table([
    TaskType.CLASSIFICATION,
    TaskType.EXTRACTION,
    TaskType.CODE_GENERATION,
    TaskType.REASONING
])

print(table)

Output:

| Task Type | Model | Tier | Cost | Reasoning |
|-----------|-------|------|------|-----------|
| Classification | claude-haiku-4-5 | 1.5 | Low | Fast, accurate |
| Extraction | claude-haiku-4-5 | 1.5 | Low | Pattern matching |
| Code Generation | glm-4.7 | 2 | Free | Sufficient quality |
| Reasoning | deepseek-r1:32b | 3 | Free | CoT capabilities |


Response Caching

Purpose: Reduce API costs and latency by memoizing responses

Implementation: /home/agent/projects/aegis-core/aegis/memory/cache.py

Cache Architecture

Storage: PostgreSQL llm_cache table Key: SHA256 hash of (prompt + model + system + temperature) TTL: 24 hours (default, configurable)

Usage

Automatic Caching:

from aegis.llm import query

# Automatically cached
response = await query(
    "What is Docker?",
    use_cache=True,  # Default
    cache_ttl=86400  # 24 hours
)

# Cache hit on second call
response = await query("What is Docker?")  # Instant response

Manual Cache Control:

from aegis.memory.cache import cache_llm_response, get_cached_llm_response

# Cache response
cache_llm_response(
    prompt="What is Docker?",
    response="Docker is...",
    model="glm-4.7",
    ttl=3600  # 1 hour
)

# Retrieve cached response
cached = get_cached_llm_response(
    prompt="What is Docker?",
    model="glm-4.7"
)

Decorator Pattern:

from aegis.memory.cache import cached_query

@cached_query(ttl=3600)
async def expensive_llm_call(prompt: str) -> str:
    return await query(prompt)

# First call: hits API
result = await expensive_llm_call("Complex question")

# Second call: cached
result = await expensive_llm_call("Complex question")  # Instant

Cache Statistics

Metrics: - Cache hit rate: ~40-60% - Average latency reduction: 95% (2000ms → 100ms) - API cost savings: ~$5-10/month

Cache Invalidation:

# Clear specific cache entry
db.execute("DELETE FROM llm_cache WHERE prompt_hash = %s", (hash,))

# Clear old entries (daily cron)
db.execute("DELETE FROM llm_cache WHERE created_at < NOW() - INTERVAL '7 days'")


Token Usage Tracking

Purpose: Monitor and bill token usage per customer

Implementation: /home/agent/projects/aegis-core/aegis/monetization/metering.py

Usage Recording

from aegis.monetization.metering import record_llm_tokens

# Automatically recorded in query()
response = await query(
    "Generate code",
    customer_id="cust-123",  # For billing
    track_usage=True         # Default
)

# Manual recording
record_llm_tokens(
    customer_id="cust-123",
    tokens=1500,
    model="glm-4.7",
    prompt_tokens=1000,
    completion_tokens=500
)

Token Counting

Haiku/Opus: Actual tokens from API response GLM: Estimated (prompt + completion length / 4) Ollama: Actual tokens from API response

Storage: api_usage_events table


Cost Analysis

Monthly Cost Breakdown

Tier Provider Model Volume Cost/Month
1 Anthropic Opus 4.5 10K tokens $1.50
1.5 Anthropic Haiku 4.5 500K tokens $3.00
2 Z.ai GLM-4.7 Unlimited $50.00
3 Ollama Local Unlimited $0.00
Total ~$55/month

Cost Optimization: - 90% of requests use GLM (flat fee) - 8% use Haiku (low cost) - 2% use Opus (high cost, sparingly) - 0% use Ollama (fallback only)

Cost per Request: - Average: $0.0001 (mostly GLM flat fee amortized) - Haiku request: $0.0002 - Opus request: $0.05 - Ollama request: $0.00

ROI Analysis

Without Routing (All Opus): - 10,000 requests/month - Average 1,000 tokens each - Cost: 10M tokens × $15/MTok = $150/month

With Routing (Current): - 10,000 requests/month - 90% GLM (flat $50), 8% Haiku ($3), 2% Opus ($3) - Cost: $56/month - Savings: $94/month (63%)


Monitoring and Observability

LLM Metrics

Collected Metrics: - Requests per tier (count) - Response latency (ms, p50/p95/p99) - Token usage (prompt + completion) - Error rate by tier (%) - Cache hit rate (%) - Fallback frequency (%)

Storage: llm_metrics table (future)

Dashboard: /admin/llm-metrics (future)

Logging

Structured Logs (via structlog):

{
  "event": "llm_query",
  "tier": "glm",
  "model": "glm-4.7",
  "prompt_tokens": 150,
  "completion_tokens": 300,
  "latency_ms": 1250,
  "cache_hit": false,
  "customer_id": "aegis-internal"
}

Log Levels: - DEBUG: All LLM calls with full context - INFO: Tier selection, fallback events - WARNING: API failures, quota warnings - ERROR: All tiers failed


Future LLM Routing Improvements

Q1 2026: Intelligence

  • A/B testing between models
  • Quality scoring (compare outputs)
  • Automatic tier promotion/demotion
  • Cost-quality Pareto frontier

Q2 2026: Advanced Routing

  • Multi-model ensembles (combine outputs)
  • Speculative execution (run multiple tiers, use fastest)
  • Adaptive routing (learn from feedback)
  • User preference learning

Q3 2026: Scale

  • Batch inference (group similar requests)
  • Request deduplication (merge identical queries)
  • Priority queuing (urgent vs. background)
  • Regional routing (use closest API endpoint)

Q4 2026: Optimization

  • Model fine-tuning (GLM on Aegis data)
  • Distillation (compress Opus → Haiku knowledge)
  • Prompt optimization (genetic algorithms)
  • Token compression (summarize long contexts)