Aegis System Architecture¶

Comprehensive technical documentation for disaster recovery and system understanding

System Overview¶

Aegis is an autonomous AI agent platform operating within a contained LXC environment on a Hetzner EX130-R dedicated server. The system combines multi-tier memory, intelligent LLM routing, graph-based workflows, and multi-channel communication.

flowchart TB
    subgraph Internet["Internet"]
        Users[Users]
        APIs[External APIs]
    end

    subgraph Hetzner["Hetzner EX130-R (157.180.63.15)"]
        subgraph Proxmox["Proxmox Host"]
            direction TB

            subgraph Dockerhost["Dockerhost LXC (10.10.10.10)"]
                TraefikMain["Traefik Proxy<br/>TCP Passthrough<br/>:80, :443"]
            end

            subgraph AegisLXC["Aegis LXC (10.10.10.103)"]
                direction TB

                subgraph Docker["Docker Compose Stack"]
                    TraefikLocal["Traefik<br/>Local Routing"]
                    Dashboard["Dashboard<br/>FastAPI<br/>:8080"]
                    Scheduler["Scheduler<br/>APScheduler"]
                    Playwright["Playwright<br/>Screenshots<br/>:3002"]
                    FalkorDB["FalkorDB<br/>Knowledge Graph<br/>:6379"]
                end

                subgraph Host["Host Services"]
                    PostgreSQL["PostgreSQL<br/>:5432"]
                    Ollama["Ollama<br/>Local LLMs<br/>:11434"]
                    FileSystem["Filesystem<br/>/home/agent"]
                end
            end
        end
    end

    Internet --> TraefikMain
    TraefikMain --> TraefikLocal
    TraefikLocal --> Dashboard
    Dashboard --> Scheduler
    Dashboard --> PostgreSQL
    Dashboard --> Ollama
    Dashboard --> FalkorDB
    Dashboard --> Playwright
    Scheduler --> PostgreSQL
    Scheduler --> Playwright

Core Architecture Principles¶

1. Containment & Autonomy¶

LXC-based isolation with controlled resource limits (110GB memory)
Docker Compose for service orchestration
Host-gateway networking for PostgreSQL and Ollama access
Traefik for SSL termination and routing

2. Multi-Tier Cognition¶

Tier	Model	Use Case	Rate Limit
1	Claude Opus 4.5	Strategic decisions, architecture	Rare
1.5	Claude Haiku 4.5	Fast ops: classify, extract, parse	High
2	GLM-4.7 via Z.ai	90%+ operational work	~8 req/min
3	Ollama Local	Fallback, offline, vision	Unlimited

3. Persistent Memory Architecture¶

Episodic: Events, decisions, interactions (PostgreSQL)
Semantic: Knowledge, facts, learnings (PostgreSQL + FalkorDB)
Procedural: Workflows, how-tos, templates (filesystem)
Cache: LLM responses, tool outputs (in-memory)
Knowledge Graph: Entity-relationship mapping (FalkorDB)

4. Graph-Based Execution¶

LangGraph-inspired workflow engine
PostgreSQL-backed checkpointing for crash recovery
Human-in-the-loop interrupt nodes
Conditional branching and error handling

5. Multi-Channel Communication¶

Channel	Purpose
Discord	Primary (status, logs, alerts, journal, tasks)
WhatsApp	Two-way command channel (Vonage WABA)
Voice	Inbound calls with ASR (Vonage)
Telegram	Time-sensitive alerts
Email	Gmail triage and automation

Quick Reference¶

Network Topology¶

IP	Host	Role
157.180.63.15	Proxmox Host	Public IP
10.10.10.10	Dockerhost LXC	Main Traefik (TCP passthrough)
10.10.10.103	Aegis LXC	This instance

Public Domains¶

Primary: aegisagent.ai - https://aegisagent.ai - Dashboard - https://intel.aegisagent.ai - Geopolitical Intelligence - https://notebooks.aegisagent.ai - Open Notebook Research - https://code.aegisagent.ai - VS Code - https://vnc.aegisagent.ai - VNC Access

Container Stack¶

Container	Image	Ports	Purpose
aegis-dashboard	aegis-core-dashboard	8080	FastAPI web app
aegis-scheduler	aegis-core-scheduler	-	APScheduler daemon
aegis-playwright	playwright-screenshot-api	3002	Browser automation
falkordb	falkordb/falkordb	6379, 3001	Knowledge graph
traefik	traefik:v2.11	80, 443	Reverse proxy

Resource Usage¶

Resource	Total	Used	Available
Memory	110 GiB	~16 GiB	~93 GiB
Disk	196 GB	164 GB	23 GB
CPU Cores	16	-	-

Key File Locations¶

/home/agent/
├── .secure/                    # Credentials (never commit)
├── .claude/                    # Claude Code configuration
│   ├── settings.json           # Agent SDK config
│   ├── history.jsonl           # Session history
│   └── hooks/                  # Pre/post tool hooks
├── memory/                     # Multi-tier memory
│   ├── episodic/               # Events, decisions
│   ├── semantic/               # Knowledge, learnings
│   ├── procedural/             # Workflows, how-tos
│   └── journal/                # Daily journals
├── projects/
│   └── aegis-core/             # Main codebase
│       ├── aegis/              # Python package (78 modules)
│       ├── docker-compose.yml  # Service definitions
│       └── docs/               # Documentation
└── stacks/                     # Docker stacks

Documentation Index¶

infrastructure.md - Hetzner server, LXC config, resources
network.md - Network topology, DNS, SSL/TLS
containers.md - Docker stack, health checks
database.md - PostgreSQL schema (137 tables)
memory.md - Memory system architecture
llm-routing.md - Cognitive hierarchy

Design Philosophy¶

Value Creation Over Feature Accumulation¶

Problem: 60-70% of the Aegis codebase (144K lines) is unused.

Solution: Prioritize SHIP (70%) > REACTIVE (20%) > PROACTIVE (10%) - Wire built features to commands/workflows/cron - Fix bugs in shipped features - Run demos, find breakage, fix it

Resource Discipline¶

110GB memory limit enforced by LXC
$50/month API budget via Privacy.com
Tier selection protocol minimizes Claude API costs

Three-Strike Debug Protocol¶

Strike 1: Retry with modified approach
Strike 2: Switch to local reasoning model
Strike 3: STOP, document, escalate to human

Last Updated: 2026-01-25