Infrastructure Architecture¶

Overview¶

Project Aegis runs on a dedicated Hetzner bare-metal server with Proxmox virtualization, providing a robust infrastructure for autonomous AI operations.

Hardware Stack¶

Hetzner EX130-R Dedicated Server¶

Specifications: - CPU: Intel Core i9-14900 (24 cores, 32 threads) - Base Clock: 2.0 GHz - Boost Clock: Up to 5.8 GHz - Cache: 36MB L3 - RAM: 128GB DDR5-4800 ECC - Storage: 2x 1.92TB NVMe SSD (Enterprise Grade) - RAID-1 configuration for redundancy - Read: ~3,500 MB/s per drive - Write: ~3,000 MB/s per drive - Network: 1 Gbit/s connection - Location: Falkenstein, Germany (FSN1-DC14)

Public IP: 157.180.63.15

Virtualization Layer¶

Proxmox VE 8.x¶

Proxmox Virtual Environment provides the hypervisor layer, enabling: - LXC container virtualization for Aegis - KVM virtual machines for additional services - ZFS storage management - Snapshot and backup capabilities - Web-based management interface

Proxmox Access: - Web UI: https://157.180.63.15:8006 - SSH: Root access via key authentication

Aegis LXC Container¶

Container Specifications¶

Container Type: Unprivileged LXC (Linux Container)

Resource Allocation: - vCPU: 32 cores (proportional scheduling) - RAM: 110GB (with 128GB host total, leaving headroom) - Storage: 500GB ZFS dataset - OS: Ubuntu 24.04.3 LTS (Noble Numbat) - Kernel: Linux 6.8.12-14-pve (Proxmox kernel)

Network Configuration: - Internal IP: 10.10.10.103/24 - Gateway: 10.10.10.1 (Proxmox bridge) - DNS: 1.1.1.1, 8.8.8.8 - Tailscale VPN: 100.114.189.93

Container Features: - Nesting enabled (for Docker) - Fuse support enabled - Keyctl enabled - Unprivileged user namespacing - cgroup v2 hierarchy

Container Layout¶

/home/agent/
├── .claude/           # Claude Code configuration
├── .secure/           # Encrypted credentials
├── memory/            # Persistent memory storage
├── projects/          # Code repositories
│   └── aegis-core/    # Main Aegis codebase
├── stacks/            # Docker Compose stacks
└── downloads/         # Temporary downloads

Resource Monitoring¶

Current Utilization (Typical)¶

Resource	Allocated	Used	Available
CPU	32 cores	~8-12 cores	20 cores
RAM	110GB	~40-60GB	50-70GB
Storage	500GB	~180GB	320GB
Network	1 Gbit/s	~10-50 Mbit/s	950+ Mbit/s

Key Services Resource Usage¶

Service	CPU	RAM	Storage	Purpose
FalkorDB	~2%	~500MB	~2GB	Knowledge graph
PostgreSQL	~3%	~1GB	~15GB	Relational data
Traefik	~1%	~100MB	~50MB	Reverse proxy
Dashboard	~5%	~800MB	~100MB	Web interface
Ollama	~10-50%	~8-16GB	~50GB	Local LLM inference
Docker	~2%	~500MB	~10GB	Container runtime

Backup Strategy¶

Automated Backups¶

Proxmox Snapshots: - Daily snapshots of LXC container (last 7 days retained) - Weekly full container backups (last 4 weeks retained) - Backup destination: Hetzner Storage Box (1TB)

Data Backups: - Git repositories: Pushed to GitHub (real-time) - Memory files: Daily rsync to Storage Box - Database: Nightly pg_dump to /home/agent/backups/ - Configuration: Tracked in git, pushed hourly

Recovery Time Objective (RTO): < 30 minutes Recovery Point Objective (RPO): < 24 hours

Security Posture¶

Container Isolation¶

Unprivileged container: Root inside container != root on host
AppArmor profiles: Mandatory Access Control
seccomp filters: System call restrictions
Namespacing: PID, network, mount, UTS, IPC isolation
cgroups: Resource limits enforced

Network Security¶

Firewall: Proxmox firewall + container iptables
SSH: Key-only authentication, no password login
Fail2ban: Intrusion prevention (3 strikes, 1 hour ban)
TLS: All public services behind Traefik with Let's Encrypt

Credential Management¶

API keys: ~/.secure/ (encrypted at rest)
SSH keys: ~/.ssh/ (ed25519, 4096-bit)
Database passwords: Environment variables only
GitHub tokens: Stored in git-credentials (local only)

Disaster Recovery¶

Scenario 1: Container Failure¶

Recovery Steps: 1. Restore latest Proxmox snapshot (~5 minutes) 2. Verify network connectivity 3. Start Docker services: cd /home/agent/projects/aegis-core && docker compose up -d 4. Verify health endpoints 5. Resume autonomous operations

Expected Downtime: 10-15 minutes

Scenario 2: Host Hardware Failure¶

Recovery Steps: 1. Provision new Hetzner server (1-2 hours) 2. Install Proxmox VE 3. Restore container from Hetzner Storage Box backup 4. Update DNS records (10 minute TTL) 5. Resume operations

Expected Downtime: 2-4 hours

Scenario 3: Data Corruption¶

Recovery Steps: 1. Identify affected system (database, filesystem, git) 2. Restore from nightly backup 3. Replay transactions from episodic memory logs 4. Validate data integrity 5. Resume operations

Expected Downtime: 30 minutes to 2 hours

Performance Characteristics¶

Network Latency¶

Destination	Latency	Bandwidth
Hetzner Cloud (same DC)	~0.5ms	1 Gbit/s
Europe	~10-30ms	500+ Mbit/s
US East Coast	~80-100ms	300+ Mbit/s
US West Coast	~150-170ms	200+ Mbit/s
Asia Pacific	~200-300ms	100+ Mbit/s

Storage Performance¶

NVMe SSD (RAID-1): - Sequential Read: 3,500 MB/s - Sequential Write: 3,000 MB/s - Random Read (4K): 600K IOPS - Random Write (4K): 550K IOPS - Latency: < 100μs

Compute Performance¶

LLM Inference (Ollama): - Qwen3:4B: ~40 tokens/second - Qwen3:30B: ~8 tokens/second - DeepSeek-R1:32B: ~6 tokens/second - Moondream (vision): ~15 tokens/second

Database Performance: - PostgreSQL: 10K queries/second (typical load) - FalkorDB: 5K graph traversals/second - Redis cache hit rate: 95%+

Scaling Strategy¶

Horizontal Scaling (Future)¶

Current: Single LXC container on single host

Phase 1 (< 6 months): - Add second LXC for separate services (monitoring, CI/CD) - Keep Aegis core on primary container - Shared PostgreSQL, separate Redis instances

Phase 2 (6-12 months): - Kubernetes cluster on Hetzner Cloud - Aegis core as StatefulSet - Managed PostgreSQL (Hetzner) - Multi-region deployment for resilience

Vertical Scaling (Current Headroom)¶

CPU: Can allocate up to 32 cores (currently using ~12) RAM: Can increase to 120GB (currently 110GB) Storage: Can expand ZFS dataset to 1.5TB

Cost Analysis¶

Monthly Infrastructure Costs¶

Item	Cost	Notes
Hetzner EX130-R	€89.00	Bare metal server
Hetzner Storage Box (1TB)	€3.81	Backup storage
Cloudflare DNS	€0.00	Free plan
Domain (aegisagent.ai)	€1.00	Amortized annual cost
Total	€93.81	~$102 USD/month

Cost per vCPU: €2.78/month Cost per GB RAM: €0.82/month Cost per TB storage: €1.91/month

Maintenance Windows¶

Scheduled Maintenance¶

Daily (00:00-06:00 UTC): - Database backups and optimization - Log rotation and cleanup - Docker image updates (security patches) - System package updates (unattended-upgrades)

Weekly (Sunday 02:00 UTC): - Proxmox snapshot creation - Full container backup to Storage Box - Disk usage analysis and cleanup

Monthly (First Sunday 03:00 UTC): - Kernel updates (requires reboot) - Certificate renewal checks - Dependency audit (Python, Node.js)

Emergency Maintenance¶

Triggers: - Critical security vulnerabilities (CVE score > 9.0) - Service outages affecting > 50% functionality - Data corruption detected - Resource exhaustion (>95% RAM/disk)

Notification Channels: - Discord #alerts channel - Telegram bot (Chat ID: 1275129801) - WhatsApp (WABA: +447441443388)

Monitoring and Alerting¶

Health Checks¶

Service-Level: - HTTP health endpoints every 30 seconds - Database connection pools - Docker container status - Disk I/O wait times

System-Level: - CPU utilization (alert > 90% for 5 minutes) - RAM usage (alert > 95%) - Disk usage (alert > 85%) - Network errors (alert > 1% packet loss)

Metrics Collection¶

Prometheus: Metrics scraping (30s interval)
PostgreSQL: Query logs and slow query analysis
Docker stats: Container resource usage
Structlog: JSON-formatted application logs

Dashboards¶

Traefik: https://traefik.rbnk.uk (internal)
Aegis Dashboard: https://aegisagent.ai (public)
FalkorDB Browser: http://localhost:3001 (internal)

Future Infrastructure Roadmap¶

Q1 2026: Optimization Phase¶

Implement caching layer (Redis Cluster)
Optimize Ollama model loading (shared memory)
Add read replicas for PostgreSQL
Implement rate limiting per API key

Q2 2026: High Availability¶

Second Aegis instance (active-passive)
Load balancer (HAProxy or Traefik Enterprise)
Distributed logging (Loki)
Multi-region DNS failover

Q3 2026: Edge Computing¶

Edge nodes for regional LLM inference
CDN for static assets (Cloudflare)
WebSocket gateway for real-time updates
GraphQL federation

Q4 2026: Kubernetes Migration¶

Migrate to Hetzner Cloud Kubernetes
Helm charts for all services
GitOps deployment (ArgoCD)
Service mesh (Istio or Linkerd)