Infrastructure Architecture¶
Overview¶
Project Aegis runs on a dedicated Hetzner bare-metal server with Proxmox virtualization, providing a robust infrastructure for autonomous AI operations.
Hardware Stack¶
Hetzner EX130-R Dedicated Server¶
Specifications: - CPU: Intel Core i9-14900 (24 cores, 32 threads) - Base Clock: 2.0 GHz - Boost Clock: Up to 5.8 GHz - Cache: 36MB L3 - RAM: 128GB DDR5-4800 ECC - Storage: 2x 1.92TB NVMe SSD (Enterprise Grade) - RAID-1 configuration for redundancy - Read: ~3,500 MB/s per drive - Write: ~3,000 MB/s per drive - Network: 1 Gbit/s connection - Location: Falkenstein, Germany (FSN1-DC14)
Public IP: 157.180.63.15
Virtualization Layer¶
Proxmox VE 8.x¶
Proxmox Virtual Environment provides the hypervisor layer, enabling: - LXC container virtualization for Aegis - KVM virtual machines for additional services - ZFS storage management - Snapshot and backup capabilities - Web-based management interface
Proxmox Access:
- Web UI: https://157.180.63.15:8006
- SSH: Root access via key authentication
Aegis LXC Container¶
Container Specifications¶
Container Type: Unprivileged LXC (Linux Container)
Resource Allocation: - vCPU: 32 cores (proportional scheduling) - RAM: 110GB (with 128GB host total, leaving headroom) - Storage: 500GB ZFS dataset - OS: Ubuntu 24.04.3 LTS (Noble Numbat) - Kernel: Linux 6.8.12-14-pve (Proxmox kernel)
Network Configuration: - Internal IP: 10.10.10.103/24 - Gateway: 10.10.10.1 (Proxmox bridge) - DNS: 1.1.1.1, 8.8.8.8 - Tailscale VPN: 100.114.189.93
Container Features: - Nesting enabled (for Docker) - Fuse support enabled - Keyctl enabled - Unprivileged user namespacing - cgroup v2 hierarchy
Container Layout¶
/home/agent/
├── .claude/ # Claude Code configuration
├── .secure/ # Encrypted credentials
├── memory/ # Persistent memory storage
├── projects/ # Code repositories
│ └── aegis-core/ # Main Aegis codebase
├── stacks/ # Docker Compose stacks
└── downloads/ # Temporary downloads
Resource Monitoring¶
Current Utilization (Typical)¶
| Resource | Allocated | Used | Available |
|---|---|---|---|
| CPU | 32 cores | ~8-12 cores | 20 cores |
| RAM | 110GB | ~40-60GB | 50-70GB |
| Storage | 500GB | ~180GB | 320GB |
| Network | 1 Gbit/s | ~10-50 Mbit/s | 950+ Mbit/s |
Key Services Resource Usage¶
| Service | CPU | RAM | Storage | Purpose |
|---|---|---|---|---|
| FalkorDB | ~2% | ~500MB | ~2GB | Knowledge graph |
| PostgreSQL | ~3% | ~1GB | ~15GB | Relational data |
| Traefik | ~1% | ~100MB | ~50MB | Reverse proxy |
| Dashboard | ~5% | ~800MB | ~100MB | Web interface |
| Ollama | ~10-50% | ~8-16GB | ~50GB | Local LLM inference |
| Docker | ~2% | ~500MB | ~10GB | Container runtime |
Backup Strategy¶
Automated Backups¶
Proxmox Snapshots: - Daily snapshots of LXC container (last 7 days retained) - Weekly full container backups (last 4 weeks retained) - Backup destination: Hetzner Storage Box (1TB)
Data Backups:
- Git repositories: Pushed to GitHub (real-time)
- Memory files: Daily rsync to Storage Box
- Database: Nightly pg_dump to /home/agent/backups/
- Configuration: Tracked in git, pushed hourly
Recovery Time Objective (RTO): < 30 minutes Recovery Point Objective (RPO): < 24 hours
Security Posture¶
Container Isolation¶
- Unprivileged container: Root inside container != root on host
- AppArmor profiles: Mandatory Access Control
- seccomp filters: System call restrictions
- Namespacing: PID, network, mount, UTS, IPC isolation
- cgroups: Resource limits enforced
Network Security¶
- Firewall: Proxmox firewall + container iptables
- SSH: Key-only authentication, no password login
- Fail2ban: Intrusion prevention (3 strikes, 1 hour ban)
- TLS: All public services behind Traefik with Let's Encrypt
Credential Management¶
- API keys:
~/.secure/(encrypted at rest) - SSH keys:
~/.ssh/(ed25519, 4096-bit) - Database passwords: Environment variables only
- GitHub tokens: Stored in git-credentials (local only)
Disaster Recovery¶
Scenario 1: Container Failure¶
Recovery Steps:
1. Restore latest Proxmox snapshot (~5 minutes)
2. Verify network connectivity
3. Start Docker services: cd /home/agent/projects/aegis-core && docker compose up -d
4. Verify health endpoints
5. Resume autonomous operations
Expected Downtime: 10-15 minutes
Scenario 2: Host Hardware Failure¶
Recovery Steps: 1. Provision new Hetzner server (1-2 hours) 2. Install Proxmox VE 3. Restore container from Hetzner Storage Box backup 4. Update DNS records (10 minute TTL) 5. Resume operations
Expected Downtime: 2-4 hours
Scenario 3: Data Corruption¶
Recovery Steps: 1. Identify affected system (database, filesystem, git) 2. Restore from nightly backup 3. Replay transactions from episodic memory logs 4. Validate data integrity 5. Resume operations
Expected Downtime: 30 minutes to 2 hours
Performance Characteristics¶
Network Latency¶
| Destination | Latency | Bandwidth |
|---|---|---|
| Hetzner Cloud (same DC) | ~0.5ms | 1 Gbit/s |
| Europe | ~10-30ms | 500+ Mbit/s |
| US East Coast | ~80-100ms | 300+ Mbit/s |
| US West Coast | ~150-170ms | 200+ Mbit/s |
| Asia Pacific | ~200-300ms | 100+ Mbit/s |
Storage Performance¶
NVMe SSD (RAID-1): - Sequential Read: 3,500 MB/s - Sequential Write: 3,000 MB/s - Random Read (4K): 600K IOPS - Random Write (4K): 550K IOPS - Latency: < 100μs
Compute Performance¶
LLM Inference (Ollama): - Qwen3:4B: ~40 tokens/second - Qwen3:30B: ~8 tokens/second - DeepSeek-R1:32B: ~6 tokens/second - Moondream (vision): ~15 tokens/second
Database Performance: - PostgreSQL: 10K queries/second (typical load) - FalkorDB: 5K graph traversals/second - Redis cache hit rate: 95%+
Scaling Strategy¶
Horizontal Scaling (Future)¶
Current: Single LXC container on single host
Phase 1 (< 6 months): - Add second LXC for separate services (monitoring, CI/CD) - Keep Aegis core on primary container - Shared PostgreSQL, separate Redis instances
Phase 2 (6-12 months): - Kubernetes cluster on Hetzner Cloud - Aegis core as StatefulSet - Managed PostgreSQL (Hetzner) - Multi-region deployment for resilience
Vertical Scaling (Current Headroom)¶
CPU: Can allocate up to 32 cores (currently using ~12) RAM: Can increase to 120GB (currently 110GB) Storage: Can expand ZFS dataset to 1.5TB
Cost Analysis¶
Monthly Infrastructure Costs¶
| Item | Cost | Notes |
|---|---|---|
| Hetzner EX130-R | €89.00 | Bare metal server |
| Hetzner Storage Box (1TB) | €3.81 | Backup storage |
| Cloudflare DNS | €0.00 | Free plan |
| Domain (aegisagent.ai) | €1.00 | Amortized annual cost |
| Total | €93.81 | ~$102 USD/month |
Cost per vCPU: €2.78/month Cost per GB RAM: €0.82/month Cost per TB storage: €1.91/month
Maintenance Windows¶
Scheduled Maintenance¶
Daily (00:00-06:00 UTC): - Database backups and optimization - Log rotation and cleanup - Docker image updates (security patches) - System package updates (unattended-upgrades)
Weekly (Sunday 02:00 UTC): - Proxmox snapshot creation - Full container backup to Storage Box - Disk usage analysis and cleanup
Monthly (First Sunday 03:00 UTC): - Kernel updates (requires reboot) - Certificate renewal checks - Dependency audit (Python, Node.js)
Emergency Maintenance¶
Triggers: - Critical security vulnerabilities (CVE score > 9.0) - Service outages affecting > 50% functionality - Data corruption detected - Resource exhaustion (>95% RAM/disk)
Notification Channels: - Discord #alerts channel - Telegram bot (Chat ID: 1275129801) - WhatsApp (WABA: +447441443388)
Monitoring and Alerting¶
Health Checks¶
Service-Level: - HTTP health endpoints every 30 seconds - Database connection pools - Docker container status - Disk I/O wait times
System-Level: - CPU utilization (alert > 90% for 5 minutes) - RAM usage (alert > 95%) - Disk usage (alert > 85%) - Network errors (alert > 1% packet loss)
Metrics Collection¶
- Prometheus: Metrics scraping (30s interval)
- PostgreSQL: Query logs and slow query analysis
- Docker stats: Container resource usage
- Structlog: JSON-formatted application logs
Dashboards¶
- Traefik: https://traefik.rbnk.uk (internal)
- Aegis Dashboard: https://aegisagent.ai (public)
- FalkorDB Browser: http://localhost:3001 (internal)
Future Infrastructure Roadmap¶
Q1 2026: Optimization Phase¶
- Implement caching layer (Redis Cluster)
- Optimize Ollama model loading (shared memory)
- Add read replicas for PostgreSQL
- Implement rate limiting per API key
Q2 2026: High Availability¶
- Second Aegis instance (active-passive)
- Load balancer (HAProxy or Traefik Enterprise)
- Distributed logging (Loki)
- Multi-region DNS failover
Q3 2026: Edge Computing¶
- Edge nodes for regional LLM inference
- CDN for static assets (Cloudflare)
- WebSocket gateway for real-time updates
- GraphQL federation
Q4 2026: Kubernetes Migration¶
- Migrate to Hetzner Cloud Kubernetes
- Helm charts for all services
- GitOps deployment (ArgoCD)
- Service mesh (Istio or Linkerd)