Skip to content

Infrastructure Architecture

Overview

Project Aegis runs on a dedicated Hetzner bare-metal server with Proxmox virtualization, providing a robust infrastructure for autonomous AI operations.

Hardware Stack

Hetzner EX130-R Dedicated Server

Specifications: - CPU: Intel Core i9-14900 (24 cores, 32 threads) - Base Clock: 2.0 GHz - Boost Clock: Up to 5.8 GHz - Cache: 36MB L3 - RAM: 128GB DDR5-4800 ECC - Storage: 2x 1.92TB NVMe SSD (Enterprise Grade) - RAID-1 configuration for redundancy - Read: ~3,500 MB/s per drive - Write: ~3,000 MB/s per drive - Network: 1 Gbit/s connection - Location: Falkenstein, Germany (FSN1-DC14)

Public IP: 157.180.63.15

Virtualization Layer

Proxmox VE 8.x

Proxmox Virtual Environment provides the hypervisor layer, enabling: - LXC container virtualization for Aegis - KVM virtual machines for additional services - ZFS storage management - Snapshot and backup capabilities - Web-based management interface

Proxmox Access: - Web UI: https://157.180.63.15:8006 - SSH: Root access via key authentication

Aegis LXC Container

Container Specifications

Container Type: Unprivileged LXC (Linux Container)

Resource Allocation: - vCPU: 32 cores (proportional scheduling) - RAM: 110GB (with 128GB host total, leaving headroom) - Storage: 500GB ZFS dataset - OS: Ubuntu 24.04.3 LTS (Noble Numbat) - Kernel: Linux 6.8.12-14-pve (Proxmox kernel)

Network Configuration: - Internal IP: 10.10.10.103/24 - Gateway: 10.10.10.1 (Proxmox bridge) - DNS: 1.1.1.1, 8.8.8.8 - Tailscale VPN: 100.114.189.93

Container Features: - Nesting enabled (for Docker) - Fuse support enabled - Keyctl enabled - Unprivileged user namespacing - cgroup v2 hierarchy

Container Layout

/home/agent/
├── .claude/           # Claude Code configuration
├── .secure/           # Encrypted credentials
├── memory/            # Persistent memory storage
├── projects/          # Code repositories
│   └── aegis-core/    # Main Aegis codebase
├── stacks/            # Docker Compose stacks
└── downloads/         # Temporary downloads

Resource Monitoring

Current Utilization (Typical)

Resource Allocated Used Available
CPU 32 cores ~8-12 cores 20 cores
RAM 110GB ~40-60GB 50-70GB
Storage 500GB ~180GB 320GB
Network 1 Gbit/s ~10-50 Mbit/s 950+ Mbit/s

Key Services Resource Usage

Service CPU RAM Storage Purpose
FalkorDB ~2% ~500MB ~2GB Knowledge graph
PostgreSQL ~3% ~1GB ~15GB Relational data
Traefik ~1% ~100MB ~50MB Reverse proxy
Dashboard ~5% ~800MB ~100MB Web interface
Ollama ~10-50% ~8-16GB ~50GB Local LLM inference
Docker ~2% ~500MB ~10GB Container runtime

Backup Strategy

Automated Backups

Proxmox Snapshots: - Daily snapshots of LXC container (last 7 days retained) - Weekly full container backups (last 4 weeks retained) - Backup destination: Hetzner Storage Box (1TB)

Data Backups: - Git repositories: Pushed to GitHub (real-time) - Memory files: Daily rsync to Storage Box - Database: Nightly pg_dump to /home/agent/backups/ - Configuration: Tracked in git, pushed hourly

Recovery Time Objective (RTO): < 30 minutes Recovery Point Objective (RPO): < 24 hours

Security Posture

Container Isolation

  • Unprivileged container: Root inside container != root on host
  • AppArmor profiles: Mandatory Access Control
  • seccomp filters: System call restrictions
  • Namespacing: PID, network, mount, UTS, IPC isolation
  • cgroups: Resource limits enforced

Network Security

  • Firewall: Proxmox firewall + container iptables
  • SSH: Key-only authentication, no password login
  • Fail2ban: Intrusion prevention (3 strikes, 1 hour ban)
  • TLS: All public services behind Traefik with Let's Encrypt

Credential Management

  • API keys: ~/.secure/ (encrypted at rest)
  • SSH keys: ~/.ssh/ (ed25519, 4096-bit)
  • Database passwords: Environment variables only
  • GitHub tokens: Stored in git-credentials (local only)

Disaster Recovery

Scenario 1: Container Failure

Recovery Steps: 1. Restore latest Proxmox snapshot (~5 minutes) 2. Verify network connectivity 3. Start Docker services: cd /home/agent/projects/aegis-core && docker compose up -d 4. Verify health endpoints 5. Resume autonomous operations

Expected Downtime: 10-15 minutes

Scenario 2: Host Hardware Failure

Recovery Steps: 1. Provision new Hetzner server (1-2 hours) 2. Install Proxmox VE 3. Restore container from Hetzner Storage Box backup 4. Update DNS records (10 minute TTL) 5. Resume operations

Expected Downtime: 2-4 hours

Scenario 3: Data Corruption

Recovery Steps: 1. Identify affected system (database, filesystem, git) 2. Restore from nightly backup 3. Replay transactions from episodic memory logs 4. Validate data integrity 5. Resume operations

Expected Downtime: 30 minutes to 2 hours

Performance Characteristics

Network Latency

Destination Latency Bandwidth
Hetzner Cloud (same DC) ~0.5ms 1 Gbit/s
Europe ~10-30ms 500+ Mbit/s
US East Coast ~80-100ms 300+ Mbit/s
US West Coast ~150-170ms 200+ Mbit/s
Asia Pacific ~200-300ms 100+ Mbit/s

Storage Performance

NVMe SSD (RAID-1): - Sequential Read: 3,500 MB/s - Sequential Write: 3,000 MB/s - Random Read (4K): 600K IOPS - Random Write (4K): 550K IOPS - Latency: < 100μs

Compute Performance

LLM Inference (Ollama): - Qwen3:4B: ~40 tokens/second - Qwen3:30B: ~8 tokens/second - DeepSeek-R1:32B: ~6 tokens/second - Moondream (vision): ~15 tokens/second

Database Performance: - PostgreSQL: 10K queries/second (typical load) - FalkorDB: 5K graph traversals/second - Redis cache hit rate: 95%+

Scaling Strategy

Horizontal Scaling (Future)

Current: Single LXC container on single host

Phase 1 (< 6 months): - Add second LXC for separate services (monitoring, CI/CD) - Keep Aegis core on primary container - Shared PostgreSQL, separate Redis instances

Phase 2 (6-12 months): - Kubernetes cluster on Hetzner Cloud - Aegis core as StatefulSet - Managed PostgreSQL (Hetzner) - Multi-region deployment for resilience

Vertical Scaling (Current Headroom)

CPU: Can allocate up to 32 cores (currently using ~12) RAM: Can increase to 120GB (currently 110GB) Storage: Can expand ZFS dataset to 1.5TB

Cost Analysis

Monthly Infrastructure Costs

Item Cost Notes
Hetzner EX130-R €89.00 Bare metal server
Hetzner Storage Box (1TB) €3.81 Backup storage
Cloudflare DNS €0.00 Free plan
Domain (aegisagent.ai) €1.00 Amortized annual cost
Total €93.81 ~$102 USD/month

Cost per vCPU: €2.78/month Cost per GB RAM: €0.82/month Cost per TB storage: €1.91/month

Maintenance Windows

Scheduled Maintenance

Daily (00:00-06:00 UTC): - Database backups and optimization - Log rotation and cleanup - Docker image updates (security patches) - System package updates (unattended-upgrades)

Weekly (Sunday 02:00 UTC): - Proxmox snapshot creation - Full container backup to Storage Box - Disk usage analysis and cleanup

Monthly (First Sunday 03:00 UTC): - Kernel updates (requires reboot) - Certificate renewal checks - Dependency audit (Python, Node.js)

Emergency Maintenance

Triggers: - Critical security vulnerabilities (CVE score > 9.0) - Service outages affecting > 50% functionality - Data corruption detected - Resource exhaustion (>95% RAM/disk)

Notification Channels: - Discord #alerts channel - Telegram bot (Chat ID: 1275129801) - WhatsApp (WABA: +447441443388)

Monitoring and Alerting

Health Checks

Service-Level: - HTTP health endpoints every 30 seconds - Database connection pools - Docker container status - Disk I/O wait times

System-Level: - CPU utilization (alert > 90% for 5 minutes) - RAM usage (alert > 95%) - Disk usage (alert > 85%) - Network errors (alert > 1% packet loss)

Metrics Collection

  • Prometheus: Metrics scraping (30s interval)
  • PostgreSQL: Query logs and slow query analysis
  • Docker stats: Container resource usage
  • Structlog: JSON-formatted application logs

Dashboards

  • Traefik: https://traefik.rbnk.uk (internal)
  • Aegis Dashboard: https://aegisagent.ai (public)
  • FalkorDB Browser: http://localhost:3001 (internal)

Future Infrastructure Roadmap

Q1 2026: Optimization Phase

  • Implement caching layer (Redis Cluster)
  • Optimize Ollama model loading (shared memory)
  • Add read replicas for PostgreSQL
  • Implement rate limiting per API key

Q2 2026: High Availability

  • Second Aegis instance (active-passive)
  • Load balancer (HAProxy or Traefik Enterprise)
  • Distributed logging (Loki)
  • Multi-region DNS failover

Q3 2026: Edge Computing

  • Edge nodes for regional LLM inference
  • CDN for static assets (Cloudflare)
  • WebSocket gateway for real-time updates
  • GraphQL federation

Q4 2026: Kubernetes Migration

  • Migrate to Hetzner Cloud Kubernetes
  • Helm charts for all services
  • GitOps deployment (ArgoCD)
  • Service mesh (Istio or Linkerd)