Emergency Runbook¶
Step-by-step procedures for recovering from various failure scenarios.
A. Complete System Loss Recovery¶
Prerequisites¶
- Access to Proxmox host (157.180.63.15)
- Backup archives (PostgreSQL dump, config backup)
- GitHub access for source code
Step 1: Provision New LXC Container¶
# On Proxmox host
pct create 103 local:vztmpl/ubuntu-24.04-standard_24.04-1_amd64.tar.gz \
--hostname aegis \
--memory 112640 \
--cores 16 \
--rootfs local-zfs:200 \
--net0 name=eth0,bridge=vmbr1,ip=10.10.10.103/24,gw=10.10.10.1 \
--features nesting=1 \
--unprivileged 1
pct start 103
Step 2: Install Base Packages¶
# Inside container
apt update && apt upgrade -y
apt install -y \
python3.12 python3.12-venv python3-pip \
docker.io docker-compose-plugin \
postgresql-client \
git curl wget htop vim \
build-essential libpq-dev
# Add user
useradd -m -s /bin/bash agent
usermod -aG docker agent
# Enable Docker
systemctl enable docker
systemctl start docker
Step 3: Restore Configuration¶
# As agent user
su - agent
# Create directory structure
mkdir -p ~/.secure ~/.claude/hooks ~/memory/{episodic,semantic,procedural,journal}
mkdir -p ~/projects ~/stacks ~/downloads ~/logs
# Restore credentials (from encrypted backup)
# CRITICAL: These must be restored from secure backup
# - ~/.secure/.env
# - ~/.secure/stripe*.json
# - ~/.claude.json
# - ~/.claude/settings.json
Step 4: Clone Source Code¶
Step 5: Restore PostgreSQL Database¶
# Install PostgreSQL server
sudo apt install -y postgresql-16 postgresql-16-pgvector
# Create database and user
sudo -u postgres psql << EOF
CREATE USER agent WITH PASSWORD 'agent';
CREATE DATABASE aegis OWNER agent;
GRANT ALL PRIVILEGES ON DATABASE aegis TO agent;
EOF
# Enable pgvector
sudo -u postgres psql -d aegis -c "CREATE EXTENSION IF NOT EXISTS vector;"
# Restore from backup
psql -h localhost -U agent -d aegis < backup/aegis_dump.sql
Step 6: Create Docker Network¶
Step 7: Start Services¶
cd ~/projects/aegis-core
# Build images
docker compose build
# Start services in order
docker compose up -d falkordb
sleep 10 # Wait for FalkorDB
docker compose up -d traefik
sleep 5
docker compose up -d dashboard
sleep 10 # Wait for health check
docker compose up -d scheduler playwright
Step 8: Verify Services¶
# Check container health
docker ps
# Expected output:
# aegis-dashboard ... (healthy)
# aegis-scheduler ... (healthy)
# aegis-playwright ... (healthy)
# falkordb ... (healthy)
# traefik ... Up
# Test endpoints
curl -s http://localhost:8080/health | jq .
curl -s https://aegisagent.ai/health | jq .
Step 9: Restore MCP Configuration¶
# Verify MCP config exists
cat ~/.claude.json | jq '.mcpServers | keys'
# Expected servers:
# discord, telegram, github, google-workspace, docker,
# filesystem, postgres, ollama, playwright, vonage
Step 10: Test Integrations¶
# Test Discord
# Post to #alerts: "System recovered from backup"
# Test database connection
psql -h localhost -U agent -d aegis -c "SELECT COUNT(*) FROM api_keys;"
# Test FalkorDB
redis-cli -p 6379 ping # Should return PONG
Step 11: Resume Operations¶
# Check scheduled jobs
docker logs aegis-scheduler --tail 50
# Run health check
curl https://aegisagent.ai/api/health
# Post recovery notice to Discord
B. Database Recovery Only¶
Step 1: Stop Services¶
Step 2: Backup Current Database (if accessible)¶
Step 3: Restore from Backup¶
# Drop and recreate database
sudo -u postgres psql << EOF
DROP DATABASE IF EXISTS aegis;
CREATE DATABASE aegis OWNER agent;
GRANT ALL PRIVILEGES ON DATABASE aegis TO agent;
EOF
# Enable extensions
sudo -u postgres psql -d aegis -c "CREATE EXTENSION IF NOT EXISTS vector;"
# Restore
psql -h localhost -U agent -d aegis < backup/aegis_dump.sql
Step 4: Run Migrations¶
Step 5: Restart Services¶
C. Container Recovery¶
Dashboard Not Starting¶
# Check logs
docker logs aegis-dashboard --tail 100
# Common issues:
# 1. Database connection failed → Check PostgreSQL running
# 2. Missing env vars → Check .env file
# 3. Port conflict → Check nothing else on 8080
# Rebuild and restart
docker compose build dashboard
docker compose up -d dashboard
Scheduler Not Starting¶
# Check logs
docker logs aegis-scheduler --tail 100
# Common issues:
# 1. Dashboard not healthy → Wait for dashboard
# 2. Playwright not reachable → Start playwright first
# Restart with dependencies
docker compose up -d playwright
sleep 5
docker compose up -d scheduler
FalkorDB Data Loss¶
# Stop services
docker compose stop dashboard scheduler
# Remove corrupted volume
docker volume rm aegis-core_falkordb_data
# Restart (will recreate empty)
docker compose up -d falkordb
# Re-ingest data from transcripts
python scripts/ingest_transcripts.py
D. Configuration Recovery¶
Lost ~/.secure/.env¶
# This file contains all API keys and secrets
# Must be restored from secure backup or regenerated
# Required variables (get new values from each service):
cat > ~/.secure/.env << 'EOF'
# Database
POSTGRES_HOST=localhost
POSTGRES_PORT=5432
POSTGRES_USER=agent
POSTGRES_PASSWORD=agent
POSTGRES_DB=aegis
# LLM APIs
ZAI_API_KEY=<get-from-z.ai>
ANTHROPIC_API_KEY=<get-from-anthropic>
PERPLEXITY_API_KEY=<get-from-perplexity>
# Communication
DISCORD_BOT_TOKEN=<get-from-discord-developer-portal>
TELEGRAM_BOT_TOKEN=<get-from-botfather>
VONAGE_API_KEY=<get-from-vonage>
VONAGE_API_SECRET=<get-from-vonage>
VONAGE_APPLICATION_ID=<get-from-vonage>
# Payment
STRIPE_SECRET_KEY=<get-from-stripe>
STRIPE_WEBHOOK_SECRET=<get-from-stripe>
# Email
RESEND_API_KEY=<get-from-resend>
EOF
chmod 600 ~/.secure/.env
Lost ~/.claude.json¶
# MCP server configuration
# Restore from backup or recreate with tokens
# Template structure:
{
"mcpServers": {
"discord": {
"command": "npx",
"args": ["-y", "@aegis/mcp-discord"],
"env": {
"DISCORD_BOT_TOKEN": "<token>"
}
},
// ... other servers
}
}
E. Verification Checklist¶
After any recovery, verify:
# Services running
docker ps | grep healthy
# Database connectivity
psql -h localhost -U agent -d aegis -c "SELECT 1;"
# API responding
curl -s https://aegisagent.ai/health
# MCP servers available
# (verify in Claude Code that tools work)
# Scheduled jobs running
docker logs aegis-scheduler --tail 20 | grep "Job executed"
# Memory accessible
ls ~/memory/journal/
ls ~/memory/semantic/
# Git status clean
git -C ~/projects/aegis-core status
F. Post-Recovery Actions¶
- Document incident in journal
- Post to Discord #alerts with recovery status
- Verify scheduled jobs are running
- Run /status to confirm health
- Check recent Beads tasks weren't lost
- Review backup strategy if data was lost
Last Updated: 2026-01-25 Recovery Time Target: 2 hours