Troubleshooting¶

Comprehensive guide to diagnosing and resolving common Aegis operational issues.

Container Issues¶

Container Won't Start¶

Symptoms: - Container exits immediately after start - docker compose up fails - Container status shows "Restarting"

Diagnosis:

# Check container status
docker ps -a | grep aegis

# View container logs
docker logs aegis-dashboard --tail 100

# Check for port conflicts
netstat -tlnp | grep 8080

# Inspect container configuration
docker inspect aegis-dashboard

Common Causes & Solutions:

1. Port Already in Use:

# Find process using port
sudo lsof -i :8080

# Kill process or change port in docker-compose.yml
docker compose down
# Edit docker-compose.yml to use different port
docker compose up -d

2. Missing Environment Variables:

# Check environment file
cat /home/agent/.secure/.env

# Verify variables are loaded
docker exec aegis-dashboard env | grep POSTGRES

# Add missing variables to docker-compose.yml or .env

3. Database Connection Failed:

# Test database connectivity
psql -U agent -d aegis -c "SELECT 1;"

# Check PostgreSQL is running
systemctl status postgresql

# Verify connection settings in docker-compose.yml
docker exec aegis-dashboard python -c "
import os
print('POSTGRES_HOST:', os.getenv('POSTGRES_HOST'))
print('POSTGRES_PORT:', os.getenv('POSTGRES_PORT'))
"

4. Volume Mount Issues:

# Check volume exists
docker volume ls | grep falkordb

# Inspect volume
docker volume inspect falkordb_data

# Recreate volume if corrupted
docker compose down
docker volume rm falkordb_data
docker compose up -d

Resolution Steps:

# 1. Stop all containers
cd /home/agent/projects/aegis-core && docker compose down

# 2. Check logs for specific error
docker logs aegis-dashboard

# 3. Fix the underlying issue

# 4. Rebuild and restart
cd /home/agent/projects/aegis-core && docker compose up -d --build

# 5. Verify health
docker ps
curl http://localhost:8080/health

Container Keeps Restarting¶

Symptoms: - Container restarts every few seconds/minutes - Health checks failing - Application crashes on startup

Diagnosis:

# Check restart count
docker inspect aegis-dashboard --format='{{.RestartCount}}'

# View crash logs
docker logs aegis-dashboard --tail 200

# Check health check status
docker inspect aegis-dashboard --format='{{json .State.Health}}' | jq

# Monitor container in real-time
docker logs -f aegis-dashboard

Common Causes & Solutions:

1. Health Check Failing:

# Test health check manually
docker exec aegis-dashboard python -c "
import urllib.request
try:
    urllib.request.urlopen('http://localhost:8080/health')
    print('✓ Health check passed')
except Exception as e:
    print('✗ Health check failed:', e)
"

# Temporarily disable health check
# Edit docker-compose.yml and comment out healthcheck section
docker compose up -d --force-recreate dashboard

2. Application Error on Startup:

# Check Python errors
docker logs aegis-dashboard 2>&1 | grep -i "error\|exception\|traceback"

# Run container interactively to debug
docker compose run --rm dashboard /bin/bash
python -c "from aegis.dashboard import app"

3. Memory Limit Exceeded:

# Check container memory usage
docker stats aegis-dashboard --no-stream

# Increase memory limit in docker-compose.yml
# deploy:
#   resources:
#     limits:
#       memory: 4G

Resolution:

# Check Three-Strike Error Protocol
# Search for similar past errors
# (Conceptual - use error MCP tools if available)

# Record the error
# mcp__aegis__error_record(
#     error_type="ContainerCrashLoop",
#     context="Container restarting with health check failure",
#     strike_count=1
# )

# Fix root cause and restart
cd /home/agent/projects/aegis-core && docker compose restart dashboard

Slow Container Performance¶

Symptoms: - API responses slow (>1s) - High CPU/memory usage - Container unresponsive

Diagnosis:

# Check resource usage
docker stats --no-stream

# Check container processes
docker top aegis-dashboard

# Check application logs for slow operations
docker logs aegis-dashboard | grep "duration_ms" | awk -F'duration_ms[: ]*' '{if($2>1000) print}'

# Profile Python app (if needed)
docker exec aegis-dashboard python -m cProfile -o /tmp/profile.stats app.py

Solutions:

1. Database Bottleneck:

# Check slow queries
psql -U agent -d aegis -c "
  SELECT query, calls, total_time, mean_time
  FROM pg_stat_statements
  WHERE mean_time > 100
  ORDER BY mean_time DESC
  LIMIT 10;
"

# Add missing indexes
psql -U agent -d aegis -c "CREATE INDEX idx_episodic_timestamp ON episodic_memory(timestamp DESC);"

# Vacuum database
psql -U agent -d aegis -c "VACUUM ANALYZE;"

2. Too Many Scheduled Jobs:

# Check running jobs
docker exec aegis-scheduler python -c "
from aegis.scheduler import scheduler
print([job.name for job in scheduler.scheduler.get_jobs()])
"

# Disable non-essential jobs temporarily
# Edit aegis/scheduler.py and comment out jobs
docker compose restart scheduler

3. Memory Leak:

# Monitor memory over time
watch -n 5 'docker stats aegis-dashboard --no-stream'

# Restart container to reclaim memory
docker compose restart dashboard

Database Issues¶

Connection Failures¶

Symptoms: - "could not connect to server" errors - Health check fails with database error - Application can't reach PostgreSQL

Diagnosis:

# Test connection from host
psql -U agent -d aegis -c "SELECT 1;"

# Test from container
docker exec aegis-dashboard psql -U agent -d aegis -c "SELECT 1;"

# Check PostgreSQL status
systemctl status postgresql

# Check PostgreSQL logs
sudo tail -100 /var/log/postgresql/postgresql-*.log

# Verify port is listening
netstat -tlnp | grep 5432

Common Causes & Solutions:

1. PostgreSQL Not Running:

# Start PostgreSQL
sudo systemctl start postgresql

# Enable auto-start
sudo systemctl enable postgresql

# Check status
systemctl status postgresql

2. Connection Pool Exhausted:

# Check active connections
psql -U agent -d aegis -c "
  SELECT count(*) as connections,
         max_connections
  FROM pg_stat_activity,
       (SELECT setting::int as max_connections FROM pg_settings WHERE name='max_connections') mc
  GROUP BY max_connections;
"

# Kill idle connections
psql -U agent -d aegis -c "
  SELECT pg_terminate_backend(pid)
  FROM pg_stat_activity
  WHERE state = 'idle' AND state_change < NOW() - INTERVAL '1 hour';
"

# Increase max_connections (if needed)
sudo -u postgres psql -c "ALTER SYSTEM SET max_connections = 200;"
sudo systemctl restart postgresql

3. Authentication Failed:

# Check pg_hba.conf
sudo cat /etc/postgresql/*/main/pg_hba.conf | grep -v "^#"

# Allow local connections
# Should have line: local   all   all   trust
# Or: host    all   all   127.0.0.1/32   trust

# Reload PostgreSQL
sudo systemctl reload postgresql

4. Docker Networking Issue:

# Check container can reach host
docker exec aegis-dashboard ping -c 3 host.docker.internal

# Verify extra_hosts in docker-compose.yml
docker inspect aegis-dashboard | jq '.[0].HostConfig.ExtraHosts'

# Recreate network
docker compose down
docker network prune -f
docker compose up -d

Slow Queries¶

Symptoms: - API endpoints timeout - Database CPU usage high - Long query execution times

Diagnosis:

# Enable pg_stat_statements
psql -U agent -d aegis -c "CREATE EXTENSION IF NOT EXISTS pg_stat_statements;"

# Find slow queries
psql -U agent -d aegis -c "
  SELECT
    query,
    calls,
    total_time,
    mean_time,
    max_time
  FROM pg_stat_statements
  WHERE mean_time > 100
  ORDER BY mean_time DESC
  LIMIT 20;
"

# Check for locks
psql -U agent -d aegis -c "
  SELECT
    pid,
    usename,
    pg_blocking_pids(pid) as blocked_by,
    query
  FROM pg_stat_activity
  WHERE cardinality(pg_blocking_pids(pid)) > 0;
"

# Check active queries
psql -U agent -d aegis -c "
  SELECT
    pid,
    now() - query_start as duration,
    query
  FROM pg_stat_activity
  WHERE state = 'active'
  ORDER BY duration DESC;
"

Solutions:

1. Add Indexes:

# Identify missing indexes
psql -U agent -d aegis -c "
  SELECT
    schemaname,
    tablename,
    attname,
    n_distinct
  FROM pg_stats
  WHERE schemaname = 'public'
    AND n_distinct > 100
  ORDER BY n_distinct DESC;
"

# Create indexes
psql -U agent -d aegis -c "CREATE INDEX idx_episodic_event_type ON episodic_memory(event_type);"
psql -U agent -d aegis -c "CREATE INDEX idx_workflow_created_at ON workflow_runs(created_at DESC);"

2. Optimize Query:

# Use EXPLAIN ANALYZE to see query plan
psql -U agent -d aegis -c "
  EXPLAIN ANALYZE
  SELECT * FROM episodic_memory
  WHERE event_type = 'decision'
  ORDER BY timestamp DESC
  LIMIT 100;
"

# Rewrite inefficient queries in code

3. Kill Long-Running Query:

# Get query PID
psql -U agent -d aegis -c "
  SELECT pid, query_start, query
  FROM pg_stat_activity
  WHERE state = 'active'
    AND query_start < NOW() - INTERVAL '10 minutes';
"

# Terminate query
psql -U agent -d aegis -c "SELECT pg_terminate_backend(12345);"

Database Corruption¶

Symptoms: - "database is corrupted" errors - Unexpected data loss - Cannot connect to database

Diagnosis:

# Check database integrity
psql -U agent -d aegis -c "
  SELECT datname, pg_database_size(datname)
  FROM pg_database
  WHERE datname = 'aegis';
"

# Check table integrity
psql -U agent -d aegis -c "
  SELECT tablename, pg_relation_size(tablename::regclass)
  FROM pg_tables
  WHERE schemaname = 'public';
"

# Check for errors in logs
sudo tail -500 /var/log/postgresql/postgresql-*.log | grep -i "error\|corrupt"

Recovery Steps:

1. Try REINDEX:

# Reindex database
psql -U agent -d aegis -c "REINDEX DATABASE aegis;"

# If fails, reindex tables individually
psql -U agent -d aegis -c "REINDEX TABLE episodic_memory;"

2. Restore from Backup:

# Stop containers
cd /home/agent/projects/aegis-core && docker compose down

# Backup current database (just in case)
pg_dump -U agent -d aegis -F c -f /tmp/aegis_corrupt_$(date +%Y%m%d).dump

# Drop and recreate database
dropdb -U agent aegis
createdb -U agent aegis

# Restore from latest backup
LATEST_BACKUP=$(ls -t /home/agent/logs/backup/aegis_*.dump | head -1)
pg_restore -U agent -d aegis "$LATEST_BACKUP"

# Restart containers
cd /home/agent/projects/aegis-core && docker compose up -d

# Verify data
psql -U agent -d aegis -c "SELECT count(*) FROM episodic_memory;"

3. Last Resort - Reinstall PostgreSQL:

# Backup all databases
sudo -u postgres pg_dumpall > /tmp/pg_dumpall_$(date +%Y%m%d).sql

# Reinstall PostgreSQL
sudo apt remove --purge postgresql postgresql-*
sudo apt autoremove
sudo apt install postgresql

# Restore databases
sudo -u postgres psql -f /tmp/pg_dumpall_$(date +%Y%m%d).sql

Certificate Issues¶

Certificate Expired¶

Symptoms: - "certificate has expired" errors - HTTPS not working - Browser shows security warning

Diagnosis:

# Check certificate expiry
echo | openssl s_client -connect aegisagent.ai:443 2>/dev/null | openssl x509 -noout -dates

# Check Traefik logs for cert renewal
docker logs traefik | grep -i "certificate\|acme"

# Verify Let's Encrypt rate limits
curl -s "https://crt.sh/?q=aegisagent.ai&output=json" | jq -r '.[].not_after' | sort | uniq -c

Solutions:

1. Renew Certificate Manually:

# Trigger ACME renewal (if using Traefik)
docker restart traefik

# Wait for renewal (check logs)
docker logs -f traefik | grep certificate

# Verify new certificate
echo | openssl s_client -connect aegisagent.ai:443 2>/dev/null | openssl x509 -noout -dates

2. Certificate Rate Limited:

# Use staging Let's Encrypt temporarily
# Edit Traefik configuration to use staging server
# certificatesResolvers.cf.acme.caServer: https://acme-staging-v02.api.letsencrypt.org/directory

# Wait for rate limit to reset (7 days for duplicates, 168 hours)

3. Certificate Not Issued:

# Check DNS resolution
dig aegisagent.ai +short

# Check HTTP-01 challenge path is accessible
curl -v http://aegisagent.ai/.well-known/acme-challenge/test

# Check Cloudflare settings (if using CF)
# Ensure proxy is disabled for ACME challenges

Network Issues¶

Service Unreachable¶

Symptoms: - Cannot access https://aegisagent.ai - Timeouts when connecting - DNS resolution fails

Diagnosis:

# Test DNS resolution
dig aegisagent.ai +short
nslookup aegisagent.ai

# Test connectivity to server
ping -c 5 157.180.63.15
traceroute aegisagent.ai

# Test ports are open
telnet aegisagent.ai 443
nc -zv aegisagent.ai 443

# Check from inside LXC
curl -v http://localhost:8080/health
curl -v https://aegisagent.ai/health

Common Causes & Solutions:

1. Traefik Not Running:

# Check Traefik on Dockerhost (10.10.10.10)
ssh dockerhost "docker ps | grep traefik"

# Check Traefik on Aegis LXC (10.10.10.103)
docker ps | grep traefik

# Restart Traefik
docker restart traefik

# Check Traefik dashboard (if enabled)
curl http://localhost:8080/dashboard/

2. Firewall Blocking:

# Check iptables rules
sudo iptables -L -n | grep 8080

# Check firewalld (if installed)
sudo firewall-cmd --list-all

# Allow port through firewall
sudo iptables -A INPUT -p tcp --dport 8080 -j ACCEPT

3. DNS Not Propagated:

# Check DNS from different nameservers
dig @8.8.8.8 aegisagent.ai
dig @1.1.1.1 aegisagent.ai

# Check Cloudflare DNS settings
# Log into Cloudflare dashboard and verify A record points to 157.180.63.15

4. Traefik Configuration Error:

# Check Traefik labels
docker inspect aegis-dashboard | jq '.[0].Config.Labels' | grep traefik

# Check Traefik routing
ssh dockerhost "cat /srv/dockerdata/traefik/dynamic/aegis-passthrough.yml"

# Verify SNI passthrough configuration
# Should include aegisagent.ai in TCP router rule

Slow Network Performance¶

Symptoms: - API requests slow despite low server load - High latency - Intermittent timeouts

Diagnosis:

# Test latency
ping -c 100 aegisagent.ai | tail -1

# Test bandwidth
curl -o /dev/null https://aegisagent.ai/static/large-file.bin

# Check network stats
docker exec aegis-dashboard netstat -s

# Monitor network traffic
docker stats --no-stream --format "table {{.Name}}\t{{.NetIO}}"

Solutions:

1. Network Congestion:

# Check host network usage
iftop -i eth0

# Limit container bandwidth (if needed)
# Add to docker-compose.yml:
# networks:
#   traefik_proxy:
#     driver: bridge
#     driver_opts:
#       com.docker.network.driver.mtu: 1500

2. DNS Resolution Slow:

# Use faster DNS
sudo tee /etc/resolv.conf <<EOF
nameserver 1.1.1.1
nameserver 8.8.8.8
EOF

# Or in docker-compose.yml:
# dns:
#   - 1.1.1.1
#   - 8.8.8.8

Memory Issues¶

Out of Memory¶

Symptoms: - Container killed by OOM killer - Application crashes - System becomes unresponsive

Diagnosis:

# Check system memory
free -h

# Check container memory limits
docker inspect aegis-dashboard --format='{{.HostConfig.Memory}}'

# Check OOM kills in logs
dmesg | grep -i "out of memory\|oom"

# Check which process is using memory
docker exec aegis-dashboard ps aux --sort=-%mem | head -10

# Monitor memory usage over time
docker stats aegis-dashboard

Solutions:

1. Increase Container Memory Limit:

# Edit docker-compose.yml
services:
  dashboard:
    deploy:
      resources:
        limits:
          memory: 4G
        reservations:
          memory: 1G

2. Find Memory Leak:

# Profile Python memory
docker exec aegis-dashboard python -c "
import gc
import objgraph
objgraph.show_most_common_types(limit=20)
"

# Use memory_profiler
docker exec aegis-dashboard python -m memory_profiler app.py

3. Restart Container to Reclaim Memory:

cd /home/agent/projects/aegis-core && docker compose restart dashboard

4. Clear Caches:

# Clear Python caches
find /home/agent/projects/aegis-core -name "__pycache__" -type d -exec rm -rf {} + 2>/dev/null

# Clear system cache (safe)
sync; echo 3 | sudo tee /proc/sys/vm/drop_caches

Disk Space Issues¶

Disk Full¶

Symptoms: - "No space left on device" errors - Cannot write to disk - Database writes fail

Diagnosis:

# Check disk usage
df -h

# Find largest directories
du -sh /home/agent/* | sort -rh | head -10
du -sh /var/lib/docker/* | sort -rh | head -10

# Find largest files
find /home/agent -type f -size +100M -exec ls -lh {} \; | sort -k5 -rh | head -20

# Check Docker disk usage
docker system df -v

Emergency Cleanup:

# Clean Docker
docker system prune -a --volumes -f

# Remove old backups
find /home/agent/logs/backup -name "*.dump" -mtime +7 -delete
find /home/agent/logs/backup -name "*.tar.gz" -mtime +7 -delete

# Truncate large logs
truncate -s 0 /home/agent/logs/*.log
for container in $(docker ps -q); do
    log_file="/var/lib/docker/containers/$container/$container-json.log"
    sudo truncate -s 0 "$log_file" 2>/dev/null
done

# Remove old journal entries
find /home/agent/memory/journal -name "*.md" -mtime +90 -delete

# Vacuum database
psql -U agent -d aegis -c "VACUUM FULL;"

# Clean package caches
sudo apt clean
pip cache purge

Inode Exhaustion¶

Symptoms: - "No space left on device" despite disk space available - Cannot create new files

Diagnosis:

# Check inode usage
df -i

# Find directories with many files
for dir in /home/agent/*; do
    echo "$dir: $(find $dir -type f | wc -l) files"
done

Solutions:

# Clean up __pycache__ directories
find /home/agent -name "__pycache__" -type d -exec rm -rf {} + 2>/dev/null

# Remove old log files
find /home/agent/logs -name "*.log.*" -delete

# Consolidate small files
find /home/agent/memory/episodic -name "*.json" -exec gzip {} \;

Three-Strike Debug Protocol¶

Implementation¶

When encountering repeated failures, follow this protocol:

Strike 1: Retry with Modified Approach

# Search for similar errors
# mcp__aegis__error_search(query="ContainerRestartLoop")

# Record the error
# mcp__aegis__error_record(
#     error_type="ContainerRestartLoop",
#     context="Dashboard container restarting, health check failing",
#     strike_count=1,
#     severity="warning"
# )

# Try alternative approach
# e.g., disable health check, check logs, test manually

Strike 2: Escalate to Local Reasoning

# Record escalation
# mcp__aegis__error_record(
#     error_type="ContainerRestartLoop",
#     context="Second attempt failed, analyzing with local model",
#     strike_count=2,
#     severity="error"
# )

# Use local model for deep analysis
# Analyze from first principles
# Check similar past resolutions

Strike 3: STOP and Escalate

# Record critical error
# mcp__aegis__error_record(
#     error_type="ContainerRestartLoop",
#     context="Third consecutive failure, halting attempts",
#     strike_count=3,
#     severity="critical"
# )

# Document in journal
echo "## Critical Error: ContainerRestartLoop" >> ~/memory/journal/$(date +%Y-%m-%d).md
echo "Three strikes reached. Awaiting human intervention." >> ~/memory/journal/$(date +%Y-%m-%d).md

# Alert to Discord
# mcp__discord__discord_send_message(
#     channel_id="1455049130614329508",
#     content="🚨 **Three-Strike Protocol Triggered**\n\nError: ContainerRestartLoop\nStrikes: 3\nStatus: HALTED - Awaiting Human Input"
# )

# STOP - Do not retry
# Wait for human intervention

Common Error Messages¶

"Health check failed: connection refused"¶

Cause: Application not listening on expected port

Fix:

# Check if port is listening inside container
docker exec aegis-dashboard netstat -tlnp | grep 8080

# Check application logs
docker logs aegis-dashboard | grep -i "listening\|started"

# Verify port binding in docker-compose.yml
docker inspect aegis-dashboard | jq '.[0].NetworkSettings.Ports'

"Database connection refused"¶

Cause: PostgreSQL not accessible from container

Fix:

# Test from container
docker exec aegis-dashboard psql -U agent -h host.docker.internal -d aegis -c "SELECT 1;"

# Verify extra_hosts
docker inspect aegis-dashboard | jq '.[0].HostConfig.ExtraHosts'

# Check PostgreSQL allows connections
sudo cat /etc/postgresql/*/main/postgresql.conf | grep listen_addresses

"Permission denied" errors¶

Cause: File/directory permissions incorrect

Fix:

# Fix ownership
sudo chown -R agent:agent /home/agent/memory
sudo chown -R agent:agent /home/agent/logs

# Fix permissions
chmod 755 /home/agent/memory
chmod 644 /home/agent/memory/**/*.md
chmod 700 /home/agent/.secure
chmod 600 /home/agent/.secure/*

"Too many open files"¶

Cause: File descriptor limit reached

Fix:

# Check current limits
ulimit -n

# Increase limit temporarily
ulimit -n 4096

# Increase limit permanently
echo "* soft nofile 4096" | sudo tee -a /etc/security/limits.conf
echo "* hard nofile 8192" | sudo tee -a /etc/security/limits.conf

# Restart session to apply

Best Practices¶

Check Logs First: Always start by reading container/application logs
Use Three-Strike Protocol: Don't retry indefinitely, escalate after 3 attempts
Document Solutions: Record error resolutions for future reference
Test in Isolation: Isolate the problem (test database separately, etc.)
Backup Before Changes: Always backup before making major changes
Monitor Resources: Keep an eye on CPU, memory, disk usage
Use Health Checks: Ensure all containers have working health checks
Alert on Failures: Set up alerts for critical errors
Keep It Simple: Prefer simple solutions over complex ones
Learn from Errors: Use error tracking to prevent recurrence

Getting Help¶

Information to Gather¶

Before escalating to human:

Error Message: Exact error message from logs
Container Status: docker ps -a
Resource Usage: docker stats --no-stream
Recent Logs: Last 100 lines from relevant container
Configuration: Relevant parts of docker-compose.yml
Recent Changes: What changed before the issue started
Reproducibility: Can you reproduce the issue consistently?

Escalation Checklist¶

Checked all logs (container, application, system)
Searched for similar errors (error_search MCP tool)
Tried Strike 1 solutions
Tried Strike 2 solutions (local reasoning)
Documented issue in journal
Sent alert to Discord #alerts
Gathered all diagnostic information
Waiting for human intervention

Monitoring - Health checks and alerting
Logging - Log management and analysis
Maintenance - Routine maintenance tasks
Backups - Backup and recovery procedures

Troubleshooting¶

Container Issues¶

Container Won't Start¶

Container Keeps Restarting¶

Slow Container Performance¶

Database Issues¶

Connection Failures¶

Slow Queries¶

Database Corruption¶

Certificate Issues¶

Certificate Expired¶

Network Issues¶

Service Unreachable¶

Slow Network Performance¶

Memory Issues¶

Out of Memory¶

Disk Space Issues¶

Disk Full¶

Inode Exhaustion¶

Three-Strike Debug Protocol¶

Implementation¶

Common Error Messages¶

"Health check failed: connection refused"¶

"Database connection refused"¶

"Permission denied" errors¶

"Too many open files"¶

Best Practices¶

Getting Help¶

Information to Gather¶

Escalation Checklist¶

Related Documentation¶