Troubleshooting¶
Comprehensive guide to diagnosing and resolving common Aegis operational issues.
Container Issues¶
Container Won't Start¶
Symptoms:
- Container exits immediately after start
- docker compose up fails
- Container status shows "Restarting"
Diagnosis:
# Check container status
docker ps -a | grep aegis
# View container logs
docker logs aegis-dashboard --tail 100
# Check for port conflicts
netstat -tlnp | grep 8080
# Inspect container configuration
docker inspect aegis-dashboard
Common Causes & Solutions:
1. Port Already in Use:
# Find process using port
sudo lsof -i :8080
# Kill process or change port in docker-compose.yml
docker compose down
# Edit docker-compose.yml to use different port
docker compose up -d
2. Missing Environment Variables:
# Check environment file
cat /home/agent/.secure/.env
# Verify variables are loaded
docker exec aegis-dashboard env | grep POSTGRES
# Add missing variables to docker-compose.yml or .env
3. Database Connection Failed:
# Test database connectivity
psql -U agent -d aegis -c "SELECT 1;"
# Check PostgreSQL is running
systemctl status postgresql
# Verify connection settings in docker-compose.yml
docker exec aegis-dashboard python -c "
import os
print('POSTGRES_HOST:', os.getenv('POSTGRES_HOST'))
print('POSTGRES_PORT:', os.getenv('POSTGRES_PORT'))
"
4. Volume Mount Issues:
# Check volume exists
docker volume ls | grep falkordb
# Inspect volume
docker volume inspect falkordb_data
# Recreate volume if corrupted
docker compose down
docker volume rm falkordb_data
docker compose up -d
Resolution Steps:
# 1. Stop all containers
cd /home/agent/projects/aegis-core && docker compose down
# 2. Check logs for specific error
docker logs aegis-dashboard
# 3. Fix the underlying issue
# 4. Rebuild and restart
cd /home/agent/projects/aegis-core && docker compose up -d --build
# 5. Verify health
docker ps
curl http://localhost:8080/health
Container Keeps Restarting¶
Symptoms: - Container restarts every few seconds/minutes - Health checks failing - Application crashes on startup
Diagnosis:
# Check restart count
docker inspect aegis-dashboard --format='{{.RestartCount}}'
# View crash logs
docker logs aegis-dashboard --tail 200
# Check health check status
docker inspect aegis-dashboard --format='{{json .State.Health}}' | jq
# Monitor container in real-time
docker logs -f aegis-dashboard
Common Causes & Solutions:
1. Health Check Failing:
# Test health check manually
docker exec aegis-dashboard python -c "
import urllib.request
try:
urllib.request.urlopen('http://localhost:8080/health')
print('✓ Health check passed')
except Exception as e:
print('✗ Health check failed:', e)
"
# Temporarily disable health check
# Edit docker-compose.yml and comment out healthcheck section
docker compose up -d --force-recreate dashboard
2. Application Error on Startup:
# Check Python errors
docker logs aegis-dashboard 2>&1 | grep -i "error\|exception\|traceback"
# Run container interactively to debug
docker compose run --rm dashboard /bin/bash
python -c "from aegis.dashboard import app"
3. Memory Limit Exceeded:
# Check container memory usage
docker stats aegis-dashboard --no-stream
# Increase memory limit in docker-compose.yml
# deploy:
# resources:
# limits:
# memory: 4G
Resolution:
# Check Three-Strike Error Protocol
# Search for similar past errors
# (Conceptual - use error MCP tools if available)
# Record the error
# mcp__aegis__error_record(
# error_type="ContainerCrashLoop",
# context="Container restarting with health check failure",
# strike_count=1
# )
# Fix root cause and restart
cd /home/agent/projects/aegis-core && docker compose restart dashboard
Slow Container Performance¶
Symptoms: - API responses slow (>1s) - High CPU/memory usage - Container unresponsive
Diagnosis:
# Check resource usage
docker stats --no-stream
# Check container processes
docker top aegis-dashboard
# Check application logs for slow operations
docker logs aegis-dashboard | grep "duration_ms" | awk -F'duration_ms[: ]*' '{if($2>1000) print}'
# Profile Python app (if needed)
docker exec aegis-dashboard python -m cProfile -o /tmp/profile.stats app.py
Solutions:
1. Database Bottleneck:
# Check slow queries
psql -U agent -d aegis -c "
SELECT query, calls, total_time, mean_time
FROM pg_stat_statements
WHERE mean_time > 100
ORDER BY mean_time DESC
LIMIT 10;
"
# Add missing indexes
psql -U agent -d aegis -c "CREATE INDEX idx_episodic_timestamp ON episodic_memory(timestamp DESC);"
# Vacuum database
psql -U agent -d aegis -c "VACUUM ANALYZE;"
2. Too Many Scheduled Jobs:
# Check running jobs
docker exec aegis-scheduler python -c "
from aegis.scheduler import scheduler
print([job.name for job in scheduler.scheduler.get_jobs()])
"
# Disable non-essential jobs temporarily
# Edit aegis/scheduler.py and comment out jobs
docker compose restart scheduler
3. Memory Leak:
# Monitor memory over time
watch -n 5 'docker stats aegis-dashboard --no-stream'
# Restart container to reclaim memory
docker compose restart dashboard
Database Issues¶
Connection Failures¶
Symptoms: - "could not connect to server" errors - Health check fails with database error - Application can't reach PostgreSQL
Diagnosis:
# Test connection from host
psql -U agent -d aegis -c "SELECT 1;"
# Test from container
docker exec aegis-dashboard psql -U agent -d aegis -c "SELECT 1;"
# Check PostgreSQL status
systemctl status postgresql
# Check PostgreSQL logs
sudo tail -100 /var/log/postgresql/postgresql-*.log
# Verify port is listening
netstat -tlnp | grep 5432
Common Causes & Solutions:
1. PostgreSQL Not Running:
# Start PostgreSQL
sudo systemctl start postgresql
# Enable auto-start
sudo systemctl enable postgresql
# Check status
systemctl status postgresql
2. Connection Pool Exhausted:
# Check active connections
psql -U agent -d aegis -c "
SELECT count(*) as connections,
max_connections
FROM pg_stat_activity,
(SELECT setting::int as max_connections FROM pg_settings WHERE name='max_connections') mc
GROUP BY max_connections;
"
# Kill idle connections
psql -U agent -d aegis -c "
SELECT pg_terminate_backend(pid)
FROM pg_stat_activity
WHERE state = 'idle' AND state_change < NOW() - INTERVAL '1 hour';
"
# Increase max_connections (if needed)
sudo -u postgres psql -c "ALTER SYSTEM SET max_connections = 200;"
sudo systemctl restart postgresql
3. Authentication Failed:
# Check pg_hba.conf
sudo cat /etc/postgresql/*/main/pg_hba.conf | grep -v "^#"
# Allow local connections
# Should have line: local all all trust
# Or: host all all 127.0.0.1/32 trust
# Reload PostgreSQL
sudo systemctl reload postgresql
4. Docker Networking Issue:
# Check container can reach host
docker exec aegis-dashboard ping -c 3 host.docker.internal
# Verify extra_hosts in docker-compose.yml
docker inspect aegis-dashboard | jq '.[0].HostConfig.ExtraHosts'
# Recreate network
docker compose down
docker network prune -f
docker compose up -d
Slow Queries¶
Symptoms: - API endpoints timeout - Database CPU usage high - Long query execution times
Diagnosis:
# Enable pg_stat_statements
psql -U agent -d aegis -c "CREATE EXTENSION IF NOT EXISTS pg_stat_statements;"
# Find slow queries
psql -U agent -d aegis -c "
SELECT
query,
calls,
total_time,
mean_time,
max_time
FROM pg_stat_statements
WHERE mean_time > 100
ORDER BY mean_time DESC
LIMIT 20;
"
# Check for locks
psql -U agent -d aegis -c "
SELECT
pid,
usename,
pg_blocking_pids(pid) as blocked_by,
query
FROM pg_stat_activity
WHERE cardinality(pg_blocking_pids(pid)) > 0;
"
# Check active queries
psql -U agent -d aegis -c "
SELECT
pid,
now() - query_start as duration,
query
FROM pg_stat_activity
WHERE state = 'active'
ORDER BY duration DESC;
"
Solutions:
1. Add Indexes:
# Identify missing indexes
psql -U agent -d aegis -c "
SELECT
schemaname,
tablename,
attname,
n_distinct
FROM pg_stats
WHERE schemaname = 'public'
AND n_distinct > 100
ORDER BY n_distinct DESC;
"
# Create indexes
psql -U agent -d aegis -c "CREATE INDEX idx_episodic_event_type ON episodic_memory(event_type);"
psql -U agent -d aegis -c "CREATE INDEX idx_workflow_created_at ON workflow_runs(created_at DESC);"
2. Optimize Query:
# Use EXPLAIN ANALYZE to see query plan
psql -U agent -d aegis -c "
EXPLAIN ANALYZE
SELECT * FROM episodic_memory
WHERE event_type = 'decision'
ORDER BY timestamp DESC
LIMIT 100;
"
# Rewrite inefficient queries in code
3. Kill Long-Running Query:
# Get query PID
psql -U agent -d aegis -c "
SELECT pid, query_start, query
FROM pg_stat_activity
WHERE state = 'active'
AND query_start < NOW() - INTERVAL '10 minutes';
"
# Terminate query
psql -U agent -d aegis -c "SELECT pg_terminate_backend(12345);"
Database Corruption¶
Symptoms: - "database is corrupted" errors - Unexpected data loss - Cannot connect to database
Diagnosis:
# Check database integrity
psql -U agent -d aegis -c "
SELECT datname, pg_database_size(datname)
FROM pg_database
WHERE datname = 'aegis';
"
# Check table integrity
psql -U agent -d aegis -c "
SELECT tablename, pg_relation_size(tablename::regclass)
FROM pg_tables
WHERE schemaname = 'public';
"
# Check for errors in logs
sudo tail -500 /var/log/postgresql/postgresql-*.log | grep -i "error\|corrupt"
Recovery Steps:
1. Try REINDEX:
# Reindex database
psql -U agent -d aegis -c "REINDEX DATABASE aegis;"
# If fails, reindex tables individually
psql -U agent -d aegis -c "REINDEX TABLE episodic_memory;"
2. Restore from Backup:
# Stop containers
cd /home/agent/projects/aegis-core && docker compose down
# Backup current database (just in case)
pg_dump -U agent -d aegis -F c -f /tmp/aegis_corrupt_$(date +%Y%m%d).dump
# Drop and recreate database
dropdb -U agent aegis
createdb -U agent aegis
# Restore from latest backup
LATEST_BACKUP=$(ls -t /home/agent/logs/backup/aegis_*.dump | head -1)
pg_restore -U agent -d aegis "$LATEST_BACKUP"
# Restart containers
cd /home/agent/projects/aegis-core && docker compose up -d
# Verify data
psql -U agent -d aegis -c "SELECT count(*) FROM episodic_memory;"
3. Last Resort - Reinstall PostgreSQL:
# Backup all databases
sudo -u postgres pg_dumpall > /tmp/pg_dumpall_$(date +%Y%m%d).sql
# Reinstall PostgreSQL
sudo apt remove --purge postgresql postgresql-*
sudo apt autoremove
sudo apt install postgresql
# Restore databases
sudo -u postgres psql -f /tmp/pg_dumpall_$(date +%Y%m%d).sql
Certificate Issues¶
Certificate Expired¶
Symptoms: - "certificate has expired" errors - HTTPS not working - Browser shows security warning
Diagnosis:
# Check certificate expiry
echo | openssl s_client -connect aegisagent.ai:443 2>/dev/null | openssl x509 -noout -dates
# Check Traefik logs for cert renewal
docker logs traefik | grep -i "certificate\|acme"
# Verify Let's Encrypt rate limits
curl -s "https://crt.sh/?q=aegisagent.ai&output=json" | jq -r '.[].not_after' | sort | uniq -c
Solutions:
1. Renew Certificate Manually:
# Trigger ACME renewal (if using Traefik)
docker restart traefik
# Wait for renewal (check logs)
docker logs -f traefik | grep certificate
# Verify new certificate
echo | openssl s_client -connect aegisagent.ai:443 2>/dev/null | openssl x509 -noout -dates
2. Certificate Rate Limited:
# Use staging Let's Encrypt temporarily
# Edit Traefik configuration to use staging server
# certificatesResolvers.cf.acme.caServer: https://acme-staging-v02.api.letsencrypt.org/directory
# Wait for rate limit to reset (7 days for duplicates, 168 hours)
3. Certificate Not Issued:
# Check DNS resolution
dig aegisagent.ai +short
# Check HTTP-01 challenge path is accessible
curl -v http://aegisagent.ai/.well-known/acme-challenge/test
# Check Cloudflare settings (if using CF)
# Ensure proxy is disabled for ACME challenges
Network Issues¶
Service Unreachable¶
Symptoms: - Cannot access https://aegisagent.ai - Timeouts when connecting - DNS resolution fails
Diagnosis:
# Test DNS resolution
dig aegisagent.ai +short
nslookup aegisagent.ai
# Test connectivity to server
ping -c 5 157.180.63.15
traceroute aegisagent.ai
# Test ports are open
telnet aegisagent.ai 443
nc -zv aegisagent.ai 443
# Check from inside LXC
curl -v http://localhost:8080/health
curl -v https://aegisagent.ai/health
Common Causes & Solutions:
1. Traefik Not Running:
# Check Traefik on Dockerhost (10.10.10.10)
ssh dockerhost "docker ps | grep traefik"
# Check Traefik on Aegis LXC (10.10.10.103)
docker ps | grep traefik
# Restart Traefik
docker restart traefik
# Check Traefik dashboard (if enabled)
curl http://localhost:8080/dashboard/
2. Firewall Blocking:
# Check iptables rules
sudo iptables -L -n | grep 8080
# Check firewalld (if installed)
sudo firewall-cmd --list-all
# Allow port through firewall
sudo iptables -A INPUT -p tcp --dport 8080 -j ACCEPT
3. DNS Not Propagated:
# Check DNS from different nameservers
dig @8.8.8.8 aegisagent.ai
dig @1.1.1.1 aegisagent.ai
# Check Cloudflare DNS settings
# Log into Cloudflare dashboard and verify A record points to 157.180.63.15
4. Traefik Configuration Error:
# Check Traefik labels
docker inspect aegis-dashboard | jq '.[0].Config.Labels' | grep traefik
# Check Traefik routing
ssh dockerhost "cat /srv/dockerdata/traefik/dynamic/aegis-passthrough.yml"
# Verify SNI passthrough configuration
# Should include aegisagent.ai in TCP router rule
Slow Network Performance¶
Symptoms: - API requests slow despite low server load - High latency - Intermittent timeouts
Diagnosis:
# Test latency
ping -c 100 aegisagent.ai | tail -1
# Test bandwidth
curl -o /dev/null https://aegisagent.ai/static/large-file.bin
# Check network stats
docker exec aegis-dashboard netstat -s
# Monitor network traffic
docker stats --no-stream --format "table {{.Name}}\t{{.NetIO}}"
Solutions:
1. Network Congestion:
# Check host network usage
iftop -i eth0
# Limit container bandwidth (if needed)
# Add to docker-compose.yml:
# networks:
# traefik_proxy:
# driver: bridge
# driver_opts:
# com.docker.network.driver.mtu: 1500
2. DNS Resolution Slow:
# Use faster DNS
sudo tee /etc/resolv.conf <<EOF
nameserver 1.1.1.1
nameserver 8.8.8.8
EOF
# Or in docker-compose.yml:
# dns:
# - 1.1.1.1
# - 8.8.8.8
Memory Issues¶
Out of Memory¶
Symptoms: - Container killed by OOM killer - Application crashes - System becomes unresponsive
Diagnosis:
# Check system memory
free -h
# Check container memory limits
docker inspect aegis-dashboard --format='{{.HostConfig.Memory}}'
# Check OOM kills in logs
dmesg | grep -i "out of memory\|oom"
# Check which process is using memory
docker exec aegis-dashboard ps aux --sort=-%mem | head -10
# Monitor memory usage over time
docker stats aegis-dashboard
Solutions:
1. Increase Container Memory Limit:
# Edit docker-compose.yml
services:
dashboard:
deploy:
resources:
limits:
memory: 4G
reservations:
memory: 1G
2. Find Memory Leak:
# Profile Python memory
docker exec aegis-dashboard python -c "
import gc
import objgraph
objgraph.show_most_common_types(limit=20)
"
# Use memory_profiler
docker exec aegis-dashboard python -m memory_profiler app.py
3. Restart Container to Reclaim Memory:
4. Clear Caches:
# Clear Python caches
find /home/agent/projects/aegis-core -name "__pycache__" -type d -exec rm -rf {} + 2>/dev/null
# Clear system cache (safe)
sync; echo 3 | sudo tee /proc/sys/vm/drop_caches
Disk Space Issues¶
Disk Full¶
Symptoms: - "No space left on device" errors - Cannot write to disk - Database writes fail
Diagnosis:
# Check disk usage
df -h
# Find largest directories
du -sh /home/agent/* | sort -rh | head -10
du -sh /var/lib/docker/* | sort -rh | head -10
# Find largest files
find /home/agent -type f -size +100M -exec ls -lh {} \; | sort -k5 -rh | head -20
# Check Docker disk usage
docker system df -v
Emergency Cleanup:
# Clean Docker
docker system prune -a --volumes -f
# Remove old backups
find /home/agent/logs/backup -name "*.dump" -mtime +7 -delete
find /home/agent/logs/backup -name "*.tar.gz" -mtime +7 -delete
# Truncate large logs
truncate -s 0 /home/agent/logs/*.log
for container in $(docker ps -q); do
log_file="/var/lib/docker/containers/$container/$container-json.log"
sudo truncate -s 0 "$log_file" 2>/dev/null
done
# Remove old journal entries
find /home/agent/memory/journal -name "*.md" -mtime +90 -delete
# Vacuum database
psql -U agent -d aegis -c "VACUUM FULL;"
# Clean package caches
sudo apt clean
pip cache purge
Inode Exhaustion¶
Symptoms: - "No space left on device" despite disk space available - Cannot create new files
Diagnosis:
# Check inode usage
df -i
# Find directories with many files
for dir in /home/agent/*; do
echo "$dir: $(find $dir -type f | wc -l) files"
done
Solutions:
# Clean up __pycache__ directories
find /home/agent -name "__pycache__" -type d -exec rm -rf {} + 2>/dev/null
# Remove old log files
find /home/agent/logs -name "*.log.*" -delete
# Consolidate small files
find /home/agent/memory/episodic -name "*.json" -exec gzip {} \;
Three-Strike Debug Protocol¶
Implementation¶
When encountering repeated failures, follow this protocol:
Strike 1: Retry with Modified Approach
# Search for similar errors
# mcp__aegis__error_search(query="ContainerRestartLoop")
# Record the error
# mcp__aegis__error_record(
# error_type="ContainerRestartLoop",
# context="Dashboard container restarting, health check failing",
# strike_count=1,
# severity="warning"
# )
# Try alternative approach
# e.g., disable health check, check logs, test manually
Strike 2: Escalate to Local Reasoning
# Record escalation
# mcp__aegis__error_record(
# error_type="ContainerRestartLoop",
# context="Second attempt failed, analyzing with local model",
# strike_count=2,
# severity="error"
# )
# Use local model for deep analysis
# Analyze from first principles
# Check similar past resolutions
Strike 3: STOP and Escalate
# Record critical error
# mcp__aegis__error_record(
# error_type="ContainerRestartLoop",
# context="Third consecutive failure, halting attempts",
# strike_count=3,
# severity="critical"
# )
# Document in journal
echo "## Critical Error: ContainerRestartLoop" >> ~/memory/journal/$(date +%Y-%m-%d).md
echo "Three strikes reached. Awaiting human intervention." >> ~/memory/journal/$(date +%Y-%m-%d).md
# Alert to Discord
# mcp__discord__discord_send_message(
# channel_id="1455049130614329508",
# content="🚨 **Three-Strike Protocol Triggered**\n\nError: ContainerRestartLoop\nStrikes: 3\nStatus: HALTED - Awaiting Human Input"
# )
# STOP - Do not retry
# Wait for human intervention
Common Error Messages¶
"Health check failed: connection refused"¶
Cause: Application not listening on expected port
Fix:
# Check if port is listening inside container
docker exec aegis-dashboard netstat -tlnp | grep 8080
# Check application logs
docker logs aegis-dashboard | grep -i "listening\|started"
# Verify port binding in docker-compose.yml
docker inspect aegis-dashboard | jq '.[0].NetworkSettings.Ports'
"Database connection refused"¶
Cause: PostgreSQL not accessible from container
Fix:
# Test from container
docker exec aegis-dashboard psql -U agent -h host.docker.internal -d aegis -c "SELECT 1;"
# Verify extra_hosts
docker inspect aegis-dashboard | jq '.[0].HostConfig.ExtraHosts'
# Check PostgreSQL allows connections
sudo cat /etc/postgresql/*/main/postgresql.conf | grep listen_addresses
"Permission denied" errors¶
Cause: File/directory permissions incorrect
Fix:
# Fix ownership
sudo chown -R agent:agent /home/agent/memory
sudo chown -R agent:agent /home/agent/logs
# Fix permissions
chmod 755 /home/agent/memory
chmod 644 /home/agent/memory/**/*.md
chmod 700 /home/agent/.secure
chmod 600 /home/agent/.secure/*
"Too many open files"¶
Cause: File descriptor limit reached
Fix:
# Check current limits
ulimit -n
# Increase limit temporarily
ulimit -n 4096
# Increase limit permanently
echo "* soft nofile 4096" | sudo tee -a /etc/security/limits.conf
echo "* hard nofile 8192" | sudo tee -a /etc/security/limits.conf
# Restart session to apply
Best Practices¶
- Check Logs First: Always start by reading container/application logs
- Use Three-Strike Protocol: Don't retry indefinitely, escalate after 3 attempts
- Document Solutions: Record error resolutions for future reference
- Test in Isolation: Isolate the problem (test database separately, etc.)
- Backup Before Changes: Always backup before making major changes
- Monitor Resources: Keep an eye on CPU, memory, disk usage
- Use Health Checks: Ensure all containers have working health checks
- Alert on Failures: Set up alerts for critical errors
- Keep It Simple: Prefer simple solutions over complex ones
- Learn from Errors: Use error tracking to prevent recurrence
Getting Help¶
Information to Gather¶
Before escalating to human:
- Error Message: Exact error message from logs
- Container Status:
docker ps -a - Resource Usage:
docker stats --no-stream - Recent Logs: Last 100 lines from relevant container
- Configuration: Relevant parts of docker-compose.yml
- Recent Changes: What changed before the issue started
- Reproducibility: Can you reproduce the issue consistently?
Escalation Checklist¶
- Checked all logs (container, application, system)
- Searched for similar errors (error_search MCP tool)
- Tried Strike 1 solutions
- Tried Strike 2 solutions (local reasoning)
- Documented issue in journal
- Sent alert to Discord #alerts
- Gathered all diagnostic information
- Waiting for human intervention
Related Documentation¶
- Monitoring - Health checks and alerting
- Logging - Log management and analysis
- Maintenance - Routine maintenance tasks
- Backups - Backup and recovery procedures