Skip to content

Troubleshooting

Comprehensive guide to diagnosing and resolving common Aegis operational issues.

Container Issues

Container Won't Start

Symptoms: - Container exits immediately after start - docker compose up fails - Container status shows "Restarting"

Diagnosis:

# Check container status
docker ps -a | grep aegis

# View container logs
docker logs aegis-dashboard --tail 100

# Check for port conflicts
netstat -tlnp | grep 8080

# Inspect container configuration
docker inspect aegis-dashboard

Common Causes & Solutions:

1. Port Already in Use:

# Find process using port
sudo lsof -i :8080

# Kill process or change port in docker-compose.yml
docker compose down
# Edit docker-compose.yml to use different port
docker compose up -d

2. Missing Environment Variables:

# Check environment file
cat /home/agent/.secure/.env

# Verify variables are loaded
docker exec aegis-dashboard env | grep POSTGRES

# Add missing variables to docker-compose.yml or .env

3. Database Connection Failed:

# Test database connectivity
psql -U agent -d aegis -c "SELECT 1;"

# Check PostgreSQL is running
systemctl status postgresql

# Verify connection settings in docker-compose.yml
docker exec aegis-dashboard python -c "
import os
print('POSTGRES_HOST:', os.getenv('POSTGRES_HOST'))
print('POSTGRES_PORT:', os.getenv('POSTGRES_PORT'))
"

4. Volume Mount Issues:

# Check volume exists
docker volume ls | grep falkordb

# Inspect volume
docker volume inspect falkordb_data

# Recreate volume if corrupted
docker compose down
docker volume rm falkordb_data
docker compose up -d

Resolution Steps:

# 1. Stop all containers
cd /home/agent/projects/aegis-core && docker compose down

# 2. Check logs for specific error
docker logs aegis-dashboard

# 3. Fix the underlying issue

# 4. Rebuild and restart
cd /home/agent/projects/aegis-core && docker compose up -d --build

# 5. Verify health
docker ps
curl http://localhost:8080/health

Container Keeps Restarting

Symptoms: - Container restarts every few seconds/minutes - Health checks failing - Application crashes on startup

Diagnosis:

# Check restart count
docker inspect aegis-dashboard --format='{{.RestartCount}}'

# View crash logs
docker logs aegis-dashboard --tail 200

# Check health check status
docker inspect aegis-dashboard --format='{{json .State.Health}}' | jq

# Monitor container in real-time
docker logs -f aegis-dashboard

Common Causes & Solutions:

1. Health Check Failing:

# Test health check manually
docker exec aegis-dashboard python -c "
import urllib.request
try:
    urllib.request.urlopen('http://localhost:8080/health')
    print('✓ Health check passed')
except Exception as e:
    print('✗ Health check failed:', e)
"

# Temporarily disable health check
# Edit docker-compose.yml and comment out healthcheck section
docker compose up -d --force-recreate dashboard

2. Application Error on Startup:

# Check Python errors
docker logs aegis-dashboard 2>&1 | grep -i "error\|exception\|traceback"

# Run container interactively to debug
docker compose run --rm dashboard /bin/bash
python -c "from aegis.dashboard import app"

3. Memory Limit Exceeded:

# Check container memory usage
docker stats aegis-dashboard --no-stream

# Increase memory limit in docker-compose.yml
# deploy:
#   resources:
#     limits:
#       memory: 4G

Resolution:

# Check Three-Strike Error Protocol
# Search for similar past errors
# (Conceptual - use error MCP tools if available)

# Record the error
# mcp__aegis__error_record(
#     error_type="ContainerCrashLoop",
#     context="Container restarting with health check failure",
#     strike_count=1
# )

# Fix root cause and restart
cd /home/agent/projects/aegis-core && docker compose restart dashboard

Slow Container Performance

Symptoms: - API responses slow (>1s) - High CPU/memory usage - Container unresponsive

Diagnosis:

# Check resource usage
docker stats --no-stream

# Check container processes
docker top aegis-dashboard

# Check application logs for slow operations
docker logs aegis-dashboard | grep "duration_ms" | awk -F'duration_ms[: ]*' '{if($2>1000) print}'

# Profile Python app (if needed)
docker exec aegis-dashboard python -m cProfile -o /tmp/profile.stats app.py

Solutions:

1. Database Bottleneck:

# Check slow queries
psql -U agent -d aegis -c "
  SELECT query, calls, total_time, mean_time
  FROM pg_stat_statements
  WHERE mean_time > 100
  ORDER BY mean_time DESC
  LIMIT 10;
"

# Add missing indexes
psql -U agent -d aegis -c "CREATE INDEX idx_episodic_timestamp ON episodic_memory(timestamp DESC);"

# Vacuum database
psql -U agent -d aegis -c "VACUUM ANALYZE;"

2. Too Many Scheduled Jobs:

# Check running jobs
docker exec aegis-scheduler python -c "
from aegis.scheduler import scheduler
print([job.name for job in scheduler.scheduler.get_jobs()])
"

# Disable non-essential jobs temporarily
# Edit aegis/scheduler.py and comment out jobs
docker compose restart scheduler

3. Memory Leak:

# Monitor memory over time
watch -n 5 'docker stats aegis-dashboard --no-stream'

# Restart container to reclaim memory
docker compose restart dashboard

Database Issues

Connection Failures

Symptoms: - "could not connect to server" errors - Health check fails with database error - Application can't reach PostgreSQL

Diagnosis:

# Test connection from host
psql -U agent -d aegis -c "SELECT 1;"

# Test from container
docker exec aegis-dashboard psql -U agent -d aegis -c "SELECT 1;"

# Check PostgreSQL status
systemctl status postgresql

# Check PostgreSQL logs
sudo tail -100 /var/log/postgresql/postgresql-*.log

# Verify port is listening
netstat -tlnp | grep 5432

Common Causes & Solutions:

1. PostgreSQL Not Running:

# Start PostgreSQL
sudo systemctl start postgresql

# Enable auto-start
sudo systemctl enable postgresql

# Check status
systemctl status postgresql

2. Connection Pool Exhausted:

# Check active connections
psql -U agent -d aegis -c "
  SELECT count(*) as connections,
         max_connections
  FROM pg_stat_activity,
       (SELECT setting::int as max_connections FROM pg_settings WHERE name='max_connections') mc
  GROUP BY max_connections;
"

# Kill idle connections
psql -U agent -d aegis -c "
  SELECT pg_terminate_backend(pid)
  FROM pg_stat_activity
  WHERE state = 'idle' AND state_change < NOW() - INTERVAL '1 hour';
"

# Increase max_connections (if needed)
sudo -u postgres psql -c "ALTER SYSTEM SET max_connections = 200;"
sudo systemctl restart postgresql

3. Authentication Failed:

# Check pg_hba.conf
sudo cat /etc/postgresql/*/main/pg_hba.conf | grep -v "^#"

# Allow local connections
# Should have line: local   all   all   trust
# Or: host    all   all   127.0.0.1/32   trust

# Reload PostgreSQL
sudo systemctl reload postgresql

4. Docker Networking Issue:

# Check container can reach host
docker exec aegis-dashboard ping -c 3 host.docker.internal

# Verify extra_hosts in docker-compose.yml
docker inspect aegis-dashboard | jq '.[0].HostConfig.ExtraHosts'

# Recreate network
docker compose down
docker network prune -f
docker compose up -d

Slow Queries

Symptoms: - API endpoints timeout - Database CPU usage high - Long query execution times

Diagnosis:

# Enable pg_stat_statements
psql -U agent -d aegis -c "CREATE EXTENSION IF NOT EXISTS pg_stat_statements;"

# Find slow queries
psql -U agent -d aegis -c "
  SELECT
    query,
    calls,
    total_time,
    mean_time,
    max_time
  FROM pg_stat_statements
  WHERE mean_time > 100
  ORDER BY mean_time DESC
  LIMIT 20;
"

# Check for locks
psql -U agent -d aegis -c "
  SELECT
    pid,
    usename,
    pg_blocking_pids(pid) as blocked_by,
    query
  FROM pg_stat_activity
  WHERE cardinality(pg_blocking_pids(pid)) > 0;
"

# Check active queries
psql -U agent -d aegis -c "
  SELECT
    pid,
    now() - query_start as duration,
    query
  FROM pg_stat_activity
  WHERE state = 'active'
  ORDER BY duration DESC;
"

Solutions:

1. Add Indexes:

# Identify missing indexes
psql -U agent -d aegis -c "
  SELECT
    schemaname,
    tablename,
    attname,
    n_distinct
  FROM pg_stats
  WHERE schemaname = 'public'
    AND n_distinct > 100
  ORDER BY n_distinct DESC;
"

# Create indexes
psql -U agent -d aegis -c "CREATE INDEX idx_episodic_event_type ON episodic_memory(event_type);"
psql -U agent -d aegis -c "CREATE INDEX idx_workflow_created_at ON workflow_runs(created_at DESC);"

2. Optimize Query:

# Use EXPLAIN ANALYZE to see query plan
psql -U agent -d aegis -c "
  EXPLAIN ANALYZE
  SELECT * FROM episodic_memory
  WHERE event_type = 'decision'
  ORDER BY timestamp DESC
  LIMIT 100;
"

# Rewrite inefficient queries in code

3. Kill Long-Running Query:

# Get query PID
psql -U agent -d aegis -c "
  SELECT pid, query_start, query
  FROM pg_stat_activity
  WHERE state = 'active'
    AND query_start < NOW() - INTERVAL '10 minutes';
"

# Terminate query
psql -U agent -d aegis -c "SELECT pg_terminate_backend(12345);"

Database Corruption

Symptoms: - "database is corrupted" errors - Unexpected data loss - Cannot connect to database

Diagnosis:

# Check database integrity
psql -U agent -d aegis -c "
  SELECT datname, pg_database_size(datname)
  FROM pg_database
  WHERE datname = 'aegis';
"

# Check table integrity
psql -U agent -d aegis -c "
  SELECT tablename, pg_relation_size(tablename::regclass)
  FROM pg_tables
  WHERE schemaname = 'public';
"

# Check for errors in logs
sudo tail -500 /var/log/postgresql/postgresql-*.log | grep -i "error\|corrupt"

Recovery Steps:

1. Try REINDEX:

# Reindex database
psql -U agent -d aegis -c "REINDEX DATABASE aegis;"

# If fails, reindex tables individually
psql -U agent -d aegis -c "REINDEX TABLE episodic_memory;"

2. Restore from Backup:

# Stop containers
cd /home/agent/projects/aegis-core && docker compose down

# Backup current database (just in case)
pg_dump -U agent -d aegis -F c -f /tmp/aegis_corrupt_$(date +%Y%m%d).dump

# Drop and recreate database
dropdb -U agent aegis
createdb -U agent aegis

# Restore from latest backup
LATEST_BACKUP=$(ls -t /home/agent/logs/backup/aegis_*.dump | head -1)
pg_restore -U agent -d aegis "$LATEST_BACKUP"

# Restart containers
cd /home/agent/projects/aegis-core && docker compose up -d

# Verify data
psql -U agent -d aegis -c "SELECT count(*) FROM episodic_memory;"

3. Last Resort - Reinstall PostgreSQL:

# Backup all databases
sudo -u postgres pg_dumpall > /tmp/pg_dumpall_$(date +%Y%m%d).sql

# Reinstall PostgreSQL
sudo apt remove --purge postgresql postgresql-*
sudo apt autoremove
sudo apt install postgresql

# Restore databases
sudo -u postgres psql -f /tmp/pg_dumpall_$(date +%Y%m%d).sql

Certificate Issues

Certificate Expired

Symptoms: - "certificate has expired" errors - HTTPS not working - Browser shows security warning

Diagnosis:

# Check certificate expiry
echo | openssl s_client -connect aegisagent.ai:443 2>/dev/null | openssl x509 -noout -dates

# Check Traefik logs for cert renewal
docker logs traefik | grep -i "certificate\|acme"

# Verify Let's Encrypt rate limits
curl -s "https://crt.sh/?q=aegisagent.ai&output=json" | jq -r '.[].not_after' | sort | uniq -c

Solutions:

1. Renew Certificate Manually:

# Trigger ACME renewal (if using Traefik)
docker restart traefik

# Wait for renewal (check logs)
docker logs -f traefik | grep certificate

# Verify new certificate
echo | openssl s_client -connect aegisagent.ai:443 2>/dev/null | openssl x509 -noout -dates

2. Certificate Rate Limited:

# Use staging Let's Encrypt temporarily
# Edit Traefik configuration to use staging server
# certificatesResolvers.cf.acme.caServer: https://acme-staging-v02.api.letsencrypt.org/directory

# Wait for rate limit to reset (7 days for duplicates, 168 hours)

3. Certificate Not Issued:

# Check DNS resolution
dig aegisagent.ai +short

# Check HTTP-01 challenge path is accessible
curl -v http://aegisagent.ai/.well-known/acme-challenge/test

# Check Cloudflare settings (if using CF)
# Ensure proxy is disabled for ACME challenges

Network Issues

Service Unreachable

Symptoms: - Cannot access https://aegisagent.ai - Timeouts when connecting - DNS resolution fails

Diagnosis:

# Test DNS resolution
dig aegisagent.ai +short
nslookup aegisagent.ai

# Test connectivity to server
ping -c 5 157.180.63.15
traceroute aegisagent.ai

# Test ports are open
telnet aegisagent.ai 443
nc -zv aegisagent.ai 443

# Check from inside LXC
curl -v http://localhost:8080/health
curl -v https://aegisagent.ai/health

Common Causes & Solutions:

1. Traefik Not Running:

# Check Traefik on Dockerhost (10.10.10.10)
ssh dockerhost "docker ps | grep traefik"

# Check Traefik on Aegis LXC (10.10.10.103)
docker ps | grep traefik

# Restart Traefik
docker restart traefik

# Check Traefik dashboard (if enabled)
curl http://localhost:8080/dashboard/

2. Firewall Blocking:

# Check iptables rules
sudo iptables -L -n | grep 8080

# Check firewalld (if installed)
sudo firewall-cmd --list-all

# Allow port through firewall
sudo iptables -A INPUT -p tcp --dport 8080 -j ACCEPT

3. DNS Not Propagated:

# Check DNS from different nameservers
dig @8.8.8.8 aegisagent.ai
dig @1.1.1.1 aegisagent.ai

# Check Cloudflare DNS settings
# Log into Cloudflare dashboard and verify A record points to 157.180.63.15

4. Traefik Configuration Error:

# Check Traefik labels
docker inspect aegis-dashboard | jq '.[0].Config.Labels' | grep traefik

# Check Traefik routing
ssh dockerhost "cat /srv/dockerdata/traefik/dynamic/aegis-passthrough.yml"

# Verify SNI passthrough configuration
# Should include aegisagent.ai in TCP router rule

Slow Network Performance

Symptoms: - API requests slow despite low server load - High latency - Intermittent timeouts

Diagnosis:

# Test latency
ping -c 100 aegisagent.ai | tail -1

# Test bandwidth
curl -o /dev/null https://aegisagent.ai/static/large-file.bin

# Check network stats
docker exec aegis-dashboard netstat -s

# Monitor network traffic
docker stats --no-stream --format "table {{.Name}}\t{{.NetIO}}"

Solutions:

1. Network Congestion:

# Check host network usage
iftop -i eth0

# Limit container bandwidth (if needed)
# Add to docker-compose.yml:
# networks:
#   traefik_proxy:
#     driver: bridge
#     driver_opts:
#       com.docker.network.driver.mtu: 1500

2. DNS Resolution Slow:

# Use faster DNS
sudo tee /etc/resolv.conf <<EOF
nameserver 1.1.1.1
nameserver 8.8.8.8
EOF

# Or in docker-compose.yml:
# dns:
#   - 1.1.1.1
#   - 8.8.8.8

Memory Issues

Out of Memory

Symptoms: - Container killed by OOM killer - Application crashes - System becomes unresponsive

Diagnosis:

# Check system memory
free -h

# Check container memory limits
docker inspect aegis-dashboard --format='{{.HostConfig.Memory}}'

# Check OOM kills in logs
dmesg | grep -i "out of memory\|oom"

# Check which process is using memory
docker exec aegis-dashboard ps aux --sort=-%mem | head -10

# Monitor memory usage over time
docker stats aegis-dashboard

Solutions:

1. Increase Container Memory Limit:

# Edit docker-compose.yml
services:
  dashboard:
    deploy:
      resources:
        limits:
          memory: 4G
        reservations:
          memory: 1G

2. Find Memory Leak:

# Profile Python memory
docker exec aegis-dashboard python -c "
import gc
import objgraph
objgraph.show_most_common_types(limit=20)
"

# Use memory_profiler
docker exec aegis-dashboard python -m memory_profiler app.py

3. Restart Container to Reclaim Memory:

cd /home/agent/projects/aegis-core && docker compose restart dashboard

4. Clear Caches:

# Clear Python caches
find /home/agent/projects/aegis-core -name "__pycache__" -type d -exec rm -rf {} + 2>/dev/null

# Clear system cache (safe)
sync; echo 3 | sudo tee /proc/sys/vm/drop_caches

Disk Space Issues

Disk Full

Symptoms: - "No space left on device" errors - Cannot write to disk - Database writes fail

Diagnosis:

# Check disk usage
df -h

# Find largest directories
du -sh /home/agent/* | sort -rh | head -10
du -sh /var/lib/docker/* | sort -rh | head -10

# Find largest files
find /home/agent -type f -size +100M -exec ls -lh {} \; | sort -k5 -rh | head -20

# Check Docker disk usage
docker system df -v

Emergency Cleanup:

# Clean Docker
docker system prune -a --volumes -f

# Remove old backups
find /home/agent/logs/backup -name "*.dump" -mtime +7 -delete
find /home/agent/logs/backup -name "*.tar.gz" -mtime +7 -delete

# Truncate large logs
truncate -s 0 /home/agent/logs/*.log
for container in $(docker ps -q); do
    log_file="/var/lib/docker/containers/$container/$container-json.log"
    sudo truncate -s 0 "$log_file" 2>/dev/null
done

# Remove old journal entries
find /home/agent/memory/journal -name "*.md" -mtime +90 -delete

# Vacuum database
psql -U agent -d aegis -c "VACUUM FULL;"

# Clean package caches
sudo apt clean
pip cache purge

Inode Exhaustion

Symptoms: - "No space left on device" despite disk space available - Cannot create new files

Diagnosis:

# Check inode usage
df -i

# Find directories with many files
for dir in /home/agent/*; do
    echo "$dir: $(find $dir -type f | wc -l) files"
done

Solutions:

# Clean up __pycache__ directories
find /home/agent -name "__pycache__" -type d -exec rm -rf {} + 2>/dev/null

# Remove old log files
find /home/agent/logs -name "*.log.*" -delete

# Consolidate small files
find /home/agent/memory/episodic -name "*.json" -exec gzip {} \;

Three-Strike Debug Protocol

Implementation

When encountering repeated failures, follow this protocol:

Strike 1: Retry with Modified Approach

# Search for similar errors
# mcp__aegis__error_search(query="ContainerRestartLoop")

# Record the error
# mcp__aegis__error_record(
#     error_type="ContainerRestartLoop",
#     context="Dashboard container restarting, health check failing",
#     strike_count=1,
#     severity="warning"
# )

# Try alternative approach
# e.g., disable health check, check logs, test manually

Strike 2: Escalate to Local Reasoning

# Record escalation
# mcp__aegis__error_record(
#     error_type="ContainerRestartLoop",
#     context="Second attempt failed, analyzing with local model",
#     strike_count=2,
#     severity="error"
# )

# Use local model for deep analysis
# Analyze from first principles
# Check similar past resolutions

Strike 3: STOP and Escalate

# Record critical error
# mcp__aegis__error_record(
#     error_type="ContainerRestartLoop",
#     context="Third consecutive failure, halting attempts",
#     strike_count=3,
#     severity="critical"
# )

# Document in journal
echo "## Critical Error: ContainerRestartLoop" >> ~/memory/journal/$(date +%Y-%m-%d).md
echo "Three strikes reached. Awaiting human intervention." >> ~/memory/journal/$(date +%Y-%m-%d).md

# Alert to Discord
# mcp__discord__discord_send_message(
#     channel_id="1455049130614329508",
#     content="🚨 **Three-Strike Protocol Triggered**\n\nError: ContainerRestartLoop\nStrikes: 3\nStatus: HALTED - Awaiting Human Input"
# )

# STOP - Do not retry
# Wait for human intervention

Common Error Messages

"Health check failed: connection refused"

Cause: Application not listening on expected port

Fix:

# Check if port is listening inside container
docker exec aegis-dashboard netstat -tlnp | grep 8080

# Check application logs
docker logs aegis-dashboard | grep -i "listening\|started"

# Verify port binding in docker-compose.yml
docker inspect aegis-dashboard | jq '.[0].NetworkSettings.Ports'

"Database connection refused"

Cause: PostgreSQL not accessible from container

Fix:

# Test from container
docker exec aegis-dashboard psql -U agent -h host.docker.internal -d aegis -c "SELECT 1;"

# Verify extra_hosts
docker inspect aegis-dashboard | jq '.[0].HostConfig.ExtraHosts'

# Check PostgreSQL allows connections
sudo cat /etc/postgresql/*/main/postgresql.conf | grep listen_addresses

"Permission denied" errors

Cause: File/directory permissions incorrect

Fix:

# Fix ownership
sudo chown -R agent:agent /home/agent/memory
sudo chown -R agent:agent /home/agent/logs

# Fix permissions
chmod 755 /home/agent/memory
chmod 644 /home/agent/memory/**/*.md
chmod 700 /home/agent/.secure
chmod 600 /home/agent/.secure/*

"Too many open files"

Cause: File descriptor limit reached

Fix:

# Check current limits
ulimit -n

# Increase limit temporarily
ulimit -n 4096

# Increase limit permanently
echo "* soft nofile 4096" | sudo tee -a /etc/security/limits.conf
echo "* hard nofile 8192" | sudo tee -a /etc/security/limits.conf

# Restart session to apply

Best Practices

  1. Check Logs First: Always start by reading container/application logs
  2. Use Three-Strike Protocol: Don't retry indefinitely, escalate after 3 attempts
  3. Document Solutions: Record error resolutions for future reference
  4. Test in Isolation: Isolate the problem (test database separately, etc.)
  5. Backup Before Changes: Always backup before making major changes
  6. Monitor Resources: Keep an eye on CPU, memory, disk usage
  7. Use Health Checks: Ensure all containers have working health checks
  8. Alert on Failures: Set up alerts for critical errors
  9. Keep It Simple: Prefer simple solutions over complex ones
  10. Learn from Errors: Use error tracking to prevent recurrence

Getting Help

Information to Gather

Before escalating to human:

  1. Error Message: Exact error message from logs
  2. Container Status: docker ps -a
  3. Resource Usage: docker stats --no-stream
  4. Recent Logs: Last 100 lines from relevant container
  5. Configuration: Relevant parts of docker-compose.yml
  6. Recent Changes: What changed before the issue started
  7. Reproducibility: Can you reproduce the issue consistently?

Escalation Checklist

  • Checked all logs (container, application, system)
  • Searched for similar errors (error_search MCP tool)
  • Tried Strike 1 solutions
  • Tried Strike 2 solutions (local reasoning)
  • Documented issue in journal
  • Sent alert to Discord #alerts
  • Gathered all diagnostic information
  • Waiting for human intervention