Monitoring¶

Comprehensive guide to Aegis system monitoring, health checks, and alerting.

Health Check Endpoints¶

Dashboard Health Check¶

Endpoint: /health

Description: Primary health check endpoint for the dashboard service. Tests database connectivity and returns service status.

Response Format:

{
  "status": "healthy|unhealthy",
  "database": "connected|<error message>"
}

Usage:

# Check dashboard health
curl https://aegisagent.ai/health

# Check internal service
curl http://localhost:8080/health

Monitoring Script: /home/agent/projects/aegis-core/scripts/health_check.sh

# Run complete health check suite
./scripts/health_check.sh https://aegisagent.ai

# Check internal endpoints
./scripts/health_check.sh http://localhost:8080

API Status Endpoint¶

Endpoint: /api/status

Description: Returns detailed system status including phase, version, and operational state.

Response Format:

{
  "status": "operational|degraded|down",
  "phase": "1|2|3",
  "version": "4.16",
  "timestamp": "2026-01-25T23:00:00Z"
}

Rate Limit Status¶

Endpoint: /api/rate-limit

Description: Check current rate limit status and remaining requests.

Response Format:

{
  "tier": "free|developer|pro|enterprise",
  "limit_per_hour": 100,
  "remaining": 87,
  "reset_seconds": 1245,
  "authenticated": true
}

Docker Container Health Checks¶

Container Configuration¶

All Aegis containers are configured with health checks in docker-compose.yml.

Dashboard Health Check:

healthcheck:
  test: ["CMD", "python", "-c", "import urllib.request; urllib.request.urlopen('http://localhost:8080/health')"]
  interval: 30s
  timeout: 10s
  retries: 3

Playwright Health Check:

healthcheck:
  test: ["CMD", "wget", "-q", "-O", "-", "http://localhost:3000/health"]
  interval: 30s
  timeout: 10s
  retries: 3

FalkorDB Health Check:

healthcheck:
  test: ["CMD", "redis-cli", "ping"]
  interval: 30s
  timeout: 10s
  retries: 3

Check Container Status¶

# View all container health status
docker ps --format "table {{.Names}}\t{{.Status}}"

# Check specific container health
docker inspect aegis-dashboard --format='{{.State.Health.Status}}'

# View health check logs
docker inspect aegis-dashboard --format='{{json .State.Health}}' | jq

Scheduled Health Checks¶

The scheduler runs automated health checks every 5 minutes:

# System health check (every 5 minutes)
self.scheduler.add_job(
    self.health_check,
    IntervalTrigger(minutes=5),
    id='health_check',
    name='System Health Check',
)

# Docker container health check (every 5 minutes)
self.scheduler.add_job(
    self.docker_health_check,
    IntervalTrigger(minutes=5),
    id='docker_health_check',
    name='Docker Container Health Check',
)

Location: /home/agent/projects/aegis-core/aegis/scheduler.py

Resource Monitoring¶

CPU and Memory Monitoring¶

Check System Resources:

# Overall system stats
docker stats --no-stream

# Detailed container stats
docker stats aegis-dashboard --no-stream --format "table {{.Name}}\t{{.CPUPerc}}\t{{.MemUsage}}\t{{.MemPerc}}"

# Host system resources
htop

# Memory breakdown
free -h
df -h

Resource Limits¶

Memory Limits: 110GB total available in LXC container

Monitor Memory Usage:

# Check current memory usage
free -m | awk 'NR==2{printf "Memory Usage: %.2f%%\n", $3*100/$2}'

# Check disk usage
df -h /home/agent | awk 'NR==2{print "Disk Usage: " $5}'

# PostgreSQL memory
psql -U agent -d aegis -c "SELECT pg_size_pretty(pg_database_size('aegis'));"

Alert Thresholds¶

Metric	Warning	Critical	Action
Memory Usage	>80%	>90%	Alert to Discord #alerts
Disk Usage	>70%	>85%	Alert to Discord #alerts, run cleanup
CPU Usage	>80%	>95%	Alert to Discord #alerts, investigate
Container Restarts	>3/hour	>10/hour	Alert to Discord #alerts
Database Connections	>80% max	>95% max	Alert, kill idle connections

Scheduled Monitoring Jobs¶

The following monitoring jobs run automatically via the scheduler:

Health Checks (Every 5 Minutes)¶

System Health Check: - Database connectivity - API responsiveness - Memory usage - Disk space

Docker Health Check: - Container status (running/stopped/unhealthy) - Restart counts - Health check status

Website Monitoring (Every 30 Minutes)¶

Endpoint Monitoring: - https://aegisagent.ai - https://intel.aegisagent.ai - https://notebooks.aegisagent.ai - https://code.aegisagent.ai

Configuration: /home/agent/projects/aegis-core/aegis/monitor/scheduler.py

Infrastructure Anomaly Detection (Every Hour)¶

Metrics Tracked: - CPU usage patterns - Memory allocation trends - Network traffic anomalies - Error rate spikes

Configuration: /home/agent/projects/aegis-core/aegis/infra/anomaly/scheduler.py

Notification Flow¶

Alert Channels¶

Discord: - #alerts (1455049130614329508): System alerts, errors, warnings - #logs (1455049129313959937): Operational logs - #journal (1455049131725816023): Daily journal entries

Telegram: - Chat ID: 1275129801 - Time-sensitive alerts - System health critical alerts - Emergency escalations

WhatsApp: - Number: +44 7441 443388 - Authorized users only - Two-way command interface

Alert Routing¶

# Critical alerts → Discord #alerts + Telegram
# Warnings → Discord #alerts
# Info logs → Discord #logs
# Journal updates → Discord #journal

MCP Tools:

# Discord alert
mcp__discord__discord_send_message(
    channel_id="1455049130614329508",
    content="⚠️ High memory usage detected: 92%"
)

# Telegram alert
mcp__telegram__telegram_send_message(
    chat_id="1275129801",
    text="🚨 Critical: Database connection pool exhausted"
)

Monitoring Dashboard¶

System Metrics Endpoints¶

Task Statistics: /api/tasks

{
  "total": 1234,
  "pending": 5,
  "in_progress": 2,
  "completed": 1227
}

OODA Cycles: /api/ooda

{
  "success_rate": 0.87,
  "recent_cycles": 10
}

Recent Memory: /api/memory/recent

{
  "memories": [
    {
      "id": 1,
      "event_type": "decision",
      "summary": "Chose Docker Compose for deployment",
      "importance": 8,
      "timestamp": "2026-01-25T12:00:00Z"
    }
  ]
}

External Monitoring¶

UptimeRobot (recommended): - Monitor: https://aegisagent.ai/health - Interval: 5 minutes - Alert: Email/SMS on downtime

Setup:

# Add to UptimeRobot:
# URL: https://aegisagent.ai/health
# Type: HTTP(s)
# Keyword: "healthy"
# Interval: 5 minutes

Performance Monitoring¶

Application Performance¶

Response Time Monitoring:

# Test API response time
time curl -s https://aegisagent.ai/health > /dev/null

# Benchmark endpoints
ab -n 100 -c 10 https://aegisagent.ai/health

Database Query Performance:

-- Slow query log (queries >100ms)
SELECT query, calls, total_time, mean_time
FROM pg_stat_statements
WHERE mean_time > 100
ORDER BY mean_time DESC
LIMIT 10;

-- Active connections
SELECT count(*) FROM pg_stat_activity WHERE state = 'active';

Log Analysis¶

Analyze Error Patterns:

# Count errors by type
grep ERROR /home/agent/logs/*.log | cut -d: -f3 | sort | uniq -c | sort -rn

# Recent errors
tail -n 100 /home/agent/logs/*.log | grep ERROR

# High-frequency events
cat /home/agent/logs/market-scheduler.log | grep -E "(ERROR|WARNING)" | wc -l

Troubleshooting Monitoring Issues¶

Health Check Failures¶

Symptom: /health endpoint returns unhealthy

Diagnosis:

# Check database connectivity
psql -U agent -d aegis -c "SELECT 1;"

# Check container logs
docker logs aegis-dashboard --tail 50

# Verify port binding
netstat -tlnp | grep 8080

Resolution:

# Restart container
cd /home/agent/projects/aegis-core && docker compose restart dashboard

# Rebuild if config changed
cd /home/agent/projects/aegis-core && docker compose up -d --build dashboard

Missing Metrics¶

Symptom: Monitoring data not appearing in logs

Diagnosis:

# Check scheduler status
docker logs aegis-scheduler --tail 50

# Verify scheduled jobs
docker exec aegis-scheduler python -c "from aegis.scheduler import scheduler; print([job.name for job in scheduler.get_jobs()])"

Resolution:

# Restart scheduler
cd /home/agent/projects/aegis-core && docker compose restart scheduler

# Verify jobs are running
docker logs aegis-scheduler -f

High Alert Volume¶

Symptom: Too many alerts being sent

Action: 1. Review alert thresholds in code 2. Implement rate limiting on alerts 3. Aggregate similar alerts 4. Increase alert threshold temporarily

# Adjust threshold in scheduler
self.scheduler.add_job(
    self.health_check,
    IntervalTrigger(minutes=10),  # Increase interval
    id='health_check',
)

Best Practices¶

Regular Health Checks: Monitor all critical endpoints every 5 minutes
Alert Hygiene: Keep alert channels focused (critical vs. info)
Metric Retention: Store metrics for at least 30 days
Dashboard Review: Check monitoring dashboard daily during morning routine
Threshold Tuning: Adjust alert thresholds based on historical data
External Monitoring: Use external service (UptimeRobot) as backup
Log Aggregation: Centralize logs for easier analysis

Logging - Log management and analysis
Troubleshooting - Common issues and solutions
Maintenance - Routine maintenance tasks

Monitoring¶

Health Check Endpoints¶

Dashboard Health Check¶

API Status Endpoint¶

Rate Limit Status¶

Docker Container Health Checks¶

Container Configuration¶

Check Container Status¶

Scheduled Health Checks¶

Resource Monitoring¶

CPU and Memory Monitoring¶

Resource Limits¶

Alert Thresholds¶

Scheduled Monitoring Jobs¶

Health Checks (Every 5 Minutes)¶

Website Monitoring (Every 30 Minutes)¶

Infrastructure Anomaly Detection (Every Hour)¶

Notification Flow¶

Alert Channels¶

Alert Routing¶

Monitoring Dashboard¶

System Metrics Endpoints¶

External Monitoring¶

Performance Monitoring¶

Application Performance¶

Log Analysis¶

Troubleshooting Monitoring Issues¶

Health Check Failures¶

Missing Metrics¶

High Alert Volume¶

Best Practices¶

Related Documentation¶