Skip to content

Monitoring

Comprehensive guide to Aegis system monitoring, health checks, and alerting.

Health Check Endpoints

Dashboard Health Check

Endpoint: /health

Description: Primary health check endpoint for the dashboard service. Tests database connectivity and returns service status.

Response Format:

{
  "status": "healthy|unhealthy",
  "database": "connected|<error message>"
}

Usage:

# Check dashboard health
curl https://aegisagent.ai/health

# Check internal service
curl http://localhost:8080/health

Monitoring Script: /home/agent/projects/aegis-core/scripts/health_check.sh

# Run complete health check suite
./scripts/health_check.sh https://aegisagent.ai

# Check internal endpoints
./scripts/health_check.sh http://localhost:8080

API Status Endpoint

Endpoint: /api/status

Description: Returns detailed system status including phase, version, and operational state.

Response Format:

{
  "status": "operational|degraded|down",
  "phase": "1|2|3",
  "version": "4.16",
  "timestamp": "2026-01-25T23:00:00Z"
}

Rate Limit Status

Endpoint: /api/rate-limit

Description: Check current rate limit status and remaining requests.

Response Format:

{
  "tier": "free|developer|pro|enterprise",
  "limit_per_hour": 100,
  "remaining": 87,
  "reset_seconds": 1245,
  "authenticated": true
}

Docker Container Health Checks

Container Configuration

All Aegis containers are configured with health checks in docker-compose.yml.

Dashboard Health Check:

healthcheck:
  test: ["CMD", "python", "-c", "import urllib.request; urllib.request.urlopen('http://localhost:8080/health')"]
  interval: 30s
  timeout: 10s
  retries: 3

Playwright Health Check:

healthcheck:
  test: ["CMD", "wget", "-q", "-O", "-", "http://localhost:3000/health"]
  interval: 30s
  timeout: 10s
  retries: 3

FalkorDB Health Check:

healthcheck:
  test: ["CMD", "redis-cli", "ping"]
  interval: 30s
  timeout: 10s
  retries: 3

Check Container Status

# View all container health status
docker ps --format "table {{.Names}}\t{{.Status}}"

# Check specific container health
docker inspect aegis-dashboard --format='{{.State.Health.Status}}'

# View health check logs
docker inspect aegis-dashboard --format='{{json .State.Health}}' | jq

Scheduled Health Checks

The scheduler runs automated health checks every 5 minutes:

# System health check (every 5 minutes)
self.scheduler.add_job(
    self.health_check,
    IntervalTrigger(minutes=5),
    id='health_check',
    name='System Health Check',
)

# Docker container health check (every 5 minutes)
self.scheduler.add_job(
    self.docker_health_check,
    IntervalTrigger(minutes=5),
    id='docker_health_check',
    name='Docker Container Health Check',
)

Location: /home/agent/projects/aegis-core/aegis/scheduler.py

Resource Monitoring

CPU and Memory Monitoring

Check System Resources:

# Overall system stats
docker stats --no-stream

# Detailed container stats
docker stats aegis-dashboard --no-stream --format "table {{.Name}}\t{{.CPUPerc}}\t{{.MemUsage}}\t{{.MemPerc}}"

# Host system resources
htop

# Memory breakdown
free -h
df -h

Resource Limits

Memory Limits: 110GB total available in LXC container

Monitor Memory Usage:

# Check current memory usage
free -m | awk 'NR==2{printf "Memory Usage: %.2f%%\n", $3*100/$2}'

# Check disk usage
df -h /home/agent | awk 'NR==2{print "Disk Usage: " $5}'

# PostgreSQL memory
psql -U agent -d aegis -c "SELECT pg_size_pretty(pg_database_size('aegis'));"

Alert Thresholds

Metric Warning Critical Action
Memory Usage >80% >90% Alert to Discord #alerts
Disk Usage >70% >85% Alert to Discord #alerts, run cleanup
CPU Usage >80% >95% Alert to Discord #alerts, investigate
Container Restarts >3/hour >10/hour Alert to Discord #alerts
Database Connections >80% max >95% max Alert, kill idle connections

Scheduled Monitoring Jobs

The following monitoring jobs run automatically via the scheduler:

Health Checks (Every 5 Minutes)

System Health Check: - Database connectivity - API responsiveness - Memory usage - Disk space

Docker Health Check: - Container status (running/stopped/unhealthy) - Restart counts - Health check status

Website Monitoring (Every 30 Minutes)

Endpoint Monitoring: - https://aegisagent.ai - https://intel.aegisagent.ai - https://notebooks.aegisagent.ai - https://code.aegisagent.ai

Configuration: /home/agent/projects/aegis-core/aegis/monitor/scheduler.py

Infrastructure Anomaly Detection (Every Hour)

Metrics Tracked: - CPU usage patterns - Memory allocation trends - Network traffic anomalies - Error rate spikes

Configuration: /home/agent/projects/aegis-core/aegis/infra/anomaly/scheduler.py

Notification Flow

Alert Channels

Discord: - #alerts (1455049130614329508): System alerts, errors, warnings - #logs (1455049129313959937): Operational logs - #journal (1455049131725816023): Daily journal entries

Telegram: - Chat ID: 1275129801 - Time-sensitive alerts - System health critical alerts - Emergency escalations

WhatsApp: - Number: +44 7441 443388 - Authorized users only - Two-way command interface

Alert Routing

# Critical alerts → Discord #alerts + Telegram
# Warnings → Discord #alerts
# Info logs → Discord #logs
# Journal updates → Discord #journal

MCP Tools:

# Discord alert
mcp__discord__discord_send_message(
    channel_id="1455049130614329508",
    content="⚠️ High memory usage detected: 92%"
)

# Telegram alert
mcp__telegram__telegram_send_message(
    chat_id="1275129801",
    text="🚨 Critical: Database connection pool exhausted"
)

Monitoring Dashboard

System Metrics Endpoints

Task Statistics: /api/tasks

{
  "total": 1234,
  "pending": 5,
  "in_progress": 2,
  "completed": 1227
}

OODA Cycles: /api/ooda

{
  "success_rate": 0.87,
  "recent_cycles": 10
}

Recent Memory: /api/memory/recent

{
  "memories": [
    {
      "id": 1,
      "event_type": "decision",
      "summary": "Chose Docker Compose for deployment",
      "importance": 8,
      "timestamp": "2026-01-25T12:00:00Z"
    }
  ]
}

External Monitoring

UptimeRobot (recommended): - Monitor: https://aegisagent.ai/health - Interval: 5 minutes - Alert: Email/SMS on downtime

Setup:

# Add to UptimeRobot:
# URL: https://aegisagent.ai/health
# Type: HTTP(s)
# Keyword: "healthy"
# Interval: 5 minutes

Performance Monitoring

Application Performance

Response Time Monitoring:

# Test API response time
time curl -s https://aegisagent.ai/health > /dev/null

# Benchmark endpoints
ab -n 100 -c 10 https://aegisagent.ai/health

Database Query Performance:

-- Slow query log (queries >100ms)
SELECT query, calls, total_time, mean_time
FROM pg_stat_statements
WHERE mean_time > 100
ORDER BY mean_time DESC
LIMIT 10;

-- Active connections
SELECT count(*) FROM pg_stat_activity WHERE state = 'active';

Log Analysis

Analyze Error Patterns:

# Count errors by type
grep ERROR /home/agent/logs/*.log | cut -d: -f3 | sort | uniq -c | sort -rn

# Recent errors
tail -n 100 /home/agent/logs/*.log | grep ERROR

# High-frequency events
cat /home/agent/logs/market-scheduler.log | grep -E "(ERROR|WARNING)" | wc -l

Troubleshooting Monitoring Issues

Health Check Failures

Symptom: /health endpoint returns unhealthy

Diagnosis:

# Check database connectivity
psql -U agent -d aegis -c "SELECT 1;"

# Check container logs
docker logs aegis-dashboard --tail 50

# Verify port binding
netstat -tlnp | grep 8080

Resolution:

# Restart container
cd /home/agent/projects/aegis-core && docker compose restart dashboard

# Rebuild if config changed
cd /home/agent/projects/aegis-core && docker compose up -d --build dashboard

Missing Metrics

Symptom: Monitoring data not appearing in logs

Diagnosis:

# Check scheduler status
docker logs aegis-scheduler --tail 50

# Verify scheduled jobs
docker exec aegis-scheduler python -c "from aegis.scheduler import scheduler; print([job.name for job in scheduler.get_jobs()])"

Resolution:

# Restart scheduler
cd /home/agent/projects/aegis-core && docker compose restart scheduler

# Verify jobs are running
docker logs aegis-scheduler -f

High Alert Volume

Symptom: Too many alerts being sent

Action: 1. Review alert thresholds in code 2. Implement rate limiting on alerts 3. Aggregate similar alerts 4. Increase alert threshold temporarily

# Adjust threshold in scheduler
self.scheduler.add_job(
    self.health_check,
    IntervalTrigger(minutes=10),  # Increase interval
    id='health_check',
)

Best Practices

  1. Regular Health Checks: Monitor all critical endpoints every 5 minutes
  2. Alert Hygiene: Keep alert channels focused (critical vs. info)
  3. Metric Retention: Store metrics for at least 30 days
  4. Dashboard Review: Check monitoring dashboard daily during morning routine
  5. Threshold Tuning: Adjust alert thresholds based on historical data
  6. External Monitoring: Use external service (UptimeRobot) as backup
  7. Log Aggregation: Centralize logs for easier analysis