Monitoring¶
Comprehensive guide to Aegis system monitoring, health checks, and alerting.
Health Check Endpoints¶
Dashboard Health Check¶
Endpoint: /health
Description: Primary health check endpoint for the dashboard service. Tests database connectivity and returns service status.
Response Format:
Usage:
# Check dashboard health
curl https://aegisagent.ai/health
# Check internal service
curl http://localhost:8080/health
Monitoring Script: /home/agent/projects/aegis-core/scripts/health_check.sh
# Run complete health check suite
./scripts/health_check.sh https://aegisagent.ai
# Check internal endpoints
./scripts/health_check.sh http://localhost:8080
API Status Endpoint¶
Endpoint: /api/status
Description: Returns detailed system status including phase, version, and operational state.
Response Format:
{
"status": "operational|degraded|down",
"phase": "1|2|3",
"version": "4.16",
"timestamp": "2026-01-25T23:00:00Z"
}
Rate Limit Status¶
Endpoint: /api/rate-limit
Description: Check current rate limit status and remaining requests.
Response Format:
{
"tier": "free|developer|pro|enterprise",
"limit_per_hour": 100,
"remaining": 87,
"reset_seconds": 1245,
"authenticated": true
}
Docker Container Health Checks¶
Container Configuration¶
All Aegis containers are configured with health checks in docker-compose.yml.
Dashboard Health Check:
healthcheck:
test: ["CMD", "python", "-c", "import urllib.request; urllib.request.urlopen('http://localhost:8080/health')"]
interval: 30s
timeout: 10s
retries: 3
Playwright Health Check:
healthcheck:
test: ["CMD", "wget", "-q", "-O", "-", "http://localhost:3000/health"]
interval: 30s
timeout: 10s
retries: 3
FalkorDB Health Check:
Check Container Status¶
# View all container health status
docker ps --format "table {{.Names}}\t{{.Status}}"
# Check specific container health
docker inspect aegis-dashboard --format='{{.State.Health.Status}}'
# View health check logs
docker inspect aegis-dashboard --format='{{json .State.Health}}' | jq
Scheduled Health Checks¶
The scheduler runs automated health checks every 5 minutes:
# System health check (every 5 minutes)
self.scheduler.add_job(
self.health_check,
IntervalTrigger(minutes=5),
id='health_check',
name='System Health Check',
)
# Docker container health check (every 5 minutes)
self.scheduler.add_job(
self.docker_health_check,
IntervalTrigger(minutes=5),
id='docker_health_check',
name='Docker Container Health Check',
)
Location: /home/agent/projects/aegis-core/aegis/scheduler.py
Resource Monitoring¶
CPU and Memory Monitoring¶
Check System Resources:
# Overall system stats
docker stats --no-stream
# Detailed container stats
docker stats aegis-dashboard --no-stream --format "table {{.Name}}\t{{.CPUPerc}}\t{{.MemUsage}}\t{{.MemPerc}}"
# Host system resources
htop
# Memory breakdown
free -h
df -h
Resource Limits¶
Memory Limits: 110GB total available in LXC container
Monitor Memory Usage:
# Check current memory usage
free -m | awk 'NR==2{printf "Memory Usage: %.2f%%\n", $3*100/$2}'
# Check disk usage
df -h /home/agent | awk 'NR==2{print "Disk Usage: " $5}'
# PostgreSQL memory
psql -U agent -d aegis -c "SELECT pg_size_pretty(pg_database_size('aegis'));"
Alert Thresholds¶
| Metric | Warning | Critical | Action |
|---|---|---|---|
| Memory Usage | >80% | >90% | Alert to Discord #alerts |
| Disk Usage | >70% | >85% | Alert to Discord #alerts, run cleanup |
| CPU Usage | >80% | >95% | Alert to Discord #alerts, investigate |
| Container Restarts | >3/hour | >10/hour | Alert to Discord #alerts |
| Database Connections | >80% max | >95% max | Alert, kill idle connections |
Scheduled Monitoring Jobs¶
The following monitoring jobs run automatically via the scheduler:
Health Checks (Every 5 Minutes)¶
System Health Check: - Database connectivity - API responsiveness - Memory usage - Disk space
Docker Health Check: - Container status (running/stopped/unhealthy) - Restart counts - Health check status
Website Monitoring (Every 30 Minutes)¶
Endpoint Monitoring: - https://aegisagent.ai - https://intel.aegisagent.ai - https://notebooks.aegisagent.ai - https://code.aegisagent.ai
Configuration: /home/agent/projects/aegis-core/aegis/monitor/scheduler.py
Infrastructure Anomaly Detection (Every Hour)¶
Metrics Tracked: - CPU usage patterns - Memory allocation trends - Network traffic anomalies - Error rate spikes
Configuration: /home/agent/projects/aegis-core/aegis/infra/anomaly/scheduler.py
Notification Flow¶
Alert Channels¶
Discord: - #alerts (1455049130614329508): System alerts, errors, warnings - #logs (1455049129313959937): Operational logs - #journal (1455049131725816023): Daily journal entries
Telegram: - Chat ID: 1275129801 - Time-sensitive alerts - System health critical alerts - Emergency escalations
WhatsApp: - Number: +44 7441 443388 - Authorized users only - Two-way command interface
Alert Routing¶
# Critical alerts → Discord #alerts + Telegram
# Warnings → Discord #alerts
# Info logs → Discord #logs
# Journal updates → Discord #journal
MCP Tools:
# Discord alert
mcp__discord__discord_send_message(
channel_id="1455049130614329508",
content="⚠️ High memory usage detected: 92%"
)
# Telegram alert
mcp__telegram__telegram_send_message(
chat_id="1275129801",
text="🚨 Critical: Database connection pool exhausted"
)
Monitoring Dashboard¶
System Metrics Endpoints¶
Task Statistics: /api/tasks
OODA Cycles: /api/ooda
Recent Memory: /api/memory/recent
{
"memories": [
{
"id": 1,
"event_type": "decision",
"summary": "Chose Docker Compose for deployment",
"importance": 8,
"timestamp": "2026-01-25T12:00:00Z"
}
]
}
External Monitoring¶
UptimeRobot (recommended): - Monitor: https://aegisagent.ai/health - Interval: 5 minutes - Alert: Email/SMS on downtime
Setup:
# Add to UptimeRobot:
# URL: https://aegisagent.ai/health
# Type: HTTP(s)
# Keyword: "healthy"
# Interval: 5 minutes
Performance Monitoring¶
Application Performance¶
Response Time Monitoring:
# Test API response time
time curl -s https://aegisagent.ai/health > /dev/null
# Benchmark endpoints
ab -n 100 -c 10 https://aegisagent.ai/health
Database Query Performance:
-- Slow query log (queries >100ms)
SELECT query, calls, total_time, mean_time
FROM pg_stat_statements
WHERE mean_time > 100
ORDER BY mean_time DESC
LIMIT 10;
-- Active connections
SELECT count(*) FROM pg_stat_activity WHERE state = 'active';
Log Analysis¶
Analyze Error Patterns:
# Count errors by type
grep ERROR /home/agent/logs/*.log | cut -d: -f3 | sort | uniq -c | sort -rn
# Recent errors
tail -n 100 /home/agent/logs/*.log | grep ERROR
# High-frequency events
cat /home/agent/logs/market-scheduler.log | grep -E "(ERROR|WARNING)" | wc -l
Troubleshooting Monitoring Issues¶
Health Check Failures¶
Symptom: /health endpoint returns unhealthy
Diagnosis:
# Check database connectivity
psql -U agent -d aegis -c "SELECT 1;"
# Check container logs
docker logs aegis-dashboard --tail 50
# Verify port binding
netstat -tlnp | grep 8080
Resolution:
# Restart container
cd /home/agent/projects/aegis-core && docker compose restart dashboard
# Rebuild if config changed
cd /home/agent/projects/aegis-core && docker compose up -d --build dashboard
Missing Metrics¶
Symptom: Monitoring data not appearing in logs
Diagnosis:
# Check scheduler status
docker logs aegis-scheduler --tail 50
# Verify scheduled jobs
docker exec aegis-scheduler python -c "from aegis.scheduler import scheduler; print([job.name for job in scheduler.get_jobs()])"
Resolution:
# Restart scheduler
cd /home/agent/projects/aegis-core && docker compose restart scheduler
# Verify jobs are running
docker logs aegis-scheduler -f
High Alert Volume¶
Symptom: Too many alerts being sent
Action: 1. Review alert thresholds in code 2. Implement rate limiting on alerts 3. Aggregate similar alerts 4. Increase alert threshold temporarily
# Adjust threshold in scheduler
self.scheduler.add_job(
self.health_check,
IntervalTrigger(minutes=10), # Increase interval
id='health_check',
)
Best Practices¶
- Regular Health Checks: Monitor all critical endpoints every 5 minutes
- Alert Hygiene: Keep alert channels focused (critical vs. info)
- Metric Retention: Store metrics for at least 30 days
- Dashboard Review: Check monitoring dashboard daily during morning routine
- Threshold Tuning: Adjust alert thresholds based on historical data
- External Monitoring: Use external service (UptimeRobot) as backup
- Log Aggregation: Centralize logs for easier analysis
Related Documentation¶
- Logging - Log management and analysis
- Troubleshooting - Common issues and solutions
- Maintenance - Routine maintenance tasks