Monitoring and Observability
Monitoring and Observability
Section titled “Monitoring and Observability”Alita Robot provides comprehensive monitoring capabilities including health checks, Prometheus metrics, and resource monitoring.
Health Endpoint
Section titled “Health Endpoint”The health endpoint provides real-time status of the bot and its dependencies.
Endpoint
Section titled “Endpoint”GET /healthResponse Format
Section titled “Response Format”{ "status": "healthy", "checks": { "database": true, "redis": true }, "version": "1.0.0", "uptime": "24h30m15s"}Response Fields
Section titled “Response Fields”| Field | Type | Description |
|---|---|---|
status | string | Overall health: healthy or unhealthy |
checks.database | boolean | PostgreSQL connection status |
checks.redis | boolean | Redis connection status |
version | string | Bot version |
uptime | string | Time since bot started |
HTTP Status Codes
Section titled “HTTP Status Codes”| Code | Meaning |
|---|---|
| 200 | All systems healthy |
| 503 | One or more checks failed |
Health Check Logic
Section titled “Health Check Logic”The health check performs:
- Database check: Pings PostgreSQL with a 2-second timeout
- Redis check: Sets and gets a test key with a 2-second timeout
Both checks must pass for healthy status.
Usage Examples
Section titled “Usage Examples”# Simple health checkcurl http://localhost:8080/health
# Docker health check (built-in)/alita_robot --health
# Kubernetes liveness probelivenessProbe: httpGet: path: /health port: 8080 initialDelaySeconds: 30 periodSeconds: 10Prometheus Metrics
Section titled “Prometheus Metrics”The metrics endpoint exposes Prometheus-compatible metrics for monitoring.
Endpoint
Section titled “Endpoint”GET /metricsPrometheus Scrape Configuration
Section titled “Prometheus Scrape Configuration”Add to your prometheus.yml:
scrape_configs: - job_name: 'alita-robot' static_configs: - targets: ['alita:8080'] scrape_interval: 15s scrape_timeout: 10s metrics_path: /metricsDocker Compose with Prometheus
Section titled “Docker Compose with Prometheus”services: prometheus: image: prom/prometheus:latest volumes: - ./prometheus.yml:/etc/prometheus/prometheus.yml - prometheus_data:/prometheus ports: - "9090:9090" depends_on: - alita
volumes: prometheus_data:Grafana Dashboard
Section titled “Grafana Dashboard”For visualization, add Grafana:
services: grafana: image: grafana/grafana:latest volumes: - grafana_data:/var/lib/grafana ports: - "3000:3000" environment: - GF_SECURITY_ADMIN_PASSWORD=admin depends_on: - prometheus
volumes: grafana_data:Resource Monitoring
Section titled “Resource Monitoring”Alita Robot includes automatic resource monitoring to prevent resource exhaustion.
Configuration
Section titled “Configuration”# Maximum goroutines before triggering cleanup# Default: 1000, Recommended: 1000-5000RESOURCE_MAX_GOROUTINES=1000
# Maximum memory usage in MB before triggering cleanup# Default: 500, Recommended: 500-2000RESOURCE_MAX_MEMORY_MB=500
# Memory threshold for triggering garbage collection# Default: 400, Recommended: 80% of RESOURCE_MAX_MEMORY_MBRESOURCE_GC_THRESHOLD_MB=400Auto-Remediation
Section titled “Auto-Remediation”When thresholds are exceeded, the system automatically:
- Triggers garbage collection
- Logs warnings about resource usage
- Takes corrective action if configured
Activity Monitoring
Section titled “Activity Monitoring”Track group activity automatically:
# Days of inactivity before marking chat as inactive# Default: 30, Range: 1-365INACTIVITY_THRESHOLD_DAYS=30
# Hours between automatic activity checks# Default: 1, Range: 1-24ACTIVITY_CHECK_INTERVAL=1
# Enable automatic cleanup of inactive chats# Default: trueENABLE_AUTO_CLEANUP=trueActivity Metrics
Section titled “Activity Metrics”The system tracks:
- DAG: Daily Active Groups
- WAG: Weekly Active Groups
- MAG: Monthly Active Groups
Groups are automatically:
- Marked inactive after the threshold period
- Reactivated when they become active again
Performance Monitoring
Section titled “Performance Monitoring”Enable Performance Tracking
Section titled “Enable Performance Tracking”# Enable automatic performance monitoringENABLE_PERFORMANCE_MONITORING=true
# Enable background statistics collectionENABLE_BACKGROUND_STATS=trueCollected Metrics
Section titled “Collected Metrics”- Message processing time
- Database query duration
- Cache hit/miss rates
- API response times
- Goroutine count
- Memory usage
Debug Mode
Section titled “Debug Mode”For detailed logging during development:
DEBUG=trueDebug mode:
- Increases log verbosity
- Disables background monitoring
- Shows detailed error stack traces
Log Analysis
Section titled “Log Analysis”Log Fields
Section titled “Log Fields”Structured log entries include:
| Field | Description |
|---|---|
update_id | Telegram update ID |
error_type | Error type (e.g., *TelegramError) |
file | Source file where error occurred |
line | Line number |
function | Function name |
Example Log Entry
Section titled “Example Log Entry”{ "level": "error", "msg": "Handler error occurred: user blocked bot", "update_id": 123456789, "error_type": "*gotgbot.TelegramError", "file": "alita/modules/admin.go", "line": 45, "function": "handleAdminCommand", "time": "2024-03-15T10:30:00Z"}Log Levels
Section titled “Log Levels”| Level | Description |
|---|---|
DEBUG | Verbose debugging (DEBUG=true only) |
INFO | Normal operation events |
WARN | Expected issues (e.g., user blocked bot) |
ERROR | Unexpected errors |
FATAL | Critical errors that stop the bot |
PANIC | Unrecoverable errors |
Alerting
Section titled “Alerting”Prometheus Alerting Rules
Section titled “Prometheus Alerting Rules”Create alerts.yml:
groups: - name: alita-alerts rules: - alert: AlitaUnhealthy expr: up{job="alita-robot"} == 0 for: 1m labels: severity: critical annotations: summary: "Alita Robot is down"
- alert: HighMemoryUsage expr: process_resident_memory_bytes > 500000000 for: 5m labels: severity: warning annotations: summary: "High memory usage detected"
- alert: DatabaseConnectionFailed expr: alita_health_database == 0 for: 1m labels: severity: critical annotations: summary: "Database connection failed"