Monitoring and Observability

Alita Robot provides comprehensive monitoring capabilities including health checks, Prometheus metrics, and resource monitoring.

Health Endpoint

The health endpoint provides real-time status of the bot and its dependencies.

Endpoint

GET /health

Response Format

{
  "status": "healthy",
  "checks": {
    "database": true,
    "redis": true
  },
  "version": "1.0.0",
  "uptime": "24h30m15s"
}

Response Fields

Field	Type	Description
`status`	string	Overall health: `healthy` or `unhealthy`
`checks.database`	boolean	PostgreSQL connection status
`checks.redis`	boolean	Redis connection status
`version`	string	Bot version
`uptime`	string	Time since bot started

HTTP Status Codes

Code	Meaning
200	All systems healthy
503	One or more checks failed

Health Check Logic

The health check performs:

Database check: Pings PostgreSQL with a 2-second timeout
Redis check: Sets and gets a test key with a 2-second timeout

Both checks must pass for healthy status.

Usage Examples

# Simple health check
curl http://localhost:8080/health

# Docker health check (built-in)
/alita_robot --health

# Kubernetes liveness probe
livenessProbe:
  httpGet:
    path: /health
    port: 8080
  initialDelaySeconds: 30
  periodSeconds: 10

Prometheus Metrics

The metrics endpoint exposes Prometheus-compatible metrics for monitoring.

Endpoint

GET /metrics

Prometheus Scrape Configuration

Add to your prometheus.yml:

scrape_configs:
  - job_name: 'alita-robot'
    static_configs:
      - targets: ['alita:8080']
    scrape_interval: 15s
    scrape_timeout: 10s
    metrics_path: /metrics

Docker Compose with Prometheus

services:
  prometheus:
    image: prom/prometheus:latest
    volumes:
      - ./prometheus.yml:/etc/prometheus/prometheus.yml
      - prometheus_data:/prometheus
    ports:
      - "9090:9090"
    depends_on:
      - alita

volumes:
  prometheus_data:

Grafana Dashboard

For visualization, add Grafana:

services:
  grafana:
    image: grafana/grafana:latest
    volumes:
      - grafana_data:/var/lib/grafana
    ports:
      - "3000:3000"
    environment:
      - GF_SECURITY_ADMIN_PASSWORD=admin
    depends_on:
      - prometheus

volumes:
  grafana_data:

Resource Monitoring

Alita Robot includes automatic resource monitoring to prevent resource exhaustion.

Configuration

# Maximum goroutines before triggering cleanup
# Default: 1000, Recommended: 1000-5000
RESOURCE_MAX_GOROUTINES=1000

# Maximum memory usage in MB before triggering cleanup
# Default: 500, Recommended: 500-2000
RESOURCE_MAX_MEMORY_MB=500

# Memory threshold for triggering garbage collection
# Default: 400, Recommended: 80% of RESOURCE_MAX_MEMORY_MB
RESOURCE_GC_THRESHOLD_MB=400

Auto-Remediation

When thresholds are exceeded, the system automatically:

Triggers garbage collection
Logs warnings about resource usage
Takes corrective action if configured

Activity Monitoring

Track group activity automatically:

# Days of inactivity before marking chat as inactive
# Default: 30, Range: 1-365
INACTIVITY_THRESHOLD_DAYS=30

# Hours between automatic activity checks
# Default: 1, Range: 1-24
ACTIVITY_CHECK_INTERVAL=1

# Enable automatic cleanup of inactive chats
# Default: true
ENABLE_AUTO_CLEANUP=true

Activity Metrics

The system tracks:

DAG: Daily Active Groups
WAG: Weekly Active Groups
MAG: Monthly Active Groups

Groups are automatically:

Marked inactive after the threshold period
Reactivated when they become active again

Performance Monitoring

Enable Performance Tracking

# Enable automatic performance monitoring
ENABLE_PERFORMANCE_MONITORING=true

# Enable background statistics collection
ENABLE_BACKGROUND_STATS=true

Collected Metrics

Message processing time
Database query duration
Cache hit/miss rates
API response times
Goroutine count
Memory usage

Debug Mode

For detailed logging during development:

DEBUG=true

Debug mode:

Increases log verbosity
Disables background monitoring
Shows detailed error stack traces

Log Analysis

Log Fields

Structured log entries include:

Field	Description
`update_id`	Telegram update ID
`error_type`	Error type (e.g., `*TelegramError`)
`file`	Source file where error occurred
`line`	Line number
`function`	Function name

Example Log Entry

{
  "level": "error",
  "msg": "Handler error occurred: user blocked bot",
  "update_id": 123456789,
  "error_type": "*gotgbot.TelegramError",
  "file": "alita/modules/admin.go",
  "line": 45,
  "function": "handleAdminCommand",
  "time": "2024-03-15T10:30:00Z"
}

Log Levels

Level	Description
`DEBUG`	Verbose debugging (DEBUG=true only)
`INFO`	Normal operation events
`WARN`	Expected issues (e.g., user blocked bot)
`ERROR`	Unexpected errors
`FATAL`	Critical errors that stop the bot
`PANIC`	Unrecoverable errors

Alerting

Prometheus Alerting Rules

Create alerts.yml:

groups:
  - name: alita-alerts
    rules:
      - alert: AlitaUnhealthy
        expr: up{job="alita-robot"} == 0
        for: 1m
        labels:
          severity: critical
        annotations:
          summary: "Alita Robot is down"

      - alert: HighMemoryUsage
        expr: process_resident_memory_bytes > 500000000
        for: 5m
        labels:
          severity: warning
        annotations:
          summary: "High memory usage detected"

      - alert: DatabaseConnectionFailed
        expr: alita_health_database == 0
        for: 1m
        labels:
          severity: critical
        annotations:
          summary: "Database connection failed"