Skip to main content
Monitoring ensures system health, operational visibility, and SLA compliance.

Observability stack

  • Prometheus for metrics collection
  • Grafana for dashboards and visualizations
  • Loki (or ELK) for centralized log aggregation

Key service metrics to track

  • Kafka consumer lag
  • WebSocket active connection count
  • Redis memory utilization and cache hit ratio
  • TiDB region health and TiKV store availability

Alerting recommendations

  • Sustained CPU utilization above 80%
  • Database query latency exceeding 100 ms
  • Kafka consumer lag breaching defined thresholds
  • WebSocket connection drops or abnormal failure rate spikes
Tune thresholds based on workload characteristics and traffic patterns.