Skip to main contentMonitoring ensures system health, operational visibility, and SLA compliance.
Observability stack
- Prometheus for metrics collection
- Grafana for dashboards and visualizations
- Loki (or ELK) for centralized log aggregation
Key service metrics to track
- Kafka consumer lag
- WebSocket active connection count
- Redis memory utilization and cache hit ratio
- TiDB region health and TiKV store availability
Alerting recommendations
- Sustained CPU utilization above 80%
- Database query latency exceeding 100 ms
- Kafka consumer lag breaching defined thresholds
- WebSocket connection drops or abnormal failure rate spikes
Tune thresholds based on workload characteristics and traffic patterns.