[트러블슈팅] Prometheus 알림 폭풍
목차
한 줄 요약
서버 재시작할 때마다 Mattermost에 알림이 50-100건씩 쏟아져서 정작 중요한 알림을 놓치고 있었어요.
for 절로 일시적 스파이크를 걸러내고, inhibit_rules로 중복 알림을 억제해서 노이즈를 90% 줄였어요.
증상
배포하거나 서버를 재시작할 때마다 Mattermost에 수십 건의 알림이 동시에 쏟아졌어요. ApplicationDown, HighCPU, HighMemory, HighResponseTime이 한꺼번에 울리는데, 원인은 서버 재시작 하나였어요.
문제는 알림 피로였어요. 알림이 너무 자주 오니까 슬슬 무시하게 되고, 진짜 장애가 났을 때도 “또 노이즈겠지”하고 넘기는 상황이 생겼거든요.
환경
- Prometheus + Alertmanager + Grafana
- Mattermost Webhook 연동
- Docker Compose 단일 서버 구성
원인 분석
두 가지가 겹쳤어요.
1. for 절 없이 즉시 알림
기존 알림 규칙에 for 절이 없었어요.
Prometheus가 15초마다 스크래핑하는데, 한 번이라도 임계값을 넘으면 바로 알림이 나갔어요.
서버 재시작 시 CPU와 메모리가 일시적으로 튀는 건 정상이에요. JVM 워밍업, 커넥션 풀 초기화, Kafka Consumer 리밸런싱 등이 동시에 일어나거든요. 그런데 이걸 전부 장애로 인식하고 있었어요.
2. 억제 규칙이 없었다
서버가 죽으면 ApplicationDown(Critical)이 떠요. 그런데 서버가 죽었으니 당연히 CPU도 응답시간도 비정상이 되죠. HighCPU(Warning), HighResponseTime(Warning)이 같이 울려요. 근본 원인은 하나인데 알림이 4건 나오는 구조였어요.
해결
1. for 절로 지속 시간 필터링
서버/DB 다운 같은 Critical은 for: 1m으로 빠르게 감지하되, Warning은 for: 5m으로 충분한 지속 시간을 확인한 뒤에만 알림을 보내도록 했어요.
| 알림 | for 값 | 심각도 | 임계값 |
|---|---|---|---|
| ApplicationDown | 1m | critical | up == 0 |
| HighErrorRate | 3m | critical | 5xx > 10% |
| HighResponseTime | 5m | warning | P95 > 2초 |
| HighCPUUsage | 5m | warning | > 80% |
| HighMemoryUsage | 5m | warning | > 85% |
| MySQLDown | 1m | critical | up == 0 |
| KafkaConsumerLag | 5m | warning | > 1000 |
서버 재시작 후 CPU 스파이크는 보통 1-2분 안에 안정화돼요.
for: 5m이면 이런 일시적 이상은 걸러집니다.
2. Alertmanager 라우팅 분리
Critical은 group_wait: 10s로 빠르게 보내고, Warning은 group_wait: 2m으로 모아서 보내요.
3. 억제 규칙(Inhibit Rules)
ApplicationDown(Critical)이 발생하면 같은 인스턴스의 HighCPU, HighMemory, HighResponseTime(Warning)을 자동 억제해요.
적용 전: ApplicationDown + HighCPU + HighMemory + HighResponseTime = 4건 적용 후: ApplicationDown 1건만
실제 알림 규칙 전체 (28개)
애플리케이션 (4개): ApplicationDown(for: 1m, critical), HighResponseTime(for: 5m, warning), HighErrorRate(for: 3m, critical), HighJVMMemoryUsage(for: 5m, warning)
인프라 (3개): HighCPUUsage(for: 5m, warning), HighMemoryUsage(for: 5m, warning), HighDiskUsage(for: 5m, warning)
데이터베이스 (4개): MySQLDown/RedisDown(for: 1m, critical), MySQLHighConnections(for: 5m, warning), RedisHighMemoryUsage(for: 5m, warning)
Kafka (2개): KafkaDown(for: 1m, critical), KafkaConsumerLag(for: 5m, warning)
컨테이너 (3개): ContainerRestartingFrequently(for: 0m, warning, 즉시 감지), ContainerHighCPU/MemoryUsage(for: 5m, warning)
결과
| 지표 | 개선 전 | 개선 후 |
|---|---|---|
| 배포 시 알림 수 | 50~100건 | 3~5건 |
| 알림 노이즈 | 높음 | 90% 감소 |
| Critical 대응 속도 | 알림 피로로 지연 | 즉시 대응 |
참고 자료
Summary
Reduced alert noise by 90% after server restarts flooded Mattermost with 50-100 alerts. Applied for clauses to filter transient spikes and inhibit rules to suppress duplicate alerts.
Symptoms
Every deployment or server restart triggered dozens of simultaneous alerts on Mattermost. ApplicationDown, HighCPU, HighMemory, HighResponseTime all fired at once — from a single server restart.
The real problem was alert fatigue. Frequent alerts led to ignoring them, causing genuine failures to be dismissed as “probably just noise.”
Environment
- Prometheus + Alertmanager + Grafana
- Mattermost Webhook integration
- Docker Compose single-server setup
Root Cause
Two issues compounded:
1. No for Clause — Instant Alerts
Alert rules had no for clause. Prometheus scrapes every 15 seconds, and a single threshold breach triggered an immediate alert.
CPU and memory spiking during server restart is normal — JVM warmup, connection pool initialization, Kafka Consumer rebalancing all happen simultaneously. But everything was being classified as a failure.
2. No Inhibit Rules
When the server dies, ApplicationDown (Critical) fires. But naturally CPU and response time also become abnormal, triggering HighCPU (Warning) and HighResponseTime (Warning). One root cause produced 4 alerts.
Solution
1. for Clause Duration Filtering
Critical alerts (server/DB down) use for: 1m for fast detection. Warnings use for: 5m to confirm sustained anomalies before alerting.
| Alert | for Value | Severity | Threshold |
|---|---|---|---|
| ApplicationDown | 1m | critical | up == 0 |
| HighErrorRate | 3m | critical | 5xx > 10% |
| HighResponseTime | 5m | warning | P95 > 2s |
| HighCPUUsage | 5m | warning | > 80% |
| HighMemoryUsage | 5m | warning | > 85% |
| MySQLDown | 1m | critical | up == 0 |
| KafkaConsumerLag | 5m | warning | > 1000 |
Post-restart CPU spikes typically stabilize within 1-2 minutes. for: 5m filters these transient anomalies.
2. Alertmanager Routing Separation
Critical alerts use group_wait: 10s for fast delivery. Warnings use group_wait: 2m for batching.
3. Inhibit Rules
When ApplicationDown (Critical) fires, HighCPU, HighMemory, and HighResponseTime (Warning) for the same instance are automatically suppressed.
Before: ApplicationDown + HighCPU + HighMemory + HighResponseTime = 4 alerts After: ApplicationDown only = 1 alert
Complete Alert Rules (28 total)
Application (4): ApplicationDown (for: 1m, critical), HighResponseTime (for: 5m, warning), HighErrorRate (for: 3m, critical), HighJVMMemoryUsage (for: 5m, warning)
Infrastructure (3): HighCPUUsage (for: 5m, warning), HighMemoryUsage (for: 5m, warning), HighDiskUsage (for: 5m, warning)
Database (4): MySQLDown/RedisDown (for: 1m, critical), MySQLHighConnections (for: 5m, warning), RedisHighMemoryUsage (for: 5m, warning)
Kafka (2): KafkaDown (for: 1m, critical), KafkaConsumerLag (for: 5m, warning)
Container (3): ContainerRestartingFrequently (for: 0m, warning, immediate), ContainerHighCPU/MemoryUsage (for: 5m, warning)
Results
| Metric | Before | After |
|---|---|---|
| Alerts per deployment | 50-100 | 3-5 |
| Alert noise | High | 90% reduction |
| Critical response speed | Delayed (fatigue) | Immediate |