Issue metadata
Sign in to add a comment
|
Reduce alert spam by coalescing multiple alerts due to same root cause |
||||||||||||||||||||||
Issue descriptionWe got a bunch of spam during the recent ganeti outage in the lab (see Design-Doc link). Audit/work out some ways to reduce this spam. Specific alerts called out in the post mortem are: UnhealthyShard alert ArchiverExportRateLow ArchiverSuccesRateLow Afe5XXResponsesHigh
,
May 25 2018
My theory/proposal for how to deal with shard-related alerts is below.
Step 1)
Create a traditional "Christmas tree" dashboard matrix for shards:
rows: shard
columns: shard services
The service metrics to track are:
scheduler tick
host scheduler tick
shard_client heartbeat
AFE RPC response latency/success
job_aborter processes
sysmon
The list above is somewhat hypothetical, based on current alerts. We might
add or remove items based on further analysis.
Step 2)
Create a single alert that fires when any entry in the Christmas tree matrix
is red. The alert should report (at minimum) how many shards have redness.
The alert can re-fire at one-hour intervals, although IWBN if either we made
the alert quieter on weekends, or possibly just reduced the frequency to once
every other hour.
,
May 25 2018
,
May 30 2018
,
Jun 4 2018
|
|||||||||||||||||||||||
►
Sign in to add a comment |
|||||||||||||||||||||||
Comment 1 by jrbarnette@chromium.org
, May 25 2018I've identified some other sources of spam that didn't make it into the post-mortem doc: ShardApachesLow JobAbortersLow SysmonMetricsMissing There are at least three principle origins for the spam: * When a single shard goes offline, multiple alerts can fire for that single outage. * When a common service outage takes out multiple shards, every shard generates its own alerts. * Many of the alerts are configured to re-alert once an hour, even on weekends.