New issue
Advanced search Search tips
Note: Color blocks (like or ) mean that a user may not be available. Tooltip shows the reason.

Issue 846877 link

Starred by 1 user

Issue metadata

Status: Available
Owner: ----
Cc:
Components:
EstimatedDays: ----
NextAction: ----
OS: ----
Pri: 2
Type: Task
Design-Doc: https://docs.google.com/document/d/1_JdDR-EISy4e4VFMIe-nepOyXhz93232lOMOobf1pVg/edit#



Sign in to add a comment

Reduce alert spam by coalescing multiple alerts due to same root cause

Project Member Reported by cra...@chromium.org, May 25 2018

Issue description

We got a bunch of spam during the recent ganeti outage in the lab (see Design-Doc link).  Audit/work out some ways to reduce this spam.  Specific alerts called out in the post mortem are:
 UnhealthyShard alert
 ArchiverExportRateLow
 ArchiverSuccesRateLow
 Afe5XXResponsesHigh

 
I've identified some other sources of spam that didn't make it into
the post-mortem doc:
    ShardApachesLow
    JobAbortersLow
    SysmonMetricsMissing

There are at least three principle origins for the spam:
  * When a single shard goes offline, multiple alerts can fire for that
    single outage.
  * When a common service outage takes out multiple shards, every shard
    generates its own alerts.
  * Many of the alerts are configured to re-alert once an hour, even on
    weekends.

My theory/proposal for how to deal with shard-related alerts is below.

Step 1)
Create a traditional "Christmas tree" dashboard matrix for shards:
    rows: shard
    columns: shard services

The service metrics to track are:
    scheduler tick
    host scheduler tick
    shard_client heartbeat
    AFE RPC response latency/success
    job_aborter processes
    sysmon

The list above is somewhat hypothetical, based on current alerts.  We might
add or remove items based on further analysis.

Step 2)
Create a single alert that fires when any entry in the Christmas tree matrix
is red.  The alert should report (at minimum) how many shards have redness.
The alert can re-fire at one-hour intervals, although IWBN if either we made
the alert quieter on weekends, or possibly just reduced the frequency to once
every other hour.

Components: Infra>Client>ChromeOS

Comment 4 by cra...@chromium.org, May 30 2018

Labels: cros-infra-pm-2018-05-21
Cc: akes...@chromium.org
Labels: -Chase-Pending
Status: Available (was: Accepted)
Summary: Reduce alert spam by coalescing multiple alerts due to same root cause (was: Reduce alert spam)

Sign in to add a comment