New issue
Advanced search Search tips
Note: Color blocks (like or ) mean that a user may not be available. Tooltip shows the reason.

Issue 851694 link

Starred by 1 user

Issue metadata

Status: Fixed
Owner:
Last visit > 30 days ago
Closed: Jun 2018
Components:
EstimatedDays: ----
NextAction: ----
OS: ----
Pri: 3
Type: Bug



Sign in to add a comment

ToomanyIdleDuts is spammy.

Project Member Reported by cra...@chromium.org, Jun 11 2018

Issue description

The tooManyIdleDuts alert was submitted in cl/195473550 on 5/4.  The alert has fired 18 times since then starting on 5/7.  I'd argue that this alert is broken:

https://groups.google.com/a/google.com/forum/#!searchin/chromeos-build-alerts/TooManyIdleDuts%7Csort:date

can we disable the alert, send it to a -testing alert stream, or fix the thresholds?  the metric being alerted on seems to have a lot of variance, so just turning the number up probably isn't sufficient.

Richard, I'm assigning to you since you created the alert.


 
> The alert has fired 18 times since then starting on 5/7.  I'd argue that this alert is broken:

Some of the incidents were real outages:
  - http://shortn/_JogCLq9rqG - this happened because of the over-the-weekend
    outage on 5/20.
  - http://shortn/_WwmkRmwmhn - this happened because of an outage on the master
    because fizz had no shard, see  bug 848372 .

The first four reported alerts were because the threshold was set to only 12.
That was more or less deliberate; I did it to ensure that the alert would
actually fire before adjusting the limit to something more forgiving (but,
apparently, not forgiving enough).

I note that the alert is spammy probably at least in part because of real
bugs.  The really are at least 3 dozen DUTs for which it is impossible to
schedule work.

I see a few options here:

= Raise the threshold
If we were to raise the threshold to 150, it would be enough to keep
the alert silent except for the single most serious event we've seen.

Pro:  Probably, that's enough to make the alert reasonably silent.
Con:  Setting the alert wouldn't have spared us the alert on 5/21, which
    was at best tardy, since it was reporting on a 24-hour old outage.
    Also, if we have a sustained level of just less than 150 DUTs idle,
    that could easily be enough to trigger suite timeouts on some pools
    without ever triggering the alert, meaning the alert would be almost
    useless.

= Remove the alert
Pro:  This would eliminate the spam.
Con:  We'd have to find some other way to address the issue found at
    go/running_duts_postmortem.

= Make the alert smarter
We could change the alert to fire when there's a significant, sustained
change in the number of idle DUTs, rather than simply "alert above a
threshold".

Pro:  That would allow us to still address the problem cited in the
    post-mortem.
Con:  Finding a reliable, quantitative definition of "significant,
    sustained change" is easier to say than to do.  That is, we might
    still wind up with a spammy alert, and one that's harder to tune.

Labels: -Chase-Pending
It's clear that this alert is not prod-alerts ready.
Status: Assigned (was: Untriaged)
OK.  On reflection, I'm leaning towards deleting this alert altogether.
We have adequate information about DUTs getting stuck idle in the form
of the inventory dashboard:
    https://viceroy.corp.google.com/chromeos/lab_inventory_summary?duration=8d

Alerting on the status of that dashboard doesn't seem to be serving
a useful purpose.  If there's a problem with too many idle DUTs, it
will show up as either 1) aborts due to shortages, or 2) suites not
running due to "not enough DUTs" errors.  We should be alerting on
one (or both) of those conditions, not on the idle inventory.

Change is up at cl/201954555.

Status: Fixed (was: Assigned)
The CL is approved and submitted.  The spam should stop before
the day's end (even if you're on MDT).

Sign in to add a comment