New issue
Advanced search Search tips

Issue 845633 link

Starred by 2 users

Issue metadata

Status: Duplicate
Owner:
Closed: Aug 16
Cc:
Components:
EstimatedDays: ----
NextAction: ----
OS: ----
Pri: 1
Type: Bug


Participants' hotlists:
chrome-client-infra-monitoring


Sign in to add a comment

Replace SwarmingPendingTimeHigh with a pool-wide alert for expired tasks

Project Member Reported by dpranke@chromium.org, May 22 2018

Issue description

The current SwarmingPendingTimeHigh alert is mostly useless as it is configured to be per <master, builder> tuple, but often the tasks are provisioned into a common pool shared by multiple builders. This means that we might have issues with high pending times across a bunch of builders that are, taken individually, below a useful threshold, but taken in aggregate, a real problem.

We should at least for now replace this with a task that monitors things "pool-wide" for some subset of defined common pool dimensions, e.g.:

- os=Ubuntu-14.04 , pool=Chrome
- os=Mac-10.13 , pool=Chrome
- os=Windows-10 , pool=Chrome
- os=Windows-7 , pool=Chrome

This list is not meant to be illustrative, not definitive. The pools should reflect the common sets of dimensions used by bots in the CQ, and should exhaustively cover *every* chromium.* builder (except where exceptions are warranted) with as few combinations as possible (and sensible).

We should then have at least an alert configured to fire whenever tasks actually *expire*, because we should, in general, not hit this condition.

We should probably also have a second alert configured to fire when we start approaching capacity, with some threshold (and mechanism) TBD. It's possible that we should alert on aggregate pending time in seconds per time period, or possibly aggregate work done in seconds per time period.

It's likely that there will be *some* builders that are resource-constrained (e.g. android devices where we only have a limited number of them) which should be exempted from this list. We should be able to (a) explicitly support that, and (b) ensure that those builders still have some kind of monitoring for that.
 
Owner: sergeybe...@chromium.org
Status: Assigned (was: Untriaged)
@sergeyberezin - as we discussed late last week, can you take at least the "expired tasks" part of this work?
Cc: hinoka@chromium.org jchin...@chromium.org
Sure, thanks for filing the bug! I'm thinking through the design in this doc: https://docs.google.com/document/d/1zPj_loUjFEs2Glrz2Lokl-JpVG5fsUbFcUrNeQ4UmPE/edit# (internal only).

Somewhat tangental, but I just saw spurious alerts for non-existing builders: bug 846785, bug 847144, bug 847254. This alert is not feeling good...
Upon further analysis in https://docs.google.com/document/d/1z-bHB7bU93QSA0YpYurAugJgN4ZtwPmdzUSHUrF-wt0/edit# this alert turned out to be dead, now that we have no CQ builds on buildbot.

This means it doesn't need fixing, it needs reimplementation. I'm inclined to do it as part of the larger monitoring effort that I'm working on.
Mergedinto: 873754
Status: Duplicate (was: Assigned)
This will be done as part of a larger effort; the relevant portion is tracked in issue 873754 - merging into it.

Sign in to add a comment