Replace SwarmingPendingTimeHigh with a pool-wide alert for expired tasks |
|||
Issue descriptionThe current SwarmingPendingTimeHigh alert is mostly useless as it is configured to be per <master, builder> tuple, but often the tasks are provisioned into a common pool shared by multiple builders. This means that we might have issues with high pending times across a bunch of builders that are, taken individually, below a useful threshold, but taken in aggregate, a real problem. We should at least for now replace this with a task that monitors things "pool-wide" for some subset of defined common pool dimensions, e.g.: - os=Ubuntu-14.04 , pool=Chrome - os=Mac-10.13 , pool=Chrome - os=Windows-10 , pool=Chrome - os=Windows-7 , pool=Chrome This list is not meant to be illustrative, not definitive. The pools should reflect the common sets of dimensions used by bots in the CQ, and should exhaustively cover *every* chromium.* builder (except where exceptions are warranted) with as few combinations as possible (and sensible). We should then have at least an alert configured to fire whenever tasks actually *expire*, because we should, in general, not hit this condition. We should probably also have a second alert configured to fire when we start approaching capacity, with some threshold (and mechanism) TBD. It's possible that we should alert on aggregate pending time in seconds per time period, or possibly aggregate work done in seconds per time period. It's likely that there will be *some* builders that are resource-constrained (e.g. android devices where we only have a limited number of them) which should be exempted from this list. We should be able to (a) explicitly support that, and (b) ensure that those builders still have some kind of monitoring for that.
,
May 22 2018
,
May 22 2018
Sure, thanks for filing the bug! I'm thinking through the design in this doc: https://docs.google.com/document/d/1zPj_loUjFEs2Glrz2Lokl-JpVG5fsUbFcUrNeQ4UmPE/edit# (internal only).
,
May 29 2018
Somewhat tangental, but I just saw spurious alerts for non-existing builders: bug 846785, bug 847144, bug 847254. This alert is not feeling good...
,
Jun 7 2018
Upon further analysis in https://docs.google.com/document/d/1z-bHB7bU93QSA0YpYurAugJgN4ZtwPmdzUSHUrF-wt0/edit# this alert turned out to be dead, now that we have no CQ builds on buildbot. This means it doesn't need fixing, it needs reimplementation. I'm inclined to do it as part of the larger monitoring effort that I'm working on.
,
Aug 16
This will be done as part of a larger effort; the relevant portion is tracked in issue 873754 - merging into it. |
|||
►
Sign in to add a comment |
|||
Comment 1 by dpranke@chromium.org
, May 22 2018Status: Assigned (was: Untriaged)