We should have alerting for bot and builders being offline too long |
||||
Issue descriptionSee bug 872704 for one possible motivating example. It looks like we have two related open bugs: bug 694611, which talks about BuilderOffline, but seems to have morphed into a ClusterFuzz-specific thing. bug 647805, which talks about "a large number of machines" going offline. It's possible that if those two bugs were fixed, and if we had good coverage for when a builder had a lot of pending builds, we'd have sufficient coverage to not need anything further. But, I don't think we have any of those things, and that seems bad. Filing against Infra>Platform for initial triage. You could argue that maybe there's some Infra>Client work here as well, but this seems like a core part of the platform quality of service. Thoughts?
,
Aug 13
,
Aug 31
,
Aug 31
Assigning to sergey for now, feel free to re-assign or mark as Available.
,
Oct 22
Merging into issue 873754. We now have alerts for pending builds and for expired tasks, so the conditions described in #0 should now be caught by troopers.
,
Jan 8
Issue 694611 has been merged into this issue. |
||||
►
Sign in to add a comment |
||||
Comment 1 by mmoss@chromium.org
, Aug 10