CQ monitoring failed to detect a badly failing test suite |
||||||||
Issue descriptionRecently, in bug 665693 , a change landed that caused many machines (but not all of them) in the pool serving the linux_chromium_rel_ng builder to completely fail the "webkit_tests" step, causing the builds to retry without the patchl, see the same failures, and move on. The root cause of the problem is being addressed, but I'm filing this bug because the way we noticed this was from a developer noticing the problem and filing a bug, not through our own monitoring of the CQ. There's several different ways we potentially could've caught this. The main one would've been by an increase in cycle time or pending job queues on the builder. However, those numbers might be too noisy or coarse by themsleves. If we were monitoring the number of jobs that had steps that failed both with and without the patch, or the specific failure rates of particular steps, we might've noticed the failure more easily. Can (or should) we make sure those things are monitored also?
,
Nov 22 2016
Much of the monitoring infrastructure is still kinda opaque to me, and so I'm not sure what "Monarch can't handle much variability in metric values, and we have a lot of different step types" means?
,
Nov 22 2016
Katie's absolutely right in #1. dpranke: see go/monarch-high-cardinality-metrics for details.
,
Nov 22 2016
We can do monitoring based on Dremel queries too. Data about builds and steps is available from the chrome_infra.completed_builds table [1]. I am not sure however, how to do alerts based on Dremel queries, but sergeyberezin@ may be able to give a pointer. [1] https://plx.corp.google.com/#/table/chrome_infra::completed_builds
,
Nov 22 2016
P.S. Another issue is that there is a 1-1.5 hour delay between the event and the moment data arrives to the Dremel table. So we may be getting an alert, but it may arrive too late.
,
Nov 22 2016
Actually, Sergey has already commented on Dremel alerting earlier in https://bugs.chromium.org/p/chromium/issues/detail?id=659344#c3.
,
Jan 18 2017
,
Mar 29 2017
Removing Infra>Monitoring since this is a CQ related alert modification. Please reserve Infra>Monitoring for monitoring (ts_mon and event_mon) bugs. Added Ops-AddMonitoring label to track monitoring related tasks.
,
Aug 18 2017
,
Aug 20
This issue has been Available for over a year. If it's no longer important or seems unlikely to be fixed, please consider closing it out. If it is important, please re-triage the issue. Sorry for the inconvenience if the bug really should have been left as Available. For more details visit https://www.chromium.org/issue-tracking/autotriage - Your friendly Sheriffbot
,
Aug 20
It might still be relevant, especially the idea of monitoring steps that fail both with and without the patch (if we're still doing those). But it's likely still a P3.
,
Aug 20
|
||||||||
►
Sign in to add a comment |
||||||||
Comment 1 by katthomas@chromium.org
, Nov 22 2016