New issue
Advanced search Search tips

Issue 666895 link

Starred by 1 user

Issue metadata

Status: Available
Owner: ----
Cc:
Components:
EstimatedDays: ----
NextAction: ----
OS: All
Pri: 3
Type: Bug



Sign in to add a comment

CQ monitoring failed to detect a badly failing test suite

Project Member Reported by dpranke@chromium.org, Nov 18 2016

Issue description

Recently, in  bug 665693 , a change landed that caused many machines (but not all of them) in the pool serving the linux_chromium_rel_ng builder to completely fail the "webkit_tests" step, causing the builds to retry without the patchl, see the same failures, and move on.

The root cause of the problem is being addressed, but I'm filing this bug because the way we noticed this was from a developer noticing the problem and filing a bug, not through our own monitoring of the CQ.

There's several different ways we potentially could've caught this. The main one would've been by an increase in cycle time or pending job queues on the builder. 

However, those numbers might be too noisy or coarse by themsleves. If we were monitoring the number of jobs that had steps that failed both with and without the patch, or the specific failure rates of particular steps, we might've noticed the failure more easily.

Can (or should) we make sure those things are monitored also?
 
I think the reason we haven't implemented more step-specific monitoring is because Monarch can't handle much variability in metric values, and we have a lot of different step types.

Swarming does some monitoring by step: https://cs.chromium.org/chromium/infra/luci/appengine/swarming/ts_mon_metrics.py?q=spec_name&l=46 so it looks like we could easily add some kind of result monitoring there. If fact, we might have that already: https://cs.chromium.org/chromium/infra/luci/appengine/swarming/ts_mon_metrics.py?q=spec_name&l=46


For non-swarming tasks, we could do identify some common patterns we care most about, and add the monitoring for steps that match those patters. We currently do this for bot_update duration. (https://cs.chromium.org/chromium/build/scripts/master/status_logger.py?l=449)


Much of the monitoring infrastructure is still kinda opaque to me, and so I'm not sure what "Monarch can't handle much variability in metric values, and we have a lot of different step types" means?
Katie's absolutely right in #1.
dpranke: see go/monarch-high-cardinality-metrics for details.
Cc: sergeybe...@chromium.org
We can do monitoring based on Dremel queries too. Data about builds and steps is available from the chrome_infra.completed_builds table [1]. I am not sure however, how to do alerts based on Dremel queries, but sergeyberezin@ may be able to give a pointer.


[1] https://plx.corp.google.com/#/table/chrome_infra::completed_builds
P.S. Another issue is that there is a 1-1.5 hour delay between the event and the moment data arrives to the Dremel table. So we may be getting an alert, but it may arrive too late.
Actually, Sergey has already commented on Dremel alerting earlier in https://bugs.chromium.org/p/chromium/issues/detail?id=659344#c3.
Cc: -andyb...@chromium.org

Comment 8 by efoo@chromium.org, Mar 29 2017

Components: -Infra>Monitoring
Labels: Ops-AddMonitoring
Removing Infra>Monitoring since this is a CQ related alert modification. Please reserve Infra>Monitoring for monitoring (ts_mon and event_mon) bugs. Added Ops-AddMonitoring label to track monitoring related tasks.
Components: -Infra>CQ
Project Member

Comment 10 by sheriffbot@chromium.org, Aug 20

Labels: Hotlist-Recharge-Cold
Status: Untriaged (was: Available)
This issue has been Available for over a year. If it's no longer important or seems unlikely to be fixed, please consider closing it out. If it is important, please re-triage the issue.

Sorry for the inconvenience if the bug really should have been left as Available.

For more details visit https://www.chromium.org/issue-tracking/autotriage - Your friendly Sheriffbot
Cc: -serg...@chromium.org -dsansome@chromium.org -katthomas@chromium.org
Labels: -Hotlist-Recharge-Cold
Status: Available (was: Untriaged)
It might still be relevant, especially the idea of monitoring steps that fail both with and without the patch (if we're still doing those). But it's likely still a P3.
Cc: -tandrii@chromium.org

Sign in to add a comment