New issue
Advanced search Search tips

Issue 841573 link

Starred by 3 users

Issue metadata

Status: Assigned
Owner:
Cc:
Components:
EstimatedDays: ----
NextAction: ----
OS: Chrome
Pri: 1
Type: Bug



Sign in to add a comment

Alert on high TestLabFailure rate.

Project Member Reported by nxia@chromium.org, May 9 2018

Issue description

Aborted suite_job failures are translated into INFRA_FAILURE in builds. Instead of watching CQ/Canary for INFRA_FAILURE, we can get notifications from alerts when suite_job abort rate goes high (probably only alert on cq and bvt pool?).
 
Labels: -Chase-Pending Chase
Owner: nxia@chromium.org
Status: Assigned (was: Untriaged)
A metric on "infra failures reported to chromite" seems like our side of the fence now, and worth alerting on.
A CL in flight to increment a counter for this event.
Project Member

Comment 3 by bugdroid1@chromium.org, May 23 2018

Project Member

Comment 4 by bugdroid1@chromium.org, May 23 2018

The following revision refers to this bug:
  https://chromium.googlesource.com/chromium/src.git/+/939d645d55c2580d3379f4494405b41547d5d4bb

commit 939d645d55c2580d3379f4494405b41547d5d4bb
Author: chromite-chromium-autoroll <chromite-chromium-autoroll@skia-buildbots.google.com.iam.gserviceaccount.com>
Date: Wed May 23 04:22:07 2018

Roll src/third_party/chromite/ e6853c3d4..07dfd789c (3 commits)

https://chromium.googlesource.com/chromiumos/chromite.git/+log/e6853c3d4b19..07dfd789cb28

$ git log e6853c3d4..07dfd789c --date=short --no-merges --format='%ad %ae %s'
2018-05-21 xixuan chromeos_config: Add non-important nyan_blaze paladin.
2018-05-18 jkop metrics: Rename cumulative distribution metric
2018-05-18 nxia Report stage failures to monarch.

Created with:
  roll-dep src/third_party/chromite
BUG= chromium:845314 ,chromium:None,chromium:841573


The AutoRoll server is located here: https://chromite-chromium-roll.skia.org

Documentation for the AutoRoller is here:
https://skia.googlesource.com/buildbot/+/master/autoroll/README.md

If the roll is causing failures, please contact the current sheriff, who should
be CC'd on the roll, and stop the roller if necessary.


TBR=chrome-os-gardeners@chromium.org

Change-Id: I722e52cb1b03cda40c98b4a2b21244a043106a69
Reviewed-on: https://chromium-review.googlesource.com/1069846
Reviewed-by: Chromite Chromium Autoroll <chromite-chromium-autoroll@skia-buildbots.google.com.iam.gserviceaccount.com>
Commit-Queue: Chromite Chromium Autoroll <chromite-chromium-autoroll@skia-buildbots.google.com.iam.gserviceaccount.com>
Cr-Commit-Position: refs/heads/master@{#560936}
[modify] https://crrev.com/939d645d55c2580d3379f4494405b41547d5d4bb/DEPS

Comment 5 by nxia@chromium.org, May 24 2018

pcon graph: http://shortn/_orVwdxLV1P
nxia to determine threshold

Comment 8 by nxia@chromium.org, Jun 1 2018

alert on fraction > 0.4 http://shortn/_bDzOnOBsP3

Comment 9 by nxia@chromium.org, Jun 6 2018

Summary: Alert on high TestLabFailure rate. (was: Alert on high suite_job abort rate.)
The CL https://critique.corp.google.com/#review/198916855 is to raise an alert when TestLabFailure rate is high. TestLabFailure (http://shortn/_B25rNar9cI) includes SuiteTimedOut, BoardNotAvailable and SwarmingProxyFailure. 

The rate threshold may need tuning when more and more HWTests are running at the same time. One improvement is to use lab_failure_fraction (hwtest_stage_lab_failure_count / hwtest_stage_completion_count). test_stages.HWTestStage can overwrite_FinishBuildStageInCIDBAndMonarch and report HWTest completion status with failure type as one field (if status == 'fail').

Comment 10 by nxia@chromium.org, Jun 8 2018

Rate alert was merged at https://critique.corp.google.com/#review/199812731

Comment 11 by nxia@chromium.org, Jun 8 2018

Owner: ----
Status: Available (was: Assigned)
see improvement idea at #9
Owner: cra...@chromium.org
Status: Assigned (was: Available)
For the record, this is the issue I was thinking should be revived as a way to alert on failures in need of deputy attention (and a way around the argument about the color that master-paladin should turn).

I have some opinions on how to best realize this so let me know if you want to chat.
Cc: akes...@chromium.org
Status: Fixed (was: Assigned)
Looks to me like this is done:
http://google3/configs/monitoring/chrome_infra/chromeos/autotest_alerts.py?l=916&rcl=201706083
akeshet to file follow-up
Labels: -Chase Chase-Pending
Status: Available (was: Fixed)
Reviving.

What we have is this metric http://shortn/_rfIBENm3tb which was added in this CL https://chromium-review.googlesource.com/c/chromiumos/chromite/+/1065213 , and an alert LabFailureRateHighAlert that has never fired.

The metric itself counts the number of failures of any cbuildbot stage; it is difficult to get from there to an actionable test-infra failure rate. (are we interested in the fraction 


Instead we should work from the starting point just of HWTest stages run by cbuildbot that included an action run_suite call. We should emit a counter metric with fields:
{build_config (string), important (bool), status (enum[pass, test_failure, lab_failure, ...])}

Then we can alert directly on the fraction or rate of HWTest stages from important=True builders terminating in lab_failure.


Owner: ----
Labels: -Chase-Pending Chase
Owner: yshaul@chromium.org
Status: Assigned (was: Available)
Progress?
Project Member

Comment 22 by bugdroid1@chromium.org, Sep 18

The following revision refers to this bug:
  https://chromium.googlesource.com/chromiumos/chromite/+/2426fe64bffb099a84645fdfb326a813307248db

commit 2426fe64bffb099a84645fdfb326a813307248db
Author: Dhanya Ganesh <dhanyaganesh@chromium.org>
Date: Tue Sep 18 17:31:04 2018

failed_stage metric: Add important field

A build is considered important if the build_config is not
on the experimental list. This feature was requested by
Test Infra team for their monitoring.

BUG=chromium:841573
TEST=run_tests

Change-Id: I72f1693fd198f4f87b6fa41ef0868a98692212f9
Reviewed-on: https://chromium-review.googlesource.com/1222155
Commit-Ready: Dhanya Ganesh <dhanyaganesh@chromium.org>
Tested-by: Dhanya Ganesh <dhanyaganesh@chromium.org>
Reviewed-by: Yaakov Shaul <yshaul@google.com>
Reviewed-by: Mike Nichols <mikenichols@chromium.org>

[modify] https://crrev.com/2426fe64bffb099a84645fdfb326a813307248db/cbuildbot/stages/generic_stages.py

Progress?
Re #22 that CL is flawed, and will lead to different branches of chromite sending different field sets for the same metric. Adding a field to an existing metric is not safe (has to be done very carefully, and concurrently on all possible clients, which is not possible to do with chromite). I'm suggesting a revert in https://chromium-review.googlesource.com/c/chromiumos/chromite/+/1249662

To move this particular bug forward, you'll need to add a new metric, not modify an existing one. 

Also, as I suggest in comment #17, I think counting stages is going to be a noisy and difficult way to go about this. Instead:

"Instead we should work from the starting point just of HWTest stages run by cbuildbot that included an action run_suite call. We should emit a counter metric with fields:
{build_config (string), important (bool), status (enum[pass, test_failure, lab_failure, ...])}

Then we can alert directly on the fraction or rate of HWTest stages from important=True builders terminating in lab_failure."
Project Member

Comment 25 by bugdroid1@chromium.org, Sep 29

The following revision refers to this bug:
  https://chromium.googlesource.com/chromiumos/chromite/+/4912a95b73c9e3191a9e6d5b2ce6b88a784d8e7a

commit 4912a95b73c9e3191a9e6d5b2ce6b88a784d8e7a
Author: Aviv Keshet <akeshet@chromium.org>
Date: Sat Sep 29 07:27:33 2018

Revert "failed_stage metric: Add important field"

This reverts commit 2426fe64bffb099a84645fdfb326a813307248db.

Reason for revert: Will lead to metric breakage due to incompatible
fields from different chromite branches.

BUG=chromium:841573
TEST=None

Original change's description:
> failed_stage metric: Add important field
>
> A build is considered important if the build_config is not
> on the experimental list. This feature was requested by
> Test Infra team for their monitoring.
>
> BUG=chromium:841573
> TEST=run_tests
>
> Change-Id: I72f1693fd198f4f87b6fa41ef0868a98692212f9
> Reviewed-on: https://chromium-review.googlesource.com/1222155
> Commit-Ready: Dhanya Ganesh <dhanyaganesh@chromium.org>
> Tested-by: Dhanya Ganesh <dhanyaganesh@chromium.org>
> Reviewed-by: Yaakov Shaul <yshaul@google.com>
> Reviewed-by: Mike Nichols <mikenichols@chromium.org>

Bug: chromium:841573
Change-Id: If015927769a520733f61827ee677ebfc7a719e88
Reviewed-on: https://chromium-review.googlesource.com/1249662
Commit-Ready: Aviv Keshet <akeshet@chromium.org>
Tested-by: Aviv Keshet <akeshet@chromium.org>
Reviewed-by: Jason Clinton <jclinton@chromium.org>
Reviewed-by: Don Garrett <dgarrett@chromium.org>

[modify] https://crrev.com/4912a95b73c9e3191a9e6d5b2ce6b88a784d8e7a/cbuildbot/stages/generic_stages.py

Owner: akes...@chromium.org
Labels: -Chase

Sign in to add a comment