Alert on high TestLabFailure rate. |
|||||||||||
Issue descriptionAborted suite_job failures are translated into INFRA_FAILURE in builds. Instead of watching CQ/Canary for INFRA_FAILURE, we can get notifications from alerts when suite_job abort rate goes high (probably only alert on cq and bvt pool?).
,
May 21 2018
A CL in flight to increment a counter for this event.
,
May 23 2018
The following revision refers to this bug: https://chromium.googlesource.com/chromiumos/chromite/+/f54c371563b2cae2b8da430f6c0ab09c635b17e2 commit f54c371563b2cae2b8da430f6c0ab09c635b17e2 Author: Ningning Xia <nxia@google.com> Date: Wed May 23 01:45:47 2018 Report stage failures to monarch. Report stage failures with known category and build_config to monarch. This will enable us to alert on high lab failure rate. BUG=chromium:841573 TEST=run_tests Change-Id: Id2e51fe5123421f39327e8809c2610c737917ebe Reviewed-on: https://chromium-review.googlesource.com/1065213 Commit-Ready: Ningning Xia <nxia@chromium.org> Tested-by: Ningning Xia <nxia@chromium.org> Reviewed-by: Ningning Xia <nxia@chromium.org> [modify] https://crrev.com/f54c371563b2cae2b8da430f6c0ab09c635b17e2/lib/failures_lib_unittest.py [modify] https://crrev.com/f54c371563b2cae2b8da430f6c0ab09c635b17e2/cbuildbot/stages/generic_stages.py [modify] https://crrev.com/f54c371563b2cae2b8da430f6c0ab09c635b17e2/cbuildbot/builders/generic_builders.py [modify] https://crrev.com/f54c371563b2cae2b8da430f6c0ab09c635b17e2/lib/constants.py [modify] https://crrev.com/f54c371563b2cae2b8da430f6c0ab09c635b17e2/lib/failure_message_lib.py [modify] https://crrev.com/f54c371563b2cae2b8da430f6c0ab09c635b17e2/lib/failures_lib.py
,
May 23 2018
The following revision refers to this bug: https://chromium.googlesource.com/chromium/src.git/+/939d645d55c2580d3379f4494405b41547d5d4bb commit 939d645d55c2580d3379f4494405b41547d5d4bb Author: chromite-chromium-autoroll <chromite-chromium-autoroll@skia-buildbots.google.com.iam.gserviceaccount.com> Date: Wed May 23 04:22:07 2018 Roll src/third_party/chromite/ e6853c3d4..07dfd789c (3 commits) https://chromium.googlesource.com/chromiumos/chromite.git/+log/e6853c3d4b19..07dfd789cb28 $ git log e6853c3d4..07dfd789c --date=short --no-merges --format='%ad %ae %s' 2018-05-21 xixuan chromeos_config: Add non-important nyan_blaze paladin. 2018-05-18 jkop metrics: Rename cumulative distribution metric 2018-05-18 nxia Report stage failures to monarch. Created with: roll-dep src/third_party/chromite BUG= chromium:845314 ,chromium:None,chromium:841573 The AutoRoll server is located here: https://chromite-chromium-roll.skia.org Documentation for the AutoRoller is here: https://skia.googlesource.com/buildbot/+/master/autoroll/README.md If the roll is causing failures, please contact the current sheriff, who should be CC'd on the roll, and stop the roller if necessary. TBR=chrome-os-gardeners@chromium.org Change-Id: I722e52cb1b03cda40c98b4a2b21244a043106a69 Reviewed-on: https://chromium-review.googlesource.com/1069846 Reviewed-by: Chromite Chromium Autoroll <chromite-chromium-autoroll@skia-buildbots.google.com.iam.gserviceaccount.com> Commit-Queue: Chromite Chromium Autoroll <chromite-chromium-autoroll@skia-buildbots.google.com.iam.gserviceaccount.com> Cr-Commit-Position: refs/heads/master@{#560936} [modify] https://crrev.com/939d645d55c2580d3379f4494405b41547d5d4bb/DEPS
,
May 24 2018
pcon graph: http://shortn/_orVwdxLV1P
,
May 29 2018
nxia to determine threshold
,
Jun 1 2018
Created a CL at https://critique.corp.google.com/#review/198916855 There was a high lab failure rate on 5/25 https://uberchromegw.corp.google.com/i/chromeos/builders/master-paladin/builds/18701 http://shortn/_PljLjh6fCp
,
Jun 1 2018
alert on fraction > 0.4 http://shortn/_bDzOnOBsP3
,
Jun 6 2018
The CL https://critique.corp.google.com/#review/198916855 is to raise an alert when TestLabFailure rate is high. TestLabFailure (http://shortn/_B25rNar9cI) includes SuiteTimedOut, BoardNotAvailable and SwarmingProxyFailure. The rate threshold may need tuning when more and more HWTests are running at the same time. One improvement is to use lab_failure_fraction (hwtest_stage_lab_failure_count / hwtest_stage_completion_count). test_stages.HWTestStage can overwrite_FinishBuildStageInCIDBAndMonarch and report HWTest completion status with failure type as one field (if status == 'fail').
,
Jun 8 2018
Rate alert was merged at https://critique.corp.google.com/#review/199812731
,
Jun 8 2018
see improvement idea at #9
,
Jun 11 2018
,
Jun 20 2018
For the record, this is the issue I was thinking should be revived as a way to alert on failures in need of deputy attention (and a way around the argument about the color that master-paladin should turn). I have some opinions on how to best realize this so let me know if you want to chat.
,
Jun 20 2018
,
Jun 25 2018
Looks to me like this is done: http://google3/configs/monitoring/chrome_infra/chromeos/autotest_alerts.py?l=916&rcl=201706083
,
Jun 25 2018
akeshet to file follow-up
,
Jul 20
Reviving. What we have is this metric http://shortn/_rfIBENm3tb which was added in this CL https://chromium-review.googlesource.com/c/chromiumos/chromite/+/1065213 , and an alert LabFailureRateHighAlert that has never fired. The metric itself counts the number of failures of any cbuildbot stage; it is difficult to get from there to an actionable test-infra failure rate. (are we interested in the fraction Instead we should work from the starting point just of HWTest stages run by cbuildbot that included an action run_suite call. We should emit a counter metric with fields: {build_config (string), important (bool), status (enum[pass, test_failure, lab_failure, ...])} Then we can alert directly on the fraction or rate of HWTest stages from important=True builders terminating in lab_failure.
,
Jul 20
,
Jul 23
,
Sep 11
Progress?
,
Sep 13
The metric was partially addressed in cl https://chromium-review.googlesource.com/c/chromiumos/chromite/+/1182404, and the important field was added in cl https://chromium-review.googlesource.com/c/chromiumos/chromite/+/1222155.
,
Sep 18
The following revision refers to this bug: https://chromium.googlesource.com/chromiumos/chromite/+/2426fe64bffb099a84645fdfb326a813307248db commit 2426fe64bffb099a84645fdfb326a813307248db Author: Dhanya Ganesh <dhanyaganesh@chromium.org> Date: Tue Sep 18 17:31:04 2018 failed_stage metric: Add important field A build is considered important if the build_config is not on the experimental list. This feature was requested by Test Infra team for their monitoring. BUG=chromium:841573 TEST=run_tests Change-Id: I72f1693fd198f4f87b6fa41ef0868a98692212f9 Reviewed-on: https://chromium-review.googlesource.com/1222155 Commit-Ready: Dhanya Ganesh <dhanyaganesh@chromium.org> Tested-by: Dhanya Ganesh <dhanyaganesh@chromium.org> Reviewed-by: Yaakov Shaul <yshaul@google.com> Reviewed-by: Mike Nichols <mikenichols@chromium.org> [modify] https://crrev.com/2426fe64bffb099a84645fdfb326a813307248db/cbuildbot/stages/generic_stages.py
,
Sep 24
Progress?
,
Sep 27
Re #22 that CL is flawed, and will lead to different branches of chromite sending different field sets for the same metric. Adding a field to an existing metric is not safe (has to be done very carefully, and concurrently on all possible clients, which is not possible to do with chromite). I'm suggesting a revert in https://chromium-review.googlesource.com/c/chromiumos/chromite/+/1249662 To move this particular bug forward, you'll need to add a new metric, not modify an existing one. Also, as I suggest in comment #17, I think counting stages is going to be a noisy and difficult way to go about this. Instead: "Instead we should work from the starting point just of HWTest stages run by cbuildbot that included an action run_suite call. We should emit a counter metric with fields: {build_config (string), important (bool), status (enum[pass, test_failure, lab_failure, ...])} Then we can alert directly on the fraction or rate of HWTest stages from important=True builders terminating in lab_failure."
,
Sep 29
The following revision refers to this bug: https://chromium.googlesource.com/chromiumos/chromite/+/4912a95b73c9e3191a9e6d5b2ce6b88a784d8e7a commit 4912a95b73c9e3191a9e6d5b2ce6b88a784d8e7a Author: Aviv Keshet <akeshet@chromium.org> Date: Sat Sep 29 07:27:33 2018 Revert "failed_stage metric: Add important field" This reverts commit 2426fe64bffb099a84645fdfb326a813307248db. Reason for revert: Will lead to metric breakage due to incompatible fields from different chromite branches. BUG=chromium:841573 TEST=None Original change's description: > failed_stage metric: Add important field > > A build is considered important if the build_config is not > on the experimental list. This feature was requested by > Test Infra team for their monitoring. > > BUG=chromium:841573 > TEST=run_tests > > Change-Id: I72f1693fd198f4f87b6fa41ef0868a98692212f9 > Reviewed-on: https://chromium-review.googlesource.com/1222155 > Commit-Ready: Dhanya Ganesh <dhanyaganesh@chromium.org> > Tested-by: Dhanya Ganesh <dhanyaganesh@chromium.org> > Reviewed-by: Yaakov Shaul <yshaul@google.com> > Reviewed-by: Mike Nichols <mikenichols@chromium.org> Bug: chromium:841573 Change-Id: If015927769a520733f61827ee677ebfc7a719e88 Reviewed-on: https://chromium-review.googlesource.com/1249662 Commit-Ready: Aviv Keshet <akeshet@chromium.org> Tested-by: Aviv Keshet <akeshet@chromium.org> Reviewed-by: Jason Clinton <jclinton@chromium.org> Reviewed-by: Don Garrett <dgarrett@chromium.org> [modify] https://crrev.com/4912a95b73c9e3191a9e6d5b2ce6b88a784d8e7a/cbuildbot/stages/generic_stages.py
,
Oct 1
,
Nov 19
|
|||||||||||
►
Sign in to add a comment |
|||||||||||
Comment 1 by akes...@chromium.org
, May 14 2018Owner: nxia@chromium.org
Status: Assigned (was: Untriaged)