Autotest suites not progressing, then getting aborted |
|||||||||
Issue descriptionBoth of these test suites were happily running tests until they suddenly stopped, and got aborted about an hour later. Why? Example 1: build: https://luci-milo.appspot.com/buildbot/chromeos/veyron_minnie-paladin/5279 suite: http://cautotest-prod/afe/#tab_id=view_job&object_id=176648874 logs: https://pantheon.corp.google.com/storage/browser/chromeos-autotest-results/176648874-chromeos-test/hostless/ Example 2: build: https://luci-milo.appspot.com/buildbot/chromeos/quawks-paladin/2181 suite: http://cautotest-prod/afe/#tab_id=view_job&object_id=176742773 logs: https://pantheon.corp.google.com/storage/browser/chromeos-autotest-results/176742773-chromeos-test/hostless/
,
Feb 14 2018
Looking at luciferlib, there are two AIs here: - job_reporter / job_executor need to agree on the behavior when num_tests_failed is missing. job_reporter currently passes in -1 in this situation, and job_executor fails, as above. - This probably means that the pidfile contents for the job were missing / corrupted. That is the root cause here that I haven't followed up.
,
Feb 14 2018
Not worrying about the root cause now, Basically -1 means the same as non-zero. Either we consider it failed or not failed.
,
Feb 14 2018
https://chromium-review.googlesource.com/c/chromiumos/infra/lucifer/+/919681
,
Feb 14 2018
,
Feb 14 2018
Issue 812130 has been merged into this issue.
,
Feb 14 2018
,
Feb 14 2018
apparently this is affecting the PFQ as well. +alemate
,
Feb 14 2018
,
Feb 14 2018
Here https://uberchromegw.corp.google.com/i/chromeos/builders/veyron_minnie-chrome-pfq/builds/2582 HWTest [bvt-arc] HWTest [bvt-arc] was not started, waited for 20000 seconds for nothing and timed out. I am not sure it is the same issue.
,
Feb 14 2018
The plague is actually spreading. https://uberchromegw.corp.google.com/i/chromeos/builders/reef-chrome-pfq/builds/1887 (reef never failed for a week) : test was never started. Though it has some new information: 12:22:49: INFO: Refreshing access_token 12:37:22: INFO: RetriableHttp: attempt 1 receiving status 503, will retry 12:37:28: INFO: RetriableHttp: attempt 2 receiving status 503, will retry 12:37:34: INFO: RetriableHttp: attempt 3 receiving status 503, will retry 12:37:40: INFO: RetriableHttp: attempt 4 receiving status 503, will retry 12:37:46: INFO: RetriableHttp: attempt 5 receiving status 503, final attempt [1;33m12:37:47: WARNING: HttpsMonitor.send received status 503: { "error": { "code": 503, "message": "The service is currently unavailable.", "status": "UNAVAILABLE" } }
,
Feb 14 2018
veyron_minnie again: https://uberchromegw.corp.google.com/i/chromeos/builders/veyron_minnie-chrome-pfq/builds/2583 HWTest [sanity] HWTest [sanity] ( 3 hrs, 1 mins, 1 secs ) stdout Link to suite [Test-Logs]: Suite job: ABORT
,
Feb 14 2018
#11 : I'm pretty sure those are logs from metrics (ts-mon). They're not related.
,
Feb 14 2018
This might also be causing the PaygenTestStable suites to fail on kevin, lars and veyron_jerry: https://luci-milo.appspot.com/buildbot/chromeos_release/kevin-release%20release-R64-10176.B/66 https://luci-milo.appspot.com/buildbot/chromeos_release/lars-release%20release-R64-10176.B/67 https://luci-milo.appspot.com/buildbot/chromeos_release/veyron_jerry-release%20release-R64-10176.B/66 +
,
Feb 15 2018
,
Feb 15 2018
I think #12 is different, the test job for that never got scheduled to a DUT. The lars-release issue in #14 is different, the tests ran to completion (but some of them failed. The suite job also ran to completion, but gets marked as aborted rather than completed because some job keyval was aborted. The duplicate merged in #6 is the same issue. #11 is the same issue. #10 is different, the child jobs never got scheduled to DUTs like #12. Regarding this issue, the CL to make it not a timeout abort is at the CQ. Perhaps we may want to chump. Note that the tests affected by this issue would have failed; this issue obfuscates the failure and makes it a timeout, but isn't affecting passing runs (so the affected builds wouldn't have passed even when this issue is fixed). Thus, there is probably some root causing to be done.
,
Feb 15 2018
The following revision refers to this bug: https://chromium.googlesource.com/chromiumos/infra/lucifer/+/191d897a62b4d41999ed24542f766467bf198761 commit 191d897a62b4d41999ed24542f766467bf198761 Author: Allen Li <ayatane@google.com> Date: Wed Feb 14 19:23:47 2018 Fix testsFailed to accept -1 BUG= chromium:812286 TEST=None Change-Id: I237922c5d7f5ef8880a806994543de35a856b135 [modify] https://crrev.com/191d897a62b4d41999ed24542f766467bf198761/src/chromiumos/infra/lucifer/cmd/lucifer_run_job/main.go
,
Feb 15 2018
,
Feb 15 2018
,
Feb 15 2018
I'm seeing infra issue while scheduling tests on boards daisy_skate and bob.
,
Feb 15 2018
^ Please include links or details, or start a new bug. |
|||||||||
►
Sign in to add a comment |
|||||||||
Comment 1 by pprabhu@chromium.org
, Feb 14 2018