New issue
Advanced search Search tips
Note: Color blocks (like or ) mean that a user may not be available. Tooltip shows the reason.

Issue 812286 link

Starred by 4 users

Issue metadata

Status: Fixed
Owner:
Closed: Feb 2018
Cc:
Components:
EstimatedDays: ----
NextAction: ----
OS: ----
Pri: 0
Type: Bug



Sign in to add a comment

Autotest suites not progressing, then getting aborted

Project Member Reported by pho...@chromium.org, Feb 14 2018

Issue description

Owner: ayatane@chromium.org
For example 1, the suite was just waiting around for one of the jobs to finish. That job seems to be in the Queued state on cautotest, but the shard shows a different story: It was aborted in the gathering phase:

http://cros-full-0010.mtv.corp.google.com/afe/#tab_id=view_job&object_id=176649049

Looking at the logs for this test: http://cros-full-0010.mtv.corp.google.com/results/176649049-chromeos-test/chromeos4-row9-rack11-host12/ 
(why are these logs not offloaded? This is one of the new shards. I suspect that gs_offloader isn't working correctly on it (also others?))

Looking inside job_reporter.log I see:

lucifer_run_job: 2018/02/14 00:25:21 Starting with args: [/opt/infra-tools/usr/bin/lucifer_run_job -abortsock /usr/local/autotest/leases/176649049.sock -resultsdir /usr/local/autotest/results/176649049-chromeos-test/chromeos4-row9-rack11-host12 -autotestdir /usr/local/autotest -watcherpath /opt/infra-tools/usr/bin/lucifer_watcher -x-need-gather -x-num-tests-failed -1 -x-autoserv-exit 1 -x-hosts chromeos4-row9-rack11-host12]
invalid value "-1" for flag -x-num-tests-failed: strconv.ParseUint: parsing "-1": invalid syntax
Usage of /opt/infra-tools/usr/bin/lucifer_run_job:
  -abortsock string
    	Abort socket (default "/nonexistent")
  -autotestdir string
    	Autotest directory (default "/usr/local/autotest")
  -hosts string
    	DUT hostnames, comma separated
  -resultsdir string
    	Results directory (default "/nonexistent")
  -watcherpath string
    	Path to lucifer_watcher binary (default "/usr/bin/lucifer_watcher")
  -x-autoserv-exit int
    	autoserv exit status (default 255)
  -x-hosts string
    	Deprecated, DUT hostnames, comma separated
  -x-need-gather
    	Run GatherLogs
  -x-need-starting
    	Handle STARTING portion of job
  -x-num-tests-failed uint
    	Number of tests failed
job_reporter: 2018-02-14 00:25:21,479:DEBUG:eventlib:run_event_command:80:Event command child with pid 23778 exited with 2
job_reporter: 2018-02-14 00:25:21,480:INFO:job_reporter:main:45:Exiting normally with: 2


----------------
So, job_executor failed, and parsing never ran.
I think: this means that the suite thinks that the job was hung (since dynamic_suite depends on updated TKO entry to know if the job passed / failed, which would only be updated in parsing.
Looking at luciferlib, there are two AIs here:

- job_reporter / job_executor need to agree on the behavior when num_tests_failed is missing. job_reporter currently passes in -1 in this situation, and job_executor fails, as above.

- This probably means that the pidfile contents for the job were missing / corrupted. That is the root cause here that I haven't followed up.
Not worrying about the root cause now,

Basically -1 means the same as non-zero.  Either we consider it failed or not failed.

Comment 6 by pho...@chromium.org, Feb 14 2018

Issue 812130 has been merged into this issue.

Comment 7 by pho...@chromium.org, Feb 14 2018

Labels: -Pri-1 Pri-0

Comment 8 by pho...@chromium.org, Feb 14 2018

Cc: alemate@chromium.org snanda@chromium.org
Components: Infra>Client>ChromeOS
apparently this is affecting the PFQ as well. +alemate
Cc: weidongg@chromium.org
Here https://uberchromegw.corp.google.com/i/chromeos/builders/veyron_minnie-chrome-pfq/builds/2582 HWTest [bvt-arc] HWTest [bvt-arc]  was not started, waited for 20000 seconds for nothing and timed out. I am not sure it is the same issue.
The plague is actually spreading.
https://uberchromegw.corp.google.com/i/chromeos/builders/reef-chrome-pfq/builds/1887 (reef never failed for a week) : test was never started. Though it has some new information:

12:22:49: INFO: Refreshing access_token
12:37:22: INFO: RetriableHttp: attempt 1 receiving status 503, will retry
12:37:28: INFO: RetriableHttp: attempt 2 receiving status 503, will retry
12:37:34: INFO: RetriableHttp: attempt 3 receiving status 503, will retry
12:37:40: INFO: RetriableHttp: attempt 4 receiving status 503, will retry
12:37:46: INFO: RetriableHttp: attempt 5 receiving status 503, final attempt
12:37:47: WARNING: HttpsMonitor.send received status 503: {
  "error": {
    "code": 503,
    "message": "The service is currently unavailable.",
    "status": "UNAVAILABLE"
  }
}
veyron_minnie again:

https://uberchromegw.corp.google.com/i/chromeos/builders/veyron_minnie-chrome-pfq/builds/2583 

HWTest [sanity] HWTest [sanity] ( 3 hrs, 1 mins, 1 secs )
stdout
Link to suite
[Test-Logs]: Suite job: ABORT
#11 : I'm pretty sure those are logs from metrics (ts-mon). They're not related.
 

Comment 15 by mqg@chromium.org, Feb 15 2018

Cc: mqg@chromium.org
I think #12 is different, the test job for that never got scheduled to a DUT.

The lars-release issue in #14 is different, the tests ran to completion (but some of them failed.  The suite job also ran to completion, but gets marked as aborted rather than completed because some job keyval was aborted.

The duplicate merged in #6 is the same issue.

#11 is the same issue.

#10 is different, the child jobs never got scheduled to DUTs like #12.

Regarding this issue, the CL to make it not a timeout abort is at the CQ.  Perhaps we may want to chump.

Note that the tests affected by this issue would have failed; this issue obfuscates the failure and makes it a timeout, but isn't affecting passing runs (so the affected builds wouldn't have passed even when this issue is fixed).  Thus, there is probably some root causing to be done.
Project Member

Comment 17 by bugdroid1@chromium.org, Feb 15 2018

The following revision refers to this bug:
  https://chromium.googlesource.com/chromiumos/infra/lucifer/+/191d897a62b4d41999ed24542f766467bf198761

commit 191d897a62b4d41999ed24542f766467bf198761
Author: Allen Li <ayatane@google.com>
Date: Wed Feb 14 19:23:47 2018

Fix testsFailed to accept -1

BUG= chromium:812286 
TEST=None

Change-Id: I237922c5d7f5ef8880a806994543de35a856b135

[modify] https://crrev.com/191d897a62b4d41999ed24542f766467bf198761/src/chromiumos/infra/lucifer/cmd/lucifer_run_job/main.go

Cc: mkarkada@chromium.org bhthompson@chromium.org josa...@chromium.org dchan@chromium.org kbleicher@chromium.org
Status: Fixed (was: Started)
I'm seeing infra issue while scheduling tests on boards daisy_skate and bob.
^ Please include links or details, or start a new bug.

Sign in to add a comment