Multiple CQ failures in daisy_skate-paladin
Reported by
jrbarnette@chromium.org,
Dec 8 2016
|
|||||
Issue description
In the past few days, a notable number of runs of daisy_skate-paladin
have failed with a distinct, unexplained syndrome. Here are the runs
with the failures, from most recent to least:
https://uberchromegw.corp.google.com/i/chromeos/builders/daisy_skate-paladin/builds/8417
https://uberchromegw.corp.google.com/i/chromeos/builders/daisy_skate-paladin/builds/8414
https://uberchromegw.corp.google.com/i/chromeos/builders/daisy_skate-paladin/builds/8413
https://uberchromegw.corp.google.com/i/chromeos/builders/daisy_skate-paladin/builds/8410
https://uberchromegw.corp.google.com/i/chromeos/builders/daisy_skate-paladin/builds/8409
https://uberchromegw.corp.google.com/i/chromeos/builders/daisy_skate-paladin/builds/8402
Only the CQ shows this failure (not the canaries, not the PFQ), and it's
almost (but not quite) restricted to daisy_skate only. There are two
failures in stumpy-paladin:
https://uberchromegw.corp.google.com/i/chromeos/builders/stumpy-paladin/builds/26844
https://uberchromegw.corp.google.com/i/chromeos/builders/stumpy-paladin/builds/26843
The key features of the syndrome:
* Multiple test failures that pass on retry.
* At least one DUT reporting provision failure.
The provision failure is also quite distinctive. Here's the
relevant error from "status.log":
FAIL ---- verify.cros timestamp=1481090149 localtime=Dec 06 21:55:49 command execution error
* Command:
/usr/bin/ssh -a -x -o ControlPath=/tmp/_autotmp_L3wc6sssh-
master/socket -o StrictHostKeyChecking=no -o UserKnownHostsFile=/dev/null
-o BatchMode=yes -o ConnectTimeout=30 -o ServerAliveInterval=900 -o
ServerAliveCountMax=3 -o ConnectionAttempts=4 -o Protocol=2 -l root -p 22
chromeos4-row2-rack8-host16 "export LIBC_FATAL_STDERR_=1; if type
\"logger\" > /dev/null 2>&1; then logger -tag \"autotest\"
\"server[stack::verify|verify_software|upstart_status] -> ssh_run(status
system-services | grep start/running)\";fi; status system-services | grep
start/running"
Exit status: 1
Duration: 0.115790128708
,
Dec 8 2016
N.B. Provision failures are reported as "infrastructure failures", which means that the CQ won't blame any CLs, and will retry all CLs in the set. To stop constantly cycling on same problem, we've throttled the tree.
,
Dec 8 2016
More data to contribute to the head-scratching:
Provision runs the following steps:
* Download and apply an AU.
* Perform sanity checks, including a check that system-services
started.
* Perform host verification on the DUT. That also includes
a check that system-services started.
The failed provision tasks seem to indicate that the first check passed
(the AU proper reported no failures), but the second check failed
(that's the failure in the status.log file). It's (essentially) not
possible for system-services to stop, so it doesn't make sense that
the second check could fail unless the first check passed, yet that
seems to be what happened.
,
Dec 8 2016
All the failure has this CL. https://chromium-review.googlesource.com/#/c/409017/
,
Dec 8 2016
Re c#4: That's also the kind of CL that might cause this kind of symptom, too. mka@ - If you think this CL is causing the problem, you can close this bug. If you think the CL isn't the problem and you can make the case, then we can reassign it.
,
Dec 8 2016
If I interpret the logs correctly all the HWTest failed due to problems with servo: [Test-Logs]: provision: FAIL: Servo initialize timed out., command execution error, completed successfully [Test-Logs]: provision: FAIL: Servo board is unconfigured; should be stumpy, command execution error, completed successfully I don't think that the software on the DUT can cause this kind of problem. Looks more like a problem on the host side to me.
,
Dec 8 2016
Sorry, I should have explained it earlier. The messages about servo
are a red herring. Roughly, they're caused by bug 670469. More
importantly, the reported servo failures were repaired during
provisioning, so although they show up in the report, they're
not the cause of the failure.
I've attached the full status.log from one of the failed provision
jobs; you can look through it to see more. The key is this failure
noted in the original description:
FAIL ---- verify.cros timestamp=1481090149 localtime=Dec 06 21:55:49 command execution error
* Command:
/usr/bin/ssh -a -x -o ControlPath=/tmp/_autotmp_L3wc6sssh-
master/socket -o StrictHostKeyChecking=no -o UserKnownHostsFile=/dev/null
-o BatchMode=yes -o ConnectTimeout=30 -o ServerAliveInterval=900 -o
ServerAliveCountMax=3 -o ConnectionAttempts=4 -o Protocol=2 -l root -p 22
chromeos4-row2-rack8-host16 "export LIBC_FATAL_STDERR_=1; if type
\"logger\" > /dev/null 2>&1; then logger -tag \"autotest\"
\"server[stack::verify|verify_software|upstart_status] -> ssh_run(status
system-services | grep start/running)\";fi; status system-services | grep
start/running"
Exit status: 1
Duration: 0.115790128708
*That's* the failure that caused provisioning to fail. That symptom
means that "system-services" wasn't running at the time the host was
finally checked. Generally, that can only happen if Chrome fails to
start at boot.
,
Dec 8 2016
Thanks for providing more context! Not sure how to interpret the logs, it seems there is no straightforward way to see which system services are running. The check for system-services is logged, but without additional information: 2016-12-06T00:21:35.139641+00:00 NOTICE ag[7117]: autotest server[stack::verify|verify_software|upstart_status] -> ssh_run(status system-services | grep start/running) Will try to get a unit to have a look.
,
Dec 8 2016
"system-services" is a single upstart job that is triggered
when Chrome reports reaching the login screen. Once the job
starts, it doesn't stop until shutdown. The command that's
run simply checks that that job is running, meaning it merely
checks that Chrome reported login screen.
There's some technical detail about the boot flow generally,
and about system-services in particular here:
https://www.chromium.org/chromium-os/chromiumos-design-docs/boot-design
That document _might_ help, but probably you'll be more enlightened
by watching an actual device in action booting. You may need to boot
more than once; the history of the failures suggests that whatever goes
wrong is intermittent.
,
Dec 9 2016
Thanks, it seems that the following trace would indicate that system-services is "running": INFO session_manager[7529]: [INFO:upstart_signal_emitter.cc(30)] Emitting login-prompt-visible Upstart signal I agree that it is probably more helpful to look at an actual device, I'm trying to get hold of one.
,
Dec 9 2016
> it seems that the following trace would indicate that system-services is "running": > > INFO session_manager[7529]: [INFO:upstart_signal_emitter.cc(30)] Emitting login-prompt-visible Upstart signal Technically, that doesn't prove that "system-services is running" only that "if system-services wasn't running, it wasn't Chrome's fault". I'll note that the logs of the failure are somewhat odd, see especially c#3, above. The only theory I've got for that symptom is that the DUT spontaneously rebooted after AU successfully completed, and that immediately after that reboot, system-services failed to start, or was (very, very) slow to start. I don't much care for that theory... I note also that this problem showed consistent test failures that passed on retry. It should be possible to find the original failed test jobs, and look at how they failed.
,
Dec 10 2016
This symptom has stopped since we blocked the suspect CL. I've got every reason to believe that we've pegged the cause, and that there's no bug in the tree and no ongoing outage. Is there any reason we shouldn't close this?
,
Dec 10 2016
I agree that it looks like the CL caused the problem (still waiting to get a daisy-skate for further investigation). Closing the issue. |
|||||
►
Sign in to add a comment |
|||||
Comment 1 by jrbarnette@chromium.org
, Dec 8 2016More data about what might or might not cause the problem: * The symptom is restricted to the CQ. It doesn't appear in the PFQ or the canaries. That suggests the problem is a bad CL in the CQ runs, and not a problem with code in the tree. * The symptom on daisy_skate has occurred on at least three different hosts. As noted, there are also failures on stumpy. So, the problem is probably not restricted to specific DUTs only. Although the pattern of failures suggests a CL with a bug in it, my first attempt to find a single CL unique to all the failed runs came up empty.