Project: chromium Issues People Development process History Sign in
New issue
Advanced search Search tips
Note: Color blocks (like or ) mean that a user may not be available. Tooltip shows the reason.
Starred by 3 users
Status: Verified
Owner:
Closed: Dec 2016
Cc:
EstimatedDays: ----
NextAction: ----
OS: Chrome
Pri: 1
Type: Bug



Sign in to add a comment
Multiple CQ failures in daisy_skate-paladin
Project Member Reported by jrbarnette@chromium.org, Dec 8 2016 Back to list
In the past few days, a notable number of runs of daisy_skate-paladin
have failed with a distinct, unexplained syndrome.  Here are the runs
with the failures, from most recent to least:
    https://uberchromegw.corp.google.com/i/chromeos/builders/daisy_skate-paladin/builds/8417
    https://uberchromegw.corp.google.com/i/chromeos/builders/daisy_skate-paladin/builds/8414
    https://uberchromegw.corp.google.com/i/chromeos/builders/daisy_skate-paladin/builds/8413
    https://uberchromegw.corp.google.com/i/chromeos/builders/daisy_skate-paladin/builds/8410
    https://uberchromegw.corp.google.com/i/chromeos/builders/daisy_skate-paladin/builds/8409
    https://uberchromegw.corp.google.com/i/chromeos/builders/daisy_skate-paladin/builds/8402

Only the CQ shows this failure (not the canaries, not the PFQ), and it's
almost (but not quite) restricted to daisy_skate only.  There are two
failures in stumpy-paladin:
    https://uberchromegw.corp.google.com/i/chromeos/builders/stumpy-paladin/builds/26844
    https://uberchromegw.corp.google.com/i/chromeos/builders/stumpy-paladin/builds/26843

The key features of the syndrome:
  * Multiple test failures that pass on retry.
  * At least one DUT reporting provision failure.

The provision failure is also quite distinctive.  Here's the
relevant error from "status.log":
	FAIL	----	verify.cros	timestamp=1481090149	localtime=Dec 06 21:55:49	command execution error
  * Command: 
      /usr/bin/ssh -a -x    -o ControlPath=/tmp/_autotmp_L3wc6sssh-
      master/socket -o StrictHostKeyChecking=no -o UserKnownHostsFile=/dev/null
      -o BatchMode=yes -o ConnectTimeout=30 -o ServerAliveInterval=900 -o
      ServerAliveCountMax=3 -o ConnectionAttempts=4 -o Protocol=2 -l root -p 22
      chromeos4-row2-rack8-host16 "export LIBC_FATAL_STDERR_=1; if type
      \"logger\" > /dev/null 2>&1; then logger -tag \"autotest\"
      \"server[stack::verify|verify_software|upstart_status] -> ssh_run(status
      system-services | grep start/running)\";fi; status system-services | grep
      start/running"
  Exit status: 1
  Duration: 0.115790128708

 
More data about what might or might not cause the problem:
  * The symptom is restricted to the CQ.  It doesn't appear in the
    PFQ or the canaries.  That suggests the problem is a bad CL
    in the CQ runs, and not a problem with code in the tree.
  * The symptom on daisy_skate has occurred on at least three different
    hosts.  As noted, there are also failures on stumpy.  So, the
    problem is probably not restricted to specific DUTs only.

Although the pattern of failures suggests a CL with a bug in it,
my first attempt to find a single CL unique to all the failed runs
came up empty.

N.B.  Provision failures are reported as "infrastructure failures", which
means that the CQ won't blame any CLs, and will retry all CLs in the set.
To stop constantly cycling on same problem, we've throttled the tree.
More data to contribute to the head-scratching:

Provision runs the following steps:
  * Download and apply an AU.
  * Perform sanity checks, including a check that system-services
    started.
  * Perform host verification on the DUT.  That also includes
    a check that system-services started.

The failed provision tasks seem to indicate that the first check passed
(the AU proper reported no failures), but the second check failed
(that's the failure in the status.log file).  It's (essentially) not
possible for system-services to stop, so it doesn't make sense that
the second check could fail unless the first check passed, yet that
seems to be what happened.

All the failure has this CL.

https://chromium-review.googlesource.com/#/c/409017/
Owner: mka@chromium.org
Status: Assigned
Re c#4:  That's also the kind of CL that might cause this kind of
symptom, too.

mka@ - If you think this CL is causing the problem, you can close
this bug.  If you think the CL isn't the problem and you can make
the case, then we can reassign it.

Comment 6 by mka@chromium.org, Dec 8 2016
Owner: jrbarnette@chromium.org
If I interpret the logs correctly all the HWTest failed due to problems with servo:

[Test-Logs]: provision: FAIL: Servo initialize timed out., command execution error, completed successfully

[Test-Logs]: provision: FAIL: Servo board is unconfigured; should be stumpy, command execution error, completed successfully

I don't think that the software on the DUT can cause this kind of problem. Looks more like a problem on the host side to me.
Owner: mka@chromium.org
Sorry, I should have explained it earlier.  The messages about servo
are a red herring.  Roughly, they're caused by bug 670469.  More
importantly, the reported servo failures were repaired during
provisioning, so although they show up in the report, they're
not the cause of the failure.

I've attached the full status.log from one of the failed provision
jobs; you can look through it to see more.  The key is this failure
noted in the original description:
	FAIL	----	verify.cros	timestamp=1481090149	localtime=Dec 06 21:55:49	command execution error
  * Command: 
      /usr/bin/ssh -a -x    -o ControlPath=/tmp/_autotmp_L3wc6sssh-
      master/socket -o StrictHostKeyChecking=no -o UserKnownHostsFile=/dev/null
      -o BatchMode=yes -o ConnectTimeout=30 -o ServerAliveInterval=900 -o
      ServerAliveCountMax=3 -o ConnectionAttempts=4 -o Protocol=2 -l root -p 22
      chromeos4-row2-rack8-host16 "export LIBC_FATAL_STDERR_=1; if type
      \"logger\" > /dev/null 2>&1; then logger -tag \"autotest\"
      \"server[stack::verify|verify_software|upstart_status] -> ssh_run(status
      system-services | grep start/running)\";fi; status system-services | grep
      start/running"
  Exit status: 1
  Duration: 0.115790128708

*That's* the failure that caused provisioning to fail.  That symptom
means that "system-services" wasn't running at the time the host was
finally checked.  Generally, that can only happen if Chrome fails to
start at boot.
status.log
2.6 KB View Download
Comment 8 by mka@chromium.org, Dec 8 2016
Thanks for providing more context!

Not sure how to interpret the logs, it seems there is no straightforward way to see which system services are running. The check for system-services is logged, but without additional information:

2016-12-06T00:21:35.139641+00:00 NOTICE ag[7117]: autotest server[stack::verify|verify_software|upstart_status] -> ssh_run(status system-services | grep start/running)

Will try to get a unit to have a look.
"system-services" is a single upstart job that is triggered
when Chrome reports reaching the login screen.  Once the job
starts, it doesn't stop until shutdown.  The command that's
run simply checks that that job is running, meaning it merely
checks that Chrome reported login screen.

There's some technical detail about the boot flow generally,
and about system-services in particular here:
    https://www.chromium.org/chromium-os/chromiumos-design-docs/boot-design

That document _might_ help, but probably you'll be more enlightened
by watching an actual device in action booting.  You may need to boot
more than once; the history of the failures suggests that whatever goes
wrong is intermittent.

Comment 10 by mka@chromium.org, Dec 9 2016
Thanks,

it seems that the following trace would indicate that system-services is "running":

INFO session_manager[7529]: [INFO:upstart_signal_emitter.cc(30)] Emitting login-prompt-visible Upstart signal

I agree that it is probably more helpful to look at an actual device, I'm trying to get hold of one.
> it seems that the following trace would indicate that system-services is "running":
>
> INFO session_manager[7529]: [INFO:upstart_signal_emitter.cc(30)] Emitting login-prompt-visible Upstart signal

Technically, that doesn't prove that "system-services is running" only
that "if system-services wasn't running, it wasn't Chrome's fault".

I'll note that the logs of the failure are somewhat odd, see
especially c#3, above.  The only theory I've got for that symptom
is that the DUT spontaneously rebooted after AU successfully completed,
and that immediately after that reboot, system-services failed to start,
or was (very, very) slow to start.  I don't much care for that theory...

I note also that this problem showed consistent test failures that
passed on retry.  It should be possible to find the original failed
test jobs, and look at how they failed.

This symptom has stopped since we blocked the suspect CL.

I've got every reason to believe that we've pegged the cause,
and that there's no bug in the tree and no ongoing outage.

Is there any reason we shouldn't close this?

Comment 13 by mka@chromium.org, Dec 10 2016
Status: Verified
I agree that it looks like the CL caused the problem (still waiting to get a daisy-skate for further investigation).

Closing the issue.
Sign in to add a comment