New issue
Advanced search Search tips
Note: Color blocks (like or ) mean that a user may not be available. Tooltip shows the reason.
Starred by 3 users
Status: Assigned
Owner:
Cc:
Components:
EstimatedDays: ----
NextAction: ----
OS: Chrome
Pri: 3
Type: Bug

Blocking:
issue 735121



Sign in to add a comment
Change non-critical-path upstart services to "started system-services" from "starting"
Project Member Reported by seobrien@chromium.org, Sep 15 Back to list
wizpig-paladin has failed three builds with this error starting Sep 14 14:21

https://uberchromegw.corp.google.com/i/chromeos/builders/wizpig-paladin/builds/908

In autoserv.DEBUG I see the message "Failed: servod service is taking calls"

Looks like servo_repair is failing.
 
Blocking: 735121
Labels: -Pri-3 Pri-2
Owner: shuqianz@chromium.org
09/15 07:06:21.430 INFO |        servo_host:0458| servo host has the following duts: ['chromeos6-row2-rack20-host7', 'chromeos6-row2-rack20-host9', 'chromeos6-row2-rack20-host3', 'chromeos6-row2-rack20-host1', 'chromeos6-row2-rack20-host13', 'chromeos6-row2-rack20-host5', 'chromeos6-row2-rack20-host11', 'chromeos6-row2-rack20-host15', 'chromeos6-row2-rack20-host17', 'chromeos6-row2-rack20-host19', 'chromeos6-row2-rack20-host21']
09/15 07:06:21.431 INFO |        servo_host:0460| servo host has multiple duts, scheduling synchronized reboot
09/15 07:06:25.143 ERROR|control_file_gette:0214| Unable to search directory /usr/local/autotest/site-packages/devserver/venv/devserver_lib/venv/devserver_lib/venv/devserver_lib/venv/devserver_lib/venv/devserver_lib/venv/devserver_lib/venv/devserver_lib/venv/devserver_lib/venv/devserver_lib/venv/devserver_lib/venv/devserver_lib/venv/devserver_lib/venv/devserver_lib/venv/devserver_lib/venv/devserver_lib/venv/devserver_lib/venv/devserver_lib/venv/devserver_lib/venv/devserver_lib/venv/devserver_lib/venv/devserver_lib/.git/objects/27 for control files.
09/15 07:06:25.149 ERROR|control_file_gette:0214| Unable to search directory /usr/local/autotest/site-packages/devserver/venv/devserver_lib/venv/devserver_lib/venv/devserver_lib/venv/devserver_lib/venv/devserver_lib/venv/devserver_lib/venv/devserver_lib/venv/devserver_lib/venv/devserver_lib/venv/devserver_lib/venv/devserver_lib/venv/devserver_lib/venv/devserver_lib/venv/devserver_lib/venv/devserver_lib/venv/devserver_lib/venv/devserver_lib/venv/devserver_lib/venv/devserver_lib/venv/devserver_lib/venv/devserver_lib/venv/devserver_lib for control files.
09/15 07:07:37.873 ERROR|control_file_gette:0214| Unable to search directory /usr/local/autotest/results/142149346-kkanchi/chromeos2-row4-rack1-host3/cheets_CTS_N.CtsDevicePolicyManagerTestCases/sysinfo/var/spool for control files.
09/15 07:07:37.873 ERROR|control_file_gette:0214| Unable to search directory /usr/local/autotest/results/142149346-kkanchi/chromeos2-row4-rack1-host3/cheets_CTS_N.CtsDevicePolicyManagerTestCases/sysinfo/var/log/ui for control files.
09/15 07:07:37.873 ERROR|control_file_gette:0214| Unable to search directory /usr/local/autotest/results/142149346-kkanchi/chromeos2-row4-rack1-host3/cheets_CTS_N.CtsDevicePolicyManagerTestCases/sysinfo/var/log/power_manager for control files.
09/15 07:07:37.874 ERROR|control_file_gette:0214| Unable to search directory /usr/local/autotest/results/142149346-kkanchi/chromeos2-row4-rack1-host3/cheets_CTS_N.CtsDevicePolicyManagerTestCases/sysinfo/var/log/recover_duts for control files.
09/15 07:07:37.874 ERROR|control_file_gette:0214| Unable to search directory /usr/local/autotest/results/142149346-kkanchi/chromeos2-row4-rack1-host3/cheets_CTS_N.CtsDevicePolicyManagerTestCases/sysinfo/var/log/metrics/shutdown.20170915030432 for control files.
09/15 07:07:37.875 ERROR|control_file_gette:0214| Unable to search directory /usr/local/autotest/results/142149346-kkanchi/chromeos2-row4-rack1-host3/cheets_CTS_N.CtsDevicePolicyManagerTestCases/sysinfo/var/log/metrics/shutdown.20170915031922 for control files.
09/15 07:07:37.875 ERROR|control_file_gette:0214| Unable to search directory /usr/local/autotest/results/142149346-kkanchi/chromeos2-row4-rack1-host3/cheets_CTS_N.CtsDevicePolicyManagerTestCases/sysinfo/var/log/metrics/shutdown.20170915020519 for control files.
09/15 07:07:37.875 ERROR|control_file_gette:0214| Unable to search directory /usr/local/autotest/results/142149346-kkanchi/chromeos2-row4-rack1-host3/cheets_CTS_N.CtsDevicePolicyManagerTestCases/sysinfo/var/log/metrics/shutdown.20170915055907 for control files.
09/15 07:07:37.875 ERROR|control_file_gette:0214| Unable to search directory /usr/local/autotest/results/142149346-kkanchi/chromeos2-row4-rack1-host3/cheets_CTS_N.CtsDevicePolicyManagerTestCases/sysinfo/var/log/metrics/shutdown.20170915065443 for control files.
09/15 07:07:37.876 ERROR|control_file_gette:0214| Unable to search directory /usr/local/autotest/results/142149346-kkanchi/chromeos2-row4-rack1-host3/cheets_CTS_N.CtsDevicePolicyManagerTestCases/sysinfo/var/log/metrics/shutdown.20170915060351 for control files.
09/15 07:07:37.876 ERROR|control_file_gette:0214| Unable to search directory /usr/local/autotest/results/142149346-kkanchi/chromeos2-row4-rack1-host3/cheets_CTS_N.CtsDevicePolicyManagerTestCases/sysinfo/var/log/metrics/shutdown.20170915022632 for control files.
09/15 07:07:37.876 ERROR|control_file_gette:0214| Unable to search directory /usr/local/autotest/results/142149346-kkanchi/chromeos2-row4-rack1-host3/cheets_CTS_N.CtsDevicePolicyManagerTestCases/sysinfo/var/log/metrics/shutdown.20170915064111 for control files.
09/15 07:07:37.876 ERROR|control_file_gette:0214| Unable to search directory /usr/local/autotest/results/142149346-kkanchi/chromeos2-row4-rack1-host3/cheets_CTS_N.CtsDevicePolicyManagerTestCases/sysinfo/var/log/metrics/shutdown.20170915063709 for control files.
09/15 07:07:37.876 ERROR|control_file_gette:0214| Unable to search directory /usr/local/autotest/results/142149346-kkanchi/chromeos2-row4-rack1-host3/cheets_CTS_N.CtsDevicePolicyManagerTestCases/sysinfo/var/log/metrics/shutdown.20170915020115 for control files.
09/15 07:07:37.877 ERROR|control_file_gette:0214| Unable to search directory /usr/local/autotest/results/142149346-kkanchi/chromeos2-row4-rack1-host3/cheets_CTS_N.CtsDevicePolicyManagerTestCases/sysinfo/var/log/metrics/shutdown.20170915051208 for control files.
09/15 07:07:37.877 ERROR|control_file_gette:0214| Unable to search directory /usr/local/autotest/results/142149346-kkanchi/chromeos2-row4-rack1-host3/cheets_CTS_N.CtsDevicePolicyManagerTestCases/sysinfo/var/log/metrics/shutdown.20170915015715 for control files.
09/15 07:07:37.877 ERROR|control_file_gette:0214| Unable to search directory /usr/local/autotest/results/142149346-kkanchi/chromeos2-row4-rack1-host3/cheets_CTS_N.CtsDevicePolicyManagerTestCases/sysinfo/var/log/metrics/shutdown.20170915015318 for control files.
09/15 07:07:37.877 ERROR|control_file_gette:0214| Unable to search directory /usr/local/autotest/results/142149346-kkanchi/chromeos2-row4-rack1-host3/cheets_CTS_N.CtsDevicePolicyManagerTestCases/sysinfo/var/log/metrics/shutdown.20170915023948 for control files.
09/15 07:07:37.877 ERROR|control_file_gette:0214| Unable to search directory /usr/local/autotest/results/142149346-kkanchi/chromeos2-row4-rack1-host3/cheets_CTS_N.CtsDevicePolicyManagerTestCases/sysinfo/var/log/metrics/shutdown.20170915032321 for control files.
09/15 07:07:37.878 ERROR|control_file_gette:0214| Unable to search directory /usr/local/autotest/results/142149346-kkanchi/chromeos2-row4-rack1-host3/cheets_CTS_N.CtsDevicePolicyManagerTestCases/sysinfo/var/log/metrics/shutdown.20170915054743 for control files.
09/15 07:07:37.878 ERROR|control_file_gette:0214| Unable to search directory /usr/local/autotest/results/142149346-kkanchi/chromeos2-row4-rack1-host3/cheets_CTS_N.CtsDevicePolicyManagerTestCases/sysinfo/var/log/metrics/shutdown.20170915021826 for control files.
09/15 07:07:37.878 ERROR|control_file_gette:0214| Unable to search directory /usr/local/autotest/results/142149346-kkanchi/chromeos2-row4-rack1-host3/cheets_CTS_N.CtsDevicePolicyManagerTestCases/sysinfo/var/log/metrics/shutdown.20170915031459 for control files.
09/15 07:07:37.878 ERROR|control_file_gette:0214| Unable to search directory /usr/local/autotest/results/142149346-kkanchi/chromeos2-row4-rack1-host3/cheets_CTS_N.CtsDevicePolicyManagerTestCases/sysinfo/var/log/metrics/shutdown.20170915022230 for control files.
09/15 07:07:37.878 ERROR|control_file_gette:0214| Unable to search directory /usr/local/autotest/results/142149346-kkanchi/chromeos2-row4-rack1-host3/cheets_CTS_N.CtsDevicePolicyManagerTestCases/sysinfo/var/log/metrics/shutdown.20170915030045 for control files.
09/15 07:07:37.879 ERROR|control_file_gette:0214| Unable to search directory /usr/local/autotest/results/142149346-kkanchi/chromeos2-row4-rack1-host3/cheets_CTS_N.CtsDevicePolicyManagerTestCases/sysinfo/var/log/metrics/shutdown.20170915063035 for control files.
09/15 07:07:37.879 ERROR|control_file_gette:0214| Unable to search directory /usr/local/autotest/results/142149346-kkanchi/chromeos2-row4-rack1-host3/cheets_CTS_N.CtsDevicePolicyManagerTestCases/sysinfo/var/log/metrics/shutdown.20170915065036 for control files.
09/15 07:07:37.879 ERROR|control_file_gette:0214| Unable to search directory /usr/local/autotest/results/142149346-kkanchi/chromeos2-row4-rack1-host3/cheets_CTS_N.CtsDevicePolicyManagerTestCases/sysinfo/var/log/metrics/shutdown.20170915050818 for control files.
09/15 07:07:37.879 ERROR|control_file_gette:0214| Unable to search directory /usr/local/autotest/results/142149346-kkanchi/chromeos2-row4-rack1-host3/cheets_CTS_N.CtsDevicePolicyManagerTestCases/sysinfo/var/log/metrics/shutdown.20170915020928 for control files.
09/15 07:07:37.880 ERROR|control_file_gette:0214| Unable to search directory /usr/local/autotest/results/142149346-kkanchi/chromeos2-row4-rack1-

Found a lot of these errors in the provison test log. It looks like the server itself has some issues
Owner: jrbarnette@chromium.org
Richard, can you take a look of this servo issue?
The log is not clear, on which server it is searching the control files, servo?shard?drone?devserver?
I searched on the wizpig shard, which is chromeos-server104.mtv. I didn't find that /usr/local/autotest/results/142149346-kkanchi repo
The servo messages are a red herring:  This is a provisioning
failure.  Part of provisioning checks and reports servo status,
which is why the messages showed up, but the failure was caused
by this event:

	FAIL	----	verify.cros	timestamp=1505485222	localtime=Sep 15 07:20:22	command execution error
  * Command: 
      /usr/bin/ssh -a -x    -o ControlPath=/tmp/_autotmp_EtDhB6ssh-
      master/socket -o StrictHostKeyChecking=no -o UserKnownHostsFile=/dev/null
      -o BatchMode=yes -o ConnectTimeout=30 -o ServerAliveInterval=900 -o
      ServerAliveCountMax=3 -o ConnectionAttempts=4 -o Protocol=2 -l root -p 22
      chromeos6-row2-rack20-host7 "export LIBC_FATAL_STDERR_=1; if type
      \"logger\" > /dev/null 2>&1; then logger -tag \"autotest\"
      \"server[stack::upstart_status|run|wrapper] -> ssh_run(status system-
      services | grep start/running)\";fi; status system-services | grep
      start/running"
  Exit status: 1
  Duration: 0.0340459346771

so, the server is down, which failed the check?
> so, the server is down, which failed the check?

The failure in the "system-services" check indicates
that chrome didn't start.  This symptom is essentially
the same as  bug 693597 .

Note that this symptom is _really_ hard to explain.  The
best explanation is the following sequence (taken from c#8
in the bug above):
 1) Provisioning installed the new build.
 2) Provisioning confirmed that Chrome was working.
 3) The DUT crashed (or at least, rebooted spontaneously).
 4) Chrome failed to start.
 5) Final provisioning verification tested and concluded
    that Chrome had crashed at boot.

This explanation really strains credulity, but I don't have
a more plausible theory.  If we can get logs from the DUT,
they might reveal more.

I checked logs on the DUT.  Specifically, I looked at /var/log/messages.
I found these lines:
    2017-09-15T14:19:15.436469+00:00 INFO session_manager[1385]: [INFO:upstart_signal_emitter.cc(36)] Emitting login-prompt-visible Upstart signal
[ ... ]
    2017-09-15T14:19:17.825996+00:00 INFO sshd[2137]: Server listening on 0.0.0.0 port 2222.

[ ... ]
    2017-09-15T14:20:21.912197+00:00 NOTICE ag[3049]: autotest server[stack::upstart_status|run|wrapper] -> ssh_run(status system-services | grep start/running)

There were no reboots in between those two messages.  That
blows away the working theory above because the
login-prompt-visible event would force system-services to
start.  Also, the fact that `sshd` was working within three
seconds of the event indicates that servers were starting.

Next I tried an experiment:  I rebooted the target DUT, and
checked the system-services job myself.  This is what I saw:
    $ ssh chromeos6-row2-rack20-host7 status system-services
    ssh: connect to host chromeos6-row2-rack20-host7 port 22: Connection timed out
    $ ssh chromeos6-row2-rack20-host7 status system-services
    system-services start/starting

That explains the failure of the verifier; "start/starting" isn't
"start/running".  The "start/starting" status still persisted when
I checked a second time, about a minute or so later.

At this point, we have two problems:
 1) On its face, the simple check for "start/running" is wrong.
 2) Something is holding up finishing boot.  That could be wrong, too.

Regarding the check for "start/running" vs. "start/starting":
if we allow "start/starting" to be valid, we're assuming one of
two things:
  * Once started, eventually, the status will change to "running".
  * If the status never changes to "running", that's OK.
I don't believe that either of those assumptions is safe, but I
don't know what to do about it.

> Regarding the check for "start/running" vs. "start/starting":
> if we allow "start/starting" to be valid, we're assuming one of
> two things:

Really, any of three assumptions might justify a more forgiving
check:
  * Once started, eventually, the status will change to "running".
  * If it takes a long time to change from "starting" to "running",
    that's OK.
  * If the status never changes to "running", that's OK.

... and none of those assumptions would seem to be safe on its face.

Cc: -seobrien@chromium.org
Components: -Infra>Client>ChromeOS OS>Systems
Owner: adlr@chromium.org
Status: Assigned
Looking it over, I'm pretty sure Autotest shouldn't be assuming either
that 'system-services' will always reach 'running' or that it doesn't
matter how long it takes.  I also don't think it's appropriate to add
a wait loop on the existing check.

The reason this is happening is that some jobs say "start on starting
system-services" instead of saying "start on started system-services".
One of those jobs is failing to start (or taking a long time), which
causes the symptom.

So... This is product bug.  We need to eliminate all inappropriate
uses of 'start on starting system-services'.  "Appropriate" use would
be only for jobs for which failure to start could be considered a
system failure.  Right now, the only jobs I know of that meet the
criterion are "update-engine" and "failsafe".

Not sure who should own this.  Passing to adlr@ in the hopes he can
triage this appropriately.

Summary: Post-provision check for "system-services" being "start/running" can fail (was: wizpig-paladin failure - Servo initialize timed out)
Let's give this a more descriptive summary.

Project Member Comment 14 by bugdroid1@chromium.org, Sep 27
The following revision refers to this bug:
  https://chromium.googlesource.com/chromiumos/third_party/autotest/+/e204dc5fe5bde85ad91b16ad5a2f577d4ebd84a6

commit e204dc5fe5bde85ad91b16ad5a2f577d4ebd84a6
Author: Richard Barnette <jrbarnette@chromium.org>
Date: Wed Sep 27 01:24:12 2017

[autotest] Make CrosHost.upstart_status() more debuggable.

Change the `run()` invocation in `CrosHost.upstart_status()` so that
stdout of the 'status' command will be logged.  This will make it
easier to see what's wrong in cases where the call returns false
unexpectedly.

BUG= chromium:768110 ,chromium:765686
TEST=manually invoke the function from a Python CLI.

Change-Id: Iaeaf97e2a73802b62bc111ad717e13ed627317bc
Reviewed-on: https://chromium-review.googlesource.com/685798
Commit-Ready: Richard Barnette <jrbarnette@chromium.org>
Tested-by: Richard Barnette <jrbarnette@chromium.org>
Reviewed-by: Dan Shi <dshi@google.com>

[modify] https://crrev.com/e204dc5fe5bde85ad91b16ad5a2f577d4ebd84a6/server/hosts/cros_host.py

Cc: nxia@chromium.org groeck@chromium.org fukino@chromium.org tetsui@chromium.org jrbarnette@chromium.org xiaochu@chromium.org
 Issue 775610  has been merged into this issue.
Labels: -Pri-2 Pri-1
This is continuing to affect CQ runs, causing intermittent failures.

adlr@ any updates on this bug ?
 Issue 776120  has been merged into this issue.
Owner: xiaochu@chromium.org
sheriffs@, can you find an owner for this bug?
Summary: wizpig-paladin Provision failed: Post-provision check for "system-services" being "start/running" can fail (was: Post-provision check for "system-services" being "start/running" can fail)
Labels: OS-Chrome
latest failure:
https://uberchromegw.corp.google.com/i/chromeos/builders/wizpig-paladin/builds/1265

Looks like usb issue again.found this in crashinfo:https://pantheon.corp.google.com/storage/browser/chromeos-autotest-results/150401041-chromeos-test/chromeos6-row2-rack20-host22/crashinfo.chromeos6-row2-rack20-host22/

2017-10-20T12:32:13.784413+00:00 ERR kernel: [   17.223202] usb 1-5: device descriptor read/64, error -110
2017-10-20T12:32:13.784414+00:00 ERR kernel: [   32.477632] usb 1-5: device descriptor read/64, error -110
2017-10-20T12:32:13.784416+00:00 INFO kernel: [   32.732732] usb 1-5: new full-speed USB device number 7 using xhci_hcd
2017-10-20T12:32:13.784417+00:00 ERR kernel: [   47.887083] usb 1-5: device descriptor read/64, error -110
2017-10-20T12:32:13.784424+00:00 WARNING kernel: [   62.398259] udevd[144]: seq 994 '/devices/pci0000:00/0000:00:14.0/usb1/1-4' is taking a long time
2017-10-20T12:32:13.784426+00:00 ERR kernel: [   63.141819] usb 1-5: device descriptor read/64, error -110
2017-10-20T12:32:13.784428+00:00 INFO kernel: [   63.396586] usb 1-5: new full-speed USB device number 8 using xhci_hcd
2017-10-20T12:32:13.784429+00:00 ERR kernel: [   68.411187] xhci_hcd 0000:00:14.0: Command completion event does not match command
2017-10-20T12:32:13.784431+00:00 WARNING kernel: [   68.411311] xhci_hcd 0000:00:14.0: Timeout while waiting for setup device command
2017-10-20T12:32:13.784432+00:00 WARNING kernel: [   73.627135] xhci_hcd 0000:00:14.0: Timeout while waiting for setup device command
2017-10-20T12:32:13.784438+00:00 ERR kernel: [   73.827703] usb 1-5: device not accepting address 8, error -62
2017-10-20T12:32:13.784441+00:00 INFO kernel: [   73.982222] usb 1-5: new full-speed USB device number 9 using xhci_hcd
2017-10-20T12:32:13.784442+00:00 ERR kernel: [   78.986835] xhci_hcd 0000:00:14.0: Command completion event does not match command
2017-10-20T12:32:13.784444+00:00 WARNING kernel: [   78.986955] xhci_hcd 0000:00:14.0: Timeout while waiting for setup device command
2017-10-20T12:32:13.784445+00:00 WARNING kernel: [   84.203127] xhci_hcd 0000:00:14.0: Timeout while waiting for setup device command
2017-10-20T12:32:13.784447+00:00 ERR kernel: [   84.404623] usb 1-5: device not accepting address 9, error -62
2017-10-20T12:32:13.784448+00:00 ERR kernel: [   84.404828] usb usb1-port5: unable to enumerate USB device
2017-10-20T12:32:13.785678+00:00 INFO kernel: [   84.564354] uvcvideo: Found UVC 1.00 device Chromebook HD Camera (05e3:f11a)
2017-10-20T12:32:13.785701+00:00 INFO kernel: [   84.564477] platform vpd: Driver vpd requests probe deferral
2017-10-20T12:32:13.785703+00:00 INFO kernel: [   84.570773] usbcore: registered new interface driver uvcvideo
2017-10-20T12:32:13.785704+00:00 INFO kernel: [   84.570787] USB Video Class driver (1.1.1)
2017-10-20T12:32:13.785706+00:00 INFO kernel: [   84.571073] platform vpd: Driver vpd requests probe deferral
Looking at the original failure reported https://uberchromegw.corp.google.com/i/chromeos/builders/wizpig-paladin/builds/908

I found similar usb errors:

2017-09-15T14:19:02.651129+00:00 ERR kernel: [   17.130976] usb 1-5: device descriptor read/64, error -110
2017-09-15T14:19:02.651130+00:00 ERR kernel: [   32.385932] usb 1-5: device descriptor read/64, error -110
2017-09-15T14:19:02.651132+00:00 INFO kernel: [   32.640312] usb 1-5: new full-speed USB device number 6 using xhci_hcd
2017-09-15T14:19:02.651133+00:00 ERR kernel: [   47.795839] usb 1-5: device descriptor read/64, error -110
2017-09-15T14:19:02.651140+00:00 WARNING kernel: [   62.226084] udevd[143]: seq 994 '/devices/pci0000:00/0000:00:14.0/usb1/1-4' is taking a long time
2017-09-15T14:19:02.651142+00:00 ERR kernel: [   63.050781] usb 1-5: device descriptor read/64, error -110
2017-09-15T14:19:02.651144+00:00 INFO kernel: [   63.306401] usb 1-5: new full-speed USB device number 7 using xhci_hcd
2017-09-15T14:19:02.651145+00:00 ERR kernel: [   68.313962] xhci_hcd 0000:00:14.0: Command completion event does not match command
2017-09-15T14:19:02.651147+00:00 WARNING kernel: [   68.314079] xhci_hcd 0000:00:14.0: Timeout while waiting for setup device command
2017-09-15T14:19:02.651148+00:00 WARNING kernel: [   73.530154] xhci_hcd 0000:00:14.0: Timeout while waiting for setup device command
2017-09-15T14:19:02.651149+00:00 ERR kernel: [   73.730939] usb 1-5: device not accepting address 7, error -62
2017-09-15T14:19:02.651155+00:00 INFO kernel: [   73.884964] usb 1-5: new full-speed USB device number 8 using xhci_hcd
2017-09-15T14:19:02.651158+00:00 ERR kernel: [   78.889656] xhci_hcd 0000:00:14.0: Command completion event does not match command
2017-09-15T14:19:02.651160+00:00 WARNING kernel: [   78.889776] xhci_hcd 0000:00:14.0: Timeout while waiting for setup device command
2017-09-15T14:19:02.651161+00:00 WARNING kernel: [   84.105695] xhci_hcd 0000:00:14.0: Timeout while waiting for setup device command
2017-09-15T14:19:02.651163+00:00 ERR kernel: [   84.306519] usb 1-5: device not accepting address 8, error -62
2017-09-15T14:19:02.651164+00:00 ERR kernel: [   84.306715] usb usb1-port5: unable to enumerate USB device
xiaochu@ - This bug is not about USB errors.  It's about the
system-services job being slow to start.

If "system-services is slow to start" is because of the USB
problem, it's two bugs:  One bug to fix the USB problem, and
one bug to fix the upstart code that causes system-services
to be slow.

Note that the upstart fix required is already explained in c#12,
above.

I had submitted  crbug.com/776780  for the usb problem since it always showed up on one DUT (6-2-20-22). Supposedly that was fixed by taking the DUT out of service. This bug was filed against 6-2-20-7. That system was taken out of service around 9-18 after it failed completely (it had already failed at 8-1, the start date of my dut-status scan; the question here is why it had not been taken out earlier). That almost looks like a systemic HW failure on wizpic, though with only two confirmed samples that may be a premature statement. If it happens again, it might be worthwhile to analyze the failed HW in more detail.

Anyway, I agree, the USB failure needs to be handled separately.

I wasn't sure what the fix should be regarding #12... what by 'system-services' do you mean? Can you please give more clear instructions regarding 'upstart code'? The builder was green since #1265, can I assume  crbug.com/776780  fixed it?
#26: AFAICS, no,  crbug.com/776780  did not fix _this_ problem. First, as I mentioned in #25, this problem was submitted against another DUT which was since removed from service. Second, while there is some evidence that the USB problem may result in the problem described here being observed, it doesn't mean that solving the USB problem (by replacing the affected DUTs) will solve the system-services problem. The system-services problem is still there, it is just no longer observed until another DUT experiences the USB problem (or some other problem triggers it).
In other words, if event A triggers problem B, problem B doesn't go away by removing event A from the picture.

0xb2f7eb1f2c376.txt
42 bytes View Download
> [ ... ] what by 'system-services' do you mean? Can
> you please give more clear instructions regarding 'upstart code'?

Upstart is the package providing /sbin/init in Chrome OS
It manages the start up of services during boot; it fills
the same niche as systemd.  'system-services' is the name
of an upstart job that's specific to the Chrome OS boot flow.

There's more detail about the design, including a detailed
explanation of 'system-services' here:
    https://www.chromium.org/chromium-os/chromiumos-design-docs/boot-design

> [ ... ] The builder was green since #1265, can I assume  crbug.com/776780 
> fixed it?

This problem only occurs intermittently.  Moreover, sometimes the failures
are retried and ignored.  The fact that some builder was green once tells
us very little about whether the bug is fixed.  This particular bug is
identified by particular features in the source code, so we'll know it's
fixed when we search the code and find that that feature (jobs that use
"start on starting system-services") is gone.

Cc: jinsong@chromium.org akahuang@chromium.org mruthven@chromium.org
Owner: jinsong@chromium.org
Recapping this bug, so that the requirement here doesn't get lost:

This bug is assigned to the sheriff to find a permanent owner to
fix the problem.  It's not the job of the sheriff to fix the problem,
or even particularly to understand it.  Only reassign to a proper owner.

The problem is that the 'system-services' job can be too slow to start.
Autotest in the test lab treats the slowness as a failure in the build
under test; justification for that treatment is in c#9, c#10 and c#12.

The fix is to find all upstart jobs that use "start on starting system-services",
and change them to use "start on started system-services" instead.

deputy comment: wizpig-paladin is currently marked experimental citing this bug. The ask in #30 is a non-trivial solution to the root cause of this problem. But surely all these "incorrect" updatart dependencies weren't added last week.
So, we need to find what happened to push wizpig over the edge so that it started failing the CQ frequently.

If we can fix that problem, we can at least mark wizpig important again.
Owner: mruthven@chromium.org
I scrubbed last 30 builds or so of wizpig-paladin. The basline is not legit -- since wizpig is experimental, it's builds are getting aborted before the complete, so there could have been more failures than I saw.

I found three failures, and they seem to be the same as this bug claims, although I haven't dug too much, so low confidence rating:
https://pantheon.corp.google.com/storage/browser/chromeos-autotest-results/151698270-chromeos-test/chromeos2-row8-rack8-host12
https://pantheon.corp.google.com/storage/browser/chromeos-autotest-results/151579000-chromeos-test/chromeos2-row8-rack8-host12/
https://pantheon.corp.google.com/storage/browser/chromeos-autotest-results/151486510-chromeos-test/chromeos2-row8-rack8-host12/

So, this issue remains active.
wizpig has been experimental for 1 week+. I'm going to mark it experimental via chromeos_config so we can track it better (tree status is for quick fixes)

Who cares about the strago platform? Without wizpig, strago doesn't have bvt-inline coverage.

+sheriffs: This still needs an owner.
Project Member Comment 33 by bugdroid1@chromium.org, Oct 27
The following revision refers to this bug:
  https://chromium.googlesource.com/chromiumos/chromite/+/efb85837f82850d32dc3f5a55438181988169054

commit efb85837f82850d32dc3f5a55438181988169054
Author: Aviv Keshet <akeshet@chromium.org>
Date: Fri Oct 27 23:14:06 2017

chromeos_config: mark wizpig, reef, reef-uni experimental

The tree status is intended to mark things as experimental for the short
term. These boards have been continuously experimental for over a week.
Moving the flag to config.

BUG=chromium:765686
TEST=None

Change-Id: I70a67b3885cf11a9443f5b808aa99dcd7a795e62
Reviewed-on: https://chromium-review.googlesource.com/742101
Commit-Ready: Aviv Keshet <akeshet@chromium.org>
Tested-by: Aviv Keshet <akeshet@chromium.org>
Reviewed-by: Prathmesh Prabhu <pprabhu@chromium.org>

[modify] https://crrev.com/efb85837f82850d32dc3f5a55438181988169054/cbuildbot/config_dump.json
[modify] https://crrev.com/efb85837f82850d32dc3f5a55438181988169054/cbuildbot/chromeos_config.py

Owner: teravest@chromium.org
Project Member Comment 35 by bugdroid1@chromium.org, Nov 1
The following revision refers to this bug:
  https://chromium.googlesource.com/chromiumos/chromite/+/65fdfb83af6ab313b5104a557d053e82266adfa3

commit 65fdfb83af6ab313b5104a557d053e82266adfa3
Author: Jason Clinton <jclinton@chromium.org>
Date: Wed Nov 01 21:06:28 2017

Revert "chromeos_config: mark wizpig, reef, reef-uni experimental"

This reverts commit efb85837f82850d32dc3f5a55438181988169054.

Reason for revert: reef and reef-uni are not broken at ToT: Snappy/Alan/Bigdaddy are going to DVT/Factory this week and need to have new regressions checked in because we don't have coverage. The one breakage in reef-uni recently is another example of the USB dongle problem.

Original change's description:
> chromeos_config: mark wizpig, reef, reef-uni experimental
>
> The tree status is intended to mark things as experimental for the short
> term. These boards have been continuously experimental for over a week.
> Moving the flag to config.
>
> BUG=chromium:765686
> TEST=None
>
> Change-Id: I70a67b3885cf11a9443f5b808aa99dcd7a795e62
> Reviewed-on: https://chromium-review.googlesource.com/742101
> Commit-Ready: Aviv Keshet <akeshet@chromium.org>
> Tested-by: Aviv Keshet <akeshet@chromium.org>
> Reviewed-by: Prathmesh Prabhu <pprabhu@chromium.org>

Bug: chromium:765686
Change-Id: I0481f984728fac5011a5a590605d67efc39cafc7
Reviewed-on: https://chromium-review.googlesource.com/749169
Commit-Ready: Jason Clinton <jclinton@chromium.org>
Tested-by: Jason Clinton <jclinton@chromium.org>
Reviewed-by: Simon Glass <sjg@chromium.org>
Reviewed-by: C Shapiro <shapiroc@google.com>

[modify] https://crrev.com/65fdfb83af6ab313b5104a557d053e82266adfa3/cbuildbot/config_dump.json
[modify] https://crrev.com/65fdfb83af6ab313b5104a557d053e82266adfa3/cbuildbot/chromeos_config.py

Project Member Comment 36 by bugdroid1@chromium.org, Nov 1
The following revision refers to this bug:
  https://chromium.googlesource.com/chromiumos/overlays/chromiumos-overlay/+/625dec89058d5e3df1ddd782a1ae19546e83188f

commit 625dec89058d5e3df1ddd782a1ae19546e83188f
Author: Justin TerAvest <teravest@chromium.org>
Date: Wed Nov 01 21:06:30 2017

bluez: Start on "started system-services"

"start on starting system-services" should be reserved for jobs for
which failure would be considered a system failure. Jobs
inappropriately using "starting system services" impact our
provisioning flow.

BUG=chromium:765686
TEST=None

Change-Id: Ibc5b28c75e9ef2a508d0c4f3f219adf5b45f5229
Reviewed-on: https://chromium-review.googlesource.com/748967
Commit-Ready: Justin TerAvest <teravest@chromium.org>
Tested-by: Justin TerAvest <teravest@chromium.org>
Reviewed-by: Simon Glass <sjg@chromium.org>

[modify] https://crrev.com/625dec89058d5e3df1ddd782a1ae19546e83188f/net-wireless/bluez/files/bluez-upstart.conf

Project Member Comment 37 by bugdroid1@chromium.org, Nov 3
The following revision refers to this bug:
  https://chromium.googlesource.com/chromiumos/overlays/chromiumos-overlay/+/d34fe79d0af8d5df0768dd005309af7441436a4d

commit d34fe79d0af8d5df0768dd005309af7441436a4d
Author: Justin TerAvest <teravest@chromium.org>
Date: Fri Nov 03 08:53:36 2017

chromeos-machine-id-regen: Fix start dependency.

"start on starting system-services" should be reserved for jobs for
which failure would be considered a system failure. Jobs
inappropriately using "starting system services" impact our
provisioning flow.

BUG=chromium:765686
TEST=None

Change-Id: Ic2e7633d153a2cefc7eb11e092317ae6d9f951db
Reviewed-on: https://chromium-review.googlesource.com/752024
Commit-Ready: Justin TerAvest <teravest@chromium.org>
Tested-by: Justin TerAvest <teravest@chromium.org>
Reviewed-by: Mike Frysinger <vapier@chromium.org>

[rename] https://crrev.com/d34fe79d0af8d5df0768dd005309af7441436a4d/chromeos-base/chromeos-machine-id-regen/chromeos-machine-id-regen-0.0.1-r4.ebuild
[modify] https://crrev.com/d34fe79d0af8d5df0768dd005309af7441436a4d/chromeos-base/chromeos-machine-id-regen/files/cros-machine-id-regen-periodic.conf

Richard,

I started to change the upstart scripts, but stopped to think through this a bit more and realized that if update_engine is slow to start, shuffling things around won't fix the problem.

Taking a couple of historical examples:
https://luci-milo.appspot.com/buildbot/chromeos/wizpig-paladin/1230
https://luci-milo.appspot.com/buildbot/chromeos/wizpig-paladin/1240

From there, we can pull the update_engine.log entry from crashinfo:
https://pantheon.corp.google.com/storage/browser/chromeos-autotest-results/149593081-chromeos-test/chromeos6-row2-rack20-host22/crashinfo.chromeos6-row2-rack20-host22/var/log/

[1017/040203:INFO:evaluation_context.cc(218)] Waiting for timeout in 11h59m59.999768s
[1017/040248:INFO:update_attempter.cc(1150)] Marking booted slot as good.
[1017/040258:INFO:subprocess.cc(156)] Subprocess output:
Cloning firmware A to B...
flashrom v0.9.9  : 312b318 : Oct 12 2017 07:09:35 UTC on Linux 3.18.0-16233-g0589562febc4 (x86_64)
flashrom v0.9.9  : 312b318 : Oct 12 2017 07:09:35 UTC on Linux 3.18.0-16233-g0589562febc4 (x86_64)
Calibrating delay loop... OK.
Error accessing low megabyte, 0x100000 bytes at 0x0
/dev/mem mmap failed: Resource temporarily unavailable
Failed getting access to coreboot low tables.
Mapping BYT IBASE at 0xfed08000, unaligned size 0x200.
Mapping BYT SBASE at 0xfed01000, unaligned size 0x200.
Reading flash... SUCCESS
flashrom v0.9.9  : 312b318 : Oct 12 2017 07:09:35 UTC on Linux 3.18.0-16233-g0589562febc4 (x86_64)
flashrom v0.9.9  : 312b318 : Oct 12 2017 07:09:35 UTC on Linux 3.18.0-16233-g0589562febc4 (x86_64)
Calibrating delay loop... OK.
Error accessing low megabyte, 0x100000 bytes at 0x0
/dev/mem mmap failed: Resource temporarily unavailable
Failed getting access to coreboot low tables.
Mapping BYT IBASE at 0xfed08000, unaligned size 0x200.
Mapping BYT SBASE at 0xfed01000, unaligned size 0x200.
Erasing and writing flash chip... SUCCESS
INFO: /usr/sbin/chromeos-setgoodfirmware: Active vboot1 firmware set as good firmware.

It looks like it took flashrom about 55 seconds to update the firmware for some reason. I don't know what a reasonable timeout is here (12 hours seems long), but it seems like autotest should take into account that there could be some delay to get to "running" coming from update_engine here.

Am I understanding this correctly?
Cc: vapier@chromium.org
+vapier to check my understanding in #38
glancing at the autotest code, there doesn't seem to be any retries/timeouts for the OS check currently.  seems like we should add some there.

server/hosts/cros_host.py:verify_software does a single:
        if not self.upstart_status('system-services'):
            raise error.AutoservError('Chrome failed to reach login. '
                                      'System services not running.')

i would start with 5 minutes.  if the current code frequently passes w/out any timeout, then 5 min seems like plenty of time.

i also agree with the recent work to change services from "starting" to "started" where it makes sense as those are critical path delays.
Labels: -Pri-1 Pri-3
Summary: Change non-critical-path upstart services to "started system-services" from "starting" (was: wizpig-paladin Provision failed: Post-provision check for "system-services" being "start/running" can fail)
https://chromium-review.googlesource.com/c/chromiumos/third_party/autotest/+/756829 should fix the fundamental issue here w/ update_engine.

wizpig_paladin has succeeded for the last 12 builds.
https://uberchromegw.corp.google.com/i/chromeos/builders/wizpig-paladin

Since there's a bunch of context in this bug, instead of closing it, I'll leave it open, and drop the priority.
Sign in to add a comment