Change non-critical-path upstart services to "started system-services" from "starting" |
|||||||||||||||||
Issue descriptionwizpig-paladin has failed three builds with this error starting Sep 14 14:21 https://uberchromegw.corp.google.com/i/chromeos/builders/wizpig-paladin/builds/908 In autoserv.DEBUG I see the message "Failed: servod service is taking calls" Looks like servo_repair is failing.
,
Sep 15 2017
09/15 07:06:21.430 INFO | servo_host:0458| servo host has the following duts: ['chromeos6-row2-rack20-host7', 'chromeos6-row2-rack20-host9', 'chromeos6-row2-rack20-host3', 'chromeos6-row2-rack20-host1', 'chromeos6-row2-rack20-host13', 'chromeos6-row2-rack20-host5', 'chromeos6-row2-rack20-host11', 'chromeos6-row2-rack20-host15', 'chromeos6-row2-rack20-host17', 'chromeos6-row2-rack20-host19', 'chromeos6-row2-rack20-host21'] 09/15 07:06:21.431 INFO | servo_host:0460| servo host has multiple duts, scheduling synchronized reboot 09/15 07:06:25.143 ERROR|control_file_gette:0214| Unable to search directory /usr/local/autotest/site-packages/devserver/venv/devserver_lib/venv/devserver_lib/venv/devserver_lib/venv/devserver_lib/venv/devserver_lib/venv/devserver_lib/venv/devserver_lib/venv/devserver_lib/venv/devserver_lib/venv/devserver_lib/venv/devserver_lib/venv/devserver_lib/venv/devserver_lib/venv/devserver_lib/venv/devserver_lib/venv/devserver_lib/venv/devserver_lib/venv/devserver_lib/venv/devserver_lib/venv/devserver_lib/venv/devserver_lib/.git/objects/27 for control files. 09/15 07:06:25.149 ERROR|control_file_gette:0214| Unable to search directory /usr/local/autotest/site-packages/devserver/venv/devserver_lib/venv/devserver_lib/venv/devserver_lib/venv/devserver_lib/venv/devserver_lib/venv/devserver_lib/venv/devserver_lib/venv/devserver_lib/venv/devserver_lib/venv/devserver_lib/venv/devserver_lib/venv/devserver_lib/venv/devserver_lib/venv/devserver_lib/venv/devserver_lib/venv/devserver_lib/venv/devserver_lib/venv/devserver_lib/venv/devserver_lib/venv/devserver_lib/venv/devserver_lib/venv/devserver_lib for control files. 09/15 07:07:37.873 ERROR|control_file_gette:0214| Unable to search directory /usr/local/autotest/results/142149346-kkanchi/chromeos2-row4-rack1-host3/cheets_CTS_N.CtsDevicePolicyManagerTestCases/sysinfo/var/spool for control files. 09/15 07:07:37.873 ERROR|control_file_gette:0214| Unable to search directory /usr/local/autotest/results/142149346-kkanchi/chromeos2-row4-rack1-host3/cheets_CTS_N.CtsDevicePolicyManagerTestCases/sysinfo/var/log/ui for control files. 09/15 07:07:37.873 ERROR|control_file_gette:0214| Unable to search directory /usr/local/autotest/results/142149346-kkanchi/chromeos2-row4-rack1-host3/cheets_CTS_N.CtsDevicePolicyManagerTestCases/sysinfo/var/log/power_manager for control files. 09/15 07:07:37.874 ERROR|control_file_gette:0214| Unable to search directory /usr/local/autotest/results/142149346-kkanchi/chromeos2-row4-rack1-host3/cheets_CTS_N.CtsDevicePolicyManagerTestCases/sysinfo/var/log/recover_duts for control files. 09/15 07:07:37.874 ERROR|control_file_gette:0214| Unable to search directory /usr/local/autotest/results/142149346-kkanchi/chromeos2-row4-rack1-host3/cheets_CTS_N.CtsDevicePolicyManagerTestCases/sysinfo/var/log/metrics/shutdown.20170915030432 for control files. 09/15 07:07:37.875 ERROR|control_file_gette:0214| Unable to search directory /usr/local/autotest/results/142149346-kkanchi/chromeos2-row4-rack1-host3/cheets_CTS_N.CtsDevicePolicyManagerTestCases/sysinfo/var/log/metrics/shutdown.20170915031922 for control files. 09/15 07:07:37.875 ERROR|control_file_gette:0214| Unable to search directory /usr/local/autotest/results/142149346-kkanchi/chromeos2-row4-rack1-host3/cheets_CTS_N.CtsDevicePolicyManagerTestCases/sysinfo/var/log/metrics/shutdown.20170915020519 for control files. 09/15 07:07:37.875 ERROR|control_file_gette:0214| Unable to search directory /usr/local/autotest/results/142149346-kkanchi/chromeos2-row4-rack1-host3/cheets_CTS_N.CtsDevicePolicyManagerTestCases/sysinfo/var/log/metrics/shutdown.20170915055907 for control files. 09/15 07:07:37.875 ERROR|control_file_gette:0214| Unable to search directory /usr/local/autotest/results/142149346-kkanchi/chromeos2-row4-rack1-host3/cheets_CTS_N.CtsDevicePolicyManagerTestCases/sysinfo/var/log/metrics/shutdown.20170915065443 for control files. 09/15 07:07:37.876 ERROR|control_file_gette:0214| Unable to search directory /usr/local/autotest/results/142149346-kkanchi/chromeos2-row4-rack1-host3/cheets_CTS_N.CtsDevicePolicyManagerTestCases/sysinfo/var/log/metrics/shutdown.20170915060351 for control files. 09/15 07:07:37.876 ERROR|control_file_gette:0214| Unable to search directory /usr/local/autotest/results/142149346-kkanchi/chromeos2-row4-rack1-host3/cheets_CTS_N.CtsDevicePolicyManagerTestCases/sysinfo/var/log/metrics/shutdown.20170915022632 for control files. 09/15 07:07:37.876 ERROR|control_file_gette:0214| Unable to search directory /usr/local/autotest/results/142149346-kkanchi/chromeos2-row4-rack1-host3/cheets_CTS_N.CtsDevicePolicyManagerTestCases/sysinfo/var/log/metrics/shutdown.20170915064111 for control files. 09/15 07:07:37.876 ERROR|control_file_gette:0214| Unable to search directory /usr/local/autotest/results/142149346-kkanchi/chromeos2-row4-rack1-host3/cheets_CTS_N.CtsDevicePolicyManagerTestCases/sysinfo/var/log/metrics/shutdown.20170915063709 for control files. 09/15 07:07:37.876 ERROR|control_file_gette:0214| Unable to search directory /usr/local/autotest/results/142149346-kkanchi/chromeos2-row4-rack1-host3/cheets_CTS_N.CtsDevicePolicyManagerTestCases/sysinfo/var/log/metrics/shutdown.20170915020115 for control files. 09/15 07:07:37.877 ERROR|control_file_gette:0214| Unable to search directory /usr/local/autotest/results/142149346-kkanchi/chromeos2-row4-rack1-host3/cheets_CTS_N.CtsDevicePolicyManagerTestCases/sysinfo/var/log/metrics/shutdown.20170915051208 for control files. 09/15 07:07:37.877 ERROR|control_file_gette:0214| Unable to search directory /usr/local/autotest/results/142149346-kkanchi/chromeos2-row4-rack1-host3/cheets_CTS_N.CtsDevicePolicyManagerTestCases/sysinfo/var/log/metrics/shutdown.20170915015715 for control files. 09/15 07:07:37.877 ERROR|control_file_gette:0214| Unable to search directory /usr/local/autotest/results/142149346-kkanchi/chromeos2-row4-rack1-host3/cheets_CTS_N.CtsDevicePolicyManagerTestCases/sysinfo/var/log/metrics/shutdown.20170915015318 for control files. 09/15 07:07:37.877 ERROR|control_file_gette:0214| Unable to search directory /usr/local/autotest/results/142149346-kkanchi/chromeos2-row4-rack1-host3/cheets_CTS_N.CtsDevicePolicyManagerTestCases/sysinfo/var/log/metrics/shutdown.20170915023948 for control files. 09/15 07:07:37.877 ERROR|control_file_gette:0214| Unable to search directory /usr/local/autotest/results/142149346-kkanchi/chromeos2-row4-rack1-host3/cheets_CTS_N.CtsDevicePolicyManagerTestCases/sysinfo/var/log/metrics/shutdown.20170915032321 for control files. 09/15 07:07:37.878 ERROR|control_file_gette:0214| Unable to search directory /usr/local/autotest/results/142149346-kkanchi/chromeos2-row4-rack1-host3/cheets_CTS_N.CtsDevicePolicyManagerTestCases/sysinfo/var/log/metrics/shutdown.20170915054743 for control files. 09/15 07:07:37.878 ERROR|control_file_gette:0214| Unable to search directory /usr/local/autotest/results/142149346-kkanchi/chromeos2-row4-rack1-host3/cheets_CTS_N.CtsDevicePolicyManagerTestCases/sysinfo/var/log/metrics/shutdown.20170915021826 for control files. 09/15 07:07:37.878 ERROR|control_file_gette:0214| Unable to search directory /usr/local/autotest/results/142149346-kkanchi/chromeos2-row4-rack1-host3/cheets_CTS_N.CtsDevicePolicyManagerTestCases/sysinfo/var/log/metrics/shutdown.20170915031459 for control files. 09/15 07:07:37.878 ERROR|control_file_gette:0214| Unable to search directory /usr/local/autotest/results/142149346-kkanchi/chromeos2-row4-rack1-host3/cheets_CTS_N.CtsDevicePolicyManagerTestCases/sysinfo/var/log/metrics/shutdown.20170915022230 for control files. 09/15 07:07:37.878 ERROR|control_file_gette:0214| Unable to search directory /usr/local/autotest/results/142149346-kkanchi/chromeos2-row4-rack1-host3/cheets_CTS_N.CtsDevicePolicyManagerTestCases/sysinfo/var/log/metrics/shutdown.20170915030045 for control files. 09/15 07:07:37.879 ERROR|control_file_gette:0214| Unable to search directory /usr/local/autotest/results/142149346-kkanchi/chromeos2-row4-rack1-host3/cheets_CTS_N.CtsDevicePolicyManagerTestCases/sysinfo/var/log/metrics/shutdown.20170915063035 for control files. 09/15 07:07:37.879 ERROR|control_file_gette:0214| Unable to search directory /usr/local/autotest/results/142149346-kkanchi/chromeos2-row4-rack1-host3/cheets_CTS_N.CtsDevicePolicyManagerTestCases/sysinfo/var/log/metrics/shutdown.20170915065036 for control files. 09/15 07:07:37.879 ERROR|control_file_gette:0214| Unable to search directory /usr/local/autotest/results/142149346-kkanchi/chromeos2-row4-rack1-host3/cheets_CTS_N.CtsDevicePolicyManagerTestCases/sysinfo/var/log/metrics/shutdown.20170915050818 for control files. 09/15 07:07:37.879 ERROR|control_file_gette:0214| Unable to search directory /usr/local/autotest/results/142149346-kkanchi/chromeos2-row4-rack1-host3/cheets_CTS_N.CtsDevicePolicyManagerTestCases/sysinfo/var/log/metrics/shutdown.20170915020928 for control files. 09/15 07:07:37.880 ERROR|control_file_gette:0214| Unable to search directory /usr/local/autotest/results/142149346-kkanchi/chromeos2-row4-rack1- Found a lot of these errors in the provison test log. It looks like the server itself has some issues
,
Sep 15 2017
Richard, can you take a look of this servo issue?
,
Sep 15 2017
The log is not clear, on which server it is searching the control files, servo?shard?drone?devserver?
,
Sep 15 2017
I searched on the wizpig shard, which is chromeos-server104.mtv. I didn't find that /usr/local/autotest/results/142149346-kkanchi repo
,
Sep 15 2017
The servo messages are a red herring: This is a provisioning
failure. Part of provisioning checks and reports servo status,
which is why the messages showed up, but the failure was caused
by this event:
FAIL ---- verify.cros timestamp=1505485222 localtime=Sep 15 07:20:22 command execution error
* Command:
/usr/bin/ssh -a -x -o ControlPath=/tmp/_autotmp_EtDhB6ssh-
master/socket -o StrictHostKeyChecking=no -o UserKnownHostsFile=/dev/null
-o BatchMode=yes -o ConnectTimeout=30 -o ServerAliveInterval=900 -o
ServerAliveCountMax=3 -o ConnectionAttempts=4 -o Protocol=2 -l root -p 22
chromeos6-row2-rack20-host7 "export LIBC_FATAL_STDERR_=1; if type
\"logger\" > /dev/null 2>&1; then logger -tag \"autotest\"
\"server[stack::upstart_status|run|wrapper] -> ssh_run(status system-
services | grep start/running)\";fi; status system-services | grep
start/running"
Exit status: 1
Duration: 0.0340459346771
,
Sep 15 2017
so, the server is down, which failed the check?
,
Sep 15 2017
> so, the server is down, which failed the check? The failure in the "system-services" check indicates that chrome didn't start. This symptom is essentially the same as bug 693597 . Note that this symptom is _really_ hard to explain. The best explanation is the following sequence (taken from c#8 in the bug above): 1) Provisioning installed the new build. 2) Provisioning confirmed that Chrome was working. 3) The DUT crashed (or at least, rebooted spontaneously). 4) Chrome failed to start. 5) Final provisioning verification tested and concluded that Chrome had crashed at boot. This explanation really strains credulity, but I don't have a more plausible theory. If we can get logs from the DUT, they might reveal more.
,
Sep 15 2017
I checked logs on the DUT. Specifically, I looked at /var/log/messages.
I found these lines:
2017-09-15T14:19:15.436469+00:00 INFO session_manager[1385]: [INFO:upstart_signal_emitter.cc(36)] Emitting login-prompt-visible Upstart signal
[ ... ]
2017-09-15T14:19:17.825996+00:00 INFO sshd[2137]: Server listening on 0.0.0.0 port 2222.
[ ... ]
2017-09-15T14:20:21.912197+00:00 NOTICE ag[3049]: autotest server[stack::upstart_status|run|wrapper] -> ssh_run(status system-services | grep start/running)
There were no reboots in between those two messages. That
blows away the working theory above because the
login-prompt-visible event would force system-services to
start. Also, the fact that `sshd` was working within three
seconds of the event indicates that servers were starting.
Next I tried an experiment: I rebooted the target DUT, and
checked the system-services job myself. This is what I saw:
$ ssh chromeos6-row2-rack20-host7 status system-services
ssh: connect to host chromeos6-row2-rack20-host7 port 22: Connection timed out
$ ssh chromeos6-row2-rack20-host7 status system-services
system-services start/starting
That explains the failure of the verifier; "start/starting" isn't
"start/running". The "start/starting" status still persisted when
I checked a second time, about a minute or so later.
At this point, we have two problems:
1) On its face, the simple check for "start/running" is wrong.
2) Something is holding up finishing boot. That could be wrong, too.
Regarding the check for "start/running" vs. "start/starting":
if we allow "start/starting" to be valid, we're assuming one of
two things:
* Once started, eventually, the status will change to "running".
* If the status never changes to "running", that's OK.
I don't believe that either of those assumptions is safe, but I
don't know what to do about it.
,
Sep 18 2017
> Regarding the check for "start/running" vs. "start/starting":
> if we allow "start/starting" to be valid, we're assuming one of
> two things:
Really, any of three assumptions might justify a more forgiving
check:
* Once started, eventually, the status will change to "running".
* If it takes a long time to change from "starting" to "running",
that's OK.
* If the status never changes to "running", that's OK.
... and none of those assumptions would seem to be safe on its face.
,
Sep 19 2017
,
Sep 20 2017
Looking it over, I'm pretty sure Autotest shouldn't be assuming either that 'system-services' will always reach 'running' or that it doesn't matter how long it takes. I also don't think it's appropriate to add a wait loop on the existing check. The reason this is happening is that some jobs say "start on starting system-services" instead of saying "start on started system-services". One of those jobs is failing to start (or taking a long time), which causes the symptom. So... This is product bug. We need to eliminate all inappropriate uses of 'start on starting system-services'. "Appropriate" use would be only for jobs for which failure to start could be considered a system failure. Right now, the only jobs I know of that meet the criterion are "update-engine" and "failsafe". Not sure who should own this. Passing to adlr@ in the hopes he can triage this appropriately.
,
Sep 20 2017
Let's give this a more descriptive summary.
,
Sep 27 2017
The following revision refers to this bug: https://chromium.googlesource.com/chromiumos/third_party/autotest/+/e204dc5fe5bde85ad91b16ad5a2f577d4ebd84a6 commit e204dc5fe5bde85ad91b16ad5a2f577d4ebd84a6 Author: Richard Barnette <jrbarnette@chromium.org> Date: Wed Sep 27 01:24:12 2017 [autotest] Make CrosHost.upstart_status() more debuggable. Change the `run()` invocation in `CrosHost.upstart_status()` so that stdout of the 'status' command will be logged. This will make it easier to see what's wrong in cases where the call returns false unexpectedly. BUG= chromium:768110 ,chromium:765686 TEST=manually invoke the function from a Python CLI. Change-Id: Iaeaf97e2a73802b62bc111ad717e13ed627317bc Reviewed-on: https://chromium-review.googlesource.com/685798 Commit-Ready: Richard Barnette <jrbarnette@chromium.org> Tested-by: Richard Barnette <jrbarnette@chromium.org> Reviewed-by: Dan Shi <dshi@google.com> [modify] https://crrev.com/e204dc5fe5bde85ad91b16ad5a2f577d4ebd84a6/server/hosts/cros_host.py
,
Oct 17 2017
Issue 775610 has been merged into this issue.
,
Oct 17 2017
This is continuing to affect CQ runs, causing intermittent failures.
,
Oct 18 2017
another failure happened yesterday: https://luci-milo.appspot.com/buildbot/chromeos/wizpig-paladin/1230 this broke CQ again today: https://luci-milo.appspot.com/buildbot/chromeos/wizpig-paladin/1240
,
Oct 18 2017
adlr@ any updates on this bug ?
,
Oct 18 2017
Issue 776120 has been merged into this issue.
,
Oct 20 2017
sheriffs@, can you find an owner for this bug?
,
Oct 20 2017
,
Oct 20 2017
latest failure: https://uberchromegw.corp.google.com/i/chromeos/builders/wizpig-paladin/builds/1265 Looks like usb issue again.found this in crashinfo:https://pantheon.corp.google.com/storage/browser/chromeos-autotest-results/150401041-chromeos-test/chromeos6-row2-rack20-host22/crashinfo.chromeos6-row2-rack20-host22/ 2017-10-20T12:32:13.784413+00:00 ERR kernel: [ 17.223202] usb 1-5: device descriptor read/64, error -110 2017-10-20T12:32:13.784414+00:00 ERR kernel: [ 32.477632] usb 1-5: device descriptor read/64, error -110 2017-10-20T12:32:13.784416+00:00 INFO kernel: [ 32.732732] usb 1-5: new full-speed USB device number 7 using xhci_hcd 2017-10-20T12:32:13.784417+00:00 ERR kernel: [ 47.887083] usb 1-5: device descriptor read/64, error -110 2017-10-20T12:32:13.784424+00:00 WARNING kernel: [ 62.398259] udevd[144]: seq 994 '/devices/pci0000:00/0000:00:14.0/usb1/1-4' is taking a long time 2017-10-20T12:32:13.784426+00:00 ERR kernel: [ 63.141819] usb 1-5: device descriptor read/64, error -110 2017-10-20T12:32:13.784428+00:00 INFO kernel: [ 63.396586] usb 1-5: new full-speed USB device number 8 using xhci_hcd 2017-10-20T12:32:13.784429+00:00 ERR kernel: [ 68.411187] xhci_hcd 0000:00:14.0: Command completion event does not match command 2017-10-20T12:32:13.784431+00:00 WARNING kernel: [ 68.411311] xhci_hcd 0000:00:14.0: Timeout while waiting for setup device command 2017-10-20T12:32:13.784432+00:00 WARNING kernel: [ 73.627135] xhci_hcd 0000:00:14.0: Timeout while waiting for setup device command 2017-10-20T12:32:13.784438+00:00 ERR kernel: [ 73.827703] usb 1-5: device not accepting address 8, error -62 2017-10-20T12:32:13.784441+00:00 INFO kernel: [ 73.982222] usb 1-5: new full-speed USB device number 9 using xhci_hcd 2017-10-20T12:32:13.784442+00:00 ERR kernel: [ 78.986835] xhci_hcd 0000:00:14.0: Command completion event does not match command 2017-10-20T12:32:13.784444+00:00 WARNING kernel: [ 78.986955] xhci_hcd 0000:00:14.0: Timeout while waiting for setup device command 2017-10-20T12:32:13.784445+00:00 WARNING kernel: [ 84.203127] xhci_hcd 0000:00:14.0: Timeout while waiting for setup device command 2017-10-20T12:32:13.784447+00:00 ERR kernel: [ 84.404623] usb 1-5: device not accepting address 9, error -62 2017-10-20T12:32:13.784448+00:00 ERR kernel: [ 84.404828] usb usb1-port5: unable to enumerate USB device 2017-10-20T12:32:13.785678+00:00 INFO kernel: [ 84.564354] uvcvideo: Found UVC 1.00 device Chromebook HD Camera (05e3:f11a) 2017-10-20T12:32:13.785701+00:00 INFO kernel: [ 84.564477] platform vpd: Driver vpd requests probe deferral 2017-10-20T12:32:13.785703+00:00 INFO kernel: [ 84.570773] usbcore: registered new interface driver uvcvideo 2017-10-20T12:32:13.785704+00:00 INFO kernel: [ 84.570787] USB Video Class driver (1.1.1) 2017-10-20T12:32:13.785706+00:00 INFO kernel: [ 84.571073] platform vpd: Driver vpd requests probe deferral
,
Oct 20 2017
Looking at the original failure reported https://uberchromegw.corp.google.com/i/chromeos/builders/wizpig-paladin/builds/908 I found similar usb errors: 2017-09-15T14:19:02.651129+00:00 ERR kernel: [ 17.130976] usb 1-5: device descriptor read/64, error -110 2017-09-15T14:19:02.651130+00:00 ERR kernel: [ 32.385932] usb 1-5: device descriptor read/64, error -110 2017-09-15T14:19:02.651132+00:00 INFO kernel: [ 32.640312] usb 1-5: new full-speed USB device number 6 using xhci_hcd 2017-09-15T14:19:02.651133+00:00 ERR kernel: [ 47.795839] usb 1-5: device descriptor read/64, error -110 2017-09-15T14:19:02.651140+00:00 WARNING kernel: [ 62.226084] udevd[143]: seq 994 '/devices/pci0000:00/0000:00:14.0/usb1/1-4' is taking a long time 2017-09-15T14:19:02.651142+00:00 ERR kernel: [ 63.050781] usb 1-5: device descriptor read/64, error -110 2017-09-15T14:19:02.651144+00:00 INFO kernel: [ 63.306401] usb 1-5: new full-speed USB device number 7 using xhci_hcd 2017-09-15T14:19:02.651145+00:00 ERR kernel: [ 68.313962] xhci_hcd 0000:00:14.0: Command completion event does not match command 2017-09-15T14:19:02.651147+00:00 WARNING kernel: [ 68.314079] xhci_hcd 0000:00:14.0: Timeout while waiting for setup device command 2017-09-15T14:19:02.651148+00:00 WARNING kernel: [ 73.530154] xhci_hcd 0000:00:14.0: Timeout while waiting for setup device command 2017-09-15T14:19:02.651149+00:00 ERR kernel: [ 73.730939] usb 1-5: device not accepting address 7, error -62 2017-09-15T14:19:02.651155+00:00 INFO kernel: [ 73.884964] usb 1-5: new full-speed USB device number 8 using xhci_hcd 2017-09-15T14:19:02.651158+00:00 ERR kernel: [ 78.889656] xhci_hcd 0000:00:14.0: Command completion event does not match command 2017-09-15T14:19:02.651160+00:00 WARNING kernel: [ 78.889776] xhci_hcd 0000:00:14.0: Timeout while waiting for setup device command 2017-09-15T14:19:02.651161+00:00 WARNING kernel: [ 84.105695] xhci_hcd 0000:00:14.0: Timeout while waiting for setup device command 2017-09-15T14:19:02.651163+00:00 ERR kernel: [ 84.306519] usb 1-5: device not accepting address 8, error -62 2017-09-15T14:19:02.651164+00:00 ERR kernel: [ 84.306715] usb usb1-port5: unable to enumerate USB device
,
Oct 21 2017
xiaochu@ - This bug is not about USB errors. It's about the system-services job being slow to start. If "system-services is slow to start" is because of the USB problem, it's two bugs: One bug to fix the USB problem, and one bug to fix the upstart code that causes system-services to be slow. Note that the upstart fix required is already explained in c#12, above.
,
Oct 21 2017
I had submitted crbug.com/776780 for the usb problem since it always showed up on one DUT (6-2-20-22). Supposedly that was fixed by taking the DUT out of service. This bug was filed against 6-2-20-7. That system was taken out of service around 9-18 after it failed completely (it had already failed at 8-1, the start date of my dut-status scan; the question here is why it had not been taken out earlier). That almost looks like a systemic HW failure on wizpic, though with only two confirmed samples that may be a premature statement. If it happens again, it might be worthwhile to analyze the failed HW in more detail. Anyway, I agree, the USB failure needs to be handled separately.
,
Oct 21 2017
I wasn't sure what the fix should be regarding #12... what by 'system-services' do you mean? Can you please give more clear instructions regarding 'upstart code'? The builder was green since #1265, can I assume crbug.com/776780 fixed it?
,
Oct 21 2017
#26: AFAICS, no, crbug.com/776780 did not fix _this_ problem. First, as I mentioned in #25, this problem was submitted against another DUT which was since removed from service. Second, while there is some evidence that the USB problem may result in the problem described here being observed, it doesn't mean that solving the USB problem (by replacing the affected DUTs) will solve the system-services problem. The system-services problem is still there, it is just no longer observed until another DUT experiences the USB problem (or some other problem triggers it). In other words, if event A triggers problem B, problem B doesn't go away by removing event A from the picture.
,
Oct 21 2017
,
Oct 23 2017
> [ ... ] what by 'system-services' do you mean? Can
> you please give more clear instructions regarding 'upstart code'?
Upstart is the package providing /sbin/init in Chrome OS
It manages the start up of services during boot; it fills
the same niche as systemd. 'system-services' is the name
of an upstart job that's specific to the Chrome OS boot flow.
There's more detail about the design, including a detailed
explanation of 'system-services' here:
https://www.chromium.org/chromium-os/chromiumos-design-docs/boot-design
> [ ... ] The builder was green since #1265, can I assume crbug.com/776780
> fixed it?
This problem only occurs intermittently. Moreover, sometimes the failures
are retried and ignored. The fact that some builder was green once tells
us very little about whether the bug is fixed. This particular bug is
identified by particular features in the source code, so we'll know it's
fixed when we search the code and find that that feature (jobs that use
"start on starting system-services") is gone.
,
Oct 23 2017
Recapping this bug, so that the requirement here doesn't get lost: This bug is assigned to the sheriff to find a permanent owner to fix the problem. It's not the job of the sheriff to fix the problem, or even particularly to understand it. Only reassign to a proper owner. The problem is that the 'system-services' job can be too slow to start. Autotest in the test lab treats the slowness as a failure in the build under test; justification for that treatment is in c#9, c#10 and c#12. The fix is to find all upstart jobs that use "start on starting system-services", and change them to use "start on started system-services" instead.
,
Oct 23 2017
deputy comment: wizpig-paladin is currently marked experimental citing this bug. The ask in #30 is a non-trivial solution to the root cause of this problem. But surely all these "incorrect" updatart dependencies weren't added last week. So, we need to find what happened to push wizpig over the edge so that it started failing the CQ frequently. If we can fix that problem, we can at least mark wizpig important again.
,
Oct 27 2017
I scrubbed last 30 builds or so of wizpig-paladin. The basline is not legit -- since wizpig is experimental, it's builds are getting aborted before the complete, so there could have been more failures than I saw. I found three failures, and they seem to be the same as this bug claims, although I haven't dug too much, so low confidence rating: https://pantheon.corp.google.com/storage/browser/chromeos-autotest-results/151698270-chromeos-test/chromeos2-row8-rack8-host12 https://pantheon.corp.google.com/storage/browser/chromeos-autotest-results/151579000-chromeos-test/chromeos2-row8-rack8-host12/ https://pantheon.corp.google.com/storage/browser/chromeos-autotest-results/151486510-chromeos-test/chromeos2-row8-rack8-host12/ So, this issue remains active. wizpig has been experimental for 1 week+. I'm going to mark it experimental via chromeos_config so we can track it better (tree status is for quick fixes) Who cares about the strago platform? Without wizpig, strago doesn't have bvt-inline coverage. +sheriffs: This still needs an owner.
,
Oct 27 2017
The following revision refers to this bug: https://chromium.googlesource.com/chromiumos/chromite/+/efb85837f82850d32dc3f5a55438181988169054 commit efb85837f82850d32dc3f5a55438181988169054 Author: Aviv Keshet <akeshet@chromium.org> Date: Fri Oct 27 23:14:06 2017 chromeos_config: mark wizpig, reef, reef-uni experimental The tree status is intended to mark things as experimental for the short term. These boards have been continuously experimental for over a week. Moving the flag to config. BUG=chromium:765686 TEST=None Change-Id: I70a67b3885cf11a9443f5b808aa99dcd7a795e62 Reviewed-on: https://chromium-review.googlesource.com/742101 Commit-Ready: Aviv Keshet <akeshet@chromium.org> Tested-by: Aviv Keshet <akeshet@chromium.org> Reviewed-by: Prathmesh Prabhu <pprabhu@chromium.org> [modify] https://crrev.com/efb85837f82850d32dc3f5a55438181988169054/cbuildbot/config_dump.json [modify] https://crrev.com/efb85837f82850d32dc3f5a55438181988169054/cbuildbot/chromeos_config.py
,
Oct 30 2017
,
Nov 1 2017
The following revision refers to this bug: https://chromium.googlesource.com/chromiumos/chromite/+/65fdfb83af6ab313b5104a557d053e82266adfa3 commit 65fdfb83af6ab313b5104a557d053e82266adfa3 Author: Jason Clinton <jclinton@chromium.org> Date: Wed Nov 01 21:06:28 2017 Revert "chromeos_config: mark wizpig, reef, reef-uni experimental" This reverts commit efb85837f82850d32dc3f5a55438181988169054. Reason for revert: reef and reef-uni are not broken at ToT: Snappy/Alan/Bigdaddy are going to DVT/Factory this week and need to have new regressions checked in because we don't have coverage. The one breakage in reef-uni recently is another example of the USB dongle problem. Original change's description: > chromeos_config: mark wizpig, reef, reef-uni experimental > > The tree status is intended to mark things as experimental for the short > term. These boards have been continuously experimental for over a week. > Moving the flag to config. > > BUG=chromium:765686 > TEST=None > > Change-Id: I70a67b3885cf11a9443f5b808aa99dcd7a795e62 > Reviewed-on: https://chromium-review.googlesource.com/742101 > Commit-Ready: Aviv Keshet <akeshet@chromium.org> > Tested-by: Aviv Keshet <akeshet@chromium.org> > Reviewed-by: Prathmesh Prabhu <pprabhu@chromium.org> Bug: chromium:765686 Change-Id: I0481f984728fac5011a5a590605d67efc39cafc7 Reviewed-on: https://chromium-review.googlesource.com/749169 Commit-Ready: Jason Clinton <jclinton@chromium.org> Tested-by: Jason Clinton <jclinton@chromium.org> Reviewed-by: Simon Glass <sjg@chromium.org> Reviewed-by: C Shapiro <shapiroc@google.com> [modify] https://crrev.com/65fdfb83af6ab313b5104a557d053e82266adfa3/cbuildbot/config_dump.json [modify] https://crrev.com/65fdfb83af6ab313b5104a557d053e82266adfa3/cbuildbot/chromeos_config.py
,
Nov 1 2017
The following revision refers to this bug: https://chromium.googlesource.com/chromiumos/overlays/chromiumos-overlay/+/625dec89058d5e3df1ddd782a1ae19546e83188f commit 625dec89058d5e3df1ddd782a1ae19546e83188f Author: Justin TerAvest <teravest@chromium.org> Date: Wed Nov 01 21:06:30 2017 bluez: Start on "started system-services" "start on starting system-services" should be reserved for jobs for which failure would be considered a system failure. Jobs inappropriately using "starting system services" impact our provisioning flow. BUG=chromium:765686 TEST=None Change-Id: Ibc5b28c75e9ef2a508d0c4f3f219adf5b45f5229 Reviewed-on: https://chromium-review.googlesource.com/748967 Commit-Ready: Justin TerAvest <teravest@chromium.org> Tested-by: Justin TerAvest <teravest@chromium.org> Reviewed-by: Simon Glass <sjg@chromium.org> [modify] https://crrev.com/625dec89058d5e3df1ddd782a1ae19546e83188f/net-wireless/bluez/files/bluez-upstart.conf
,
Nov 3 2017
The following revision refers to this bug: https://chromium.googlesource.com/chromiumos/overlays/chromiumos-overlay/+/d34fe79d0af8d5df0768dd005309af7441436a4d commit d34fe79d0af8d5df0768dd005309af7441436a4d Author: Justin TerAvest <teravest@chromium.org> Date: Fri Nov 03 08:53:36 2017 chromeos-machine-id-regen: Fix start dependency. "start on starting system-services" should be reserved for jobs for which failure would be considered a system failure. Jobs inappropriately using "starting system services" impact our provisioning flow. BUG=chromium:765686 TEST=None Change-Id: Ic2e7633d153a2cefc7eb11e092317ae6d9f951db Reviewed-on: https://chromium-review.googlesource.com/752024 Commit-Ready: Justin TerAvest <teravest@chromium.org> Tested-by: Justin TerAvest <teravest@chromium.org> Reviewed-by: Mike Frysinger <vapier@chromium.org> [rename] https://crrev.com/d34fe79d0af8d5df0768dd005309af7441436a4d/chromeos-base/chromeos-machine-id-regen/chromeos-machine-id-regen-0.0.1-r4.ebuild [modify] https://crrev.com/d34fe79d0af8d5df0768dd005309af7441436a4d/chromeos-base/chromeos-machine-id-regen/files/cros-machine-id-regen-periodic.conf
,
Nov 3 2017
Richard, I started to change the upstart scripts, but stopped to think through this a bit more and realized that if update_engine is slow to start, shuffling things around won't fix the problem. Taking a couple of historical examples: https://luci-milo.appspot.com/buildbot/chromeos/wizpig-paladin/1230 https://luci-milo.appspot.com/buildbot/chromeos/wizpig-paladin/1240 From there, we can pull the update_engine.log entry from crashinfo: https://pantheon.corp.google.com/storage/browser/chromeos-autotest-results/149593081-chromeos-test/chromeos6-row2-rack20-host22/crashinfo.chromeos6-row2-rack20-host22/var/log/ [1017/040203:INFO:evaluation_context.cc(218)] Waiting for timeout in 11h59m59.999768s [1017/040248:INFO:update_attempter.cc(1150)] Marking booted slot as good. [1017/040258:INFO:subprocess.cc(156)] Subprocess output: Cloning firmware A to B... flashrom v0.9.9 : 312b318 : Oct 12 2017 07:09:35 UTC on Linux 3.18.0-16233-g0589562febc4 (x86_64) flashrom v0.9.9 : 312b318 : Oct 12 2017 07:09:35 UTC on Linux 3.18.0-16233-g0589562febc4 (x86_64) Calibrating delay loop... OK. Error accessing low megabyte, 0x100000 bytes at 0x0 /dev/mem mmap failed: Resource temporarily unavailable Failed getting access to coreboot low tables. Mapping BYT IBASE at 0xfed08000, unaligned size 0x200. Mapping BYT SBASE at 0xfed01000, unaligned size 0x200. Reading flash... SUCCESS flashrom v0.9.9 : 312b318 : Oct 12 2017 07:09:35 UTC on Linux 3.18.0-16233-g0589562febc4 (x86_64) flashrom v0.9.9 : 312b318 : Oct 12 2017 07:09:35 UTC on Linux 3.18.0-16233-g0589562febc4 (x86_64) Calibrating delay loop... OK. Error accessing low megabyte, 0x100000 bytes at 0x0 /dev/mem mmap failed: Resource temporarily unavailable Failed getting access to coreboot low tables. Mapping BYT IBASE at 0xfed08000, unaligned size 0x200. Mapping BYT SBASE at 0xfed01000, unaligned size 0x200. Erasing and writing flash chip... SUCCESS INFO: /usr/sbin/chromeos-setgoodfirmware: Active vboot1 firmware set as good firmware. It looks like it took flashrom about 55 seconds to update the firmware for some reason. I don't know what a reasonable timeout is here (12 hours seems long), but it seems like autotest should take into account that there could be some delay to get to "running" coming from update_engine here. Am I understanding this correctly?
,
Nov 3 2017
+vapier to check my understanding in #38
,
Nov 4 2017
glancing at the autotest code, there doesn't seem to be any retries/timeouts for the OS check currently. seems like we should add some there.
server/hosts/cros_host.py:verify_software does a single:
if not self.upstart_status('system-services'):
raise error.AutoservError('Chrome failed to reach login. '
'System services not running.')
i would start with 5 minutes. if the current code frequently passes w/out any timeout, then 5 min seems like plenty of time.
i also agree with the recent work to change services from "starting" to "started" where it makes sense as those are critical path delays.
,
Nov 13 2017
https://chromium-review.googlesource.com/c/chromiumos/third_party/autotest/+/756829 should fix the fundamental issue here w/ update_engine. wizpig_paladin has succeeded for the last 12 builds. https://uberchromegw.corp.google.com/i/chromeos/builders/wizpig-paladin Since there's a bunch of context in this bug, instead of closing it, I'll leave it open, and drop the priority.
,
May 17 2018
Oh, <sigh> I just re-discovered this... > https://chromium-review.googlesource.com/c/chromiumos/third_party/autotest/+/756829 should fix the fundamental issue here w/ update_engine. I'm not entirely thrilled with this solution... From comment #12: > [ ... ] I also don't think it's appropriate to add > a wait loop on the existing check. The fix applied does the thing I was hoping we wouldn't do. :-( It seems the change works, but I see two problems. First, this (more-or-less) layers timeouts on timeouts; there's already a wait for update-engine to get started inside the provisioning code; this adds another wait later in the flow. It also adds the wait into code flows that have nothing to do with provisioning, and where it's very likely just wasted time. Second the change is papering over a bug somewhere, and that bug is emphatically not "update engine can be slow to start". The record on this bug is and always was that something other than update-engine was slowing down boot. There's no evidence that that underlying cause has ever been fixed. So, put more simply: Whatever was making post-Chrome startup slow before is probably still happening, and hasn't been explained.
,
Jun 8 2018
|
|||||||||||||||||
►
Sign in to add a comment |
|||||||||||||||||
Comment 1 by seobrien@chromium.org
, Sep 15 2017