"status system-services | grep start/running" failed on nyan_kitty paladin |
||||||
Issue descriptionhttps://uberchromegw.corp.google.com/i/chromeos/builders/nyan_kitty-paladin/builds/765/steps/HWTest%20%5Bbvt-cq%5D/logs/stdio platform_ToolchainOptions was reported as timing out, but it actually failed: https://pantheon.corp.google.com/storage/browser/chromeos-autotest-results/101755753-chromeos-test/chromeos4-row13-rack8-host4/debug/ 02/17 05:24:14.597 DEBUG| ssh_host:0284| Running (ssh) 'df -PB 1000000 /mnt/stateful_partition/encrypted | tail -1' 02/17 05:24:15.098 DEBUG| base_utils:0280| [stdout] /dev/mapper/encstateful 3233 8 3209 1% /mnt/stateful_partition/encrypted 02/17 05:24:15.103 INFO | base_classes:0359| Found 3.209 GB >= 0.1 GB of space under /mnt/stateful_partition/encrypted on machine chromeos4-row13-rack8-host4 02/17 05:24:15.104 DEBUG| ssh_host:0284| Running (ssh) 'status system-services | grep start/running' 02/17 05:24:15.644 ERROR| repair:0332| Failed: Legacy host verification checks I am going to take a look at the CLs but it's hard to imagine that a CL caused this to happen on this board only.
,
Feb 18 2017
This happened again: https://uberchromegw.corp.google.com/i/chromeos/builders/nyan_kitty-paladin/builds/769 https://pantheon.corp.google.com/storage/browser/chromeos-autotest-results/101853605-chromeos-test/chromeos4-row13-rack8-host4/debug/ in the graphics_Idle test: 02/17 17:03:14.596 DEBUG| base_utils:0280| [stdout] /dev/mapper/encstateful 3233 8 3208 1% /mnt/stateful_partition/encrypted 02/17 17:03:14.600 INFO | base_classes:0359| Found 3.208 GB >= 0.1 GB of space under /mnt/stateful_partition/encrypted on machine chromeos4-row13-rack8-host4 02/17 17:03:14.601 DEBUG| ssh_host:0284| Running (ssh) 'status system-services | grep start/running' 02/17 17:03:15.075 ERROR| repair:0332| Failed: Legacy host verification checks Traceback (most recent call last): File "/usr/local/autotest/client/common_lib/hosts/repair.py", line 329, in _verify_host self.verify(host) File "/usr/local/autotest/server/hosts/repair.py", line 55, in verify host.verify_software() File "/usr/local/autotest/server/hosts/cros_host.py", line 1449, in verify_software if not self.upstart_status('system-services'): File "/usr/local/autotest/server/hosts/cros_host.py", line 1413, in upstart_status service_name).stdout.strip() != '' File "/usr/local/autotest/server/hosts/ssh_host.py", line 295, in run ssh_failure_retry_ok) File "/usr/local/autotest/server/hosts/ssh_host.py", line 252, in _run raise error.AutoservRunError("command execution error", result) AutoservRunError: command execution error * Command: /usr/bin/ssh -a -x -o ControlPath=/tmp/_autotmp_pmjMnxssh- master/socket -o StrictHostKeyChecking=no -o UserKnownHostsFile=/dev/null -o BatchMode=yes -o ConnectTimeout=30 -o ServerAliveInterval=900 -o ServerAliveCountMax=3 -o ConnectionAttempts=4 -o Protocol=2 -l root -p 22 chromeos4-row13-rack8-host4 "export LIBC_FATAL_STDERR_=1; if type \"logger\" > /dev/null 2>&1; then logger -tag \"autotest\" \"server[stack::verify|verify_software|upstart_status] -> ssh_run(status system-services | grep start/running)\";fi; status system-services | grep start/running" Exit status: 1 Duration: 0.355021953583
,
Feb 18 2017
Wait... I found the syslog here: https://pantheon.corp.google.com/storage/browser/chromeos-autotest-results/hosts/chromeos4-row13-rack8-host4/2546348-repair/20171702170323/chromeos4-row13-rack8-host4/before_repair/log/ and there is something suspicious: 2017-02-18T01:03:11.893953+00:00 NOTICE ag[2540]: autotest server[stack::verify_software|get_board|parse_cmd_output] -> ssh_run(cat /etc/lsb-release) 2017-02-18T01:03:12.386516+00:00 NOTICE ag[2543]: autotest server[stack::verify|verify_software|check_inodes] -> ssh_run(df -Pi /mnt/stateful_partition | tail -1) 2017-02-18T01:03:12.882757+00:00 NOTICE ag[2547]: autotest server[stack::verify|verify_software|check_diskspace] -> ssh_run(df -PB 1000000 /mnt/stateful_partition | tail -1) 2017-02-18T01:03:13.374367+00:00 NOTICE ag[2551]: autotest server[stack::verify|verify_software|path_exists] -> ssh_run(test -e "/mnt/stateful_partition/encrypted") 2017-02-18T01:03:13.859397+00:00 NOTICE ag[2553]: autotest server[stack::verify|verify_software|check_diskspace] -> ssh_run(df -PB 1000000 /mnt/stateful_partition/encrypted | tail -1) 2017-02-18T01:03:13.961379+00:00 ERR kernel: [ 116.865489] Bluetooth: FW failed to be active in time! 2017-02-18T01:03:13.961418+00:00 ERR kernel: [ 116.865548] Bluetooth: Downloading firmware failed! 2017-02-18T01:03:14.082368+00:00 INFO kernel: [ 116.986149] tegra-hda 70030000.hda: codec_mask = 0x8 2017-02-18T01:03:14.082396+00:00 INFO kernel: [ 116.986308] tegra-hda 70030000.hda: Enable delay in RIRB handling 2017-02-18T01:03:14.111100+00:00 INFO kernel: [ 117.020867] uvcvideo: Found UVC 1.00 device HD WebCam (064e:c403) 2017-02-18T01:03:14.121083+00:00 INFO kernel: [ 117.029531] usbcore: registered new interface driver uvcvideo 2017-02-18T01:03:14.121111+00:00 INFO kernel: [ 117.029551] USB Video Class driver (1.1.1) 2017-02-18T01:03:14.336525+00:00 NOTICE ag[2581]: autotest server[stack::verify|verify_software|upstart_status] -> ssh_run(status system-services | grep start/running) 2017-02-18T01:03:14.541056+00:00 INFO kernel: [ 117.452659] smsc95xx v1.0.4 2017-02-18T01:03:14.611123+00:00 INFO kernel: [ 117.516135] smsc95xx 1-1.1:1.0 eth1: register 'smsc95xx' at usb-xhci-hcd.0.auto-1.1, smsc95xx USB 2.0 Ethernet, 94:eb:2c:10:02:5f 2017-02-18T01:03:14.611151+00:00 INFO kernel: [ 117.516251] usbcore: registered new interface driver smsc95xx 2017-02-18T01:03:14.701205+00:00 INFO kernel: [ 117.604730] smsc95xx 1-1.1:1.0 eth1: hardware isn't capable of remote wakeup 2017-02-18T01:03:14.701238+00:00 INFO kernel: [ 117.604917] IPv6: ADDRCONF(NETDEV_UP): eth1: link is not ready 2017-02-18T01:03:14.721176+00:00 DEBUG kernel: [ 117.623605] of_get_named_gpio_flags exited with status 71 2017-02-18T01:03:14.721241+00:00 DEBUG kernel: [ 117.623621] of_get_named_gpio_flags exited with status 143 2017-02-18T01:03:14.751114+00:00 INFO kernel: [ 117.653162] max98090 0-0010: MAX98090 REVID=0x43 2017-02-18T01:03:14.751146+00:00 ERR kernel: [ 117.655877] max98090 0-0010: No platform data 2017-02-18T01:03:14.751151+00:00 INFO kernel: [ 117.658722] tegra-snd-max98090 sound.8: HiFi <-> 70301100.i2s mapping ok 2017-02-18T01:03:14.761064+00:00 WARNING kernel: [ 117.669020] udevd[175]: Process '/usr/sbin/alsactl restore 0' failed with exit code 99. 2017-02-18T01:03:14.801069+00:00 INFO kernel: [ 117.709937] input: NVIDIA Tegra Venice2 Mic Jack as /devices/soc0/sound.8/sound/card1/input1 2017-02-18T01:03:14.801108+00:00 INFO kernel: [ 117.710179] input: NVIDIA Tegra Venice2 Headphone Jack as /devices/soc0/sound.8/sound/card1/input2 2017-02-18T01:03:14.821757+00:00 WARNING kernel: [ 117.730121] udevd[1469]: Process '/usr/sbin/alsactl restore 1' failed with exit code 99. 2017-02-18T01:03:15.027648+00:00 INFO permission_broker[901]: ProcessPath(/dev/bus/usb/001/008) It looks like the network connection may have gone down during execution of the last ssh command---which would have then returned a failure, but maybe the caller is not checking the actual status value, just whether there is an error or not... sigh...
,
Feb 21 2017
Pass to next deputy
,
Feb 21 2017
,
Feb 21 2017
The complaint about "system-services" generally indicates that chrome crashed at boot. Looking at history, I see that this particular DUT has encountered this failure on an off for a week or more. I'm searching to see if any other DUTs are seeing it.
,
Feb 27 2017
,
Mar 2 2017
Following up on c#6, I note that provisioning checks that
Chrome started without crashing before the failure was
reported. That means the failure here probably went like
this:
1) Provisioning installed the new build.
2) Provisioning tested, and confirmed that Chrome was
working.
3) The DUT crashed (or at least, rebooted spontaneously.
4) Chrome failed to start.
5) Final provisioning verification tested and concluded
that Chrome had crashed at boot.
,
Jun 21 2017
This hasn't happened in a while, and logs for debug are gradually vanishing. |
||||||
►
Sign in to add a comment |
||||||
Comment 1 by semenzato@chromium.org
, Feb 17 20171. let's find out why this was reported as a timeout when the test actually failed 2. there's some confusion in a database somewhere. $ dut-status -f -d 6 chromeos4-row13-rack8-host4 chromeos4-row13-rack8-host4 2017-02-17 08:16:35 OK http://cautotest/tko/retrieve_logs.cgi?job=/results/hosts/chromeos4-row13-rack8-host4/2543576-reset/ 2017-02-17 08:14:54 -- http://cautotest/tko/retrieve_logs.cgi?job=/results/101797539-chromeos-test/ 2017-02-17 08:14:19 OK http://cautotest/tko/retrieve_logs.cgi?job=/results/hosts/chromeos4-row13-rack8-host4/2543543-reset/ 2017-02-17 08:12:51 -- http://cautotest/tko/retrieve_logs.cgi?job=/results/101797548-chromeos-test/ 2017-02-17 08:12:09 OK http://cautotest/tko/retrieve_logs.cgi?job=/results/hosts/chromeos4-row13-rack8-host4/2543522-reset/ 2017-02-17 08:10:13 -- http://cautotest/tko/retrieve_logs.cgi?job=/results/101797522-chromeos-test/ 2017-02-17 08:00:07 OK http://cautotest/tko/retrieve_logs.cgi?job=/results/hosts/chromeos4-row13-rack8-host4/2543461-provision/ 2017-02-17 05:34:09 -- http://cautotest/tko/retrieve_logs.cgi?job=/results/101756042-chromeos-test/ 2017-02-17 05:33:22 OK http://cautotest/tko/retrieve_logs.cgi?job=/results/hosts/chromeos4-row13-rack8-host4/2542150-reset/ 2017-02-17 05:31:35 -- http://cautotest/tko/retrieve_logs.cgi?job=/results/101755952-chromeos-test/ 2017-02-17 05:30:51 OK http://cautotest/tko/retrieve_logs.cgi?job=/results/hosts/chromeos4-row13-rack8-host4/2542108-reset/ 2017-02-17 05:29:27 -- http://cautotest/tko/retrieve_logs.cgi?job=/results/101755886-chromeos-test/ 2017-02-17 05:28:45 OK http://cautotest/tko/retrieve_logs.cgi?job=/results/hosts/chromeos4-row13-rack8-host4/2542078-reset/ 2017-02-17 05:26:37 -- http://cautotest/tko/retrieve_logs.cgi?job=/results/101755877-chromeos-test/ 2017-02-17 05:25:36 OK http://cautotest/tko/retrieve_logs.cgi?job=/results/hosts/chromeos4-row13-rack8-host4/2542048-reset/ 2017-02-17 05:24:30 OK http://cautotest/tko/retrieve_logs.cgi?job=/results/hosts/chromeos4-row13-rack8-host4/2542041-repair/ 2017-02-17 05:13:48 -- http://cautotest/tko/retrieve_logs.cgi?job=/results/hosts/chromeos4-row13-rack8-host4/2541896-provision/ 2017-02-17 03:21:09 -- http://cautotest/tko/retrieve_logs.cgi?job=/results/101722750-chromeos-test/ 2017-02-17 03:20:28 OK http://cautotest/tko/retrieve_logs.cgi?job=/results/hosts/chromeos4-row13-rack8-host4/2540840-reset/ 2017-02-17 03:18:36 -- http://cautotest/tko/retrieve_logs.cgi?job=/results/101722725-chromeos-test/ 2017-02-17 03:18:01 OK http://cautotest/tko/retrieve_logs.cgi?job=/results/hosts/chromeos4-row13-rack8-host4/2540794-reset/ 2017-02-17 03:16:45 -- http://cautotest/tko/retrieve_logs.cgi?job=/results/101722717-chromeos-test/ 2017-02-17 03:16:05 OK http://cautotest/tko/retrieve_logs.cgi?job=/results/hosts/chromeos4-row13-rack8-host4/2540767-reset/ 2017-02-17 03:14:22 -- http://cautotest/tko/retrieve_logs.cgi?job=/results/101722713-chromeos-test/ 2017-02-17 03:13:41 OK http://cautotest/tko/retrieve_logs.cgi?job=/results/hosts/chromeos4-row13-rack8-host4/2540744-reset/ 2017-02-17 03:11:59 -- http://cautotest/tko/retrieve_logs.cgi?job=/results/101722677-chromeos-test/ 2017-02-17 03:01:46 OK http://cautotest/tko/retrieve_logs.cgi?job=/results/hosts/chromeos4-row13-rack8-host4/2540695-provision/ The above job 101755753 is not in this list. Maybe this is related to the timeout? Is it possible that multiple autoserv instances were mistakenly trying to use the same DUT?