Project: chromium Issues People Development process History Sign in
New issue
Advanced search Search tips
Starred by 1 user
Status: WontFix
Owner:
Closed: Jun 21
Cc:
Components:
EstimatedDays: ----
NextAction: ----
OS: Chrome
Pri: 1
Type: Bug

Blocking:
issue 695529



Sign in to add a comment
"status system-services | grep start/running" failed on nyan_kitty paladin
Project Member Reported by semenzato@chromium.org, Feb 17 2017 Back to list
https://uberchromegw.corp.google.com/i/chromeos/builders/nyan_kitty-paladin/builds/765/steps/HWTest%20%5Bbvt-cq%5D/logs/stdio

platform_ToolchainOptions was reported as timing out, but it actually failed:

https://pantheon.corp.google.com/storage/browser/chromeos-autotest-results/101755753-chromeos-test/chromeos4-row13-rack8-host4/debug/

02/17 05:24:14.597 DEBUG|          ssh_host:0284| Running (ssh) 'df -PB 1000000 /mnt/stateful_partition/encrypted | tail -1'
02/17 05:24:15.098 DEBUG|        base_utils:0280| [stdout] /dev/mapper/encstateful           3233     8      3209       1% /mnt/stateful_partition/encrypted
02/17 05:24:15.103 INFO |      base_classes:0359| Found 3.209 GB >= 0.1 GB of space under /mnt/stateful_partition/encrypted on machine chromeos4-row13-rack8-host4
02/17 05:24:15.104 DEBUG|          ssh_host:0284| Running (ssh) 'status system-services | grep start/running'
02/17 05:24:15.644 ERROR|            repair:0332| Failed: Legacy host verification checks

I am going to take a look at the CLs but it's hard to imagine that a CL caused this to happen on this board only.

 
1. let's find out why this was reported as a timeout when the test actually failed

2. there's some confusion in a database somewhere.

$ dut-status -f -d 6 chromeos4-row13-rack8-host4
chromeos4-row13-rack8-host4
    2017-02-17 08:16:35  OK http://cautotest/tko/retrieve_logs.cgi?job=/results/hosts/chromeos4-row13-rack8-host4/2543576-reset/
    2017-02-17 08:14:54  -- http://cautotest/tko/retrieve_logs.cgi?job=/results/101797539-chromeos-test/
    2017-02-17 08:14:19  OK http://cautotest/tko/retrieve_logs.cgi?job=/results/hosts/chromeos4-row13-rack8-host4/2543543-reset/
    2017-02-17 08:12:51  -- http://cautotest/tko/retrieve_logs.cgi?job=/results/101797548-chromeos-test/
    2017-02-17 08:12:09  OK http://cautotest/tko/retrieve_logs.cgi?job=/results/hosts/chromeos4-row13-rack8-host4/2543522-reset/
    2017-02-17 08:10:13  -- http://cautotest/tko/retrieve_logs.cgi?job=/results/101797522-chromeos-test/
    2017-02-17 08:00:07  OK http://cautotest/tko/retrieve_logs.cgi?job=/results/hosts/chromeos4-row13-rack8-host4/2543461-provision/
    2017-02-17 05:34:09  -- http://cautotest/tko/retrieve_logs.cgi?job=/results/101756042-chromeos-test/
    2017-02-17 05:33:22  OK http://cautotest/tko/retrieve_logs.cgi?job=/results/hosts/chromeos4-row13-rack8-host4/2542150-reset/
    2017-02-17 05:31:35  -- http://cautotest/tko/retrieve_logs.cgi?job=/results/101755952-chromeos-test/
    2017-02-17 05:30:51  OK http://cautotest/tko/retrieve_logs.cgi?job=/results/hosts/chromeos4-row13-rack8-host4/2542108-reset/
    2017-02-17 05:29:27  -- http://cautotest/tko/retrieve_logs.cgi?job=/results/101755886-chromeos-test/
    2017-02-17 05:28:45  OK http://cautotest/tko/retrieve_logs.cgi?job=/results/hosts/chromeos4-row13-rack8-host4/2542078-reset/
    2017-02-17 05:26:37  -- http://cautotest/tko/retrieve_logs.cgi?job=/results/101755877-chromeos-test/
    2017-02-17 05:25:36  OK http://cautotest/tko/retrieve_logs.cgi?job=/results/hosts/chromeos4-row13-rack8-host4/2542048-reset/
    2017-02-17 05:24:30  OK http://cautotest/tko/retrieve_logs.cgi?job=/results/hosts/chromeos4-row13-rack8-host4/2542041-repair/
    2017-02-17 05:13:48  -- http://cautotest/tko/retrieve_logs.cgi?job=/results/hosts/chromeos4-row13-rack8-host4/2541896-provision/
    2017-02-17 03:21:09  -- http://cautotest/tko/retrieve_logs.cgi?job=/results/101722750-chromeos-test/
    2017-02-17 03:20:28  OK http://cautotest/tko/retrieve_logs.cgi?job=/results/hosts/chromeos4-row13-rack8-host4/2540840-reset/
    2017-02-17 03:18:36  -- http://cautotest/tko/retrieve_logs.cgi?job=/results/101722725-chromeos-test/
    2017-02-17 03:18:01  OK http://cautotest/tko/retrieve_logs.cgi?job=/results/hosts/chromeos4-row13-rack8-host4/2540794-reset/
    2017-02-17 03:16:45  -- http://cautotest/tko/retrieve_logs.cgi?job=/results/101722717-chromeos-test/
    2017-02-17 03:16:05  OK http://cautotest/tko/retrieve_logs.cgi?job=/results/hosts/chromeos4-row13-rack8-host4/2540767-reset/
    2017-02-17 03:14:22  -- http://cautotest/tko/retrieve_logs.cgi?job=/results/101722713-chromeos-test/
    2017-02-17 03:13:41  OK http://cautotest/tko/retrieve_logs.cgi?job=/results/hosts/chromeos4-row13-rack8-host4/2540744-reset/
    2017-02-17 03:11:59  -- http://cautotest/tko/retrieve_logs.cgi?job=/results/101722677-chromeos-test/
    2017-02-17 03:01:46  OK http://cautotest/tko/retrieve_logs.cgi?job=/results/hosts/chromeos4-row13-rack8-host4/2540695-provision/

The above job 101755753 is not in this list.  Maybe this is related to the timeout?

Is it possible that multiple autoserv instances were mistakenly trying to use the same DUT?

Labels: -Pri-2 Pri-1
This happened again:

https://uberchromegw.corp.google.com/i/chromeos/builders/nyan_kitty-paladin/builds/769

https://pantheon.corp.google.com/storage/browser/chromeos-autotest-results/101853605-chromeos-test/chromeos4-row13-rack8-host4/debug/

in the graphics_Idle test:

02/17 17:03:14.596 DEBUG|        base_utils:0280| [stdout] /dev/mapper/encstateful           3233     8      3208       1% /mnt/stateful_partition/encrypted
02/17 17:03:14.600 INFO |      base_classes:0359| Found 3.208 GB >= 0.1 GB of space under /mnt/stateful_partition/encrypted on machine chromeos4-row13-rack8-host4
02/17 17:03:14.601 DEBUG|          ssh_host:0284| Running (ssh) 'status system-services | grep start/running'
02/17 17:03:15.075 ERROR|            repair:0332| Failed: Legacy host verification checks
Traceback (most recent call last):
  File "/usr/local/autotest/client/common_lib/hosts/repair.py", line 329, in _verify_host
    self.verify(host)
  File "/usr/local/autotest/server/hosts/repair.py", line 55, in verify
    host.verify_software()
  File "/usr/local/autotest/server/hosts/cros_host.py", line 1449, in verify_software
    if not self.upstart_status('system-services'):
  File "/usr/local/autotest/server/hosts/cros_host.py", line 1413, in upstart_status
    service_name).stdout.strip() != ''
  File "/usr/local/autotest/server/hosts/ssh_host.py", line 295, in run
    ssh_failure_retry_ok)
  File "/usr/local/autotest/server/hosts/ssh_host.py", line 252, in _run
    raise error.AutoservRunError("command execution error", result)
AutoservRunError: command execution error
* Command: 
    /usr/bin/ssh -a -x    -o ControlPath=/tmp/_autotmp_pmjMnxssh-
    master/socket -o StrictHostKeyChecking=no -o UserKnownHostsFile=/dev/null
    -o BatchMode=yes -o ConnectTimeout=30 -o ServerAliveInterval=900 -o
    ServerAliveCountMax=3 -o ConnectionAttempts=4 -o Protocol=2 -l root -p 22
    chromeos4-row13-rack8-host4 "export LIBC_FATAL_STDERR_=1; if type
    \"logger\" > /dev/null 2>&1; then logger -tag \"autotest\"
    \"server[stack::verify|verify_software|upstart_status] -> ssh_run(status
    system-services | grep start/running)\";fi; status system-services | grep
    start/running"
Exit status: 1
Duration: 0.355021953583

Wait... I found the syslog here:

https://pantheon.corp.google.com/storage/browser/chromeos-autotest-results/hosts/chromeos4-row13-rack8-host4/2546348-repair/20171702170323/chromeos4-row13-rack8-host4/before_repair/log/

and there is something suspicious:

2017-02-18T01:03:11.893953+00:00 NOTICE ag[2540]: autotest server[stack::verify_software|get_board|parse_cmd_output] -> ssh_run(cat /etc/lsb-release)
2017-02-18T01:03:12.386516+00:00 NOTICE ag[2543]: autotest server[stack::verify|verify_software|check_inodes] -> ssh_run(df -Pi /mnt/stateful_partition | tail -1)
2017-02-18T01:03:12.882757+00:00 NOTICE ag[2547]: autotest server[stack::verify|verify_software|check_diskspace] -> ssh_run(df -PB 1000000 /mnt/stateful_partition | tail -1)
2017-02-18T01:03:13.374367+00:00 NOTICE ag[2551]: autotest server[stack::verify|verify_software|path_exists] -> ssh_run(test -e "/mnt/stateful_partition/encrypted")
2017-02-18T01:03:13.859397+00:00 NOTICE ag[2553]: autotest server[stack::verify|verify_software|check_diskspace] -> ssh_run(df -PB 1000000 /mnt/stateful_partition/encrypted | tail -1)
2017-02-18T01:03:13.961379+00:00 ERR kernel: [  116.865489] Bluetooth: FW failed to be active in time!
2017-02-18T01:03:13.961418+00:00 ERR kernel: [  116.865548] Bluetooth: Downloading firmware failed!
2017-02-18T01:03:14.082368+00:00 INFO kernel: [  116.986149] tegra-hda 70030000.hda: codec_mask = 0x8
2017-02-18T01:03:14.082396+00:00 INFO kernel: [  116.986308] tegra-hda 70030000.hda: Enable delay in RIRB handling
2017-02-18T01:03:14.111100+00:00 INFO kernel: [  117.020867] uvcvideo: Found UVC 1.00 device HD WebCam (064e:c403)
2017-02-18T01:03:14.121083+00:00 INFO kernel: [  117.029531] usbcore: registered new interface driver uvcvideo
2017-02-18T01:03:14.121111+00:00 INFO kernel: [  117.029551] USB Video Class driver (1.1.1)
2017-02-18T01:03:14.336525+00:00 NOTICE ag[2581]: autotest server[stack::verify|verify_software|upstart_status] -> ssh_run(status system-services | grep start/running)
2017-02-18T01:03:14.541056+00:00 INFO kernel: [  117.452659] smsc95xx v1.0.4
2017-02-18T01:03:14.611123+00:00 INFO kernel: [  117.516135] smsc95xx 1-1.1:1.0 eth1: register 'smsc95xx' at usb-xhci-hcd.0.auto-1.1, smsc95xx USB 2.0 Ethernet, 94:eb:2c:10:02:5f
2017-02-18T01:03:14.611151+00:00 INFO kernel: [  117.516251] usbcore: registered new interface driver smsc95xx
2017-02-18T01:03:14.701205+00:00 INFO kernel: [  117.604730] smsc95xx 1-1.1:1.0 eth1: hardware isn't capable of remote wakeup
2017-02-18T01:03:14.701238+00:00 INFO kernel: [  117.604917] IPv6: ADDRCONF(NETDEV_UP): eth1: link is not ready
2017-02-18T01:03:14.721176+00:00 DEBUG kernel: [  117.623605] of_get_named_gpio_flags exited with status 71
2017-02-18T01:03:14.721241+00:00 DEBUG kernel: [  117.623621] of_get_named_gpio_flags exited with status 143
2017-02-18T01:03:14.751114+00:00 INFO kernel: [  117.653162] max98090 0-0010: MAX98090 REVID=0x43
2017-02-18T01:03:14.751146+00:00 ERR kernel: [  117.655877] max98090 0-0010: No platform data
2017-02-18T01:03:14.751151+00:00 INFO kernel: [  117.658722] tegra-snd-max98090 sound.8:  HiFi <-> 70301100.i2s mapping ok
2017-02-18T01:03:14.761064+00:00 WARNING kernel: [  117.669020] udevd[175]: Process '/usr/sbin/alsactl restore 0' failed with exit code 99.
2017-02-18T01:03:14.801069+00:00 INFO kernel: [  117.709937] input: NVIDIA Tegra Venice2 Mic Jack as /devices/soc0/sound.8/sound/card1/input1
2017-02-18T01:03:14.801108+00:00 INFO kernel: [  117.710179] input: NVIDIA Tegra Venice2 Headphone Jack as /devices/soc0/sound.8/sound/card1/input2
2017-02-18T01:03:14.821757+00:00 WARNING kernel: [  117.730121] udevd[1469]: Process '/usr/sbin/alsactl restore 1' failed with exit code 99.
2017-02-18T01:03:15.027648+00:00 INFO permission_broker[901]: ProcessPath(/dev/bus/usb/001/008)

It looks like the network connection may have gone down during execution of the last ssh command---which would have then returned a failure, but maybe the caller is not checking the actual status value, just whether there is an error or not... sigh...
Owner: jrbarnette@chromium.org
Pass to next deputy
Comment 5 by aut...@google.com, Feb 21 2017
Labels: -current-issue
The complaint about "system-services" generally indicates that chrome
crashed at boot.

Looking at history, I see that this particular DUT has encountered
this failure on an off for a week or more.  I'm searching to see if
any other DUTs are seeing it.

Blocking: 695529
Following up on c#6, I note that provisioning checks that
Chrome started without crashing before the failure was
reported.  That means the failure here probably went like
this:
 1) Provisioning installed the new build.
 2) Provisioning tested, and confirmed that Chrome was
    working.
 3) The DUT crashed (or at least, rebooted spontaneously.
 4) Chrome failed to start.
 5) Final provisioning verification tested and concluded
    that Chrome had crashed at boot.

Status: WontFix
This hasn't happened in a while, and logs for debug are gradually
vanishing.

Sign in to add a comment