(I apologize for the weak attempt at humor in the summary, but one has to work hard at staying motivated. I will try to control myself.)
Executive summary: during provision, a DUT appears to be unreachable for about 4 minutes, then it comes back up, and logs are collected for failure analysis. However, none of the system logs covers the critical time, so we don't know if the device hung and crashed, or what else.
Case at hand:
https://uberchromegw.corp.google.com/i/chromeos/builders/kip-paladin/builds/248
https://pantheon.corp.google.com/storage/browser/chromeos-autotest-results/91776597-chromeos-test/chromeos4-row2-rack13-host17/debug/
I have seen this in other cases too, and it is very time-consuming to go through all possibly relevant logs.
Here's the story:
DUT is c4-r2-r13-h17 and is alive and well at 1539.
12/20 15:39:15.725 DEBUG| ssh_host:0177| Running (ssh) 'true'
12/20 15:39:15.878 DEBUG| abstract_ssh:0577| Host chromeos4-row2-rack13-host17 is now up
Devserver 100.115.219.134 is selected.
2/20 15:39:49.882 DEBUG| base_utils:0185| Running 'ssh 100.115.219.134 'curl "http://100.115.219.134:8082/check_health?"''
12/20 15:39:51.305 DEBUG| cros_host:0737| Resolved devserver for auto-update: http://100.115.219.134:8082
Then autoserv tells devserver to make a curl-rpc to itself to start the update on the DUT. (This seems overcomplicated---autoserv should tell the DUT directly.) The RPC is asynchronous.
12/20 15:39:51.306 DEBUG| dev_server:1847| Start CrOS auto-update for host chromeos4-row2-rack13-host17 at 1 time(s).
12/20 15:39:51.306 DEBUG| base_utils:0185| Running 'ssh 100.115.219.134 'curl "http://100.115.219.134:8082/cros_au?build_name=kip-paladin/R57-9106.0.0-rc2&async=True&force_update=True&host_name=chromeos4-row2-rack13-host17&full_update=False"''
12/20 15:39:52.603 INFO | dev_server:1667| Received response from devserver for cros_au call: '[true, 16823]'
12/20 15:39:52.604 DEBUG| dev_server:1782| start process 16823 for auto_update in devserver
Then autoserv starts checking update progress.
12/20 15:39:52.604 DEBUG| dev_server:1688| Check the progress for auto-update process 16823
12/20 15:39:52.604 DEBUG| base_utils:0185| Running 'ssh 100.115.219.134 'curl "http://100.115.219.134:8082/get_au_status?build_name=kip-paladin/R57-9106.0.0-rc2&force_update=True&pid=16823&host_name=chromeos4-row2-rack13-host17&full_update=False"''
12/20 15:39:53.656 DEBUG| monitors:0191| ts_mon: sending 13 metrics to PubSub
12/20 15:39:53.869 DEBUG| dev_server:1723| Current CrOS auto-update status: CrOS update is just started.
Progress is happening because a little later the auto-update status changes.
12/20 15:40:28.684 DEBUG| dev_server:1723| Current CrOS auto-update status: transfer rootfs update package
...
That goes on for a while, but 5 minutes later we get a failure, and start cleaning up. As part of the cleanup we attempt to collect logs from the DUT, but that fails:
12/20 15:45:33.649 DEBUG| dev_server:1785| Failed to trigger auto-update process on devserver
12/20 15:45:33.650 DEBUG| base_utils:0185| Running 'ssh 100.115.219.134 'curl "http://100.115.219.134:8082/handler_cleanup?pid=16823&host_name=chromeos4-row2-rack13-host17"''
12/20 15:45:34.844 DEBUG| base_utils:0185| Running 'ssh 100.115.219.134 'curl "http://100.115.219.134:8082/collect_cros_au_log?pid=16823&host_name=chromeos4-row2-rack13-host17"''
12/20 15:45:36.061 DEBUG| dev_server:1617| Saving auto-update logs into /usr/local/autotest/results/hosts/chromeos4-row2-rack13-host17/357834-provision/20162012153838/autoupdate_logs/CrOS_update_chromeos4-row2-rack13-host17_16823.log
12/20 15:45:36.062 DEBUG| dev_server:1884| Exception raised on auto_update attempt #1:
Traceback (most recent call last):
File "/home/chromeos-test/chromiumos/src/platform/dev/cros_update.py", line 227, in TriggerAU
self._WriteAUStatus(cros_update_progress.FINISHED)
File "/home/chromeos-test/chromiumos/chromite/lib/remote_access.py", line 542, in __exit__
self.device.Cleanup()
File "/home/chromeos-test/chromiumos/chromite/lib/remote_access.py", line 665, in Cleanup
self.BaseRunCommand(cmd, **kwargs)
File "/home/chromeos-test/chromiumos/chromite/lib/remote_access.py", line 830, in BaseRunCommand
return self.GetAgent().RemoteSh(cmd, **kwargs)
File "/home/chromeos-test/chromiumos/chromite/lib/remote_access.py", line 340, in RemoteSh
raise SSHConnectionError(e.result.error)
SSHConnectionError: ssh: connect to host chromeos4-row2-rack13-host17 port 22: Connection timed out
Then we try the provision again, but this time using the IP address instead of the DUT name.
12/20 15:45:36.062 DEBUG| base_utils:0185| Running 'ssh 100.115.219.134 'curl "http://100.115.219.134:8082/kill_au_proc?host_name=chromeos4-row2-rack13-host17"''
12/20 15:45:57.390 DEBUG| dev_server:1902| AU failed, trying IP instead of hostname: 100.115.193.135
12/20 15:45:57.391 DEBUG| dev_server:1847| Start CrOS auto-update for host chromeos4-row2-rack13-host17 at 2 time(s).
12/20 15:45:57.391 DEBUG| base_utils:0185| Running 'ssh 100.115.219.134 'curl "http://100.115.219.134:8082/cros_au?build_name=kip-paladin/R57-9106.0.0-rc2&async=True&force_update=True&host_name=100.115.193.135&full_update=False"''
12/20 15:45:58.603 INFO | dev_server:1667| Received response from devserver for cros_au call: '[true, 27713]'
12/20 15:45:58.604 DEBUG| dev_server:1782| start process 27713 for auto_update in devserver
12/20 15:45:58.604 DEBUG| dev_server:1688| Check the progress for auto-update process 27713
12/20 15:45:58.604 DEBUG| base_utils:0185| Running 'ssh 100.115.219.134 'curl "http://100.115.219.134:8082/get_au_status?build_name=kip-paladin/R57-9106.0.0-rc2&force_update=True&pid=27713&host_name=100.115.193.135&full_update=False"''
12/20 15:45:59.781 DEBUG| dev_server:1723| Current CrOS auto-update status: CrOS update is just started.
12/20 15:46:09.810 DEBUG| base_utils:0185| Running 'ssh 100.115.219.134 'curl "http://100.115.219.134:8082/get_au_status?build_name=kip-paladin/R57-9106.0.0-rc2&force_update=True&pid=27713&host_name=100.115.193.135&full_update=False"''
12/20 15:46:11.034 DEBUG| dev_server:1723| Current CrOS auto-update status: CrOS update is just started.
12/20 15:46:21.056 DEBUG| base_utils:0185| Running 'ssh 100.115.219.134 'curl "http://100.115.219.134:8082/get_au_status?build_name=kip-paladin/R57-9106.0.0-rc2&force_update=True&pid=27713&host_name=100.115.193.135&full_update=False"''
12/20 15:46:22.282 DEBUG| dev_server:1785| Failed to trigger auto-update process on devserver
This fails again, but never comes out of the "CrOS update is just started" state. This suggests that the devserver is no longer able to contact the DUT. Nevertheless we try to collect logs again, and fail, at least twice, which seems to confirm that the DUT is down. After considerable barfing, autotest contacts the DUT directly:
12/20 15:49:25.621 INFO | crashcollect:0154| Collecting crash information...
12/20 15:49:25.628 DEBUG| ssh_host:0177| Running (ssh) 'python -c 'import cPickle, glob, sys;cPickle.dump(glob.glob(sys.argv[1]), sys.stdout, 0)''
12/20 15:49:25.638 INFO | abstract_ssh:0788| Starting master ssh connection '/usr/bin/ssh -a -x -N -o ControlMaster=yes -o ControlPath=/tmp/_autotmp_iX1jpossh-master/socket -o StrictHostKeyChecking=no -o UserKnownHostsFile=/dev/null -o BatchMode=yes -o ConnectTimeout=30 -o ServerAliveInterval=900 -o ServerAliveCountMax=3 -o ConnectionAttempts=4 -o Protocol=2 -l root -p 22 chromeos4-row2-rack13-host17'
12/20 15:49:25.638 DEBUG| base_utils:0185| Running '/usr/bin/ssh -a -x -N -o ControlMaster=yes -o ControlPath=/tmp/_autotmp_iX1jpossh-master/socket -o StrictHostKeyChecking=no -o UserKnownHostsFile=/dev/null -o BatchMode=yes -o ConnectTimeout=30 -o ServerAliveInterval=900 -o ServerAliveCountMax=3 -o ConnectionAttempts=4 -o Protocol=2 -l root -p 22 chromeos4-row2-rack13-host17'
12/20 15:49:26.517 INFO | site_crashcollect:0212| There are no orphaned crashes; deleting /usr/local/autotest/results/hosts/chromeos4-row2-rack13-host17/357834-provision/20162012153838/crashinfo.chromeos4-row2-rack13-host17
12/20 15:49:26.522 DEBUG| ssh_host:0177| Running (ssh) 'true'
12/20 15:49:26.524 DEBUG| monitors:0191| ts_mon: sending 16 metrics to PubSub
12/20 15:49:26.675 INFO | crashcollect:0206| chromeos4-row2-rack13-host17 already up, collecting crash info
The DUT looks fine now. So we'd like to know what happened to the DUT between 1545 and 1549.
We look at sysinfo.tgz in the bucket. console-ramoops shows no kernel crashes. /var/log/messages ends at 1544.
Bus is arriving. More later.
Comment 1 by semenzato@chromium.org
, Dec 21 2016