repair should recover DUT logs between rebooting and powerwashing |
|||||
Issue descriptionhttps://pantheon.corp.google.com/storage/browser/chromeos-autotest-results/hosts/chromeos4-row3-rack9-host2/433415-repair/20171702062819/debug/ autotest.DEBUG From my reading of this log, and the fact that the job bucket doesn't have DUT logs, I am guessing that after many attempts to connect to the DUT, the servo reboots it, then immediately powerwashes it, without collecting the logs This makes it impossible to diagnose the reason for the DUT being unreachable. In fact I am not sure how the reboot is done, but we should try the triple alt-volumeup-X first, so console-ramoops will have a kernel stack if the kernel is hung. If that doesn't work, then an EC reset will do.
,
Feb 17 2017
,
Feb 22 2017
Good fixit candidate.
,
Feb 22 2017
Does "fixit" mean it won't get fixed for a while? Because this is pretty important for figuring out how many provisioning failures are caused by bad ethernet dongles.
,
Feb 23 2017
I am not even sure that this bug is correct. Are these logs already collected, but I am not looking in the right place? If that's the case, maybe the directory containing them needs to be named in a suggestive way, maybe ending in "_dut_logs".
,
Mar 2 2017
In cases where we recover an offline DUT by power cycling it, we will gather logs. In other cases, especially cases where the DUT was online at the start of repair, we will not gather logs. We should gather logs according to a useful and predictable rule.
,
Mar 2 2017
Here's what happens here. There is a provision attempt before the repair.
2017-02-17 06:28:22 OK http://cautotest/tko/retrieve_logs.cgi?job=/results/hosts/chromeos4-row3-rack9-host2/433415-repair/
2017-02-17 06:23:48 -- http://cautotest/tko/retrieve_logs.cgi?job=/results/hosts/chromeos4-row3-rack9-host2/433401-provision/
In the provision job, the DUT is offline since the first attempt:
02/17 06:23:52.879 DEBUG| ssh_host:0284| Running (ssh) 'test ! -e /var/log/messages || cp -f /var/log/messages /var/tmp/messages.autotest_start'
02/17 06:23:52.951 INFO | abstract_ssh:0809| Starting master ssh connection '/usr/bin/ssh -a -x -N -o ControlMaster=yes -o ControlPath=/tmp/_autotmp_ajPK9Gssh-master/socket -o StrictHostKeyChecking=no -o UserKnownHostsFile=/dev/null -o BatchMode=yes -o ConnectTimeout=30 -o ServerAliveInterval=900 -o ServerAliveCountMax=3 -o ConnectionAttempts=4 -o Protocol=2 -l root -p 22 chromeos4-row3-rack9-host2'
02/17 06:23:52.951 DEBUG| base_utils:0185| Running '/usr/bin/ssh -a -x -N -o ControlMaster=yes -o ControlPath=/tmp/_autotmp_ajPK9Gssh-master/socket -o StrictHostKeyChecking=no -o UserKnownHostsFile=/dev/null -o BatchMode=yes -o ConnectTimeout=30 -o ServerAliveInterval=900 -o ServerAliveCountMax=3 -o ConnectionAttempts=4 -o Protocol=2 -l root -p 22 chromeos4-row3-rack9-host2'
02/17 06:23:58.014 INFO | abstract_ssh:0824| Timed out waiting for master-ssh connection to be established.
02/17 06:25:01.341 ERROR| base_utils:0280| [stderr] ssh: connect to host chromeos4-row3-rack9-host2 port 22: Connection timed out
In the repair job, first we contact the servo, which responds fine. Then we try again to contact the DUT:
02/17 06:28:41.761 DEBUG| ssh_host:0284| Running (ssh) 'test ! -e /var/log/messages || cp -f /var/log/messages /var/tmp/messages.autotest_start'
02/17 06:28:41.768 INFO | abstract_ssh:0809| Starting master ssh connection '/usr/bin/ssh -a -x -N -o ControlMaster=yes -o ControlPath=/tmp/_autotmp_7GyF8yssh-master/socket -o StrictHostKeyChecking=no -o UserKnownHostsFile=/dev/null -o BatchMode=yes -o ConnectTimeout=30 -o ServerAliveInterval=900 -o ServerAliveCountMax=3 -o ConnectionAttempts=4 -o Protocol=2 -l root -p 22 chromeos4-row3-rack9-host2'
02/17 06:28:41.768 DEBUG| base_utils:0185| Running '/usr/bin/ssh -a -x -N -o ControlMaster=yes -o ControlPath=/tmp/_autotmp_7GyF8yssh-master/socket -o StrictHostKeyChecking=no -o UserKnownHostsFile=/dev/null -o BatchMode=yes -o ConnectTimeout=30 -o ServerAliveInterval=900 -o ServerAliveCountMax=3 -o ConnectionAttempts=4 -o Protocol=2 -l root -p 22 chromeos4-row3-rack9-host2'
02/17 06:28:46.832 INFO | abstract_ssh:0824| Timed out waiting for master-ssh connection to be established.
02/17 06:29:50.169 ERROR| base_utils:0280| [stderr] ssh: connect to host chromeos4-row3-rack9-host2 port 22: Connection timed out
We keep trying for a ridiculous amount of time:
02/17 06:40:49.138 INFO | abstract_ssh:0809| Starting master ssh connection '/usr/bin/ssh -a -x -N -o ControlMaster=yes -o ControlPath=/tmp/_autotmp_EizXOossh-master/socket -o StrictHostKeyChecking=no -o UserKnownHostsFile=/dev/null -o BatchMode=yes -o ConnectTimeout=30 -o ServerAliveInterval=900 -o ServerAliveCountMax=3 -o ConnectionAttempts=4 -o Protocol=2 -l root -p 22 chromeos4-row3-rack9-host2'
02/17 06:40:49.138 DEBUG| base_utils:0185| Running '/usr/bin/ssh -a -x -N -o ControlMaster=yes -o ControlPath=/tmp/_autotmp_EizXOossh-master/socket -o StrictHostKeyChecking=no -o UserKnownHostsFile=/dev/null -o BatchMode=yes -o ConnectTimeout=30 -o ServerAliveInterval=900 -o ServerAliveCountMax=3 -o ConnectionAttempts=4 -o Protocol=2 -l root -p 22 chromeos4-row3-rack9-host2'
02/17 06:40:54.205 INFO | abstract_ssh:0824| Timed out waiting for master-ssh connection to be established.
02/17 06:41:11.375 DEBUG| abstract_ssh:0587| Host chromeos4-row3-rack9-host2 is now up
02/17 06:41:11.375 INFO | repair:0327| Verifying this condition: host is available via ssh
02/17 06:41:11.375 DEBUG| ssh_host:0284| Running (ssh) 'true'
02/17 06:41:11.385 INFO | abstract_ssh:0795| Master ssh connection to chromeos4-row3-rack9-host2 is down.
Wait, what's that "Host ... is now up" right after the ssh connection timeout????
Anyway, probably just poor logging, because the DUT did come up. But then:
02/17 06:41:11.385 DEBUG| abstract_ssh:0756| Nuking master_ssh_job.
02/17 06:41:12.391 DEBUG| abstract_ssh:0762| Cleaning master_ssh_tempdir.
02/17 06:41:12.392 INFO | abstract_ssh:0809| Starting master ssh connection '/usr/bin/ssh -a -x -N -o ControlMaster=yes -o ControlPath=/tmp/_autotmp_YXZEBWssh-master/socket -o StrictHostKeyChecking=no -o UserKnownHostsFile=/dev/null -o BatchMode=yes -o ConnectTimeout=30 -o ServerAliveInterval=900 -o ServerAliveCountMax=3 -o ConnectionAttempts=4 -o Protocol=2 -l root -p 22 chromeos4-row3-rack9-host2'
02/17 06:41:12.392 DEBUG| base_utils:0185| Running '/usr/bin/ssh -a -x -N -o ControlMaster=yes -o ControlPath=/tmp/_autotmp_YXZEBWssh-master/socket -o StrictHostKeyChecking=no -o UserKnownHostsFile=/dev/null -o BatchMode=yes -o ConnectTimeout=30 -o ServerAliveInterval=900 -o ServerAliveCountMax=3 -o ConnectionAttempts=4 -o Protocol=2 -l root -p 22 chromeos4-row3-rack9-host2'
02/17 06:41:13.786 INFO | server_job:0184| GOOD ---- verify.ssh timestamp=1487342473 localtime=Feb 17 06:41:13
02/17 06:41:13.787 INFO | repair:0327| Verifying this condition: The DUT is plugged in to AC power
02/17 06:41:13.787 DEBUG| ssh_host:0284| Running (ssh) 'power_supply_info'
02/17 06:41:14.170 DEBUG| base_utils:0280| [stdout] Device: Line Power
02/17 06:41:14.171 DEBUG| base_utils:0280| [stdout] path: /sys/class/power_supply/AC
02/17 06:41:14.171 DEBUG| base_utils:0280| [stdout] online: yes
...
02/17 06:41:14.175 INFO | repair:0327| Verifying this condition: The most recent AU attempt on this DUT succeeded
02/17 06:41:14.175 DEBUG| ssh_host:0284| Running (ssh) 'test -f /var/tmp/provision_failed'
02/17 06:41:14.528 ERROR| repair:0332| Failed: The most recent AU attempt on this DUT succeeded
So here we know that the latest provision failed. Is this not a good enough reason to save the syslog? It seems a good enough reason to run a powerwash:
02/17 06:41:28.383 INFO | repair:0105| Attempting this repair action: Powerwash and then re-install the stable build via AU
02/17 06:41:28.384 DEBUG| repair:0106| Repairing because these triggers failed:
02/17 06:41:28.384 DEBUG| repair:0108| The most recent AU attempt on this DUT succeeded
02/17 06:41:28.384 INFO | server_job:0184| START ---- repair.powerwash timestamp=1487342488 localtime=Feb 17 06:41:28
,
Mar 2 2017
So my point is: since a powerwash destroys the syslog, is that not a good enough reason to save it first?
,
Mar 22 2017
Basically I think that the behavior described in #6 is not the most desirable one, but it also depends on the extra load caused by always collecting the logs before a powerwash.
,
Mar 22 2017
,
Mar 31 2018
Still relevant, or relevant to new project? |
|||||
►
Sign in to add a comment |
|||||
Comment 1 by shuqianz@chromium.org
, Feb 17 2017