New issue
Advanced search Search tips
Note: Color blocks (like or ) mean that a user may not be available. Tooltip shows the reason.

Issue 693683 link

Starred by 1 user

Issue metadata

Status: Assigned
Owner:
Last visit > 30 days ago
Cc:
Components:
EstimatedDays: ----
NextAction: ----
OS: Chrome
Pri: 2
Type: Bug



Sign in to add a comment

repair should recover DUT logs between rebooting and powerwashing

Project Member Reported by semenzato@chromium.org, Feb 17 2017

Issue description

https://pantheon.corp.google.com/storage/browser/chromeos-autotest-results/hosts/chromeos4-row3-rack9-host2/433415-repair/20171702062819/debug/

autotest.DEBUG

From my reading of this log, and the fact that the job bucket doesn't have DUT logs, I am guessing that after many attempts to connect to the DUT, the servo reboots it, then immediately powerwashes it, without collecting the logs  This makes it impossible to diagnose the reason for the DUT being unreachable.

In fact I am not sure how the reboot is done, but we should try the triple alt-volumeup-X first, so console-ramoops will have a kernel stack if the kernel is hung.  If that doesn't work, then an EC reset will do.

 
Cc: jrbarnette@chromium.org
Components: Infra>Client>ChromeOS
Labels: OS-Chrome
Owner: ----
Status: Available (was: Untriaged)

Comment 3 by aut...@google.com, Feb 22 2017

Labels: Hotlist-Fixit
Good fixit candidate. 
Does "fixit" mean it won't get fixed for a while?  Because this is pretty important for figuring out how many provisioning failures are caused by bad ethernet dongles.
I am not even sure that this bug is correct.  Are these logs already collected, but I am not looking in the right place?

If that's the case, maybe the directory containing them needs to be named in a suggestive way, maybe ending in "_dut_logs".

In cases where we recover an offline DUT by power cycling it, we
will gather logs.  In other cases, especially cases where the
DUT was online at the start of repair, we will not gather logs.

We should gather logs according to a useful and predictable rule.

Here's what happens here.  There is a provision attempt before the repair.

 
    2017-02-17 06:28:22  OK http://cautotest/tko/retrieve_logs.cgi?job=/results/hosts/chromeos4-row3-rack9-host2/433415-repair/
    2017-02-17 06:23:48  -- http://cautotest/tko/retrieve_logs.cgi?job=/results/hosts/chromeos4-row3-rack9-host2/433401-provision/

In the provision job, the DUT is offline since the first attempt:

02/17 06:23:52.879 DEBUG|          ssh_host:0284| Running (ssh) 'test ! -e /var/log/messages || cp -f /var/log/messages /var/tmp/messages.autotest_start'
02/17 06:23:52.951 INFO |      abstract_ssh:0809| Starting master ssh connection '/usr/bin/ssh -a -x   -N -o ControlMaster=yes -o ControlPath=/tmp/_autotmp_ajPK9Gssh-master/socket -o StrictHostKeyChecking=no -o UserKnownHostsFile=/dev/null -o BatchMode=yes -o ConnectTimeout=30 -o ServerAliveInterval=900 -o ServerAliveCountMax=3 -o ConnectionAttempts=4 -o Protocol=2 -l root -p 22 chromeos4-row3-rack9-host2'
02/17 06:23:52.951 DEBUG|        base_utils:0185| Running '/usr/bin/ssh -a -x   -N -o ControlMaster=yes -o ControlPath=/tmp/_autotmp_ajPK9Gssh-master/socket -o StrictHostKeyChecking=no -o UserKnownHostsFile=/dev/null -o BatchMode=yes -o ConnectTimeout=30 -o ServerAliveInterval=900 -o ServerAliveCountMax=3 -o ConnectionAttempts=4 -o Protocol=2 -l root -p 22 chromeos4-row3-rack9-host2'
02/17 06:23:58.014 INFO |      abstract_ssh:0824| Timed out waiting for master-ssh connection to be established.
02/17 06:25:01.341 ERROR|        base_utils:0280| [stderr] ssh: connect to host chromeos4-row3-rack9-host2 port 22: Connection timed out

In the repair job, first we contact the servo, which responds fine.  Then we try again to contact the DUT:

02/17 06:28:41.761 DEBUG|          ssh_host:0284| Running (ssh) 'test ! -e /var/log/messages || cp -f /var/log/messages /var/tmp/messages.autotest_start'
02/17 06:28:41.768 INFO |      abstract_ssh:0809| Starting master ssh connection '/usr/bin/ssh -a -x   -N -o ControlMaster=yes -o ControlPath=/tmp/_autotmp_7GyF8yssh-master/socket -o StrictHostKeyChecking=no -o UserKnownHostsFile=/dev/null -o BatchMode=yes -o ConnectTimeout=30 -o ServerAliveInterval=900 -o ServerAliveCountMax=3 -o ConnectionAttempts=4 -o Protocol=2 -l root -p 22 chromeos4-row3-rack9-host2'
02/17 06:28:41.768 DEBUG|        base_utils:0185| Running '/usr/bin/ssh -a -x   -N -o ControlMaster=yes -o ControlPath=/tmp/_autotmp_7GyF8yssh-master/socket -o StrictHostKeyChecking=no -o UserKnownHostsFile=/dev/null -o BatchMode=yes -o ConnectTimeout=30 -o ServerAliveInterval=900 -o ServerAliveCountMax=3 -o ConnectionAttempts=4 -o Protocol=2 -l root -p 22 chromeos4-row3-rack9-host2'
02/17 06:28:46.832 INFO |      abstract_ssh:0824| Timed out waiting for master-ssh connection to be established.
02/17 06:29:50.169 ERROR|        base_utils:0280| [stderr] ssh: connect to host chromeos4-row3-rack9-host2 port 22: Connection timed out

We keep trying for a ridiculous amount of time:

02/17 06:40:49.138 INFO |      abstract_ssh:0809| Starting master ssh connection '/usr/bin/ssh -a -x   -N -o ControlMaster=yes -o ControlPath=/tmp/_autotmp_EizXOossh-master/socket -o StrictHostKeyChecking=no -o UserKnownHostsFile=/dev/null -o BatchMode=yes -o ConnectTimeout=30 -o ServerAliveInterval=900 -o ServerAliveCountMax=3 -o ConnectionAttempts=4 -o Protocol=2 -l root -p 22 chromeos4-row3-rack9-host2'
02/17 06:40:49.138 DEBUG|        base_utils:0185| Running '/usr/bin/ssh -a -x   -N -o ControlMaster=yes -o ControlPath=/tmp/_autotmp_EizXOossh-master/socket -o StrictHostKeyChecking=no -o UserKnownHostsFile=/dev/null -o BatchMode=yes -o ConnectTimeout=30 -o ServerAliveInterval=900 -o ServerAliveCountMax=3 -o ConnectionAttempts=4 -o Protocol=2 -l root -p 22 chromeos4-row3-rack9-host2'
02/17 06:40:54.205 INFO |      abstract_ssh:0824| Timed out waiting for master-ssh connection to be established.
02/17 06:41:11.375 DEBUG|      abstract_ssh:0587| Host chromeos4-row3-rack9-host2 is now up
02/17 06:41:11.375 INFO |            repair:0327| Verifying this condition: host is available via ssh
02/17 06:41:11.375 DEBUG|          ssh_host:0284| Running (ssh) 'true'
02/17 06:41:11.385 INFO |      abstract_ssh:0795| Master ssh connection to chromeos4-row3-rack9-host2 is down.

Wait, what's that "Host ... is now up" right after the ssh connection timeout????

Anyway, probably just poor logging, because the DUT did come up.  But then:

02/17 06:41:11.385 DEBUG|      abstract_ssh:0756| Nuking master_ssh_job.
02/17 06:41:12.391 DEBUG|      abstract_ssh:0762| Cleaning master_ssh_tempdir.
02/17 06:41:12.392 INFO |      abstract_ssh:0809| Starting master ssh connection '/usr/bin/ssh -a -x   -N -o ControlMaster=yes -o ControlPath=/tmp/_autotmp_YXZEBWssh-master/socket -o StrictHostKeyChecking=no -o UserKnownHostsFile=/dev/null -o BatchMode=yes -o ConnectTimeout=30 -o ServerAliveInterval=900 -o ServerAliveCountMax=3 -o ConnectionAttempts=4 -o Protocol=2 -l root -p 22 chromeos4-row3-rack9-host2'
02/17 06:41:12.392 DEBUG|        base_utils:0185| Running '/usr/bin/ssh -a -x   -N -o ControlMaster=yes -o ControlPath=/tmp/_autotmp_YXZEBWssh-master/socket -o StrictHostKeyChecking=no -o UserKnownHostsFile=/dev/null -o BatchMode=yes -o ConnectTimeout=30 -o ServerAliveInterval=900 -o ServerAliveCountMax=3 -o ConnectionAttempts=4 -o Protocol=2 -l root -p 22 chromeos4-row3-rack9-host2'
02/17 06:41:13.786 INFO |        server_job:0184| 		GOOD	----	verify.ssh	timestamp=1487342473	localtime=Feb 17 06:41:13	
02/17 06:41:13.787 INFO |            repair:0327| Verifying this condition: The DUT is plugged in to AC power
02/17 06:41:13.787 DEBUG|          ssh_host:0284| Running (ssh) 'power_supply_info'
02/17 06:41:14.170 DEBUG|        base_utils:0280| [stdout] Device: Line Power
02/17 06:41:14.171 DEBUG|        base_utils:0280| [stdout]   path:                    /sys/class/power_supply/AC
02/17 06:41:14.171 DEBUG|        base_utils:0280| [stdout]   online:                  yes

...

02/17 06:41:14.175 INFO |            repair:0327| Verifying this condition: The most recent AU attempt on this DUT succeeded
02/17 06:41:14.175 DEBUG|          ssh_host:0284| Running (ssh) 'test -f /var/tmp/provision_failed'
02/17 06:41:14.528 ERROR|            repair:0332| Failed: The most recent AU attempt on this DUT succeeded

So here we know that the latest provision failed.  Is this not a good enough reason to save the syslog?  It seems a good enough reason to run a powerwash:

02/17 06:41:28.383 INFO |            repair:0105| Attempting this repair action: Powerwash and then re-install the stable build via AU
02/17 06:41:28.384 DEBUG|            repair:0106| Repairing because these triggers failed:
02/17 06:41:28.384 DEBUG|            repair:0108|     The most recent AU attempt on this DUT succeeded
02/17 06:41:28.384 INFO |        server_job:0184| 	START	----	repair.powerwash	timestamp=1487342488	localtime=Feb 17 06:41:28	

So my point is: since a powerwash destroys the syslog, is that not a good enough reason to save it first?

Basically I think that the behavior described in #6 is not the most desirable one, but it also depends on the extra load caused by always collecting the logs before a powerwash.

Labels: Logging-improv
Owner: jrbarnette@chromium.org
Status: Assigned (was: Available)
Still relevant, or relevant to new project?

Sign in to add a comment