New issue
Advanced search Search tips
Note: Color blocks (like or ) mean that a user may not be available. Tooltip shows the reason.

Issue 676219 link

Starred by 1 user

Issue metadata

Status: Archived
Owner:
Closed: Mar 2018
Cc:
Components:
EstimatedDays: ----
NextAction: ----
OS: Chrome
Pri: 2
Type: Bug



Sign in to add a comment

kitchen sink not included in test logs

Project Member Reported by semenzato@chromium.org, Dec 21 2016

Issue description

(I apologize for the weak attempt at humor in the summary, but one has to work hard at staying motivated.  I will try to control myself.)

Executive summary: during provision, a DUT appears to be unreachable for about 4 minutes, then it comes back up, and logs are collected for failure analysis.  However, none of the system logs covers the critical time, so we don't know if the device hung and crashed, or what else.

Case at hand:

https://uberchromegw.corp.google.com/i/chromeos/builders/kip-paladin/builds/248

https://pantheon.corp.google.com/storage/browser/chromeos-autotest-results/91776597-chromeos-test/chromeos4-row2-rack13-host17/debug/

I have seen this in other cases too, and it is very time-consuming to go through all possibly relevant logs.

Here's the story:

DUT is c4-r2-r13-h17 and is alive and well at 1539.

12/20 15:39:15.725 DEBUG|          ssh_host:0177| Running (ssh) 'true'
12/20 15:39:15.878 DEBUG|      abstract_ssh:0577| Host chromeos4-row2-rack13-host17 is now up

Devserver 100.115.219.134 is selected.

2/20 15:39:49.882 DEBUG|        base_utils:0185| Running 'ssh 100.115.219.134 'curl "http://100.115.219.134:8082/check_health?"''
12/20 15:39:51.305 DEBUG|         cros_host:0737| Resolved devserver for auto-update: http://100.115.219.134:8082

Then autoserv tells devserver to make a curl-rpc to itself to start the update on the DUT.  (This seems overcomplicated---autoserv should tell the DUT directly.)  The RPC is asynchronous.

12/20 15:39:51.306 DEBUG|        dev_server:1847| Start CrOS auto-update for host chromeos4-row2-rack13-host17 at 1 time(s).
12/20 15:39:51.306 DEBUG|        base_utils:0185| Running 'ssh 100.115.219.134 'curl "http://100.115.219.134:8082/cros_au?build_name=kip-paladin/R57-9106.0.0-rc2&async=True&force_update=True&host_name=chromeos4-row2-rack13-host17&full_update=False"''
12/20 15:39:52.603 INFO |        dev_server:1667| Received response from devserver for cros_au call: '[true, 16823]'
12/20 15:39:52.604 DEBUG|        dev_server:1782| start process 16823 for auto_update in devserver

Then autoserv starts checking update progress.

12/20 15:39:52.604 DEBUG|        dev_server:1688| Check the progress for auto-update process 16823
12/20 15:39:52.604 DEBUG|        base_utils:0185| Running 'ssh 100.115.219.134 'curl "http://100.115.219.134:8082/get_au_status?build_name=kip-paladin/R57-9106.0.0-rc2&force_update=True&pid=16823&host_name=chromeos4-row2-rack13-host17&full_update=False"''
12/20 15:39:53.656 DEBUG|          monitors:0191| ts_mon: sending 13 metrics to PubSub
12/20 15:39:53.869 DEBUG|        dev_server:1723| Current CrOS auto-update status: CrOS update is just started.

Progress is happening because a little later the auto-update status changes.

12/20 15:40:28.684 DEBUG|        dev_server:1723| Current CrOS auto-update status: transfer rootfs update package
...

That goes on for a while, but 5 minutes later we get a failure, and start cleaning up.  As part of the cleanup we attempt to collect logs from the DUT, but that fails:

12/20 15:45:33.649 DEBUG|        dev_server:1785| Failed to trigger auto-update process on devserver
12/20 15:45:33.650 DEBUG|        base_utils:0185| Running 'ssh 100.115.219.134 'curl "http://100.115.219.134:8082/handler_cleanup?pid=16823&host_name=chromeos4-row2-rack13-host17"''
12/20 15:45:34.844 DEBUG|        base_utils:0185| Running 'ssh 100.115.219.134 'curl "http://100.115.219.134:8082/collect_cros_au_log?pid=16823&host_name=chromeos4-row2-rack13-host17"''
12/20 15:45:36.061 DEBUG|        dev_server:1617| Saving auto-update logs into /usr/local/autotest/results/hosts/chromeos4-row2-rack13-host17/357834-provision/20162012153838/autoupdate_logs/CrOS_update_chromeos4-row2-rack13-host17_16823.log
12/20 15:45:36.062 DEBUG|        dev_server:1884| Exception raised on auto_update attempt #1:
 Traceback (most recent call last):
  File "/home/chromeos-test/chromiumos/src/platform/dev/cros_update.py", line 227, in TriggerAU
    self._WriteAUStatus(cros_update_progress.FINISHED)
  File "/home/chromeos-test/chromiumos/chromite/lib/remote_access.py", line 542, in __exit__
    self.device.Cleanup()
  File "/home/chromeos-test/chromiumos/chromite/lib/remote_access.py", line 665, in Cleanup
    self.BaseRunCommand(cmd, **kwargs)
  File "/home/chromeos-test/chromiumos/chromite/lib/remote_access.py", line 830, in BaseRunCommand
    return self.GetAgent().RemoteSh(cmd, **kwargs)
  File "/home/chromeos-test/chromiumos/chromite/lib/remote_access.py", line 340, in RemoteSh
    raise SSHConnectionError(e.result.error)
SSHConnectionError: ssh: connect to host chromeos4-row2-rack13-host17 port 22: Connection timed out

Then we try the provision again, but this time using the IP address instead of the DUT name.

12/20 15:45:36.062 DEBUG|        base_utils:0185| Running 'ssh 100.115.219.134 'curl "http://100.115.219.134:8082/kill_au_proc?host_name=chromeos4-row2-rack13-host17"''
12/20 15:45:57.390 DEBUG|        dev_server:1902| AU failed, trying IP instead of hostname: 100.115.193.135
12/20 15:45:57.391 DEBUG|        dev_server:1847| Start CrOS auto-update for host chromeos4-row2-rack13-host17 at 2 time(s).
12/20 15:45:57.391 DEBUG|        base_utils:0185| Running 'ssh 100.115.219.134 'curl "http://100.115.219.134:8082/cros_au?build_name=kip-paladin/R57-9106.0.0-rc2&async=True&force_update=True&host_name=100.115.193.135&full_update=False"''
12/20 15:45:58.603 INFO |        dev_server:1667| Received response from devserver for cros_au call: '[true, 27713]'
12/20 15:45:58.604 DEBUG|        dev_server:1782| start process 27713 for auto_update in devserver
12/20 15:45:58.604 DEBUG|        dev_server:1688| Check the progress for auto-update process 27713
12/20 15:45:58.604 DEBUG|        base_utils:0185| Running 'ssh 100.115.219.134 'curl "http://100.115.219.134:8082/get_au_status?build_name=kip-paladin/R57-9106.0.0-rc2&force_update=True&pid=27713&host_name=100.115.193.135&full_update=False"''
12/20 15:45:59.781 DEBUG|        dev_server:1723| Current CrOS auto-update status: CrOS update is just started.
12/20 15:46:09.810 DEBUG|        base_utils:0185| Running 'ssh 100.115.219.134 'curl "http://100.115.219.134:8082/get_au_status?build_name=kip-paladin/R57-9106.0.0-rc2&force_update=True&pid=27713&host_name=100.115.193.135&full_update=False"''
12/20 15:46:11.034 DEBUG|        dev_server:1723| Current CrOS auto-update status: CrOS update is just started.
12/20 15:46:21.056 DEBUG|        base_utils:0185| Running 'ssh 100.115.219.134 'curl "http://100.115.219.134:8082/get_au_status?build_name=kip-paladin/R57-9106.0.0-rc2&force_update=True&pid=27713&host_name=100.115.193.135&full_update=False"''
12/20 15:46:22.282 DEBUG|        dev_server:1785| Failed to trigger auto-update process on devserver

This fails again, but never comes out of the "CrOS update is just started" state.  This suggests that the devserver is no longer able to contact the DUT.  Nevertheless we try to collect logs again, and fail, at least twice, which seems to confirm that the DUT is down.  After considerable barfing, autotest contacts the DUT directly:

12/20 15:49:25.621 INFO |      crashcollect:0154| Collecting crash information...
12/20 15:49:25.628 DEBUG|          ssh_host:0177| Running (ssh) 'python -c 'import cPickle, glob, sys;cPickle.dump(glob.glob(sys.argv[1]), sys.stdout, 0)''
12/20 15:49:25.638 INFO |      abstract_ssh:0788| Starting master ssh connection '/usr/bin/ssh -a -x   -N -o ControlMaster=yes -o ControlPath=/tmp/_autotmp_iX1jpossh-master/socket -o StrictHostKeyChecking=no -o UserKnownHostsFile=/dev/null -o BatchMode=yes -o ConnectTimeout=30 -o ServerAliveInterval=900 -o ServerAliveCountMax=3 -o ConnectionAttempts=4 -o Protocol=2 -l root -p 22 chromeos4-row2-rack13-host17'
12/20 15:49:25.638 DEBUG|        base_utils:0185| Running '/usr/bin/ssh -a -x   -N -o ControlMaster=yes -o ControlPath=/tmp/_autotmp_iX1jpossh-master/socket -o StrictHostKeyChecking=no -o UserKnownHostsFile=/dev/null -o BatchMode=yes -o ConnectTimeout=30 -o ServerAliveInterval=900 -o ServerAliveCountMax=3 -o ConnectionAttempts=4 -o Protocol=2 -l root -p 22 chromeos4-row2-rack13-host17'
12/20 15:49:26.517 INFO | site_crashcollect:0212| There are no orphaned crashes; deleting /usr/local/autotest/results/hosts/chromeos4-row2-rack13-host17/357834-provision/20162012153838/crashinfo.chromeos4-row2-rack13-host17
12/20 15:49:26.522 DEBUG|          ssh_host:0177| Running (ssh) 'true'
12/20 15:49:26.524 DEBUG|          monitors:0191| ts_mon: sending 16 metrics to PubSub
12/20 15:49:26.675 INFO |      crashcollect:0206| chromeos4-row2-rack13-host17 already up, collecting crash info

The DUT looks fine now.  So we'd like to know what happened to the DUT between 1545 and 1549.

We look at sysinfo.tgz in the bucket.  console-ramoops shows no kernel crashes.  /var/log/messages ends at 1544.

Bus is arriving.  More later.





 
Cc: vpalatin@chromium.org
So we look at other logs.

autoupdate_logs.tgz

this has two logs from outside the DUT (probably from the drone), which confirm what we already knew (DUT is unreachable)

crashinfo-<DUT>.tgz

This contains console-ramoops, /var/log/messages etc apparently from the drone side (Linux version 3.10.18 (chrome-bot@build107-m2)).

provision_AutoUpdate.tgz

These look the same as autoserv.DEBUG etc.

So there is no log from the DUT after DUT recovery.  How do I know that the DUT recovered?  Because I checked, and it was up and running (but its syslog didn't go far back enough).

Did we recover this later, by power-cycling it via the servo?  Do we collect logs then, and where can we find them?


Status: Archived (was: Untriaged)
This bug has not been touched in over a year.  It is probably no longer relevant.

Sign in to add a comment