Surface exactly when SSH connection failed during provision |
||||||
Issue description
We reboot the DUT a few times during provision, and we often fail to SSH into the DUT after such a reboot. Exactly when this SSH failure happens is critical -- if it happened before any update, it is likely to be an infra issue. If it happened after rootfs update, it is more likely to be a problem with the image.
Currently, all these failures look identical in status.log:
Traceback (most recent call last):
File "/usr/local/autotest/client/common_lib/test.py", line 806, in _call_test_function
return func(*args, **dargs)
File "/usr/local/autotest/client/common_lib/test.py", line 470, in execute
dargs)
File "/usr/local/autotest/client/common_lib/test.py", line 347, in _call_run_once_with_retry
postprocess_profiled_run, args, dargs)
File "/usr/local/autotest/client/common_lib/test.py", line 380, in _call_run_once
self.run_once(*args, **dargs)
File "/usr/local/autotest/server/site_tests/provision_AutoUpdate/provision_AutoUpdate.py", line 121, in run_once
with_cheets=with_cheets)
File "/usr/local/autotest/server/afe_utils.py", line 124, in machine_install_and_update_labels
*args, **dargs)
File "/usr/local/autotest/server/hosts/cros_host.py", line 815, in machine_install_by_devserver
force_original=force_original)
File "/usr/local/autotest/client/common_lib/cros/dev_server.py", line 2355, in auto_update
error_msg % (host_name, real_error))
DevServerException: CrOS auto-update failed for host chromeos2-row7-rack6-host19: 0) SSHConnectionError: ssh: connect to host chromeos2-row7-rack6-host19 port 22: Connection timed out
, 1) SSHConnectionError: ssh: connect to host 100.115.230.65 port 22: Connection timed out
Insert a failure message saying exactly when the SSH failed so that it is easier to classify these failure types.
,
Oct 24 2017
,
Oct 24 2017
,
Oct 27 2017
Justification: Scope is correct for Chase -- go find all the places where SSH can timeout when in the actual provision code, and create new exceptions that tell us what happened to the DUT and when. Impact: As the deputy this week, I spent 75%+ of my time looking at provision failures. This would have cut this time by at least 25%. Also, this will surface the correct error all the way back to the builders, giving users more than "DevserverException", and "SSH timed out" -- both of which they ignore and do not know where to dig deeper. So yes, this is about reporting; but I claim that this reporting is central to being able to deputy.
,
Oct 30 2017
,
Nov 6 2017
CL in flight.
,
Nov 7 2017
The following revision refers to this bug: https://chromium.googlesource.com/chromiumos/chromite/+/24b04ea4a0891030da74cbd3bd17d1d644a1ff74 commit 24b04ea4a0891030da74cbd3bd17d1d644a1ff74 Author: Xixuan Wu <xixuan@chromium.org> Date: Tue Nov 07 01:26:21 2017 auto_updater: Add error logging for different reboots. BUG= chromium:777923 TEST=cros flash & ds.auto_update() Change-Id: I427c58a63b601e476dc0703d041c53e0ad922569 Reviewed-on: https://chromium-review.googlesource.com/753535 Commit-Ready: Xixuan Wu <xixuan@chromium.org> Tested-by: Xixuan Wu <xixuan@chromium.org> Reviewed-by: David Haddock <dhaddock@chromium.org> [modify] https://crrev.com/24b04ea4a0891030da74cbd3bd17d1d644a1ff74/lib/remote_access.py [modify] https://crrev.com/24b04ea4a0891030da74cbd3bd17d1d644a1ff74/lib/auto_updater.py
,
Nov 13 2017
|
||||||
►
Sign in to add a comment |
||||||
Comment 1 by pprabhu@chromium.org
, Oct 24 2017