New issue
Advanced search Search tips
Note: Color blocks (like or ) mean that a user may not be available. Tooltip shows the reason.

Issue 696668 link

Starred by 1 user

Issue metadata

Status: WontFix
Owner:
Closed: Jun 2017
Cc:
Components:
EstimatedDays: ----
NextAction: ----
OS: Chrome
Pri: 1
Type: Bug

Blocking:
issue 695529



Sign in to add a comment

ssh connection failure after post-stateful-update reboot

Project Member Reported by akes...@chromium.org, Feb 27 2017

Issue description

Example build: https://luci-milo.appspot.com/buildbot/chromeos/veyron_minnie-paladin/1756

Test: http://cautotest.corp.google.com/afe/#tab_id=view_job&object_id=103257581

Provision job: http://cautotest/tko/retrieve_logs.cgi?job=/results/hosts/chromeos4-row9-rack11-host22/707665-provision/

Symptom:

2017/02/24 18:40:16.669 DEBUG|      auto_updater:1044| Start post check for stateful update...
2017/02/24 18:40:16.669 INFO |     remote_access:0395| Rebooting chromeos4-row9-rack11-host22...
2017/02/24 18:40:16.670 DEBUG|    cros_build_lib:0564| RunCommand: ssh -p 22 '-oConnectionAttempts=4' '-oUserKnownHostsFile=/dev/null' '-oProtocol=2' '-oConnectTimeout=30' '-oServerAliveCountMax=3' '-oStrictHostKeyChecking=no' '-oServerAliveInterval=10' '-oNumberOfPasswordPrompts=0' '-oIdentitiesOnly=yes' -i /tmp/ssh-tmplammUR/testing_rsa root@chromeos4-row9-rack11-host22 -- cat /proc/sys/kernel/random/boot_id
2017/02/24 18:40:16.921 DEBUG|    cros_build_lib:0613| (stdout):
ea9e9887-7813-42bb-8810-525136095258

2017/02/24 18:40:16.922 DEBUG|    cros_build_lib:0615| (stderr):
Warning: Permanently added 'chromeos4-row9-rack11-host22,100.115.200.155' (RSA) to the list of known hosts.

2017/02/24 18:40:16.922 DEBUG|    cros_build_lib:0564| RunCommand: ssh -p 22 '-oConnectionAttempts=4' '-oUserKnownHostsFile=/dev/null' '-oProtocol=2' '-oConnectTimeout=30' '-oServerAliveCountMax=3' '-oStrictHostKeyChecking=no' '-oServerAliveInterval=10' '-oNumberOfPasswordPrompts=0' '-oIdentitiesOnly=yes' -i /tmp/ssh-tmplammUR/testing_rsa root@chromeos4-row9-rack11-host22 -- reboot
2017/02/24 18:40:22.202 DEBUG|    cros_build_lib:0564| RunCommand: ssh -p 22 '-oConnectionAttempts=2' '-oUserKnownHostsFile=/dev/null' '-oProtocol=2' '-oConnectTimeout=2' '-oServerAliveCountMax=3' '-oStrictHostKeyChecking=no' '-oServerAliveInterval=10' '-oNumberOfPasswordPrompts=0' '-oIdentitiesOnly=yes' -i /tmp/ssh-tmplammUR/testing_rsa root@chromeos4-row9-rack11-host22 -- cat /proc/sys/kernel/random/boot_id
2017/02/24 18:40:27.272 DEBUG|    cros_build_lib:0615| (stderr):

<many retries later...>

ssh: connect to host chromeos4-row9-rack11-host22 port 22: Connection timed out

2017/02/24 18:48:15.824 DEBUG|    cros_build_lib:0564| RunCommand: ssh -p 22 '-oConnectionAttempts=2' '-oUserKnownHostsFile=/dev/null' '-oProtocol=2' '-oConnectTimeout=2' '-oServerAliveCountMax=3' '-oStrictHostKeyChecking=no' '-oServerAliveInterval=10' '-oNumberOfPasswordPrompts=0' '-oIdentitiesOnly=yes' -i /tmp/ssh-tmplammUR/testing_rsa root@chromeos4-row9-rack11-host22 -- cat /proc/sys/kernel/random/boot_id
2017/02/24 18:48:20.894 DEBUG|    cros_build_lib:0615| (stderr):
ssh: connect to host chromeos4-row9-rack11-host22 port 22: Connection timed out

2017/02/24 18:48:20.895 ERROR|    cros_build_lib:0661| Reboot has not completed after 480 seconds; giving up.
2017/02/24 18:48:20.895 DEBUG|    cros_build_lib:0564| RunCommand: ssh -p 22 '-oConnectionAttempts=4' '-oUserKnownHostsFile=/dev/null' '-oProtocol=2' '-oConnectTimeout=30' '-oServerAliveCountMax=3' '-oStrictHostKeyChecking=no' '-oServerAliveInterval=10' '-oNumberOfPasswordPrompts=0' '-oIdentitiesOnly=yes' -i /tmp/ssh-tmplammUR/testing_rsa root@chromeos4-row9-rack11-host22 -- rm -rf /mnt/stateful_partition/unencrypted/preserve/cros-update/tmp.A1azBJgH9E
2017/02/24 18:50:24.061 ERROR|     remote_access:0856| Error connecting to device chromeos4-row9-rack11-host22
2017/02/24 18:50:24.061 DEBUG|       cros_update:0253| Error happens in CrOS auto-update: SSHConnectionError('ssh: connect to host chromeos4-row9-rack11-host22 port 22: Connection timed out\r\n',)
 
Note, both of the above failures were same rack. Maybe same devserver? chingcodes@ will verify

Comment 3 by nxia@google.com, Feb 27 2017

ssh connection failed after rootfs update for veyron_speedy-paladin.


https://luci-milo.appspot.com/buildbot/chromeos/veyron_speedy-paladin/4506

http://cautotest.corp.google.com/afe/#tab_id=view_job&object_id=103618475

2017/02/26 14:09:01.599 DEBUG|      auto_updater:1055| Start pre-setup for rootfs update...
2017/02/26 14:09:01.599 INFO |     remote_access:0395| Rebooting chromeos4-row4-rack11-host6...
2017/02/26 14:09:01.600 DEBUG|    cros_build_lib:0564| RunCommand: ssh -p 22 '-oConnectionAttempts=4' '-oUserKnownHostsFile=/dev/null' '-oProtocol=2' '-oConnectTimeout=30' '-oServerAliveCountMax=3' '-oStrictHostKeyChecking=no' '-oServerAliveInterval=10' '-oNumberOfPasswordPrompts=0' '-oIdentitiesOnly=yes' -i /tmp/ssh-tmpClRd_4/testing_rsa root@chromeos4-row4-rack11-host6 -- cat /proc/sys/kernel/random/boot_id
devservers are diffrent
Another instance:

https://luci-milo.appspot.com/buildbot/chromeos/veyron_minnie-paladin/1769#
http://cautotest/tko/retrieve_logs.cgi?job=/results/hosts/chromeos4-row9-rack11-host22/711281-provision

Per jrbarnette's hint, checked repair log at 
http://cautotest/tko/retrieve_logs.cgi?job=/results/hosts/chromeos4-row9-rack11-host22/711345-repair/

which seems to indicate no problems (other than servod not running?)

START	----	repair	timestamp=1488087142	localtime=Feb 25 21:32:22	
	GOOD	----	verify.ssh	timestamp=1488087143	localtime=Feb 25 21:32:23	
	GOOD	----	verify.brd_config	timestamp=1488087143	localtime=Feb 25 21:32:23	
	GOOD	----	verify.ser_config	timestamp=1488087144	localtime=Feb 25 21:32:24	
	GOOD	----	verify.job	timestamp=1488087144	localtime=Feb 25 21:32:24	
	FAIL	----	verify.servod	timestamp=1488087175	localtime=Feb 25 21:32:55	Servo initialize timed out.
	START	----	repair.restart	timestamp=1488087175	localtime=Feb 25 21:32:55	
		GOOD	----	verify.ssh	timestamp=1488087197	localtime=Feb 25 21:33:17	
		GOOD	----	verify.brd_config	timestamp=1488087197	localtime=Feb 25 21:33:17	
		GOOD	----	verify.ser_config	timestamp=1488087197	localtime=Feb 25 21:33:17	
		GOOD	----	verify.job	timestamp=1488087198	localtime=Feb 25 21:33:18	
		GOOD	----	verify.servod	timestamp=1488087203	localtime=Feb 25 21:33:23	
		GOOD	----	verify.pwr_button	timestamp=1488087203	localtime=Feb 25 21:33:23	
		GOOD	----	verify.lid_open	timestamp=1488087203	localtime=Feb 25 21:33:23	
	END GOOD	----	repair.restart	timestamp=1488087203	localtime=Feb 25 21:33:23	
	GOOD	----	verify.update	timestamp=1488087207	localtime=Feb 25 21:33:27	
	GOOD	----	verify.PASS	timestamp=1488087207	localtime=Feb 25 21:33:27	
	GOOD	----	verify.ssh	timestamp=1488087213	localtime=Feb 25 21:33:33	
	GOOD	----	verify.power	timestamp=1488087214	localtime=Feb 25 21:33:34	
	GOOD	----	verify.fwstatus	timestamp=1488087214	localtime=Feb 25 21:33:34	
	GOOD	----	verify.good_au	timestamp=1488087214	localtime=Feb 25 21:33:34	
	GOOD	----	verify.writable	timestamp=1488087214	localtime=Feb 25 21:33:34	
	GOOD	----	verify.tpm	timestamp=1488087215	localtime=Feb 25 21:33:35	
	GOOD	----	verify.ext4	timestamp=1488087215	localtime=Feb 25 21:33:35	
	GOOD	----	verify.rwfw	timestamp=1488087217	localtime=Feb 25 21:33:37	
	GOOD	----	verify.python	timestamp=1488087217	localtime=Feb 25 21:33:37	
	GOOD	----	verify.cros	timestamp=1488087220	localtime=Feb 25 21:33:40	
	GOOD	----	verify.PASS	timestamp=1488087220	localtime=Feb 25 21:33:40	
	INFO	----	repair	timestamp=1488087220	localtime=Feb 25 21:33:40	Can't repair label 'pool:cq'.
	INFO	----	repair	timestamp=1488087220	localtime=Feb 25 21:33:40	Can't repair label 'arc'.
	INFO	----	repair	timestamp=1488087220	localtime=Feb 25 21:33:40	Can't repair label 'board:veyron_minnie'.
	INFO	----	repair	timestamp=1488087220	localtime=Feb 25 21:33:40	Can't repair label 'cros-version:veyron_minnie-paladin/R58-9318.0.0-rc1'.
END GOOD	----	repair	timestamp=1488087220	localtime=Feb 25 21:33:40	chromeos4-row9-rack11-host22 repaired successf


So we believe this falls under the "ssh failed mysteriously, maybe due to network load" category?
Build: https://uberchromegw.corp.google.com/i/chromeos/builders/master-paladin/builds/13764
Provision logs: https://pantheon.corp.google.com/storage/browser/chromeos-autotest-results/hosts/chromeos4-row11-rack11-host13/737004-provision/20172502190114/debug/
Repair logs: https://pantheon.corp.google.com/storage/browser/chromeos-autotest-results/hosts/chromeos4-row11-rack11-host13/737072-repair/20172502193357/

provision status.log:

START	----	provision	timestamp=1488078083	localtime=Feb 25 19:01:23	
	GOOD	----	verify.ssh	timestamp=1488078086	localtime=Feb 25 19:01:26	
	GOOD	----	verify.update	timestamp=1488078091	localtime=Feb 25 19:01:31	
	GOOD	----	verify.brd_config	timestamp=1488078091	localtime=Feb 25 19:01:31	
	GOOD	----	verify.ser_config	timestamp=1488078091	localtime=Feb 25 19:01:31	
	GOOD	----	verify.job	timestamp=1488078092	localtime=Feb 25 19:01:32	
	GOOD	----	verify.servod	timestamp=1488078096	localtime=Feb 25 19:01:36	
	GOOD	----	verify.pwr_button	timestamp=1488078096	localtime=Feb 25 19:01:36	
	GOOD	----	verify.lid_open	timestamp=1488078096	localtime=Feb 25 19:01:36	
	GOOD	----	verify.PASS	timestamp=1488078096	localtime=Feb 25 19:01:36	
	START	provision_AutoUpdate	provision_AutoUpdate	timestamp=1488078096	localtime=Feb 25 19:01:36	
		START	----	----	timestamp=1488078117	localtime=Feb 25 19:01:57	
			GOOD	----	sysinfo.before	timestamp=1488078117	localtime=Feb 25 19:01:57	
		END GOOD	----	----	timestamp=1488078117	localtime=Feb 25 19:01:57	
		FAIL	provision_AutoUpdate	provision_AutoUpdate	timestamp=1488080024	localtime=Feb 25 19:33:44	Unhandled DevServerException: CrOS auto-update failed for host chromeos4-row11-rack11-host13: SSHConnectionError: ssh: connect to host chromeos4-row11-rack11-host13 port 22: Connection timed out
  Traceback (most recent call last):
    File "/usr/local/autotest/client/common_lib/test.py", line 804, in _call_test_function
      return func(*args, **dargs)
    File "/usr/local/autotest/client/common_lib/test.py", line 461, in execute
      dargs)
    File "/usr/local/autotest/client/common_lib/test.py", line 347, in _call_run_once_with_retry
      postprocess_profiled_run, args, dargs)
    File "/usr/local/autotest/client/common_lib/test.py", line 376, in _call_run_once
      self.run_once(*args, **dargs)
    File "/usr/local/autotest/server/site_tests/provision_AutoUpdate/provision_AutoUpdate.py", line 113, in run_once
      force_full_update=force)
    File "/usr/local/autotest/server/afe_utils.py", line 232, in machine_install_and_update_labels
      *args, **dargs)
    File "/usr/local/autotest/server/hosts/cros_host.py", line 728, in machine_install_by_devserver
      full_update=force_full_update)
    File "/usr/local/autotest/client/common_lib/cros/dev_server.py", line 2013, in auto_update
      raise DevServerException(error_msg % (host_name, error_list[0]))
  DevServerException: CrOS auto-update failed for host chromeos4-row11-rack11-host13: SSHConnectionError: ssh: connect to host chromeos4-row11-rack11-host13 port 22: Connection timed out
	END FAIL	provision_AutoUpdate	provision_AutoUpdate	timestamp=1488080024	localtime=Feb 25 19:33:44	
END FAIL	----	provision	timestamp=1488080024	localtime=Feb 25 19:33:44	
INFO	----	----	timestamp=1488080024	job_abort_reason=	localtime=Feb 25 19:33:44	

repair status.log:

START	----	repair	timestamp=1488080050	localtime=Feb 25 19:34:10	
	GOOD	----	verify.ssh	timestamp=1488080051	localtime=Feb 25 19:34:11	
	GOOD	----	verify.brd_config	timestamp=1488080051	localtime=Feb 25 19:34:11	
	GOOD	----	verify.ser_config	timestamp=1488080052	localtime=Feb 25 19:34:12	
	GOOD	----	verify.job	timestamp=1488080052	localtime=Feb 25 19:34:12	
	GOOD	----	verify.servod	timestamp=1488080057	localtime=Feb 25 19:34:17	
	GOOD	----	verify.pwr_button	timestamp=1488080057	localtime=Feb 25 19:34:17	
	GOOD	----	verify.lid_open	timestamp=1488080057	localtime=Feb 25 19:34:17	
	GOOD	----	verify.update	timestamp=1488080061	localtime=Feb 25 19:34:21	
	GOOD	----	verify.PASS	timestamp=1488080061	localtime=Feb 25 19:34:21	
	FAIL	----	verify.ssh	timestamp=1488080763	localtime=Feb 25 19:46:03	No answer to ping from chromeos4-row11-rack11-host13
	START	----	repair.rpm	timestamp=1488080763	localtime=Feb 25 19:46:03	
		FAIL	----	repair.rpm	timestamp=1488081006	localtime=Feb 25 19:50:06	chromeos4-row11-rack11-host13 is still offline after powercycling
	END FAIL	----	repair.rpm	timestamp=1488081006	localtime=Feb 25 19:50:06	
	START	----	repair.sysrq	timestamp=1488081006	localtime=Feb 25 19:50:06	
		FAIL	----	repair.sysrq	timestamp=1488081283	localtime=Feb 25 19:54:43	chromeos4-row11-rack11-host13 is still offline after reset.
	END FAIL	----	repair.sysrq	timestamp=1488081283	localtime=Feb 25 19:54:43	
	START	----	repair.servoreset	timestamp=1488081283	localtime=Feb 25 19:54:43	
		GOOD	----	verify.ssh	timestamp=1488081312	localtime=Feb 25 19:55:12	
	END GOOD	----	repair.servoreset	timestamp=1488081312	localtime=Feb 25 19:55:12	
	GOOD	----	verify.fwstatus	timestamp=1488081312	localtime=Feb 25 19:55:12	
	FAIL	----	verify.good_au	timestamp=1488081313	localtime=Feb 25 19:55:13	Last AU on this DUT failed
	START	----	repair.firmware	timestamp=1488081313	localtime=Feb 25 19:55:13	
		FAIL	----	repair.firmware	timestamp=1488081313	localtime=Feb 25 19:55:13	Firmware repair is not applicable to host chromeos4-row11-rack11-host13.
	END FAIL	----	repair.firmware	timestamp=1488081313	localtime=Feb 25 19:55:13	
	GOOD	----	verify.writable	timestamp=1488081313	localtime=Feb 25 19:55:13	
	GOOD	----	verify.tpm	timestamp=1488081314	localtime=Feb 25 19:55:14	
	GOOD	----	verify.ext4	timestamp=1488081315	localtime=Feb 25 19:55:15	
	GOOD	----	verify.power	timestamp=1488081315	localtime=Feb 25 19:55:15	
	GOOD	----	verify.rwfw	timestamp=1488081326	localtime=Feb 25 19:55:26	
	GOOD	----	verify.python	timestamp=1488081326	localtime=Feb 25 19:55:26	
	GOOD	----	verify.cros	timestamp=1488081339	localtime=Feb 25 19:55:39	
	START	----	repair.powerwash	timestamp=1488081339	localtime=Feb 25 19:55:39	
		START	----	reboot	timestamp=1488081339	localtime=Feb 25 19:55:39	
			GOOD	----	reboot.start	timestamp=1488081339	localtime=Feb 25 19:55:39	
			GOOD	----	reboot.verify	timestamp=1488081470	localtime=Feb 25 19:57:50	
		END GOOD	----	reboot	kernel=3.18.0-13924-gccf6bf1	localtime=Feb 25 19:57:51	timestamp=1488081471	
		GOOD	----	verify.ssh	timestamp=1488082318	localtime=Feb 25 20:11:58	
		GOOD	----	verify.tpm	timestamp=1488082319	localtime=Feb 25 20:11:59	
		GOOD	----	verify.good_au	timestamp=1488082319	localtime=Feb 25 20:11:59	
		GOOD	----	verify.ext4	timestamp=1488082320	localtime=Feb 25 20:12:00	
		GOOD	----	verify.power	timestamp=1488082320	localtime=Feb 25 20:12:00	
		GOOD	----	verify.rwfw	timestamp=1488082321	localtime=Feb 25 20:12:01	
		GOOD	----	verify.python	timestamp=1488082321	localtime=Feb 25 20:12:01	
		GOOD	----	verify.cros	timestamp=1488082324	localtime=Feb 25 20:12:04	
	END GOOD	----	repair.powerwash	timestamp=1488082324	localtime=Feb 25 20:12:04	
	GOOD	----	verify.PASS	timestamp=1488082324	localtime=Feb 25 20:12:04	
	INFO	----	repair	timestamp=1488082324	localtime=Feb 25 20:12:04	Can't repair label 'pool:cq'.
	INFO	----	repair	timestamp=1488082324	localtime=Feb 25 20:12:04	Can't repair label 'arc'.
	INFO	----	repair	timestamp=1488082324	localtime=Feb 25 20:12:04	Can't repair label 'board:cyan'.
	INFO	----	repair	timestamp=1488082324	localtime=Feb 25 20:12:04	Can't repair label 'cros-version:cyan-paladin/R58-9317.0.0-rc4'.
END GOOD	----	repair	timestamp=1488082324	localtime=Feb 25 20:12:04	chromeos4-row11-rack11-host13 repaired successfully

Blocking: 695529
I have an SSH timeout with a successful verify afterward.

Test: http://cautotest.corp.google.com/afe/#tab_id=view_job&object_id=103596819
Provision results: https://pantheon.corp.google.com/storage/browser/chromeos-autotest-results/hosts/chromeos4-row4-rack11-host7/2702035-provision/20172602084612/debug/
Repair results: https://pantheon.corp.google.com/storage/browser/chromeos-autotest-results/hosts/chromeos4-row4-rack11-host7/2702276-repair/20172602092128/

Repair status.log:

START	----	repair	timestamp=1488129694	localtime=Feb 26 09:21:34	
	GOOD	----	verify.ssh	timestamp=1488129696	localtime=Feb 26 09:21:36	
	GOOD	----	verify.brd_config	timestamp=1488129697	localtime=Feb 26 09:21:37	
	GOOD	----	verify.ser_config	timestamp=1488129697	localtime=Feb 26 09:21:37	
	GOOD	----	verify.job	timestamp=1488129698	localtime=Feb 26 09:21:38	
	GOOD	----	verify.servod	timestamp=1488129708	localtime=Feb 26 09:21:48	
	GOOD	----	verify.pwr_button	timestamp=1488129708	localtime=Feb 26 09:21:48	
	GOOD	----	verify.lid_open	timestamp=1488129708	localtime=Feb 26 09:21:48	
	GOOD	----	verify.update	timestamp=1488129711	localtime=Feb 26 09:21:51	
	GOOD	----	verify.PASS	timestamp=1488129711	localtime=Feb 26 09:21:51	
	GOOD	----	verify.ssh	timestamp=1488129719	localtime=Feb 26 09:21:59	
	GOOD	----	verify.power	timestamp=1488129720	localtime=Feb 26 09:22:00	
	GOOD	----	verify.fwstatus	timestamp=1488129720	localtime=Feb 26 09:22:00	
	GOOD	----	verify.good_au	timestamp=1488129720	localtime=Feb 26 09:22:00	
	GOOD	----	verify.writable	timestamp=1488129721	localtime=Feb 26 09:22:01	
	GOOD	----	verify.tpm	timestamp=1488129721	localtime=Feb 26 09:22:01	
	GOOD	----	verify.ext4	timestamp=1488129722	localtime=Feb 26 09:22:02	
	GOOD	----	verify.rwfw	timestamp=1488129722	localtime=Feb 26 09:22:02	
	GOOD	----	verify.python	timestamp=1488129723	localtime=Feb 26 09:22:03	
	GOOD	----	verify.cros	timestamp=1488129727	localtime=Feb 26 09:22:07	
	GOOD	----	verify.PASS	timestamp=1488129727	localtime=Feb 26 09:22:07	
	INFO	----	repair	timestamp=1488129727	localtime=Feb 26 09:22:07	Can't repair label 'pool:cq'.
	INFO	----	repair	timestamp=1488129727	localtime=Feb 26 09:22:07	Can't repair label 'arc'.
	INFO	----	repair	timestamp=1488129727	localtime=Feb 26 09:22:07	Can't repair label 'board:veyron_speedy'.
	INFO	----	repair	timestamp=1488129727	localtime=Feb 26 09:22:07	Can't repair label 'cros-version:veyron_speedy-paladin/R58-9319.0.0-rc3'.
END GOOD	----	repair	timestamp=1488129727	localtime=Feb 26 09:22:07	chromeos4-row4-rack11-host7 repaired successfully

This one timed out during transfer of rootfs

http://cautotest.corp.google.com/afe/#tab_id=view_job&object_id=103767055
Labels: -Pri-2 Pri-0
Owner: xixuan@chromium.org
This or related ssh-failures-from-devserver to DUT-that-has-no-repair-problems is happening very often. Latest example:

    2017-02-28 07:56:30  OK http://cautotest/tko/retrieve_logs.cgi?job=/results/hosts/chromeos4-row6-rack10-host5/2719962-repair/
    2017-02-28 07:20:59  -- http://cautotest/tko/retrieve_logs.cgi?job=/results/hosts/chromeos4-row6-rack10-host5/2719700-provision/

Labels: -Pri-0 Pri-1
Labels: -current-issue
Status: WontFix (was: Untriaged)
Again, probably due to devserver overload. Mark it as wontfix.

Sign in to add a comment