New issue
Advanced search Search tips
Note: Color blocks (like or ) mean that a user may not be available. Tooltip shows the reason.

Issue 883543 link

Starred by 2 users

Issue metadata

Status: Available
Owner: ----
Cc:
Components:
EstimatedDays: ----
NextAction: ----
OS: Chrome
Pri: 1
Type: Bug



Sign in to add a comment

hourly ssh timeouts to autotestd_monitor changing status of long running client tests

Project Member Reported by ihf@chromium.org, Sep 12

Issue description

Since 2018-09-06 we started to have a lot of ssh disconnect warnings, which should have been passes.

https://stainless.corp.google.com/search?view=list&first_date=2018-08-16&last_date=2018-09-12&status=GOOD&status=WARN&status=FAIL&status=ERROR&status=ABORT&status=ALERT&status=RUNNING&status=TEST_NA&status=NOSTATUS&status=NOT_RUN&reason=%5EAutotest+client+was+disconnected%3A+Timeout+encountered%3A&exclude_cts=true&exclude_not_run=false&exclude_non_release=true&exclude_au=true&exclude_acts=true&exclude_retried=false&exclude_non_production=false

Notice this never happened for any job starting before 2018-09-04 and affects all kind of long running tests (and on all branches now from R68-R71). There was a push to prod on 09/04, but while there was a change to tko parser fa654f81f I don't think that would have caused this. Notice also that the tests are happily executing, so it isn't that ganeti fell asleep. Yet everything points to an infra change causing this.

Sample:
	grunt	R71-11057.0.0	graphics_dEQP.vk-master	WARN	Autotest client was disconnected: Timeout encountered: /usr/bin/ssh
https://stainless.corp.google.com/search?view=list&first_date=2018-08-16&last_date=2018-09-12&test=%5Egraphics%5C_dEQP%5C.vk%5C-master%24&board=grunt&status=GOOD&status=WARN&status=FAIL&status=ERROR&status=ABORT&status=ALERT&status=RUNNING&status=TEST_NA&status=NOSTATUS&status=NOT_RUN&exclude_cts=true&exclude_not_run=false&exclude_non_release=true&exclude_au=true&exclude_acts=true&exclude_retried=false&exclude_non_production=false

According to "status" the test ran fine
START	----	----	timestamp=1536718484	localtime=Sep 11 19:14:44	
	START	graphics_dEQP.vk-master	graphics_dEQP.vk-master	timestamp=1536718484	localtime=Sep 11 19:14:44	
		GOOD	graphics_dEQP.vk-master	graphics_dEQP.vk-master	timestamp=1536730299	localtime=Sep 11 22:31:39	completed successfully
	END GOOD	graphics_dEQP.vk-master	graphics_dEQP.vk-master	timestamp=1536730299	localtime=Sep 11 22:31:39	
END GOOD	----	----	timestamp=1536730313	localtime=Sep 11 22:31:53	


It looks like once an hour the ssh connection to autotestd_monitor dies and gets recovered. This then gets picked up by the tko parser.
I guess one could make tko ignore these warnings (or not raise them in the first place), but I did not see the warnings before 09/04 so something else must have changed.


09/11 19:14:45.203 INFO |          autotest:1340| Entered autotestd_monitor.
09/11 19:14:45.203 INFO |          autotest:1340| Finished launching tail subprocesses.
09/11 19:14:45.203 INFO |          autotest:1340| Finished waiting on autotestd to start.
09/11 19:14:45.381 INFO |        server_job:0216| START	----	----	timestamp=1536718484	localtime=Sep 11 19:14:44	
09/11 19:14:45.477 INFO |        server_job:0216| 	START	graphics_dEQP.vk-master	graphics_dEQP.vk-master	timestamp=1536718484	localtime=Sep 11 19:14:44	
09/11 20:14:44.587 WARNI|             utils:0938| run process timeout (3599) fired on: /usr/bin/ssh -a -x  -o ControlPath=/tmp/_autotmp_lsu9D0ssh-master/socket -o Protocol=2 -o StrictHostKeyChecking=no -o UserKnownHostsFile=/dev/null -o BatchMode=yes -o ConnectTimeout=30 -o ServerAliveInterval=900 -o ServerAliveCountMax=3 -o ConnectionAttempts=4 -l root -p 22 chromeos2-row3-rack8-host15 "export LIBC_FATAL_STDERR_=1; if type \"logger\" > /dev/null 2>&1; then logger -tag \"autotest\" \"server[stack::execute_section|_execute_daemon|run] -> ssh_run(/usr/local/autotest/bin/autotestd_monitor /tmp/autoserv-8zbVDc 0 0)\";fi; /usr/local/autotest/bin/autotestd_monitor /tmp/autoserv-8zbVDc 0 0"
09/11 20:14:44.631 INFO |        server_job:0216| 		WARN	----	----	timestamp=1536722084	localtime=Sep 11 20:14:44	Autotest client was disconnected: Timeout encountered: /usr/bin/ssh -a -x  -o ControlPath=/tmp/_autotmp_lsu9D0ssh-master/socket -o Protocol=2 -o StrictHostKeyChecking=no -o UserKnownHostsFile=/dev/null -o BatchMode=yes -o ConnectTimeout=30 -o ServerAliveInterval=900 -o ServerAliveCountMax=3 -o ConnectionAttempts=4 -l root -p 22 chromeos2-row3-rack8-host15 "export LIBC_FATAL_STDERR_=1; if type \"logger\" > /dev/null 2>&1; then logger -tag \"autotest\" \"server[stack::execute_section|_execute_daemon|run] -> ssh_run(/usr/local/autotest/bin/autotestd_monitor /tmp/autoserv-8zbVDc 0 0)\";fi; /usr/local/autotest/bin/autotestd_monitor /tmp/autoserv-8zbVDc 0 0"
09/11 20:14:45.646 INFO |          autotest:1340| 	localtime=Sep 11 19:14:44	
09/11 20:14:45.646 INFO |          autotest:1340| Entered autotestd_monitor.
09/11 20:14:45.646 INFO |          autotest:1340| Finished launching tail subprocesses.
09/11 20:14:45.647 INFO |          autotest:1340| Finished waiting on autotestd to start.
09/11 21:14:44.230 WARNI|             utils:0938| run process timeout (3599) fired on: /usr/bin/ssh -a -x  -o ControlPath=/tmp/_autotmp_lsu9D0ssh-master/socket -o Protocol=2 -o StrictHostKeyChecking=no -o UserKnownHostsFile=/dev/null -o BatchMode=yes -o ConnectTimeout=30 -o ServerAliveInterval=900 -o ServerAliveCountMax=3 -o ConnectionAttempts=4 -l root -p 22 chromeos2-row3-rack8-host15 "export LIBC_FATAL_STDERR_=1; if type \"logger\" > /dev/null 2>&1; then logger -tag \"autotest\" \"server[stack::execute_section|_execute_daemon|run] -> ssh_run(/usr/local/autotest/bin/autotestd_monitor /tmp/autoserv-8zbVDc 18840252 279)\";fi; /usr/local/autotest/bin/autotestd_monitor /tmp/autoserv-8zbVDc 18840252 279"
09/11 21:14:45.242 INFO |        server_job:0216| 		WARN	----	----	timestamp=1536725685	localtime=Sep 11 21:14:45	Autotest client was disconnected: Timeout encountered: /usr/bin/ssh -a -x  -o ControlPath=/tmp/_autotmp_lsu9D0ssh-master/socket -o Protocol=2 -o StrictHostKeyChecking=no -o UserKnownHostsFile=/dev/null -o BatchMode=yes -o ConnectTimeout=30 -o ServerAliveInterval=900 -o ServerAliveCountMax=3 -o ConnectionAttempts=4 -l root -p 22 chromeos2-row3-rack8-host15 "export LIBC_FATAL_STDERR_=1; if type \"logger\" > /dev/null 2>&1; then logger -tag \"autotest\" \"server[stack::execute_section|_execute_daemon|run] -> ssh_run(/usr/local/autotest/bin/autotestd_monitor /tmp/autoserv-8zbVDc 18840252 279)\";fi; /usr/local/autotest/bin/autotestd_monitor /tmp/autoserv-8zbVDc 18840252 279"
09/11 21:14:46.259 INFO |          autotest:1340| ting on autotestd to start.
09/11 21:14:46.260 INFO |          autotest:1340| Entered autotestd_monitor.
09/11 21:14:46.260 INFO |          autotest:1340| Finished launching tail subprocesses.
09/11 21:14:46.260 INFO |          autotest:1340| Finished waiting on autotestd to start.
09/11 22:14:45.234 WARNI|             utils:0938| run process timeout (3599) fired on: /usr/bin/ssh -a -x  -o ControlPath=/tmp/_autotmp_lsu9D0ssh-master/socket -o Protocol=2 -o StrictHostKeyChecking=no -o UserKnownHostsFile=/dev/null -o BatchMode=yes -o ConnectTimeout=30 -o ServerAliveInterval=900 -o ServerAliveCountMax=3 -o ConnectionAttempts=4 -l root -p 22 chromeos2-row3-rack8-host15 "export LIBC_FATAL_STDERR_=1; if type \"logger\" > /dev/null 2>&1; then logger -tag \"autotest\" \"server[stack::execute_section|_execute_daemon|run] -> ssh_run(/usr/local/autotest/bin/autotestd_monitor /tmp/autoserv-8zbVDc 25511996 384)\";fi; /usr/local/autotest/bin/autotestd_monitor /tmp/autoserv-8zbVDc 25511996 384"
09/11 22:14:46.246 INFO |        server_job:0216| 		WARN	----	----	timestamp=1536729286	localtime=Sep 11 22:14:46	Autotest client was disconnected: Timeout encountered: /usr/bin/ssh -a -x  -o ControlPath=/tmp/_autotmp_lsu9D0ssh-master/socket -o Protocol=2 -o StrictHostKeyChecking=no -o UserKnownHostsFile=/dev/null -o BatchMode=yes -o ConnectTimeout=30 -o ServerAliveInterval=900 -o ServerAliveCountMax=3 -o ConnectionAttempts=4 -l root -p 22 chromeos2-row3-rack8-host15 "export LIBC_FATAL_STDERR_=1; if type \"logger\" > /dev/null 2>&1; then logger -tag \"autotest\" \"server[stack::execute_section|_execute_daemon|run] -> ssh_run(/usr/local/autotest/bin/autotestd_monitor /tmp/autoserv-8zbVDc 25511996 384)\";fi; /usr/local/autotest/bin/autotestd_monitor /tmp/autoserv-8zbVDc 25511996 384"
09/11 22:14:47.269 INFO |          autotest:1340| ting on autotestd to start.
09/11 22:14:47.269 INFO |          autotest:1340| Entered autotestd_monitor.
09/11 22:14:47.269 INFO |          autotest:1340| Finished launching tail subprocesses.
09/11 22:14:47.270 INFO |          autotest:1340| Finished waiting on autotestd to start.
09/11 22:31:40.426 INFO |        server_job:0216| 		GOOD	graphics_dEQP.vk-master	graphics_dEQP.vk-master	timestamp=1536730299	localtime=Sep 11 22:31:39	completed successfully
09/11 22:31:40.486 INFO |        server_job:0216| 	END GOOD	graphics_dEQP.vk-master	graphics_dEQP.vk-master	timestamp=1536730299	localtime=Sep 11 22:31:39	
09/11 22:31:54.596 INFO |        server_job:0216| END GOOD	----	----	timestamp=1536730313	localtime=Sep 11 22:31:53	
 
Cc: kinaba@chromium.org
Cc: tbroch@chromium.org
Summary: hourly ssh timeouts to autotestd_monitor changing status of long running client tests (was: hourly ssh timeouts to autotestd_monitor changing status of long running tests)
+Todd fyi

Just wanted to add that this affects only client tests and many long running power tests.

Sign in to add a comment