Since 2018-09-06 we started to have a lot of ssh disconnect warnings, which should have been passes.
https://stainless.corp.google.com/search?view=list&first_date=2018-08-16&last_date=2018-09-12&status=GOOD&status=WARN&status=FAIL&status=ERROR&status=ABORT&status=ALERT&status=RUNNING&status=TEST_NA&status=NOSTATUS&status=NOT_RUN&reason=%5EAutotest+client+was+disconnected%3A+Timeout+encountered%3A&exclude_cts=true&exclude_not_run=false&exclude_non_release=true&exclude_au=true&exclude_acts=true&exclude_retried=false&exclude_non_production=false
Notice this never happened for any job starting before 2018-09-04 and affects all kind of long running tests (and on all branches now from R68-R71). There was a push to prod on 09/04, but while there was a change to tko parser fa654f81f I don't think that would have caused this. Notice also that the tests are happily executing, so it isn't that ganeti fell asleep. Yet everything points to an infra change causing this.
Sample:
grunt R71-11057.0.0 graphics_dEQP.vk-master WARN Autotest client was disconnected: Timeout encountered: /usr/bin/ssh
https://stainless.corp.google.com/search?view=list&first_date=2018-08-16&last_date=2018-09-12&test=%5Egraphics%5C_dEQP%5C.vk%5C-master%24&board=grunt&status=GOOD&status=WARN&status=FAIL&status=ERROR&status=ABORT&status=ALERT&status=RUNNING&status=TEST_NA&status=NOSTATUS&status=NOT_RUN&exclude_cts=true&exclude_not_run=false&exclude_non_release=true&exclude_au=true&exclude_acts=true&exclude_retried=false&exclude_non_production=false
According to "status" the test ran fine
START ---- ---- timestamp=1536718484 localtime=Sep 11 19:14:44
START graphics_dEQP.vk-master graphics_dEQP.vk-master timestamp=1536718484 localtime=Sep 11 19:14:44
GOOD graphics_dEQP.vk-master graphics_dEQP.vk-master timestamp=1536730299 localtime=Sep 11 22:31:39 completed successfully
END GOOD graphics_dEQP.vk-master graphics_dEQP.vk-master timestamp=1536730299 localtime=Sep 11 22:31:39
END GOOD ---- ---- timestamp=1536730313 localtime=Sep 11 22:31:53
It looks like once an hour the ssh connection to autotestd_monitor dies and gets recovered. This then gets picked up by the tko parser.
I guess one could make tko ignore these warnings (or not raise them in the first place), but I did not see the warnings before 09/04 so something else must have changed.
09/11 19:14:45.203 INFO | autotest:1340| Entered autotestd_monitor.
09/11 19:14:45.203 INFO | autotest:1340| Finished launching tail subprocesses.
09/11 19:14:45.203 INFO | autotest:1340| Finished waiting on autotestd to start.
09/11 19:14:45.381 INFO | server_job:0216| START ---- ---- timestamp=1536718484 localtime=Sep 11 19:14:44
09/11 19:14:45.477 INFO | server_job:0216| START graphics_dEQP.vk-master graphics_dEQP.vk-master timestamp=1536718484 localtime=Sep 11 19:14:44
09/11 20:14:44.587 WARNI| utils:0938| run process timeout (3599) fired on: /usr/bin/ssh -a -x -o ControlPath=/tmp/_autotmp_lsu9D0ssh-master/socket -o Protocol=2 -o StrictHostKeyChecking=no -o UserKnownHostsFile=/dev/null -o BatchMode=yes -o ConnectTimeout=30 -o ServerAliveInterval=900 -o ServerAliveCountMax=3 -o ConnectionAttempts=4 -l root -p 22 chromeos2-row3-rack8-host15 "export LIBC_FATAL_STDERR_=1; if type \"logger\" > /dev/null 2>&1; then logger -tag \"autotest\" \"server[stack::execute_section|_execute_daemon|run] -> ssh_run(/usr/local/autotest/bin/autotestd_monitor /tmp/autoserv-8zbVDc 0 0)\";fi; /usr/local/autotest/bin/autotestd_monitor /tmp/autoserv-8zbVDc 0 0"
09/11 20:14:44.631 INFO | server_job:0216| WARN ---- ---- timestamp=1536722084 localtime=Sep 11 20:14:44 Autotest client was disconnected: Timeout encountered: /usr/bin/ssh -a -x -o ControlPath=/tmp/_autotmp_lsu9D0ssh-master/socket -o Protocol=2 -o StrictHostKeyChecking=no -o UserKnownHostsFile=/dev/null -o BatchMode=yes -o ConnectTimeout=30 -o ServerAliveInterval=900 -o ServerAliveCountMax=3 -o ConnectionAttempts=4 -l root -p 22 chromeos2-row3-rack8-host15 "export LIBC_FATAL_STDERR_=1; if type \"logger\" > /dev/null 2>&1; then logger -tag \"autotest\" \"server[stack::execute_section|_execute_daemon|run] -> ssh_run(/usr/local/autotest/bin/autotestd_monitor /tmp/autoserv-8zbVDc 0 0)\";fi; /usr/local/autotest/bin/autotestd_monitor /tmp/autoserv-8zbVDc 0 0"
09/11 20:14:45.646 INFO | autotest:1340| localtime=Sep 11 19:14:44
09/11 20:14:45.646 INFO | autotest:1340| Entered autotestd_monitor.
09/11 20:14:45.646 INFO | autotest:1340| Finished launching tail subprocesses.
09/11 20:14:45.647 INFO | autotest:1340| Finished waiting on autotestd to start.
09/11 21:14:44.230 WARNI| utils:0938| run process timeout (3599) fired on: /usr/bin/ssh -a -x -o ControlPath=/tmp/_autotmp_lsu9D0ssh-master/socket -o Protocol=2 -o StrictHostKeyChecking=no -o UserKnownHostsFile=/dev/null -o BatchMode=yes -o ConnectTimeout=30 -o ServerAliveInterval=900 -o ServerAliveCountMax=3 -o ConnectionAttempts=4 -l root -p 22 chromeos2-row3-rack8-host15 "export LIBC_FATAL_STDERR_=1; if type \"logger\" > /dev/null 2>&1; then logger -tag \"autotest\" \"server[stack::execute_section|_execute_daemon|run] -> ssh_run(/usr/local/autotest/bin/autotestd_monitor /tmp/autoserv-8zbVDc 18840252 279)\";fi; /usr/local/autotest/bin/autotestd_monitor /tmp/autoserv-8zbVDc 18840252 279"
09/11 21:14:45.242 INFO | server_job:0216| WARN ---- ---- timestamp=1536725685 localtime=Sep 11 21:14:45 Autotest client was disconnected: Timeout encountered: /usr/bin/ssh -a -x -o ControlPath=/tmp/_autotmp_lsu9D0ssh-master/socket -o Protocol=2 -o StrictHostKeyChecking=no -o UserKnownHostsFile=/dev/null -o BatchMode=yes -o ConnectTimeout=30 -o ServerAliveInterval=900 -o ServerAliveCountMax=3 -o ConnectionAttempts=4 -l root -p 22 chromeos2-row3-rack8-host15 "export LIBC_FATAL_STDERR_=1; if type \"logger\" > /dev/null 2>&1; then logger -tag \"autotest\" \"server[stack::execute_section|_execute_daemon|run] -> ssh_run(/usr/local/autotest/bin/autotestd_monitor /tmp/autoserv-8zbVDc 18840252 279)\";fi; /usr/local/autotest/bin/autotestd_monitor /tmp/autoserv-8zbVDc 18840252 279"
09/11 21:14:46.259 INFO | autotest:1340| ting on autotestd to start.
09/11 21:14:46.260 INFO | autotest:1340| Entered autotestd_monitor.
09/11 21:14:46.260 INFO | autotest:1340| Finished launching tail subprocesses.
09/11 21:14:46.260 INFO | autotest:1340| Finished waiting on autotestd to start.
09/11 22:14:45.234 WARNI| utils:0938| run process timeout (3599) fired on: /usr/bin/ssh -a -x -o ControlPath=/tmp/_autotmp_lsu9D0ssh-master/socket -o Protocol=2 -o StrictHostKeyChecking=no -o UserKnownHostsFile=/dev/null -o BatchMode=yes -o ConnectTimeout=30 -o ServerAliveInterval=900 -o ServerAliveCountMax=3 -o ConnectionAttempts=4 -l root -p 22 chromeos2-row3-rack8-host15 "export LIBC_FATAL_STDERR_=1; if type \"logger\" > /dev/null 2>&1; then logger -tag \"autotest\" \"server[stack::execute_section|_execute_daemon|run] -> ssh_run(/usr/local/autotest/bin/autotestd_monitor /tmp/autoserv-8zbVDc 25511996 384)\";fi; /usr/local/autotest/bin/autotestd_monitor /tmp/autoserv-8zbVDc 25511996 384"
09/11 22:14:46.246 INFO | server_job:0216| WARN ---- ---- timestamp=1536729286 localtime=Sep 11 22:14:46 Autotest client was disconnected: Timeout encountered: /usr/bin/ssh -a -x -o ControlPath=/tmp/_autotmp_lsu9D0ssh-master/socket -o Protocol=2 -o StrictHostKeyChecking=no -o UserKnownHostsFile=/dev/null -o BatchMode=yes -o ConnectTimeout=30 -o ServerAliveInterval=900 -o ServerAliveCountMax=3 -o ConnectionAttempts=4 -l root -p 22 chromeos2-row3-rack8-host15 "export LIBC_FATAL_STDERR_=1; if type \"logger\" > /dev/null 2>&1; then logger -tag \"autotest\" \"server[stack::execute_section|_execute_daemon|run] -> ssh_run(/usr/local/autotest/bin/autotestd_monitor /tmp/autoserv-8zbVDc 25511996 384)\";fi; /usr/local/autotest/bin/autotestd_monitor /tmp/autoserv-8zbVDc 25511996 384"
09/11 22:14:47.269 INFO | autotest:1340| ting on autotestd to start.
09/11 22:14:47.269 INFO | autotest:1340| Entered autotestd_monitor.
09/11 22:14:47.269 INFO | autotest:1340| Finished launching tail subprocesses.
09/11 22:14:47.270 INFO | autotest:1340| Finished waiting on autotestd to start.
09/11 22:31:40.426 INFO | server_job:0216| GOOD graphics_dEQP.vk-master graphics_dEQP.vk-master timestamp=1536730299 localtime=Sep 11 22:31:39 completed successfully
09/11 22:31:40.486 INFO | server_job:0216| END GOOD graphics_dEQP.vk-master graphics_dEQP.vk-master timestamp=1536730299 localtime=Sep 11 22:31:39
09/11 22:31:54.596 INFO | server_job:0216| END GOOD ---- ---- timestamp=1536730313 localtime=Sep 11 22:31:53
Comment 1 by ihf@chromium.org
, Sep 12