New issue
Advanced search Search tips
Note: Color blocks (like or ) mean that a user may not be available. Tooltip shows the reason.

Issue 766196 link

Starred by 3 users

Issue metadata

Status: Verified
Owner:
Closed: Feb 2018
Cc:
Components:
EstimatedDays: ----
NextAction: ----
OS: ----
Pri: 2
Type: Bug

Blocked on:
issue 736119
issue 766330

Blocking:
issue 766342



Sign in to add a comment

sporadic, widespread ssh failures in lab

Project Member Reported by akes...@chromium.org, Sep 18 2017

Issue description

provision failure log: https://storage.cloud.google.com/chromeos-autotest-results/hosts/chromeos4-row1-rack12-host19/104525-provision/20171709125847/provision_AutoUpdate/debug/provision_AutoUpdate.DEBUG?_ga=2.212263831.-2000744542.1478018485

Key text:

09/17 13:01:39.522 INFO |     ssh_multiplex:0107| Timed out waiting for master-ssh connection to be established.
09/17 13:02:42.772 ERROR|             utils:0280| [stderr] ssh: connect to host chromeos4-row1-rack12-host19 port 22: Connection timed out
09/17 13:02:42.778 DEBUG|              test:0389| Test failed due to ('ssh timed out', * Command: 
    /usr/bin/ssh -a -x    -o ControlPath=/tmp/_autotmp_y74xSnssh-
    master/socket -o StrictHostKeyChecking=no -o UserKnownHostsFile=/dev/null
    -o BatchMode=yes -o ConnectTimeout=30 -o ServerAliveInterval=900 -o
    ServerAliveCountMax=3 -o ConnectionAttempts=4 -o Protocol=2 -l root -p 22
    chromeos4-row1-rack12-host19 "export LIBC_FATAL_STDERR_=1; if type
    \"logger\" > /dev/null 2>&1; then logger -tag \"autotest\"
    \"server[stack::_get_lsb_release_content|run|wrapper] -> ssh_run(cat
    \\\"/etc/lsb-release\\\")\";fi; cat \"/etc/lsb-release\""
Exit status: 255
Duration: 63.1428010464
Another possible instance, this time on reef-uni paladin: https://luci-milo.appspot.com/buildbot/chromeos/reef-uni-paladin/401
Blockedon: 766330
Cc: puneetster@chromium.org jrbarnette@chromium.org xixuan@chromium.org amstan@chromium.org
Labels: -Pri-2 Pri-1
Summary: sporadic, widespread ssh failures in lab (was: kip provision: ssh failure in initial ssh; device is back later)
Upgrading, because this seems to affecting numerous boards, and may also present as Issue 712682
Owner: akes...@chromium.org
Status: Assigned (was: Untriaged)
I worry this is due to transient network (over)load, but my devserver metrics for tracking network io are not currently trustworthy.
I saw one instance of tcpdump running on a devserver "tcpdump tcp[tcpflags] & tcp-syn != 0 and ( ( tcp[tcpflags] & tcp-ack == 0 and src host 100.115.219.132 ) or ( ( tcp[tcpflags] & tcp-ack != 0 or tcp[tcpflags] & tcp-rst != 0) and dst host 100.115.219.132 ) ) " on chromeos4-devserver4 and killed it. No idea if this is contributing to tcp slowdown. Will spot check some other devservers to see if this is running elsewhere.
Cc: semenzato@chromium.org
Blocking: 766342
Cc: haoweiw@chromium.org johndhong@chromium.org
Trying to see if I can make sense of cacti graphs, sampling lab traffic at the network switch level, eg http://chromeos-monitor.hot.corp.google.com/cacti/graph_view.php?action=tree&tree_id=3&leaf_id=39

+johndhong +haowei any network changes recently, or guidance on possible network overload, in particular in chromeos4 and possibly in chromeos2 ?
I have not done my network improvements yet.

I did notice a network traffic burst around 10am today

Perhaps other SSH related work affecting the lab?
https://bugs.chromium.org/p/chromium/issues/detail?id=712682#c38


Blockedon: 736119
Issue 764789 has been merged into this issue.
Another example from https://luci-milo.appspot.com/buildbot/chromeos/veyron_mighty-paladin/6680

dut-status of the affected DUT around that time:

    2017-09-22 07:38:20  -- http://cautotest/tko/retrieve_logs.cgi?job=/results/144034917-chromeos-test/
    2017-09-22 07:24:32  OK http://cautotest/tko/retrieve_logs.cgi?job=/results/hosts/chromeos4-row6-rack11-host2/4393211-provision/
    2017-09-22 05:46:13  OK http://cautotest/tko/retrieve_logs.cgi?job=/results/hosts/chromeos4-row6-rack11-host2/4392553-repair/
    2017-09-22 05:17:32  -- http://cautotest/tko/retrieve_logs.cgi?job=/results/hosts/chromeos4-row6-rack11-host2/4392333-provision/
    2017-09-22 03:33:19  -- http://cautotest/tko/retrieve_logs.cgi?job=/results/144002179-chromeos-test/


provision failed due to the device being unreachable by ssh part way through the provision. However, the following repair job was able to ssh into the DUT just fine.
Lat week, most of the time the SSHConnection error happened on either veyron_might or reef paladin. It seems this week are still these two builders. Could it be a board specific bad CL in tot caused this issue? 

This is the first SSHConnection error on veyron_might this month at 9/15
https://luci-milo.appspot.com/buildbot/chromeos/veyron_mighty-paladin/6619

This is the first reef SSHConnection error at 9/7
https://luci-milo.appspot.com/buildbot/chromeos/reef-paladin/3546

Maybe we can check whether there was a suspicious CL merged before 9/7 ?
Labels: -Pri-1 Pri-2
Owner: xixuan@chromium.org
Passing on to current deputy. Not sure if this is still ongoing though. It might have been another symptom of DNS flakiness that was hopefully addressed by turning off |unbound| in  Issue 767713 
Status: Verified (was: Assigned)

Sign in to add a comment