sporadic, widespread ssh failures in lab |
|||||||||
Issue descriptionMay be similar to Issue 696679 Build in question: https://luci-milo.appspot.com/buildbot/chromeos/kip-paladin/2969 akeshet@akeshet:~$ dut-status -f chromeos4-row1-rack12-host19 [snip] 2017-09-17 14:36:25 OK http://cautotest/tko/retrieve_logs.cgi?job=/results/hosts/chromeos4-row1-rack12-host19/104912-provision/ 2017-09-17 13:07:35 OK http://cautotest/tko/retrieve_logs.cgi?job=/results/hosts/chromeos4-row1-rack12-host19/104531-repair/ 2017-09-17 12:58:48 -- http://cautotest/tko/retrieve_logs.cgi?job=/results/hosts/chromeos4-row1-rack12-host19/104525-provision/ Filing to monitor the situation, see if this recurs.
,
Sep 18 2017
Another possible instance, this time on reef-uni paladin: https://luci-milo.appspot.com/buildbot/chromeos/reef-uni-paladin/401
,
Sep 18 2017
Upgrading, because this seems to affecting numerous boards, and may also present as Issue 712682
,
Sep 18 2017
,
Sep 18 2017
I worry this is due to transient network (over)load, but my devserver metrics for tracking network io are not currently trustworthy.
,
Sep 18 2017
I saw one instance of tcpdump running on a devserver "tcpdump tcp[tcpflags] & tcp-syn != 0 and ( ( tcp[tcpflags] & tcp-ack == 0 and src host 100.115.219.132 ) or ( ( tcp[tcpflags] & tcp-ack != 0 or tcp[tcpflags] & tcp-rst != 0) and dst host 100.115.219.132 ) ) " on chromeos4-devserver4 and killed it. No idea if this is contributing to tcp slowdown. Will spot check some other devservers to see if this is running elsewhere.
,
Sep 18 2017
,
Sep 18 2017
,
Sep 18 2017
Trying to see if I can make sense of cacti graphs, sampling lab traffic at the network switch level, eg http://chromeos-monitor.hot.corp.google.com/cacti/graph_view.php?action=tree&tree_id=3&leaf_id=39 +johndhong +haowei any network changes recently, or guidance on possible network overload, in particular in chromeos4 and possibly in chromeos2 ?
,
Sep 18 2017
I have not done my network improvements yet. I did notice a network traffic burst around 10am today Perhaps other SSH related work affecting the lab? https://bugs.chromium.org/p/chromium/issues/detail?id=712682#c38
,
Sep 18 2017
,
Sep 20 2017
Issue 764789 has been merged into this issue.
,
Sep 22 2017
Another example from https://luci-milo.appspot.com/buildbot/chromeos/veyron_mighty-paladin/6680 dut-status of the affected DUT around that time: 2017-09-22 07:38:20 -- http://cautotest/tko/retrieve_logs.cgi?job=/results/144034917-chromeos-test/ 2017-09-22 07:24:32 OK http://cautotest/tko/retrieve_logs.cgi?job=/results/hosts/chromeos4-row6-rack11-host2/4393211-provision/ 2017-09-22 05:46:13 OK http://cautotest/tko/retrieve_logs.cgi?job=/results/hosts/chromeos4-row6-rack11-host2/4392553-repair/ 2017-09-22 05:17:32 -- http://cautotest/tko/retrieve_logs.cgi?job=/results/hosts/chromeos4-row6-rack11-host2/4392333-provision/ 2017-09-22 03:33:19 -- http://cautotest/tko/retrieve_logs.cgi?job=/results/144002179-chromeos-test/ provision failed due to the device being unreachable by ssh part way through the provision. However, the following repair job was able to ssh into the DUT just fine.
,
Sep 22 2017
Lat week, most of the time the SSHConnection error happened on either veyron_might or reef paladin. It seems this week are still these two builders. Could it be a board specific bad CL in tot caused this issue? This is the first SSHConnection error on veyron_might this month at 9/15 https://luci-milo.appspot.com/buildbot/chromeos/veyron_mighty-paladin/6619 This is the first reef SSHConnection error at 9/7 https://luci-milo.appspot.com/buildbot/chromeos/reef-paladin/3546 Maybe we can check whether there was a suspicious CL merged before 9/7 ?
,
Sep 25 2017
Passing on to current deputy. Not sure if this is still ongoing though. It might have been another symptom of DNS flakiness that was hopefully addressed by turning off |unbound| in Issue 767713
,
Feb 10 2018
|
|||||||||
►
Sign in to add a comment |
|||||||||
Comment 1 by akes...@chromium.org
, Sep 18 2017