New issue
Advanced search Search tips
Note: Color blocks (like or ) mean that a user may not be available. Tooltip shows the reason.

Issue 767713 link

Starred by 3 users

Issue metadata

Status: Fixed
Owner: ----
Closed: Oct 2017
Cc:
Components:
EstimatedDays: ----
NextAction: ----
OS: ----
Pri: 1
Type: Bug

Blocked on:
issue 768534

Blocking:
issue 712682



Sign in to add a comment

DNS on devservers is widely flakey

Project Member Reported by akes...@chromium.org, Sep 22 2017

Issue description

Blocking: 712682
chromeos-test@chromeos4-devserver4:~$ while ssh -p 22 '-oConnectionAttempts=4' '-oUserKnownHostsFile=/dev/null' '-oProtocol=2' '-oConnectTimeout=30' '-oServerAliveCountMax=3' '-oStrictHostKeyChecking=no' '-oServerAliveInterval=10' '-oNumberOfPasswordPrompts=0' '-oIdentitiesOnly=yes' -i /tmp/ssh-tmpispwXC/testing_rsa root@chromeos4-row12-rack11-host3 -- echo Hello from DUT; do sleep 5; done
Warning: Permanently added 'chromeos4-row12-rack11-host3,100.115.203.117' (ED25519) to the list of known hosts.
Warning: Permanently added 'chromeos4-row12-rack11-host3,100.115.203.117' (ED25519) to the list of known hosts.
Hello from DUT
Warning: Permanently added 'chromeos4-row12-rack11-host3,100.115.203.117' (ED25519) to the list of known hosts.
Warning: Permanently added 'chromeos4-row12-rack11-host3,100.115.203.117' (ED25519) to the list of known hosts.
Hello from DUT
Warning: Permanently added 'chromeos4-row12-rack11-host3,100.115.203.117' (ED25519) to the list of known hosts.
Warning: Permanently added 'chromeos4-row12-rack11-host3,100.115.203.117' (ED25519) to the list of known hosts.
Hello from DUT
Warning: Permanently added 'chromeos4-row12-rack11-host3,100.115.203.117' (ED25519) to the list of known hosts.
Warning: Permanently added 'chromeos4-row12-rack11-host3,100.115.203.117' (ED25519) to the list of known hosts.
Hello from DUT
Warning: Permanently added 'chromeos4-row12-rack11-host3,100.115.203.117' (ED25519) to the list of known hosts.
Warning: Permanently added 'chromeos4-row12-rack11-host3,100.115.203.117' (ED25519) to the list of known hosts.
Hello from DUT
Warning: Permanently added 'chromeos4-row12-rack11-host3,100.115.203.117' (ED25519) to the list of known hosts.
Warning: Permanently added 'chromeos4-row12-rack11-host3,100.115.203.117' (ED25519) to the list of known hosts.
Hello from DUT
Warning: Permanently added 'chromeos4-row12-rack11-host3,100.115.203.117' (ED25519) to the list of known hosts.
Warning: Permanently added 'chromeos4-row12-rack11-host3,100.115.203.117' (ED25519) to the list of known hosts.
Hello from DUT
Warning: Permanently added 'chromeos4-row12-rack11-host3,100.115.203.117' (ED25519) to the list of known hosts.
Warning: Permanently added 'chromeos4-row12-rack11-host3,100.115.203.117' (ED25519) to the list of known hosts.
Hello from DUT
Warning: Permanently added 'chromeos4-row12-rack11-host3,100.115.203.117' (ED25519) to the list of known hosts.
Warning: Permanently added 'chromeos4-row12-rack11-host3,100.115.203.117' (ED25519) to the list of known hosts.
Hello from DUT
Warning: Permanently added 'chromeos4-row12-rack11-host3,100.115.203.117' (ED25519) to the list of known hosts.
Warning: Permanently added 'chromeos4-row12-rack11-host3,100.115.203.117' (ED25519) to the list of known hosts.
Hello from DUT
Warning: Permanently added 'chromeos4-row12-rack11-host3,100.115.203.117' (ED25519) to the list of known hosts.
Warning: Permanently added 'chromeos4-row12-rack11-host3,100.115.203.117' (ED25519) to the list of known hosts.
Hello from DUT
Warning: Permanently added 'chromeos4-row12-rack11-host3,100.115.203.117' (ED25519) to the list of known hosts.
Warning: Permanently added 'chromeos4-row12-rack11-host3,100.115.203.117' (ED25519) to the list of known hosts.
Hello from DUT
Warning: Permanently added 'chromeos4-row12-rack11-host3,100.115.203.117' (ED25519) to the list of known hosts.
Warning: Permanently added 'chromeos4-row12-rack11-host3,100.115.203.117' (ED25519) to the list of known hosts.
Hello from DUT
ssh: Could not resolve hostname chromeos4-row12-rack11-host3: Name or service not known

chromeos-test@chromeos4-devserver4:~$ ping chromeos4-row12-rack11-host3
ping: unknown host chromeos4-row12-rack11-host3
chromeos-test@chromeos4-devserver4:~$ host chromeos4-row12-rack11-host3
Host chromeos4-row12-rack11-host3 not found: 3(NXDOMAIN)
chromeos-test@chromeos4-devserver4:~$ host chromeos4-row12-rack11-host3
Host chromeos4-row12-rack11-host3 not found: 3(NXDOMAIN)
chromeos-test@chromeos4-devserver4:~$ ping 100.115.203.117

chromeos-test@chromeos4-devserver4:~$ ping 100.115.203.117
PING 100.115.203.117 (100.115.203.117) 56(84) bytes of data.
64 bytes from 100.115.203.117: icmp_seq=1 ttl=63 time=1.06 ms
64 bytes from 100.115.203.117: icmp_seq=2 ttl=63 time=0.703 ms
^C
--- 100.115.203.117 ping statistics ---
2 packets transmitted, 2 received, 0% packet loss, time 1001ms
rtt min/avg/max/mdev = 0.703/0.883/1.063/0.180 ms

I think we should disable |unbound| or at least turn on its logging (I can't find any service logs for it).

Comment 6 by pho...@chromium.org, Sep 22 2017

Owner: pho...@chromium.org
Status: Started (was: Untriaged)
Agreed, I’ll upload a CL to disable it. 
Project Member

Comment 7 by bugdroid1@chromium.org, Sep 22 2017

The following revision refers to this bug:
  https://chrome-internal.googlesource.com/chromeos/chromeos-admin/+/39b4c126023b43704358396e0ed106bdf28bb610

commit 39b4c126023b43704358396e0ed106bdf28bb610
Author: Paul Hobbs <phobbs@google.com>
Date: Fri Sep 22 17:28:50 2017

Comment 8 by pho...@chromium.org, Sep 22 2017

Owner: ----
Status: Unconfirmed (was: Started)
alright, let's see if that fixes it.
Cc: davidri...@chromium.org
Can you explicitly test and determine instead of doing a "wait and see"?
This is a good metric to examine: http://shortn/_cpcFjMYWR6

DNS failures in ssh spiking on the half-hour.

The current massive spike is related to Issue 768004
http://shortn/_QUPa6KJ1C9 is even better, split by server
Project Member

Comment 13 by bugdroid1@chromium.org, Sep 22 2017

The following revision refers to this bug:
  https://chrome-internal.googlesource.com/chromeos/chromeos-admin/+/2ae59b52d73f349a1f6578f8a80bede80cef3574

commit 2ae59b52d73f349a1f6578f8a80bede80cef3574
Author: Aviv Keshet <akeshet@chromium.org>
Date: Fri Sep 22 22:23:17 2017

Actually, #12 does not include ssh calls from devservers, so that metric is not necessarily relevant to this issue.

Comment 15 by ihf@chromium.org, Sep 23 2017

Cc: ihf@chromium.org
Labels: -Pri-0 Chase-Pending Pri-1
Similarly, the DUT that I spot checked in OP didn't have this problem in its most recent run.

I suspect disabling unbound fixed this. Downgrading to P1.

Chase-Pending because we should come up with suitable follow-up instrumentation.
Cc: nsylvain@chromium.org
Status: Available (was: Unconfirmed)
Per nsylvain@:

Some of our jobs are failing with dns-related issues as well: https://atp.googleplex.com/test_runs/10303388

ServerNotFoundError: Unable to find the server at accounts.google.com Powered by CherryPy 3.2.2 Will return from run_suite with status: INFRA_FAILURE
^ I don't know what that is coming from, but not convinced it is from a job on a devserver.
Re c#10: Has anybody tried to explicitly verify and determine what is happening instead of just backing out some change that may or may not be responsible and then checking some potentially incomplete metrics?

Our dev servers were impacted as well

android1758-infra-devserver4	100.107.126.136
android1758-infra-devserver5	100.107.126.137
android1758-infra-devserver7	100.107.126.159
android1758-infra-devserver8	100.107.126.160
android1758-infra-devserver9	100.107.126.161
android1758-infra-devserver10	100.107.126.163
android1758-infra-devserver11	100.107.126.164
android1758-infra-devserver12	100.107.126.165
android1758-infra-devserver13	100.107.126.175
android1758-infra-devserver14	100.107.126.174

chromeos9-infra-devserver4	100.115.99.247
chromeos9-infra-devserver5	100.115.99.246
chromeos9-infra-devserver6	100.115.99.236
chromeos9-infra-devserver7	100.115.99.244

chromeos1-infra-devserver4	172.27.215.246
chromeos1-infra-devserver5	172.27.215.245
chromeos1-infra-devserver6	172.27.215.243
chromeos1-infra-devserver7	172.27.215.242
chromeos1-dev-infra-devserver	100.107.156.243
chromeos1-dev-infra-devserver1	100.107.156.241

chromeos3-infra-devserver	172.22.39.161
chromeos3-infra-devserver1	172.22.39.162
chromeos3-infra-devserver2	172.22.39.163
chromeos3-infra-devserver3	172.22.39.164

so we could not ping www.google.com

after making these changes
We have temporary fixed by doing this
/etc/resolvconf/resolv.conf.d/head
add:
nameserver 8.8.8.8
nameserver 8.8.4.4
Then:
sudo resolvconf -u

But this is a temporary fix.

DNS is then working

Labels: -Chase-Pending
Blockedon: 768534
I spot checked the first on that list, and didn't see |unbound| so I don't know what was responsible. Those devservers aren't under the same puppet control as the rest of the lab servers either (yet), so we don't control that yet either.
Status: Fixed (was: Available)
I'm going to call this fixed, at least from the disabling-|unbound| point of view.

Other remaining flake that manifests as DNS lookup failures (but might actually be due to something else) tracked elsewhere, for instance  Issue 712682

Sign in to add a comment