DNS on devservers is widely flakey |
||||||||||
Issue descriptionWe don't have a good metric for this, but I'm seeing provision jobs fail left and right in their first hostname-based attempt, and only succeed upon their IP-based attempt. Examples: https://pantheon.corp.google.com/storage/browser/chromeos-autotest-results/hosts/chromeos4-row12-rack11-host3/1452949-provision/20172109131343/autoupdate_logs/ https://pantheon.corp.google.com/storage/browser/chromeos-autotest-results/hosts/chromeos4-row12-rack11-host3/1452949-provision/20172109131343/autoupdate_logs/ https://pantheon.corp.google.com/storage/browser/chromeos-autotest-results/hosts/chromeos4-row12-rack11-host3/1446038-provision/20172009231805/autoupdate_logs/ Those were the first 3 provision jobs I ranomly selected in a single DUTs history, and they all indicated this problem.
,
Sep 22 2017
chromeos-test@chromeos4-devserver4:~$ while ssh -p 22 '-oConnectionAttempts=4' '-oUserKnownHostsFile=/dev/null' '-oProtocol=2' '-oConnectTimeout=30' '-oServerAliveCountMax=3' '-oStrictHostKeyChecking=no' '-oServerAliveInterval=10' '-oNumberOfPasswordPrompts=0' '-oIdentitiesOnly=yes' -i /tmp/ssh-tmpispwXC/testing_rsa root@chromeos4-row12-rack11-host3 -- echo Hello from DUT; do sleep 5; done Warning: Permanently added 'chromeos4-row12-rack11-host3,100.115.203.117' (ED25519) to the list of known hosts. Warning: Permanently added 'chromeos4-row12-rack11-host3,100.115.203.117' (ED25519) to the list of known hosts. Hello from DUT Warning: Permanently added 'chromeos4-row12-rack11-host3,100.115.203.117' (ED25519) to the list of known hosts. Warning: Permanently added 'chromeos4-row12-rack11-host3,100.115.203.117' (ED25519) to the list of known hosts. Hello from DUT Warning: Permanently added 'chromeos4-row12-rack11-host3,100.115.203.117' (ED25519) to the list of known hosts. Warning: Permanently added 'chromeos4-row12-rack11-host3,100.115.203.117' (ED25519) to the list of known hosts. Hello from DUT Warning: Permanently added 'chromeos4-row12-rack11-host3,100.115.203.117' (ED25519) to the list of known hosts. Warning: Permanently added 'chromeos4-row12-rack11-host3,100.115.203.117' (ED25519) to the list of known hosts. Hello from DUT Warning: Permanently added 'chromeos4-row12-rack11-host3,100.115.203.117' (ED25519) to the list of known hosts. Warning: Permanently added 'chromeos4-row12-rack11-host3,100.115.203.117' (ED25519) to the list of known hosts. Hello from DUT Warning: Permanently added 'chromeos4-row12-rack11-host3,100.115.203.117' (ED25519) to the list of known hosts. Warning: Permanently added 'chromeos4-row12-rack11-host3,100.115.203.117' (ED25519) to the list of known hosts. Hello from DUT Warning: Permanently added 'chromeos4-row12-rack11-host3,100.115.203.117' (ED25519) to the list of known hosts. Warning: Permanently added 'chromeos4-row12-rack11-host3,100.115.203.117' (ED25519) to the list of known hosts. Hello from DUT Warning: Permanently added 'chromeos4-row12-rack11-host3,100.115.203.117' (ED25519) to the list of known hosts. Warning: Permanently added 'chromeos4-row12-rack11-host3,100.115.203.117' (ED25519) to the list of known hosts. Hello from DUT Warning: Permanently added 'chromeos4-row12-rack11-host3,100.115.203.117' (ED25519) to the list of known hosts. Warning: Permanently added 'chromeos4-row12-rack11-host3,100.115.203.117' (ED25519) to the list of known hosts. Hello from DUT Warning: Permanently added 'chromeos4-row12-rack11-host3,100.115.203.117' (ED25519) to the list of known hosts. Warning: Permanently added 'chromeos4-row12-rack11-host3,100.115.203.117' (ED25519) to the list of known hosts. Hello from DUT Warning: Permanently added 'chromeos4-row12-rack11-host3,100.115.203.117' (ED25519) to the list of known hosts. Warning: Permanently added 'chromeos4-row12-rack11-host3,100.115.203.117' (ED25519) to the list of known hosts. Hello from DUT Warning: Permanently added 'chromeos4-row12-rack11-host3,100.115.203.117' (ED25519) to the list of known hosts. Warning: Permanently added 'chromeos4-row12-rack11-host3,100.115.203.117' (ED25519) to the list of known hosts. Hello from DUT Warning: Permanently added 'chromeos4-row12-rack11-host3,100.115.203.117' (ED25519) to the list of known hosts. Warning: Permanently added 'chromeos4-row12-rack11-host3,100.115.203.117' (ED25519) to the list of known hosts. Hello from DUT ssh: Could not resolve hostname chromeos4-row12-rack11-host3: Name or service not known
,
Sep 22 2017
chromeos-test@chromeos4-devserver4:~$ ping chromeos4-row12-rack11-host3 ping: unknown host chromeos4-row12-rack11-host3 chromeos-test@chromeos4-devserver4:~$ host chromeos4-row12-rack11-host3 Host chromeos4-row12-rack11-host3 not found: 3(NXDOMAIN) chromeos-test@chromeos4-devserver4:~$ host chromeos4-row12-rack11-host3 Host chromeos4-row12-rack11-host3 not found: 3(NXDOMAIN) chromeos-test@chromeos4-devserver4:~$ ping 100.115.203.117
,
Sep 22 2017
chromeos-test@chromeos4-devserver4:~$ ping 100.115.203.117 PING 100.115.203.117 (100.115.203.117) 56(84) bytes of data. 64 bytes from 100.115.203.117: icmp_seq=1 ttl=63 time=1.06 ms 64 bytes from 100.115.203.117: icmp_seq=2 ttl=63 time=0.703 ms ^C --- 100.115.203.117 ping statistics --- 2 packets transmitted, 2 received, 0% packet loss, time 1001ms rtt min/avg/max/mdev = 0.703/0.883/1.063/0.180 ms
,
Sep 22 2017
I think we should disable |unbound| or at least turn on its logging (I can't find any service logs for it).
,
Sep 22 2017
Agreed, I’ll upload a CL to disable it.
,
Sep 22 2017
The following revision refers to this bug: https://chrome-internal.googlesource.com/chromeos/chromeos-admin/+/39b4c126023b43704358396e0ed106bdf28bb610 commit 39b4c126023b43704358396e0ed106bdf28bb610 Author: Paul Hobbs <phobbs@google.com> Date: Fri Sep 22 17:28:50 2017
,
Sep 22 2017
alright, let's see if that fixes it.
,
Sep 22 2017
,
Sep 22 2017
Can you explicitly test and determine instead of doing a "wait and see"?
,
Sep 22 2017
This is a good metric to examine: http://shortn/_cpcFjMYWR6 DNS failures in ssh spiking on the half-hour. The current massive spike is related to Issue 768004
,
Sep 22 2017
http://shortn/_QUPa6KJ1C9 is even better, split by server
,
Sep 22 2017
The following revision refers to this bug: https://chrome-internal.googlesource.com/chromeos/chromeos-admin/+/2ae59b52d73f349a1f6578f8a80bede80cef3574 commit 2ae59b52d73f349a1f6578f8a80bede80cef3574 Author: Aviv Keshet <akeshet@chromium.org> Date: Fri Sep 22 22:23:17 2017
,
Sep 22 2017
Actually, #12 does not include ssh calls from devservers, so that metric is not necessarily relevant to this issue.
,
Sep 23 2017
,
Sep 23 2017
Maybe disabling unbound helped. Spot checking some random recent provision jobs, none of them had a DNS failure: http://cautotest/tko/retrieve_logs.cgi?job=/results/hosts/chromeos4-row6-rack11-host2/4395977-provision/ https://pantheon.corp.google.com/storage/browser/chromeos-autotest-results/hosts/chromeos2-row8-rack6-host11/1612810-provision https://pantheon.corp.google.com/storage/browser/chromeos-autotest-results/hosts/chromeos4-row5-rack12-host7/4352770-provision/20172209165210/autoupdate_logs/ https://pantheon.corp.google.com/storage/browser/chromeos-autotest-results/hosts/chromeos4-row4-rack12-host19/4395996-provision/20172209170440/autoupdate_logs/
,
Sep 23 2017
Similarly, the DUT that I spot checked in OP didn't have this problem in its most recent run. I suspect disabling unbound fixed this. Downgrading to P1. Chase-Pending because we should come up with suitable follow-up instrumentation.
,
Sep 25 2017
Per nsylvain@: Some of our jobs are failing with dns-related issues as well: https://atp.googleplex.com/test_runs/10303388 ServerNotFoundError: Unable to find the server at accounts.google.com Powered by CherryPy 3.2.2 Will return from run_suite with status: INFRA_FAILURE
,
Sep 25 2017
^ I don't know what that is coming from, but not convinced it is from a job on a devserver.
,
Sep 25 2017
Re c#10: Has anybody tried to explicitly verify and determine what is happening instead of just backing out some change that may or may not be responsible and then checking some potentially incomplete metrics?
,
Sep 25 2017
Our dev servers were impacted as well android1758-infra-devserver4 100.107.126.136 android1758-infra-devserver5 100.107.126.137 android1758-infra-devserver7 100.107.126.159 android1758-infra-devserver8 100.107.126.160 android1758-infra-devserver9 100.107.126.161 android1758-infra-devserver10 100.107.126.163 android1758-infra-devserver11 100.107.126.164 android1758-infra-devserver12 100.107.126.165 android1758-infra-devserver13 100.107.126.175 android1758-infra-devserver14 100.107.126.174 chromeos9-infra-devserver4 100.115.99.247 chromeos9-infra-devserver5 100.115.99.246 chromeos9-infra-devserver6 100.115.99.236 chromeos9-infra-devserver7 100.115.99.244 chromeos1-infra-devserver4 172.27.215.246 chromeos1-infra-devserver5 172.27.215.245 chromeos1-infra-devserver6 172.27.215.243 chromeos1-infra-devserver7 172.27.215.242 chromeos1-dev-infra-devserver 100.107.156.243 chromeos1-dev-infra-devserver1 100.107.156.241 chromeos3-infra-devserver 172.22.39.161 chromeos3-infra-devserver1 172.22.39.162 chromeos3-infra-devserver2 172.22.39.163 chromeos3-infra-devserver3 172.22.39.164 so we could not ping www.google.com after making these changes We have temporary fixed by doing this /etc/resolvconf/resolv.conf.d/head add: nameserver 8.8.8.8 nameserver 8.8.4.4 Then: sudo resolvconf -u But this is a temporary fix. DNS is then working
,
Sep 25 2017
,
Sep 25 2017
,
Sep 25 2017
I spot checked the first on that list, and didn't see |unbound| so I don't know what was responsible. Those devservers aren't under the same puppet control as the rest of the lab servers either (yet), so we don't control that yet either.
,
Oct 13 2017
I'm going to call this fixed, at least from the disabling-|unbound| point of view. Other remaining flake that manifests as DNS lookup failures (but might actually be due to something else) tracked elsewhere, for instance Issue 712682 |
||||||||||
►
Sign in to add a comment |
||||||||||
Comment 1 by akes...@chromium.org
, Sep 22 2017