HWTest failure due to Could not resolve host: storage.googleapis.com |
|||||||||||||||||||
Issue descriptionFailed on nyan_kitty-paladin: https://cros-goldeneye.corp.google.com/chromeos/healthmonitoring/buildDetails?buildbucketId=8934351839432758688 > tast.mustpass-system_SERVER_JOB FAIL Logs from ssp_logs/debug show: > 09/25 23:30:56.037 DEBUG| utils:0219| Running 'sudo lxc-attach -P /usr/local/autotest/containers -n test_241855598_1537942996_214536 -- bash -c "curl --head https://storage.googleapis.com/abci-ssp/autotest-containers/base_09.tar.xz"' > 09/25 23:31:25.979 DEBUG| container:0344| Command <sudo lxc-attach -P /usr/local/autotest/containers -n test_241855598_1537942996_214536 -- bash -c "curl --head https://storage.googleapis.com/abci-ssp/autotest-containers/base_09.tar.xz"> failed, rc=6, Command returned non-zero exit status > * Command: > sudo lxc-attach -P /usr/local/autotest/containers -n > test_241855598_1537942996_214536 -- bash -c "curl --head > https://storage.googleapis.com/abci-ssp/autotest- > containers/base_09.tar.xz" > Exit status: 6 > Duration: 29.5143070221 > curl: (6) Could not resolve host: storage.googleapis.com
,
Sep 26
I don't know of any recent DNS issues in the lab. +johndhong for the current state DNS configuration in the lab.
,
Sep 26
I assume this is the relevant log https://storage.cloud.google.com/chromeos-autotest-results/241855598-chromeos-test/chromeos4-row13-rack1-host3/ssp_logs/debug/autoserv.DEBUG In terms of DNS it is possibly flake. johndhong@phobrz:~$ ssh chromeos4-row13-rack1-host3 Warning: Permanently added 'chromeos4-row13-rack1-host3,100.115.216.3' (ED25519) to the list of known hosts. localhost ~ # ping storage.googleapis.com PING storage.l.googleusercontent.com (173.194.222.128) 56(84) bytes of data. --- storage.l.googleusercontent.com ping statistics --- 6 packets transmitted, 6 received, 0% packet loss, time 5007ms rtt min/avg/max/mdev = 173.592/173.699/173.989/0.437 ms localhost ~ # cd /tmp/ localhost /tmp # curl --head https://storage.googleapis.com/abci-ssp/autotest-containers/base_09.tar.xz HTTP/1.1 200 OK X-GUploader-Customer: cloud-storage X-GUploader-UploadID: AEnB2Uq6Ph0pSBrq6Cu3KP3c0ceiYSKCsClFoThF8xraN2JAaHqOjFj9KKPcufV_o9EP60zia6QuYgm6RZmMR1pjJaFD8eTBU7geazAW5VDQ6-RO3egftnM Expires: Wed, 26 Sep 2018 23:27:50 GMT
,
Sep 26
To reduce the odds of flake I'll make this the higher priority for some network work I am doing.
,
Sep 27
Happened on nyan_kitty-paladin again: https://cros-goldeneye.corp.google.com/chromeos/healthmonitoring/buildDetails?buildbucketId=8934252588913134544 I believe, this time it's a different host - chromeos4-row13-rack2-host4.
,
Sep 28
localhost /tmp # curl --head https://storage.googleapis.com/abci-ssp/autotest-containers/base_09.tar.xz HTTP/1.1 200 OK X-GUploader-Customer: cloud-storage X-GUploader-UploadID: AEnB2UrcjW6y55l2Wj7jx7m5-kiW91CtAfqmK85wZ8wnKYB7k-YmiuWQX7pYZfKATdGM5aFnHHdrxe51i7z0KabhtVp3SgqN2w Expires: Fri, 28 Sep 2018 01:47:01 GMT Date: Fri, 28 Sep 2018 00:47:01 GMT Logs https://storage.cloud.google.com/chromeos-autotest-results/242368223-chromeos-test/chromeos4-row13-rack1-host4/ssp_logs/debug/autoserv.DEBUG Seems to happen during the nightly runs (if I recall a lot of tests do run at that time) so it would not surprise me if it happened again in tonight's run. New network gear has been prepped so aiming to upgrade tomorrow. Tricky part is that the network all the AIOs DUTs will be offline intermittently as we start cutting things over...We'll monitor very carefully to not blow up CQ runs... Are there any other instances of a similar issue?
,
Oct 1
Reassigning to the new deputy.
,
Oct 1
Issue 890428 has been merged into this issue.
,
Oct 1
John, How is the network update? Can I assume this issue shouldn't happen anymore?
,
Oct 1
We ran into installation issues so aiming for today at best but probably tomorrow before we get it up and running. I will sync with everyone here since it does involve messing with CQ DUTs...
,
Oct 1
John, this code doesn't run in your lab, it is on the Ganeti servers which are overloaded (issue 885199). So nothing for you guys here. Looking at the log, I think this is the container starting very slowly (like 5 minutes)? https://stainless.corp.google.com/browse/chromeos-autotest-results/241855598-chromeos-test/ Indeed there was a load spike around that time: https://viceroy.corp.google.com/chromeos/machines/?hostname=cros-full-0010&duration=11913&refresh=-1&scheduler_host=cros-full-0036&sentinel_host=chromeos-server156&staging_master=chromeos-staging-master2&utc_end=1537951683
,
Oct 2
To be absolutely sure it is not my lab network the upgraded switch will be installed anyway. This is part of a larger lab network overhaul anyway. The source of the flake can be either or but soon not the lab network :)
,
Oct 4
Issue 892364 has been merged into this issue.
,
Oct 4
,
Oct 5
Issue 892520 has been merged into this issue.
,
Oct 5
This is still happening; marking edgar-paladin experimental.
,
Oct 5
,
Oct 5
From my understanding of ihf it looks like the edgar server is overloaded too? https://viceroy.corp.google.com/chromeos/machines?duration=1h&hostname=cros-full-0003&refresh=-1&scheduler_host=cros-full-0036&sentinel_host=chromeos-server156&staging_master=chromeos-staging-master2&utc_end=1537951683 Hard to match timeline of test failure to graph as I'm not sure if same timezone...
,
Oct 5
Not sure if I can do something. Open for suggestions.
,
Oct 5
Work with the lab-eng team to make DNS more reliable, or work with the test authors to get rid of the need for working DNS lookups.
,
Oct 5
Re #c21, Thanks. If I am right, the problem doesn't happen in lab, it happens in geneti instead. Additionally, the code seems is not in tests, it's in infra code (creating lxc process). +jkop anyway.
,
Oct 5
1) Escalate DNS failures to the Ganeti team. 2) Look for ways to harden the infra code, maybe with retries? No reason not to do both in parallel.
,
Oct 5
Re #23 1) filed b/117332879 to Ganeti team. 2) I don't think retry can help. The pattern is, all curl requests to googleapis.com failed in a test. It may last for 5 minutes. My gut feeling is that should related, as mentioned in some comments, to server load, instead of network.
,
Oct 5
The log in #15 and the graph in #19 clearly show that cros-full-0003 (a ganeti server) is deeply overloaded. This can be confirmed by running locally "time sudo true" - when overloaded this runs in the 30s (instead of <1s). Using "sudo" everywhere doesn't help the situation. But the root cause is that the ganeti team is not giving our virtual ganeti servers as much CPU cycles as they used to. Aviv and Ganeti folks are on that.
,
Oct 8
Issue 893320 has been merged into this issue.
,
Oct 10
,
Oct 11
Issue 894250 has been merged into this issue.
,
Oct 11
,
Oct 11
Just a heads up, this is blocking the chrome PFQ.
,
Oct 11
peach-pit-chrome-pfq and tricky-chrome-pfq are both failing due to HWTest, but it looks a bit different. Are they related? Link: https://cros-goldeneye.corp.google.com/chromeos/healthmonitoring/buildDetails?buildbucketId=8932989727528596960
,
Oct 11
,
Oct 11
Re #31: both tricky and peach_pit passed their hwtests as far as I can tell, but something happened after that. Can you file a new issue and CC me on it? I will leave some notes.
,
Oct 11
TY sir, 894526 was born!
,
Oct 18
This is no longer blocking the PFQ removing RBB.
,
Oct 25
|
|||||||||||||||||||
►
Sign in to add a comment |
|||||||||||||||||||
Comment 1 by jclinton@chromium.org
, Sep 26Labels: -Pri-2 Pri-1
Owner: zamorzaev@chromium.org
Status: Available (was: Untriaged)