New issue
Advanced search Search tips
Note: Color blocks (like or ) mean that a user may not be available. Tooltip shows the reason.

Issue 889403 link

Starred by 5 users

Issue metadata

Status: ExternalDependency
Owner:
Cc:
Components:
EstimatedDays: ----
NextAction: ----
OS: Chrome
Pri: 1
Type: Bug

Blocked on:
issue 885199
issue 878403
issue 882562



Sign in to add a comment

HWTest failure due to Could not resolve host: storage.googleapis.com

Project Member Reported by emaxx@chromium.org, Sep 26

Issue description

Failed on nyan_kitty-paladin:
https://cros-goldeneye.corp.google.com/chromeos/healthmonitoring/buildDetails?buildbucketId=8934351839432758688

> tast.mustpass-system_SERVER_JOB     FAIL

Logs from ssp_logs/debug show:

> 09/25 23:30:56.037 DEBUG|             utils:0219| Running 'sudo lxc-attach -P /usr/local/autotest/containers -n test_241855598_1537942996_214536 -- bash -c "curl --head https://storage.googleapis.com/abci-ssp/autotest-containers/base_09.tar.xz"'
> 09/25 23:31:25.979 DEBUG|         container:0344| Command <sudo lxc-attach -P /usr/local/autotest/containers -n test_241855598_1537942996_214536 -- bash -c "curl --head https://storage.googleapis.com/abci-ssp/autotest-containers/base_09.tar.xz"> failed, rc=6, Command returned non-zero exit status
> * Command: 
>     sudo lxc-attach -P /usr/local/autotest/containers -n
>     test_241855598_1537942996_214536 -- bash -c "curl --head
>     https://storage.googleapis.com/abci-ssp/autotest-
>     containers/base_09.tar.xz"
> Exit status: 6
> Duration: 29.5143070221

> curl: (6) Could not resolve host: storage.googleapis.com
 
Components: -Infra>Client>ChromeOS>CI Infra>Client>ChromeOS>Test
Labels: -Pri-2 Pri-1
Owner: zamorzaev@chromium.org
Status: Available (was: Untriaged)
Are the DNS resolution problems in the lab back?
Cc: johndhong@chromium.org
I don't know of any recent DNS issues in the lab.

+johndhong for the current state DNS configuration in the lab.
Cc: haoweiw@chromium.org
I assume this is the relevant log
https://storage.cloud.google.com/chromeos-autotest-results/241855598-chromeos-test/chromeos4-row13-rack1-host3/ssp_logs/debug/autoserv.DEBUG

In terms of DNS it is possibly flake.

johndhong@phobrz:~$ ssh chromeos4-row13-rack1-host3
Warning: Permanently added 'chromeos4-row13-rack1-host3,100.115.216.3' (ED25519) to the list of known hosts.
localhost ~ # ping storage.googleapis.com
PING storage.l.googleusercontent.com (173.194.222.128) 56(84) bytes of data.
--- storage.l.googleusercontent.com ping statistics ---
6 packets transmitted, 6 received, 0% packet loss, time 5007ms
rtt min/avg/max/mdev = 173.592/173.699/173.989/0.437 ms
localhost ~ # cd /tmp/
localhost /tmp # curl --head https://storage.googleapis.com/abci-ssp/autotest-containers/base_09.tar.xz
HTTP/1.1 200 OK
X-GUploader-Customer: cloud-storage
X-GUploader-UploadID: AEnB2Uq6Ph0pSBrq6Cu3KP3c0ceiYSKCsClFoThF8xraN2JAaHqOjFj9KKPcufV_o9EP60zia6QuYgm6RZmMR1pjJaFD8eTBU7geazAW5VDQ6-RO3egftnM
Expires: Wed, 26 Sep 2018 23:27:50 GMT


To reduce the odds of flake I'll make this the higher priority for some network work I am doing.
Happened on nyan_kitty-paladin again:
https://cros-goldeneye.corp.google.com/chromeos/healthmonitoring/buildDetails?buildbucketId=8934252588913134544
I believe, this time it's a different host - chromeos4-row13-rack2-host4.
Cc: gu...@chromium.org wryan@google.com
localhost /tmp # curl --head https://storage.googleapis.com/abci-ssp/autotest-containers/base_09.tar.xz
HTTP/1.1 200 OK
X-GUploader-Customer: cloud-storage
X-GUploader-UploadID: AEnB2UrcjW6y55l2Wj7jx7m5-kiW91CtAfqmK85wZ8wnKYB7k-YmiuWQX7pYZfKATdGM5aFnHHdrxe51i7z0KabhtVp3SgqN2w
Expires: Fri, 28 Sep 2018 01:47:01 GMT
Date: Fri, 28 Sep 2018 00:47:01 GMT

Logs
https://storage.cloud.google.com/chromeos-autotest-results/242368223-chromeos-test/chromeos4-row13-rack1-host4/ssp_logs/debug/autoserv.DEBUG

Seems to happen during the nightly runs (if I recall a lot of tests do run at that time) so it would not surprise me if it happened again in tonight's run.

New network gear has been prepped so aiming to upgrade tomorrow.

Tricky part is that the network all the AIOs DUTs will be offline intermittently as we start cutting things over...We'll monitor very carefully to not blow up CQ runs...

Are there any other instances of a similar issue?
Labels: Hotlist-Deputy
Owner: gu...@chromium.org
Reassigning to the new deputy.
Issue 890428 has been merged into this issue.
John, How is the network update? Can I assume this issue shouldn't happen anymore?
We ran into installation issues so aiming for today at best but probably tomorrow before we get it up and running.

I will sync with everyone here since it does involve messing with CQ DUTs...
Blockedon: 885199
Cc: akes...@chromium.org ihf@chromium.org
John, this code doesn't run in your lab, it is on the Ganeti servers which are overloaded (issue 885199). So nothing for you guys here.

Looking at the log, I think this is the container starting very slowly (like 5 minutes)?
https://stainless.corp.google.com/browse/chromeos-autotest-results/241855598-chromeos-test/

Indeed there was a load spike around that time:
https://viceroy.corp.google.com/chromeos/machines/?hostname=cros-full-0010&duration=11913&refresh=-1&scheduler_host=cros-full-0036&sentinel_host=chromeos-server156&staging_master=chromeos-staging-master2&utc_end=1537951683
To be absolutely sure it is not my lab network the upgraded switch will be installed anyway.  This is part of a larger lab network overhaul anyway.

The source of the flake can be either or but soon not the lab network :)
Issue 892364 has been merged into this issue.
Labels: Hotlist-CrOS-Sheriffing
Cc: apronin@chromium.org yamaguchi@chromium.org briannorris@chromium.org
 Issue 892520  has been merged into this issue.
Labels: -Pri-1 Pri-0
This is still happening; marking edgar-paladin experimental.

Labels: -Pri-0 Pri-1
From my understanding of ihf it looks like the edgar server is overloaded too?
https://viceroy.corp.google.com/chromeos/machines?duration=1h&hostname=cros-full-0003&refresh=-1&scheduler_host=cros-full-0036&sentinel_host=chromeos-server156&staging_master=chromeos-staging-master2&utc_end=1537951683

Hard to match timeline of test failure to graph as I'm not sure if same timezone...
Not sure if I can do something. Open for suggestions.
Work with the lab-eng team to make DNS more reliable, or work with the test authors to get rid of the need for working DNS lookups.
Cc: jkop@chromium.org
Re #c21, Thanks.
If I am right, the problem doesn't happen in lab, it happens in geneti instead.
Additionally, the code seems is not in tests, it's in infra code (creating lxc process).

+jkop anyway.
1) Escalate DNS failures to the Ganeti team.
2) Look for ways to harden the infra code, maybe with retries?

No reason not to do both in parallel.
Re #23
1) filed b/117332879 to Ganeti team.
2) I don't think retry can help. The pattern is, all curl requests to googleapis.com failed in a test. It may last for 5 minutes. My gut feeling is that should related, as mentioned in some comments, to server load, instead of network.

Blockedon: 882562 878403
Cc: ayatane@chromium.org rasputin@google.com
Owner: ihf@chromium.org
Status: ExternalDependency (was: Available)
The log in #15 and the graph in #19 clearly show that cros-full-0003 (a ganeti server) is deeply overloaded.

This can be confirmed by running locally "time sudo true" - when overloaded this runs in the 30s (instead of <1s). Using "sudo" everywhere doesn't help the situation. But the root cause is that the ganeti team is not giving our virtual ganeti servers as much CPU cycles as they used to.

Aviv and Ganeti folks are on that.
Issue 893320 has been merged into this issue.
Cc: stagenut@chromium.org
Cc: newcomer@chromium.org kbleicher@chromium.org lgcheng@google.com levarum@google.com
 Issue 894250  has been merged into this issue.
Cc: -apronin@chromium.org
Labels: ReleaseBlock-Beta M-71
Just a heads up, this is blocking the chrome PFQ.
peach-pit-chrome-pfq and tricky-chrome-pfq are both failing due to HWTest, but it looks a bit different. Are they related?

Link:
https://cros-goldeneye.corp.google.com/chromeos/healthmonitoring/buildDetails?buildbucketId=8932989727528596960
Cc: -briannorris@chromium.org
Re #31: both tricky and peach_pit passed their hwtests as far as I can tell, but something happened after that. Can you file a new issue and CC me on it? I will leave some notes.
TY sir, 894526 was born!
Labels: -ReleaseBlock-Beta
This is no longer blocking the PFQ removing RBB.
Labels: -Hotlist-Deputy

Sign in to add a comment