New issue
Advanced search Search tips
Note: Color blocks (like or ) mean that a user may not be available. Tooltip shows the reason.

Issue 855664 link

Starred by 5 users

Issue metadata

Status: Fixed
Merged: issue 829871
Owner:
Closed: Jul 16
Cc:
Components:
EstimatedDays: ----
NextAction: ----
OS: Chrome
Pri: 1
Type: Bug



Sign in to add a comment

moblab-generic-vm-pre-cq can't download gsutils from storage.googleapis.com

Project Member Reported by briannorris@chromium.org, Jun 22 2018

Issue description

https://chromium-review.googlesource.com/c/chromiumos/third_party/autotest/+/1099190

and its companion (listed as co-dependent, but the only true dependency should be the latter to the former):

https://chromium-review.googlesource.com/c/aosp/platform/system/connectivity/shill/+/1087527

have had a very hard time passing the PreCQ, for over a week now.

There are a couple different symptoms, some of which were known issues:

#1
The moblab-generic-vm-pre-cq trybot for your change crashed.

https://cros-goldeneye.corp.google.com/chromeos/healthmonitoring/buildDetails?buildbucketId=8943451098642350224


#2
moblab-generic-vm-pre-cq: The MoblabVMTest stage failed: return code: 1; command: cros_sdk -- test_that --no-quickmerge -b moblab-generic-vm --results_dir /tmp/cbuildbotE7POj9/results localhost:11722 moblab_DummyServerNoSspSuite --args 'services_init_timeout_m=10 target_build="moblab-generic-vm-p in https://cros-goldeneye.corp.google.com/chromeos/healthmonitoring/buildDetails?buildbucketId=8943449783213160960

https://cros-goldeneye.corp.google.com/chromeos/healthmonitoring/buildDetails?buildbucketId=8943363905347811232

https://cros-goldeneye.corp.google.com/chromeos/healthmonitoring/buildDetails?buildbucketId=8943325067823785472

Most (all?) of these show this in the MoblabVMTest logs:

tko parser:   Error Message: DevServerException: All devservers in subnet: unrestricted subnet are currently down: set(['http://192.168.231.1:8080']). (dut hostname: None)


#3

moblab-generic-vm-pre-cq: The BuildPackages stage failed: Packages failed in ./build_packages: chromeos-base/chromeos-bsp-moblab in https://cros-goldeneye.corp.google.com/chromeos/healthmonitoring/buildDetails?buildbucketId=8943235816715022688

https://cros-goldeneye.corp.google.com/chromeos/healthmonitoring/buildDetails?buildbucketId=8943092965854764736


chromeos-bsp-moblab-0.0.1-r31: GIT update -->
chromeos-bsp-moblab-0.0.1-r31:    repository:               https://chromium.googlesource.com/chromiumos/overlays/board-overlays.git
chromeos-bsp-moblab-0.0.1-r31:    at the commit:            7f4dc7ac892038caf67b4e3a325422e23994a244
chromeos-bsp-moblab-0.0.1-r31:    commit:                   dfc54e352903a0b43ae65e5224399a86b25a1b0b
chromeos-bsp-moblab-0.0.1-r31:    branch:                   master
chromeos-bsp-moblab-0.0.1-r31:    storage directory:        "/var/cache/chromeos-cache/distfiles/target/egit-src/chromiumos/overlays/board-overlays"
chromeos-bsp-moblab-0.0.1-r31:    checkout type:            bare repository
chromeos-bsp-moblab-0.0.1-r31: Cloning into '/build/moblab-generic-vm/tmp/portage/chromeos-base/chromeos-bsp-moblab-0.0.1-r31/work/chromeos-bsp-moblab-0.0.1'...
chromeos-bsp-moblab-0.0.1-r31: done.
chromeos-bsp-moblab-0.0.1-r31: fatal: reference is not a tree: dfc54e352903a0b43ae65e5224399a86b25a1b0b
chromeos-bsp-moblab-0.0.1-r31:  * ERROR: chromeos-base/chromeos-bsp-moblab-0.0.1-r31::moblab failed (unpack phase):
chromeos-bsp-moblab-0.0.1-r31:  *   git-2_branch: changing the branch failed


---

I believe symptom #2 is the most common.
 
 Issue #3  is covered by http://crbug.com/854633 and can be ignored here.
#1 is interesting, but unrelated to moblab in any way.

https://chrome-swarming.appspot.com/task?id=3e27a9c3f7a98a10&refresh=10&show_raw=1

Let's focus on #2 in this bug.
Cc: mortonm@chromium.org
To be clear that is the reason moblab put out the error message DevServerException: All devservers in subnet: unrestricted subnet are currently down: set(['http://192.168.231.1:8080']). (dut hostname: None)

Why curl can not access 'storage.googleapis.com' from the vm is what needs work
Components: -Infra>Client>ChromeOS>CI Infra>Client>ChromeOS>Test
Labels: -Pri-3 Pri-1
Owner: jkop@chromium.org
Status: Assigned (was: Untriaged)
This is squarely in the hardware testing space. Over to the current Infra Deputy.
Cc: jclinton@chromium.org
Components: Infra>Client>ChromeOS>CI
Don reminded me that some of this (the VM component) is still CI until we complete the VM move to HWTest. So, readding the CI component to monitor.

Still, the immediate issue seems to be related to DevServer so it would be good if jkop@ can take a look at those error messages and root cause.

Comment 9 by jkop@chromium.org, Jun 22 2018

Summary: moblab-generic-vm-pre-cq can't download gsutils from storage.googleapis.com (was: moblab-generic-vm-pre-cq: multiple failures over a week)
Cc: pprabhu@chromium.org ihf@chromium.org
The working theory from IRC chat is that the networking bridge between the VM instance and the outside world is failing for unknown reasons.

That suggests to me that this is either an issue with the GCE kernel, or a timing issue related to starting up the VM.

Either way, we need someone with more VM expertise to diagnose.
How are the tests getting started and test logs getting off of the VM if networking is broken?
Specifically the test is not able to resolve a host - so DNS is down at that point not necessarily all networking

curl: (6) Couldn't resolve host 'storage.googleapis.com'
#11 is a good question.

Pure speculation.... the bridge works for host to VM connections, but not VM to world?
Mergedinto: 829871
Status: Duplicate (was: Assigned)
We've seen it before but it didn't happen often enough and we didn't see it for a long while afterwards. See duped bug.
I have done what I can to make sure that DNS and network connection is present, before moblab starts.   However I am still just playing whack a mole with the network issues, now downloads are timing out ( timeout is 180 secs, the same file take 3 secs to download on my desktop )  crbug.com/855941 

It seems to me the crosvm networking is not working correctly.  I will keep trying to figure out band-aids but it would be good if someone who knows about the vm system could look into the root cause.  Or could at least point me to how to investigate networking issues on the vm
Cc: davidri...@chromium.org grundler@chromium.org
Owner: haddowk@chromium.org
Status: Assigned (was: Duplicate)
Merge  Issue 829871  here since @Keith is actively working on this. 

cc @grundler & @david in case it's the same issue described by @grundler at today's handoff meeting.
Cc: gu...@chromium.org xixuan@chromium.org
 Issue 829871  has been merged into this issue.
Xixuan: a bit more context would have been helpful to David. :)

And I don't think this is the same issue David saw the provisions issues he workg on. This bug is about DNS failures in a VM, not DNS failures on machines directly sending DNS requests to the wrong DNS servers (or to misconfigured DNS servers).
I have a physical moblab that is experiencing similar issues - what would make it so that DNS fails for user moblab but work for user root

localhost /home/moblab # dig google.com

; <<>> DiG 9.10.2 <<>> google.com
;; global options: +cmd
;; Got answer:
;; ->>HEADER<<- opcode: QUERY, status: NOERROR, id: 47780
;; flags: qr rd ra; QUERY: 1, ANSWER: 1, AUTHORITY: 0, ADDITIONAL: 1

;; OPT PSEUDOSECTION:
; EDNS: version: 0, flags:; udp: 4096
;; QUESTION SECTION:
;google.com.                    IN      A

;; ANSWER SECTION:
google.com.             211     IN      A       172.217.2.46

;; Query time: 64 msec
;; SERVER: 100.109.178.168#53(100.109.178.168)
;; WHEN: Tue Jun 26 20:34:54 PDT 2018
;; MSG SIZE  rcvd: 55

localhost /home/moblab # exit
moblab@localhost ~ $ dig google.com

; <<>> DiG 9.10.2 <<>> google.com
;; global options: +cmd
;; connection timed out; no servers could be reached
Best guess is that not being able to access resolv.conf is the problem - 

moblab@localhost ~ $ cat /etc/resolv.conf 
cat: /etc/resolv.conf: Permission denied
moblab@localhost ~ $ ls -lrt /etc/resolv.conf 
lrwxrwxrwx 1 root root 22 Jun 26 15:04 /etc/resolv.conf -> /run/shill/resolv.conf

moblab@localhost ~ $ ls -lrt /run/shill/resolv.conf
ls: cannot access '/run/shill/resolv.conf': Permission denied

So...I guess one of those CLs *was* problematic:

https://chromium-review.googlesource.com/c/aosp/platform/system/connectivity/shill/+/1087527/9

It's setting the wrong permissions for /run/shill/.

So is the $subject problem happening only on trybot runs with that CL? If so, then I think it's working as intended (rejecting the bad CL). It's just very opaque about why.

And if this is indeed only a problem with this CL: can this vm test be improved to make this easier to discern? For one, I don't recall seeing /var/log/net.log in the sysinfo. I thought we normally collect that?
So in the CQ https://uberchromegw.corp.google.com/i/chromeos/builders/guado_moblab-paladin/builds/9774

and https://uberchromegw.corp.google.com/i/chromeos/builders/guado_moblab-paladin/builds/9775

Were rejected because of the bad CL correctly - I was hoping that I had a reproducible issue on a real device to make it easier to debug I doubt that is the case now.  I will keep working on the precq failures
I'm still kinda disappointed that we didn't understand the (in retrospect) blatantly broken CL until now though :( At least it was mostly dying in the PreCQ, so it didn't affect so many other people.

> I will keep working on the precq failures

Is that to say that you've found recent $subject PreCQ failures that did *not* include CL:1087527?
I need to check - I think there were some, the only one I have at the moment is this one:

https://cros-goldeneye.corp.google.com/chromeos/healthmonitoring/buildDetails?buildbucketId=8942651661289006544

but it failed for a totally different reason ( DUT VM did not start )  I will just want to be through and check through all the recent failures.

I added a "internet ready check" on moblab bootup - thinking this was being caused by a network timing issue in the VM, those checks just work as root, I need to have some checks as moblab user as well.  At least this would make similar issues in the future easier to debug.
Status: Fixed (was: Assigned)
Marking this as fixed - the root cause was fixed (bad CL ) and things have been stable for some time.

Sign in to add a comment