Issue metadata
Sign in to add a comment
|
moblab-generic-vm-pre-cq can't download gsutils from storage.googleapis.com |
||||||||||||||||||||||||
Issue descriptionhttps://chromium-review.googlesource.com/c/chromiumos/third_party/autotest/+/1099190 and its companion (listed as co-dependent, but the only true dependency should be the latter to the former): https://chromium-review.googlesource.com/c/aosp/platform/system/connectivity/shill/+/1087527 have had a very hard time passing the PreCQ, for over a week now. There are a couple different symptoms, some of which were known issues: #1 The moblab-generic-vm-pre-cq trybot for your change crashed. https://cros-goldeneye.corp.google.com/chromeos/healthmonitoring/buildDetails?buildbucketId=8943451098642350224 #2 moblab-generic-vm-pre-cq: The MoblabVMTest stage failed: return code: 1; command: cros_sdk -- test_that --no-quickmerge -b moblab-generic-vm --results_dir /tmp/cbuildbotE7POj9/results localhost:11722 moblab_DummyServerNoSspSuite --args 'services_init_timeout_m=10 target_build="moblab-generic-vm-p in https://cros-goldeneye.corp.google.com/chromeos/healthmonitoring/buildDetails?buildbucketId=8943449783213160960 https://cros-goldeneye.corp.google.com/chromeos/healthmonitoring/buildDetails?buildbucketId=8943363905347811232 https://cros-goldeneye.corp.google.com/chromeos/healthmonitoring/buildDetails?buildbucketId=8943325067823785472 Most (all?) of these show this in the MoblabVMTest logs: tko parser: Error Message: DevServerException: All devservers in subnet: unrestricted subnet are currently down: set(['http://192.168.231.1:8080']). (dut hostname: None) #3 moblab-generic-vm-pre-cq: The BuildPackages stage failed: Packages failed in ./build_packages: chromeos-base/chromeos-bsp-moblab in https://cros-goldeneye.corp.google.com/chromeos/healthmonitoring/buildDetails?buildbucketId=8943235816715022688 https://cros-goldeneye.corp.google.com/chromeos/healthmonitoring/buildDetails?buildbucketId=8943092965854764736 chromeos-bsp-moblab-0.0.1-r31: GIT update --> chromeos-bsp-moblab-0.0.1-r31: repository: https://chromium.googlesource.com/chromiumos/overlays/board-overlays.git chromeos-bsp-moblab-0.0.1-r31: at the commit: 7f4dc7ac892038caf67b4e3a325422e23994a244 chromeos-bsp-moblab-0.0.1-r31: commit: dfc54e352903a0b43ae65e5224399a86b25a1b0b chromeos-bsp-moblab-0.0.1-r31: branch: master chromeos-bsp-moblab-0.0.1-r31: storage directory: "/var/cache/chromeos-cache/distfiles/target/egit-src/chromiumos/overlays/board-overlays" chromeos-bsp-moblab-0.0.1-r31: checkout type: bare repository chromeos-bsp-moblab-0.0.1-r31: Cloning into '/build/moblab-generic-vm/tmp/portage/chromeos-base/chromeos-bsp-moblab-0.0.1-r31/work/chromeos-bsp-moblab-0.0.1'... chromeos-bsp-moblab-0.0.1-r31: done. chromeos-bsp-moblab-0.0.1-r31: fatal: reference is not a tree: dfc54e352903a0b43ae65e5224399a86b25a1b0b chromeos-bsp-moblab-0.0.1-r31: * ERROR: chromeos-base/chromeos-bsp-moblab-0.0.1-r31::moblab failed (unpack phase): chromeos-bsp-moblab-0.0.1-r31: * git-2_branch: changing the branch failed --- I believe symptom #2 is the most common.
,
Jun 22 2018
#1 is interesting, but unrelated to moblab in any way. https://chrome-swarming.appspot.com/task?id=3e27a9c3f7a98a10&refresh=10&show_raw=1 Let's focus on #2 in this bug.
,
Jun 22 2018
,
Jun 22 2018
#2 is caused by the dev server being unable to download gsutils curl: (6) Couldn't resolve host 'storage.googleapis.com' https://storage.cloud.google.com/chromeos-image-archive/moblab-generic-vm-pre-cq/R69-10807.0.0-b2686953/moblab_vm_test_results/results-1-moblab_DummyServerNoSspSuite/moblab_RunSuite/sysinfo/var/log/devserver/console.log?_ga=2.209903698.-61875620.1524681804
,
Jun 22 2018
To be clear that is the reason moblab put out the error message DevServerException: All devservers in subnet: unrestricted subnet are currently down: set(['http://192.168.231.1:8080']). (dut hostname: None) Why curl can not access 'storage.googleapis.com' from the vm is what needs work
,
Jun 22 2018
This is squarely in the hardware testing space. Over to the current Infra Deputy.
,
Jun 22 2018
,
Jun 22 2018
Don reminded me that some of this (the VM component) is still CI until we complete the VM move to HWTest. So, readding the CI component to monitor. Still, the immediate issue seems to be related to DevServer so it would be good if jkop@ can take a look at those error messages and root cause.
,
Jun 22 2018
,
Jun 22 2018
The working theory from IRC chat is that the networking bridge between the VM instance and the outside world is failing for unknown reasons. That suggests to me that this is either an issue with the GCE kernel, or a timing issue related to starting up the VM. Either way, we need someone with more VM expertise to diagnose.
,
Jun 22 2018
How are the tests getting started and test logs getting off of the VM if networking is broken?
,
Jun 22 2018
Specifically the test is not able to resolve a host - so DNS is down at that point not necessarily all networking curl: (6) Couldn't resolve host 'storage.googleapis.com'
,
Jun 22 2018
#11 is a good question. Pure speculation.... the bridge works for host to VM connections, but not VM to world?
,
Jun 23 2018
We've seen it before but it didn't happen often enough and we didn't see it for a long while afterwards. See duped bug.
,
Jun 24 2018
The following revision refers to this bug: https://chromium.googlesource.com/chromiumos/overlays/board-overlays/+/925a306d3bde497d239357e85518e6985d746772 commit 925a306d3bde497d239357e85518e6985d746772 Author: Keith Haddow <haddowk@chromium.org> Date: Sun Jun 24 10:36:43 2018 [moblab] Add in some dns debugging commands for devserver. TEST=ad hoc test on moblab, tryjob BUG= chromium:855664 Change-Id: Ie3ba96da60f16b5544cd43479e10152af08f9103 Reviewed-on: https://chromium-review.googlesource.com/1112743 Commit-Ready: ChromeOS CL Exonerator Bot <chromiumos-cl-exonerator@appspot.gserviceaccount.com> Tested-by: Keith Haddow <haddowk@chromium.org> Reviewed-by: Keith Haddow <haddowk@chromium.org> [add] https://crrev.com/925a306d3bde497d239357e85518e6985d746772/project-moblab/chromeos-base/chromeos-bsp-moblab/files/init/moblab-internetcheck.conf [modify] https://crrev.com/925a306d3bde497d239357e85518e6985d746772/project-moblab/chromeos-base/chromeos-bsp-moblab/files/init/moblab-devserver-init.conf [modify] https://crrev.com/925a306d3bde497d239357e85518e6985d746772/project-moblab/chromeos-base/chromeos-bsp-moblab/files/init/moblab-base-container-init.conf
,
Jun 24 2018
I have done what I can to make sure that DNS and network connection is present, before moblab starts. However I am still just playing whack a mole with the network issues, now downloads are timing out ( timeout is 180 secs, the same file take 3 secs to download on my desktop ) crbug.com/855941 It seems to me the crosvm networking is not working correctly. I will keep trying to figure out band-aids but it would be good if someone who knows about the vm system could look into the root cause. Or could at least point me to how to investigate networking issues on the vm
,
Jun 25 2018
Merge Issue 829871 here since @Keith is actively working on this. cc @grundler & @david in case it's the same issue described by @grundler at today's handoff meeting.
,
Jun 25 2018
,
Jun 25 2018
Xixuan: a bit more context would have been helpful to David. :) And I don't think this is the same issue David saw the provisions issues he workg on. This bug is about DNS failures in a VM, not DNS failures on machines directly sending DNS requests to the wrong DNS servers (or to misconfigured DNS servers).
,
Jun 27 2018
I have a physical moblab that is experiencing similar issues - what would make it so that DNS fails for user moblab but work for user root localhost /home/moblab # dig google.com ; <<>> DiG 9.10.2 <<>> google.com ;; global options: +cmd ;; Got answer: ;; ->>HEADER<<- opcode: QUERY, status: NOERROR, id: 47780 ;; flags: qr rd ra; QUERY: 1, ANSWER: 1, AUTHORITY: 0, ADDITIONAL: 1 ;; OPT PSEUDOSECTION: ; EDNS: version: 0, flags:; udp: 4096 ;; QUESTION SECTION: ;google.com. IN A ;; ANSWER SECTION: google.com. 211 IN A 172.217.2.46 ;; Query time: 64 msec ;; SERVER: 100.109.178.168#53(100.109.178.168) ;; WHEN: Tue Jun 26 20:34:54 PDT 2018 ;; MSG SIZE rcvd: 55 localhost /home/moblab # exit moblab@localhost ~ $ dig google.com ; <<>> DiG 9.10.2 <<>> google.com ;; global options: +cmd ;; connection timed out; no servers could be reached
,
Jun 27 2018
Best guess is that not being able to access resolv.conf is the problem - moblab@localhost ~ $ cat /etc/resolv.conf cat: /etc/resolv.conf: Permission denied moblab@localhost ~ $ ls -lrt /etc/resolv.conf lrwxrwxrwx 1 root root 22 Jun 26 15:04 /etc/resolv.conf -> /run/shill/resolv.conf moblab@localhost ~ $ ls -lrt /run/shill/resolv.conf ls: cannot access '/run/shill/resolv.conf': Permission denied
,
Jun 27 2018
So...I guess one of those CLs *was* problematic: https://chromium-review.googlesource.com/c/aosp/platform/system/connectivity/shill/+/1087527/9 It's setting the wrong permissions for /run/shill/. So is the $subject problem happening only on trybot runs with that CL? If so, then I think it's working as intended (rejecting the bad CL). It's just very opaque about why. And if this is indeed only a problem with this CL: can this vm test be improved to make this easier to discern? For one, I don't recall seeing /var/log/net.log in the sysinfo. I thought we normally collect that?
,
Jun 27 2018
So in the CQ https://uberchromegw.corp.google.com/i/chromeos/builders/guado_moblab-paladin/builds/9774 and https://uberchromegw.corp.google.com/i/chromeos/builders/guado_moblab-paladin/builds/9775 Were rejected because of the bad CL correctly - I was hoping that I had a reproducible issue on a real device to make it easier to debug I doubt that is the case now. I will keep working on the precq failures
,
Jun 27 2018
I'm still kinda disappointed that we didn't understand the (in retrospect) blatantly broken CL until now though :( At least it was mostly dying in the PreCQ, so it didn't affect so many other people. > I will keep working on the precq failures Is that to say that you've found recent $subject PreCQ failures that did *not* include CL:1087527?
,
Jun 27 2018
I need to check - I think there were some, the only one I have at the moment is this one: https://cros-goldeneye.corp.google.com/chromeos/healthmonitoring/buildDetails?buildbucketId=8942651661289006544 but it failed for a totally different reason ( DUT VM did not start ) I will just want to be through and check through all the recent failures. I added a "internet ready check" on moblab bootup - thinking this was being caused by a network timing issue in the VM, those checks just work as root, I need to have some checks as moblab user as well. At least this would make similar issues in the future easier to debug.
,
Jul 16
Marking this as fixed - the root cause was fixed (bad CL ) and things have been stable for some time. |
|||||||||||||||||||||||||
►
Sign in to add a comment |
|||||||||||||||||||||||||
Comment 1 by dgarr...@chromium.org
, Jun 22 2018