VM(betty) losts its internet connection to host after long run test. |
||||||||
Issue descriptionVM Chrome OS Version: 9932.0.0 (Official Build) dev-channel betty test VM Chrome Version: 63.0.3212.0 (Official Build) Host OS: Linux 4.4.0-89-generic, Goobuntu What steps will reproduce the problem? (1) Download betty image from goldeneye. (2) Launch betty through kvm. The command I used as followed "sudo kvm -m 2G -smp 4 -vga cirrus -pidfile /tmp/kvm.pid -net nic,model=virtio -net user,hostfwd=tcp:127.0.0.1:9222-:22 -hda chromiumos_qemu_image.bin" (3) In chroot, type "test_that 127.0.0.1:9222 cheets_CTS_N.7.1_r9.arm.CtsDeqpTestCases --board=betty --debug" which will launch a long run test. What is the expected result? Host can access the VM through ssh, and VM can connect to the host network. What happens instead? In about 3-4 hours of running, Host can't connect to the VM through ssh anymore. Even though host network can connect to public internet, VM can't access it as well. (See attachment for VM screenshot)
,
Sep 18 2017
Kevin, could you take a look?
,
Sep 18 2017
The symptom looks suspiciously similar to https://bugs.chromium.org/p/chromium/issues/detail?id=748634#c29 This was originally attributed to the inbound connection rules (which have been reverted), but the failure was intermittent enough that we couldn't quite be sure they were to blame.
,
Sep 18 2017
We have a experiment builder constantly running long run test in the VM, which the connection failure seems quite consistent. https://uberchromegw.corp.google.com/i/chromeos/builders/betty-vmtest-informational (Please skip build #13/#12 as they are different failure)
,
Sep 18 2017
,
Sep 18 2017
,
Sep 19 2017
I was able to reproduce this locally. This is the first dhcp-failed in the logs: 2017-09-18T17:37:08.426158-07:00 INFO dhcpcd[1012]: eth0: sending DISCOVER (xid 0xfb8106ed), next in 4.7 seconds 2017-09-18T17:37:13.173896-07:00 INFO dhcpcd[1012]: eth0: sending DISCOVER (xid 0xfb8106ed), next in 8.5 seconds 2017-09-18T17:37:21.647208-07:00 INFO dhcpcd[1012]: eth0: sending DISCOVER (xid 0xfb8106ed), next in 16.0 seconds 2017-09-18T17:37:32.526569-07:00 ERR shill[847]: [ERROR:dhcp_config.cc(285)] Timed out waiting for DHCP lease on eth0 (after 30 seconds). 2017-09-18T17:37:32.526993-07:00 INFO shill[847]: [INFO:service.cc(332)] Disconnecting from service 1: OnIPConfigFailure 2017-09-18T17:37:32.559965-07:00 INFO dhcpcd[1012]: status changed to Release 2017-09-18T17:37:32.560315-07:00 INFO shill[847]: [INFO:dhcp_config.cc(210)] Stopping 1012 (ReleaseIP) 2017-09-18T17:37:32.560460-07:00 INFO dhcpcd[1012]: received SIGTERM, stopping 2017-09-18T17:37:32.560472-07:00 INFO dhcpcd[1012]: eth0: removing interface 2017-09-18T17:37:32.560498-07:00 INFO dhcpcd[1012]: status changed to Release 2017-09-18T17:37:32.560843-07:00 INFO dhcpcd[1012]: dhcpcd exited 2017-09-18T17:37:32.564764-07:00 INFO shill[847]: [INFO:service.cc(400)] Service 1: state Configuring -> Idle 2017-09-18T17:37:32.564857-07:00 INFO shill[847]: [INFO:manager.cc(1456)] Service 1 updated; state: Idle failure Unknown 2017-09-18T17:37:32.565698-07:00 INFO shill[847]: [INFO:service.cc(400)] Service 1: state Idle -> Failure 2017-09-18T17:37:32.565747-07:00 INFO shill[847]: [INFO:service.cc(986)] Noting an unexpected failure to connect. 2017-09-18T17:37:32.565821-07:00 INFO shill[847]: [INFO:manager.cc(1456)] Service 1 updated; state: Failure failure dhcp-failed The last dhcpcd activity happened an hour prior: 2017-09-18T16:25:38.376643-07:00 NOTICE dhcpcd[969]: dhcpcd is running with reduced privileges 2017-09-18T16:25:38.377323-07:00 INFO dhcpcd[969]: status changed to Init 2017-09-18T16:25:38.377968-07:00 INFO shill[816]: [INFO:dhcp_config.cc(145)] Init DHCP Proxy: eth0 at :1.12 2017-09-18T16:25:38.615542-07:00 INFO dhcpcd[969]: status changed to Reboot 2017-09-18T16:25:38.615604-07:00 INFO dhcpcd[969]: eth0: rebinding lease of 10.0.2.15 2017-09-18T16:25:38.615648-07:00 INFO dhcpcd[969]: eth0: sending REQUEST (xid 0xe73ba565), next in 3.8 seconds 2017-09-18T16:25:38.615956-07:00 INFO dhcpcd[969]: eth0: received ACK with xid 0xe73ba565 2017-09-18T16:25:38.615967-07:00 INFO dhcpcd[969]: eth0: acknowledged 10.0.2.15 from 10.0.2.2 2017-09-18T16:25:38.615988-07:00 INFO dhcpcd[969]: eth0: leased 10.0.2.15 for 86400 seconds 2017-09-18T16:25:38.616099-07:00 INFO dhcpcd[969]: event REBOOT on interface eth0 2017-09-18T16:25:38.616195-07:00 INFO dhcpcd[969]: status changed to Bound The dhcp-failure happened right after an OOM and a reboot: 2017-09-18T17:36:47.109613-07:00 ERR kernel: [ 4276.152210] Out of memory: Kill process 10326 (awelements.deqp) score 581 or sacrifice child 2017-09-18T17:36:47.109618-07:00 ERR kernel: [ 4276.152243] Killed process 10326 (awelements.deqp) total-vm:1267288kB, anon-rss:10532kB, file-rss:77336kB 2017-09-18T17:36:47.111498-07:00 ERR kernel: [ 4276.155478] Out of memory: Kill process 5183 (id.ext.services) score 487 or sacrifice child 2017-09-18T17:36:47.111507-07:00 ERR kernel: [ 4276.155500] Killed process 5183 (id.ext.services) total-vm:1032100kB, anon-rss:1432kB, file-rss:49960kB [...] 2017-09-18T17:36:58.209893-07:00 NOTICE kernel: [ 0.000000] Linux version 4.4.86-11631-g2c4551465d0b (chrome-bot@build109-m2.golo.chromium.org) (gcc version 4.9.x 20150123 (prerelease) (4.9.2_cos_gg_4.9.2-r163-0c5a656a1322e137fa4a251f2ccc6c4022918c0a_4.9.2-r163) ) #1 SMP PREEMPT Mon Sep 11 12:15:43 PDT 2017 2017-09-18T17:36:58.209899-07:00 INFO kernel: [ 0.000000] Command line: BOOT_IMAGE=vmlinuz.A init=/sbin/init boot=local rootwait ro noresume noswap loglevel=7 noinitrd console= dm_verity.error_behavior=3 dm_verity.max_bios=-1 dm_verity.dev_wait=1 root=/dev/dm-0 i915.modeset=1 cros_legacy cros_debug dm="1 vroot none ro 1,0 4096000 verity payload=PARTUUID=9A54AA70-AC44-914D-A892-2FB2746F3002 hashtree=PARTUUID=9A54AA70-AC44-914D-A892-2FB2746F3002 hashstart=4096000 alg=sha1 root_hexdigest=e9ff8cb0f588425d38ac25a7d2c512e17fcd5b92 salt=cf2b20e473ed1c620a7b2e60a3ea675aa42778d8addd9abb9e1e7cf82f5854c8" If I reboot again without stopping qemu, dhcpcd still fails. So I don't think this is a Chrome OS issue. I can manually configure the network interface with `ifconfig eth0 10.0.2.15 netmask 255.255.255.0 up` and then ssh to the VM again. However, I cannot ping 10.0.2.2 (which is possible when the VM is working normally). The fact that ssh still works in this setup indicates that two-way data traffic is OK, but something is probably wrong with qemu's internal networking services. I wondered if random MAC addresses were being used (which might confuse a simple DHCP server that only allocates one IP), but the MAC address does not change from one boot to the next. It is always 52:54:00:12:34:56. These are the networking options we are using: -net nic,model=virtio -net user,hostfwd=tcp:127.0.0.1:9222-:22 I'm retesting with a ToT build of qemu. `kvm --version` reports v2.0.0 from 2014. Is that what you are using too? BTW do we really need to start qemu as root? For me it seems to be working OK as a normal user.
,
Sep 19 2017
`kvm --version` reports 'QEMU emulator version 2.0.0 (Debian 2.0.0+dfsg-2ubuntu1.35), Copyright (c) 2003-2008 Fabrice Bellard' in my side. I also tried to use the src/script/bin/cros_start_vm, which using chroot/usr/bin/qemu-system-x86_64(QEMU emulator version 2.6.0), to start vm, but got the same result.
,
Sep 19 2017
ToT qemu (commit a9158a5cba955b79) also lost networking during this test. So the bad news is that upgrading qemu won't fix it, but the good news is that I can attach to qemu with gdb and debug it with symbols. I see another problem with OOBE not adjusting to the VM screen resolution, so the system menu (and, more importantly, the power off button) overlap with the buttons at the bottom. This makes it hard to navigate through the UI, and caused my VM to power off when I was trying to click a button. Could someone please take a look?
,
Sep 19 2017
@9 you can resize the window, or navigate with the keyboard only, or use serial console (you don't really need UI to debug network problems).
,
Sep 19 2017
I don't think I can resize this window. Also, why does it need all of that extra whitespace if it doesn't fit on the screen? Keyboard navigation sometimes works, sometimes not. It is flaky. Serial console isn't enabled on these images AFAICT. I do need the UI to debug this problem. At the minimum I need to get into guest mode and open up a shell.
,
Sep 19 2017
@11: I just walked to your desk, and you were past OOBE, so you should be good to go. In the future, running any autotest that makes an account will skip OOBE for you, so that's another solution.
,
Sep 19 2017
@12: Thanks for looking. There have been cases where I needed to repeat OOBE in the course of debugging, like after the VM decides to powerwash itself. I'm not stuck _at the moment_, but that doesn't mean the bug isn't worth fixing. It did cost me a few hours earlier today. Since this is separate from the networking issue and it is a lower priority, I have filed bug 766883 to track it.
,
Sep 20 2017
I attached to the qemu process when the system was in a bad state, poked around at the slirp/mbuf data structures to see what was wrong, and came up with a test case that reproduces the same issue (attached). Proposed fix: http://lists.nongnu.org/archive/html/qemu-devel/2017-09/msg05795.html The first patch seems to fix the networking hang. This vmtest is still not stable (it seems to run out of memory and reboot the VM every 10 minutes or so) but I assume that if the networking problem is fixed, it will now be possible to debug the other test failures one by one. I have filed a ticket to patch our installed qemu package: http://t/28764965
,
Sep 22 2017
The following revision refers to this bug: https://chromium.googlesource.com/chromiumos/overlays/chromiumos-overlay/+/1df5d18341378887bcf20375b65f22bd3ee15633 commit 1df5d18341378887bcf20375b65f22bd3ee15633 Author: Kevin Cernekee <cernekee@chromium.org> Date: Fri Sep 22 02:38:23 2017 app-emulation/qemu: Move package from portage-stable We need to carry a local patch for a networking stability bug until there is a new upstream release with the fix. BUG= chromium:766323 TEST=`sudo emerge qemu` CQ-DEPEND=CL:676530 Change-Id: I5645ad2b9469eeda8700b829c4348fdbf804f6d7 Reviewed-on: https://chromium-review.googlesource.com/676481 Commit-Ready: Kevin Cernekee <cernekee@chromium.org> Tested-by: Kevin Cernekee <cernekee@chromium.org> Reviewed-by: Ilja H. Friedel <ihf@chromium.org> Reviewed-by: Bernie Thompson <bhthompson@chromium.org> Reviewed-by: Mike Frysinger <vapier@chromium.org> [add] https://crrev.com/1df5d18341378887bcf20375b65f22bd3ee15633/app-emulation/qemu/files/qemu-2.5.0-cflags.patch [add] https://crrev.com/1df5d18341378887bcf20375b65f22bd3ee15633/app-emulation/qemu/files/qemu-2.5.0-sysmacros.patch [add] https://crrev.com/1df5d18341378887bcf20375b65f22bd3ee15633/app-emulation/qemu/metadata.xml [add] https://crrev.com/1df5d18341378887bcf20375b65f22bd3ee15633/app-emulation/qemu/qemu-2.6.0.ebuild [add] https://crrev.com/1df5d18341378887bcf20375b65f22bd3ee15633/app-emulation/qemu/files/65-kvm.rules [add] https://crrev.com/1df5d18341378887bcf20375b65f22bd3ee15633/app-emulation/qemu/files/bridge.conf [add] https://crrev.com/1df5d18341378887bcf20375b65f22bd3ee15633/app-emulation/qemu/files/qemu-binfmt.initd-r1 [add] https://crrev.com/1df5d18341378887bcf20375b65f22bd3ee15633/app-emulation/qemu/Manifest [add] https://crrev.com/1df5d18341378887bcf20375b65f22bd3ee15633/app-emulation/qemu/files/qemu-2.6.0-crypto-static.patch
,
Sep 22 2017
The following revision refers to this bug: https://chromium.googlesource.com/chromiumos/overlays/portage-stable/+/8f69adc22682b5d815eef597618f60ef6c5615bb commit 8f69adc22682b5d815eef597618f60ef6c5615bb Author: Kevin Cernekee <cernekee@chromium.org> Date: Fri Sep 22 02:38:23 2017 app-emulation/qemu: Move package to chromiumos_overlay We need to carry a local patch for a networking stability bug until there is a new upstream release with the fix. BUG= chromium:766323 TEST=`sudo emerge qemu` CQ-DEPEND=CL:676481 Change-Id: I20ff5cfe7f0b9ea611bee0700f69ad15b3f2733c Reviewed-on: https://chromium-review.googlesource.com/676530 Commit-Ready: Kevin Cernekee <cernekee@chromium.org> Tested-by: Kevin Cernekee <cernekee@chromium.org> Reviewed-by: Ilja H. Friedel <ihf@chromium.org> Reviewed-by: Bernie Thompson <bhthompson@chromium.org> Reviewed-by: Mike Frysinger <vapier@chromium.org> [delete] https://crrev.com/4da757cd0b47ace46eb6b7b5d5ba8a8d2b08bf5e/app-emulation/qemu/files/qemu-2.5.0-cflags.patch [delete] https://crrev.com/4da757cd0b47ace46eb6b7b5d5ba8a8d2b08bf5e/app-emulation/qemu/files/qemu-2.5.0-sysmacros.patch [delete] https://crrev.com/4da757cd0b47ace46eb6b7b5d5ba8a8d2b08bf5e/app-emulation/qemu/metadata.xml [delete] https://crrev.com/4da757cd0b47ace46eb6b7b5d5ba8a8d2b08bf5e/app-emulation/qemu/qemu-2.6.0.ebuild [delete] https://crrev.com/4da757cd0b47ace46eb6b7b5d5ba8a8d2b08bf5e/app-emulation/qemu/files/65-kvm.rules [delete] https://crrev.com/4da757cd0b47ace46eb6b7b5d5ba8a8d2b08bf5e/app-emulation/qemu/files/bridge.conf [delete] https://crrev.com/4da757cd0b47ace46eb6b7b5d5ba8a8d2b08bf5e/app-emulation/qemu/files/qemu-binfmt.initd-r1 [delete] https://crrev.com/4da757cd0b47ace46eb6b7b5d5ba8a8d2b08bf5e/app-emulation/qemu/Manifest [delete] https://crrev.com/4da757cd0b47ace46eb6b7b5d5ba8a8d2b08bf5e/app-emulation/qemu/files/qemu-2.6.0-crypto-static.patch
,
Sep 22 2017
The following revision refers to this bug: https://chromium.googlesource.com/chromiumos/overlays/chromiumos-overlay/+/78a164a4def15652ee834927b8cfda02eb7e6229 commit 78a164a4def15652ee834927b8cfda02eb7e6229 Author: Kevin Cernekee <cernekee@chromium.org> Date: Fri Sep 22 09:02:01 2017 app-emulation/qemu: Patch network send queue bug This corrects a problem that causes betty vmtests to intermittently lose connectivity. BUG= chromium:766323 TEST=`sudo emerge qemu` Change-Id: I765299d602a71980839aa54af08b1f12c4c91a03 Reviewed-on: https://chromium-review.googlesource.com/676532 Commit-Ready: Kevin Cernekee <cernekee@chromium.org> Tested-by: Kevin Cernekee <cernekee@chromium.org> Reviewed-by: Ilja H. Friedel <ihf@chromium.org> Reviewed-by: Bernie Thompson <bhthompson@chromium.org> Reviewed-by: Mike Frysinger <vapier@chromium.org> [modify] https://crrev.com/78a164a4def15652ee834927b8cfda02eb7e6229/app-emulation/qemu/qemu-2.6.0.ebuild [add] https://crrev.com/78a164a4def15652ee834927b8cfda02eb7e6229/app-emulation/qemu/files/qemu-2.6.0-slirp-send-queue-hangs.patch [add] https://crrev.com/78a164a4def15652ee834927b8cfda02eb7e6229/app-emulation/qemu/qemu-2.6.0-r1.ebuild
,
Sep 25 2017
The following revision refers to this bug: https://chrome-internal.googlesource.com/chromeos/cheets-scripts/+/84f43c5f0ac96a7033c38cea92371f7b57c9de56 commit 84f43c5f0ac96a7033c38cea92371f7b57c9de56 Author: Kevin Cernekee <cernekee@chromium.org> Date: Mon Sep 25 18:32:18 2017
,
Sep 25 2017
This is believed to be fixed with the qemu patch.
,
Sep 25 2017
Thanks Kevin. Verified with the betty-vmtest-informational builder.
,
Sep 26 2017
The following revision refers to this bug: https://chrome-internal.googlesource.com/chromeos/cheets-scripts/+/d81122d1a7870ce3e4f711c2e35874c65e92792c commit d81122d1a7870ce3e4f711c2e35874c65e92792c Author: Kevin Cernekee <cernekee@chromium.org> Date: Tue Sep 26 22:04:24 2017 |
||||||||
►
Sign in to add a comment |
||||||||
Comment 1 by marc...@chromium.org
, Sep 18 2017