New issue
Advanced search Search tips
Note: Color blocks (like or ) mean that a user may not be available. Tooltip shows the reason.

Issue 766323 link

Starred by 1 user

Issue metadata

Status: Verified
Owner:
Last visit > 30 days ago
Closed: Sep 2017
Cc:
Components:
EstimatedDays: ----
NextAction: ----
OS: Chrome
Pri: 1
Type: Bug



Sign in to add a comment

VM(betty) losts its internet connection to host after long run test.

Project Member Reported by pwang@chromium.org, Sep 18 2017

Issue description

VM Chrome OS Version: 9932.0.0 (Official Build) dev-channel betty test
VM Chrome Version: 63.0.3212.0 (Official Build)
Host OS: Linux 4.4.0-89-generic, Goobuntu

What steps will reproduce the problem?
(1) Download betty image from goldeneye.
(2) Launch betty through kvm. The command I used as followed 
"sudo kvm -m 2G -smp 4 -vga cirrus -pidfile /tmp/kvm.pid -net nic,model=virtio -net user,hostfwd=tcp:127.0.0.1:9222-:22 -hda chromiumos_qemu_image.bin"
(3) In chroot, type 
"test_that 127.0.0.1:9222 cheets_CTS_N.7.1_r9.arm.CtsDeqpTestCases --board=betty --debug" which will launch a long run test.

What is the expected result?
Host can access the VM through ssh, and VM can connect to the host network. 

What happens instead?
In about 3-4 hours of running, Host can't connect to the VM through ssh anymore.
Even though host network can connect to public internet, VM can't access it as well. 
(See attachment for VM screenshot)



 
Screenshot from 2017-09-18 13:54:23.png
135 KB View Download
Owner: snanda@chromium.org

Comment 2 by snanda@chromium.org, Sep 18 2017

Cc: snanda@chromium.org norvez@chromium.org
Owner: cernekee@chromium.org
Kevin, could you take a look?
The symptom looks suspiciously similar to https://bugs.chromium.org/p/chromium/issues/detail?id=748634#c29

This was originally attributed to the inbound connection rules (which have been reverted), but the failure was intermittent enough that we couldn't quite be sure they were to blame.

Comment 4 by pwang@chromium.org, Sep 18 2017

We have a experiment builder constantly running long run test in the VM, which the connection failure seems quite consistent.
https://uberchromegw.corp.google.com/i/chromeos/builders/betty-vmtest-informational
(Please skip build #13/#12 as they are different failure)
Components: -Internals>Network>Connectivity OS>Systems>Network
Labels: OS-Chrome
I was able to reproduce this locally.  This is the first dhcp-failed in the logs:

2017-09-18T17:37:08.426158-07:00 INFO dhcpcd[1012]: eth0: sending DISCOVER (xid 0xfb8106ed), next in 4.7 seconds
2017-09-18T17:37:13.173896-07:00 INFO dhcpcd[1012]: eth0: sending DISCOVER (xid 0xfb8106ed), next in 8.5 seconds
2017-09-18T17:37:21.647208-07:00 INFO dhcpcd[1012]: eth0: sending DISCOVER (xid 0xfb8106ed), next in 16.0 seconds
2017-09-18T17:37:32.526569-07:00 ERR shill[847]: [ERROR:dhcp_config.cc(285)] Timed out waiting for DHCP lease on eth0 (after 30 seconds).
2017-09-18T17:37:32.526993-07:00 INFO shill[847]: [INFO:service.cc(332)] Disconnecting from service 1: OnIPConfigFailure
2017-09-18T17:37:32.559965-07:00 INFO dhcpcd[1012]: status changed to Release
2017-09-18T17:37:32.560315-07:00 INFO shill[847]: [INFO:dhcp_config.cc(210)] Stopping 1012 (ReleaseIP)
2017-09-18T17:37:32.560460-07:00 INFO dhcpcd[1012]: received SIGTERM, stopping
2017-09-18T17:37:32.560472-07:00 INFO dhcpcd[1012]: eth0: removing interface
2017-09-18T17:37:32.560498-07:00 INFO dhcpcd[1012]: status changed to Release
2017-09-18T17:37:32.560843-07:00 INFO dhcpcd[1012]: dhcpcd exited
2017-09-18T17:37:32.564764-07:00 INFO shill[847]: [INFO:service.cc(400)] Service 1: state Configuring -> Idle
2017-09-18T17:37:32.564857-07:00 INFO shill[847]: [INFO:manager.cc(1456)] Service 1 updated; state: Idle failure Unknown
2017-09-18T17:37:32.565698-07:00 INFO shill[847]: [INFO:service.cc(400)] Service 1: state Idle -> Failure
2017-09-18T17:37:32.565747-07:00 INFO shill[847]: [INFO:service.cc(986)] Noting an unexpected failure to connect.
2017-09-18T17:37:32.565821-07:00 INFO shill[847]: [INFO:manager.cc(1456)] Service 1 updated; state: Failure failure dhcp-failed

The last dhcpcd activity happened an hour prior:

2017-09-18T16:25:38.376643-07:00 NOTICE dhcpcd[969]: dhcpcd is running with reduced privileges
2017-09-18T16:25:38.377323-07:00 INFO dhcpcd[969]: status changed to Init
2017-09-18T16:25:38.377968-07:00 INFO shill[816]: [INFO:dhcp_config.cc(145)] Init DHCP Proxy: eth0 at :1.12
2017-09-18T16:25:38.615542-07:00 INFO dhcpcd[969]: status changed to Reboot
2017-09-18T16:25:38.615604-07:00 INFO dhcpcd[969]: eth0: rebinding lease of 10.0.2.15
2017-09-18T16:25:38.615648-07:00 INFO dhcpcd[969]: eth0: sending REQUEST (xid 0xe73ba565), next in 3.8 seconds
2017-09-18T16:25:38.615956-07:00 INFO dhcpcd[969]: eth0: received ACK with xid 0xe73ba565
2017-09-18T16:25:38.615967-07:00 INFO dhcpcd[969]: eth0: acknowledged 10.0.2.15 from 10.0.2.2
2017-09-18T16:25:38.615988-07:00 INFO dhcpcd[969]: eth0: leased 10.0.2.15 for 86400 seconds
2017-09-18T16:25:38.616099-07:00 INFO dhcpcd[969]: event REBOOT on interface eth0
2017-09-18T16:25:38.616195-07:00 INFO dhcpcd[969]: status changed to Bound

The dhcp-failure happened right after an OOM and a reboot:

2017-09-18T17:36:47.109613-07:00 ERR kernel: [ 4276.152210] Out of memory: Kill process 10326 (awelements.deqp) score 581 or sacrifice child
2017-09-18T17:36:47.109618-07:00 ERR kernel: [ 4276.152243] Killed process 10326 (awelements.deqp) total-vm:1267288kB, anon-rss:10532kB, file-rss:77336kB
2017-09-18T17:36:47.111498-07:00 ERR kernel: [ 4276.155478] Out of memory: Kill process 5183 (id.ext.services) score 487 or sacrifice child
2017-09-18T17:36:47.111507-07:00 ERR kernel: [ 4276.155500] Killed process 5183 (id.ext.services) total-vm:1032100kB, anon-rss:1432kB, file-rss:49960kB
[...]
2017-09-18T17:36:58.209893-07:00 NOTICE kernel: [    0.000000] Linux version 4.4.86-11631-g2c4551465d0b (chrome-bot@build109-m2.golo.chromium.org) (gcc version 4.9.x 20150123 (prerelease) (4.9.2_cos_gg_4.9.2-r163-0c5a656a1322e137fa4a251f2ccc6c4022918c0a_4.9.2-r163) ) #1 SMP PREEMPT Mon Sep 11 12:15:43 PDT 2017
2017-09-18T17:36:58.209899-07:00 INFO kernel: [    0.000000] Command line: BOOT_IMAGE=vmlinuz.A init=/sbin/init boot=local rootwait ro noresume noswap loglevel=7 noinitrd console=  dm_verity.error_behavior=3 dm_verity.max_bios=-1 dm_verity.dev_wait=1 root=/dev/dm-0       i915.modeset=1 cros_legacy cros_debug dm="1 vroot none ro 1,0 4096000 verity payload=PARTUUID=9A54AA70-AC44-914D-A892-2FB2746F3002 hashtree=PARTUUID=9A54AA70-AC44-914D-A892-2FB2746F3002 hashstart=4096000 alg=sha1 root_hexdigest=e9ff8cb0f588425d38ac25a7d2c512e17fcd5b92 salt=cf2b20e473ed1c620a7b2e60a3ea675aa42778d8addd9abb9e1e7cf82f5854c8"


If I reboot again without stopping qemu, dhcpcd still fails.  So I don't think this is a Chrome OS issue.  I can manually configure the network interface with `ifconfig eth0 10.0.2.15 netmask 255.255.255.0 up` and then ssh to the VM again.  However, I cannot ping 10.0.2.2 (which is possible when the VM is working normally).  The fact that ssh still works in this setup indicates that two-way data traffic is OK, but something is probably wrong with qemu's internal networking services.

I wondered if random MAC addresses were being used (which might confuse a simple DHCP server that only allocates one IP), but the MAC address does not change from one boot to the next.  It is always 52:54:00:12:34:56.

These are the networking options we are using: -net nic,model=virtio -net user,hostfwd=tcp:127.0.0.1:9222-:22

I'm retesting with a ToT build of qemu.  `kvm --version` reports v2.0.0 from 2014.  Is that what you are using too?

BTW do we really need to start qemu as root?  For me it seems to be working OK as a normal user.

Comment 8 by pwang@chromium.org, Sep 19 2017

`kvm --version` reports 'QEMU emulator version 2.0.0 (Debian 2.0.0+dfsg-2ubuntu1.35), Copyright (c) 2003-2008 Fabrice Bellard' in my side.

I also tried to use the src/script/bin/cros_start_vm, which using chroot/usr/bin/qemu-system-x86_64(QEMU emulator version 2.6.0), to start vm, but got the same result. 
ToT qemu (commit a9158a5cba955b79) also lost networking during this test.  So the bad news is that upgrading qemu won't fix it, but the good news is that I can attach to qemu with gdb and debug it with symbols.

I see another problem with OOBE not adjusting to the VM screen resolution, so the system menu (and, more importantly, the power off button) overlap with the buttons at the bottom.  This makes it hard to navigate through the UI, and caused my VM to power off when I was trying to click a button.  Could someone please take a look?
qemu.png
44.2 KB View Download
Labels: -Pri-2 Pri-1
@9 you can resize the window, or navigate with the keyboard only, or use serial console (you don't really need UI to debug network problems).
I don't think I can resize this window.  Also, why does it need all of that extra whitespace if it doesn't fit on the screen?

Keyboard navigation sometimes works, sometimes not.  It is flaky.

Serial console isn't enabled on these images AFAICT.

I do need the UI to debug this problem.  At the minimum I need to get into guest mode and open up a shell.
@11: I just walked to your desk, and you were past OOBE, so you should be good to go.

In the future, running any autotest that makes an account will skip OOBE for you, so that's another solution.
@12: Thanks for looking.  There have been cases where I needed to repeat OOBE in the course of debugging, like after the VM decides to powerwash itself.  I'm not stuck _at the moment_, but that doesn't mean the bug isn't worth fixing.  It did cost me a few hours earlier today.

Since this is separate from the networking issue and it is a lower priority, I have filed  bug 766883  to track it.
I attached to the qemu process when the system was in a bad state, poked around at the slirp/mbuf data structures to see what was wrong, and came up with a test case that reproduces the same issue (attached).

Proposed fix:

http://lists.nongnu.org/archive/html/qemu-devel/2017-09/msg05795.html

The first patch seems to fix the networking hang.  This vmtest is still not stable (it seems to run out of memory and reboot the VM every 10 minutes or so) but I assume that if the networking problem is fixed, it will now be possible to debug the other test failures one by one.

I have filed a ticket to patch our installed qemu package: http://t/28764965
slirp_test.c
10.2 KB View Download
Project Member

Comment 15 by bugdroid1@chromium.org, Sep 22 2017

The following revision refers to this bug:
  https://chromium.googlesource.com/chromiumos/overlays/chromiumos-overlay/+/1df5d18341378887bcf20375b65f22bd3ee15633

commit 1df5d18341378887bcf20375b65f22bd3ee15633
Author: Kevin Cernekee <cernekee@chromium.org>
Date: Fri Sep 22 02:38:23 2017

app-emulation/qemu: Move package from portage-stable

We need to carry a local patch for a networking stability bug until
there is a new upstream release with the fix.

BUG= chromium:766323 
TEST=`sudo emerge qemu`
CQ-DEPEND=CL:676530

Change-Id: I5645ad2b9469eeda8700b829c4348fdbf804f6d7
Reviewed-on: https://chromium-review.googlesource.com/676481
Commit-Ready: Kevin Cernekee <cernekee@chromium.org>
Tested-by: Kevin Cernekee <cernekee@chromium.org>
Reviewed-by: Ilja H. Friedel <ihf@chromium.org>
Reviewed-by: Bernie Thompson <bhthompson@chromium.org>
Reviewed-by: Mike Frysinger <vapier@chromium.org>

[add] https://crrev.com/1df5d18341378887bcf20375b65f22bd3ee15633/app-emulation/qemu/files/qemu-2.5.0-cflags.patch
[add] https://crrev.com/1df5d18341378887bcf20375b65f22bd3ee15633/app-emulation/qemu/files/qemu-2.5.0-sysmacros.patch
[add] https://crrev.com/1df5d18341378887bcf20375b65f22bd3ee15633/app-emulation/qemu/metadata.xml
[add] https://crrev.com/1df5d18341378887bcf20375b65f22bd3ee15633/app-emulation/qemu/qemu-2.6.0.ebuild
[add] https://crrev.com/1df5d18341378887bcf20375b65f22bd3ee15633/app-emulation/qemu/files/65-kvm.rules
[add] https://crrev.com/1df5d18341378887bcf20375b65f22bd3ee15633/app-emulation/qemu/files/bridge.conf
[add] https://crrev.com/1df5d18341378887bcf20375b65f22bd3ee15633/app-emulation/qemu/files/qemu-binfmt.initd-r1
[add] https://crrev.com/1df5d18341378887bcf20375b65f22bd3ee15633/app-emulation/qemu/Manifest
[add] https://crrev.com/1df5d18341378887bcf20375b65f22bd3ee15633/app-emulation/qemu/files/qemu-2.6.0-crypto-static.patch

Project Member

Comment 16 by bugdroid1@chromium.org, Sep 22 2017

The following revision refers to this bug:
  https://chromium.googlesource.com/chromiumos/overlays/portage-stable/+/8f69adc22682b5d815eef597618f60ef6c5615bb

commit 8f69adc22682b5d815eef597618f60ef6c5615bb
Author: Kevin Cernekee <cernekee@chromium.org>
Date: Fri Sep 22 02:38:23 2017

app-emulation/qemu: Move package to chromiumos_overlay

We need to carry a local patch for a networking stability bug until
there is a new upstream release with the fix.

BUG= chromium:766323 
TEST=`sudo emerge qemu`
CQ-DEPEND=CL:676481

Change-Id: I20ff5cfe7f0b9ea611bee0700f69ad15b3f2733c
Reviewed-on: https://chromium-review.googlesource.com/676530
Commit-Ready: Kevin Cernekee <cernekee@chromium.org>
Tested-by: Kevin Cernekee <cernekee@chromium.org>
Reviewed-by: Ilja H. Friedel <ihf@chromium.org>
Reviewed-by: Bernie Thompson <bhthompson@chromium.org>
Reviewed-by: Mike Frysinger <vapier@chromium.org>

[delete] https://crrev.com/4da757cd0b47ace46eb6b7b5d5ba8a8d2b08bf5e/app-emulation/qemu/files/qemu-2.5.0-cflags.patch
[delete] https://crrev.com/4da757cd0b47ace46eb6b7b5d5ba8a8d2b08bf5e/app-emulation/qemu/files/qemu-2.5.0-sysmacros.patch
[delete] https://crrev.com/4da757cd0b47ace46eb6b7b5d5ba8a8d2b08bf5e/app-emulation/qemu/metadata.xml
[delete] https://crrev.com/4da757cd0b47ace46eb6b7b5d5ba8a8d2b08bf5e/app-emulation/qemu/qemu-2.6.0.ebuild
[delete] https://crrev.com/4da757cd0b47ace46eb6b7b5d5ba8a8d2b08bf5e/app-emulation/qemu/files/65-kvm.rules
[delete] https://crrev.com/4da757cd0b47ace46eb6b7b5d5ba8a8d2b08bf5e/app-emulation/qemu/files/bridge.conf
[delete] https://crrev.com/4da757cd0b47ace46eb6b7b5d5ba8a8d2b08bf5e/app-emulation/qemu/files/qemu-binfmt.initd-r1
[delete] https://crrev.com/4da757cd0b47ace46eb6b7b5d5ba8a8d2b08bf5e/app-emulation/qemu/Manifest
[delete] https://crrev.com/4da757cd0b47ace46eb6b7b5d5ba8a8d2b08bf5e/app-emulation/qemu/files/qemu-2.6.0-crypto-static.patch

Project Member

Comment 17 by bugdroid1@chromium.org, Sep 22 2017

The following revision refers to this bug:
  https://chromium.googlesource.com/chromiumos/overlays/chromiumos-overlay/+/78a164a4def15652ee834927b8cfda02eb7e6229

commit 78a164a4def15652ee834927b8cfda02eb7e6229
Author: Kevin Cernekee <cernekee@chromium.org>
Date: Fri Sep 22 09:02:01 2017

app-emulation/qemu: Patch network send queue bug

This corrects a problem that causes betty vmtests to intermittently
lose connectivity.

BUG= chromium:766323 
TEST=`sudo emerge qemu`

Change-Id: I765299d602a71980839aa54af08b1f12c4c91a03
Reviewed-on: https://chromium-review.googlesource.com/676532
Commit-Ready: Kevin Cernekee <cernekee@chromium.org>
Tested-by: Kevin Cernekee <cernekee@chromium.org>
Reviewed-by: Ilja H. Friedel <ihf@chromium.org>
Reviewed-by: Bernie Thompson <bhthompson@chromium.org>
Reviewed-by: Mike Frysinger <vapier@chromium.org>

[modify] https://crrev.com/78a164a4def15652ee834927b8cfda02eb7e6229/app-emulation/qemu/qemu-2.6.0.ebuild
[add] https://crrev.com/78a164a4def15652ee834927b8cfda02eb7e6229/app-emulation/qemu/files/qemu-2.6.0-slirp-send-queue-hangs.patch
[add] https://crrev.com/78a164a4def15652ee834927b8cfda02eb7e6229/app-emulation/qemu/qemu-2.6.0-r1.ebuild

Project Member

Comment 18 by bugdroid1@chromium.org, Sep 25 2017

The following revision refers to this bug:
  https://chrome-internal.googlesource.com/chromeos/cheets-scripts/+/84f43c5f0ac96a7033c38cea92371f7b57c9de56

commit 84f43c5f0ac96a7033c38cea92371f7b57c9de56
Author: Kevin Cernekee <cernekee@chromium.org>
Date: Mon Sep 25 18:32:18 2017

Status: Fixed (was: Untriaged)
This is believed to be fixed with the qemu patch.

Comment 20 by pwang@chromium.org, Sep 25 2017

Status: Verified (was: Fixed)
Thanks Kevin.
Verified with the betty-vmtest-informational builder.
Project Member

Comment 21 by bugdroid1@chromium.org, Sep 26 2017

Labels: merge-merged-release-R62-9901.B
The following revision refers to this bug:
  https://chrome-internal.googlesource.com/chromeos/cheets-scripts/+/d81122d1a7870ce3e4f711c2e35874c65e92792c

commit d81122d1a7870ce3e4f711c2e35874c65e92792c
Author: Kevin Cernekee <cernekee@chromium.org>
Date: Tue Sep 26 22:04:24 2017

Sign in to add a comment