New issue
Advanced search Search tips

Issue 843807 link

Starred by 4 users

Issue metadata

Status: Fixed
Owner:
Closed: Jun 2018
Cc:
Components:
EstimatedDays: ----
NextAction: ----
OS: Chrome
Pri: 1
Type: Bug



Sign in to add a comment

vm_CrosVmStart is flakey on Kevin

Project Member Reported by sonnyrao@chromium.org, May 16 2018

Issue description

On our builders we're seeing a lot of flakiness

https://stainless.corp.google.com/search?exclude_cts=true&exclude_non_release=true&board=%5Ekevin%24&test=%5Evm%5C_&suite=%5Ebvt%5C-perbuild%24&model=%5Ekevin%24&view=matrix&col=build&row=test&first_date=2018-05-10&last_date=2018-05-16

It looks like there were core files in the failing runs

If I run it manually it generally passes all the time, but sometimes I get core dumps and SIG31 crashes -- which is usually seccomp violation.

It seems like whatever the violation is, it only happens some of the time, and maybe doesn't always cause the test to fail.  I notice if I do a debug build that I don't ever seem to get the core files.

We also have issues with minijail not properly reporting the cause of the seccomp violations on Arm so we should fix that as well.

 
Also I did have some backtraces from gdb of these crashes:

#0  __libc_do_syscall () at ../sysdeps/unix/sysv/linux/arm/libc-do-syscall.S:47
#1  0xf3f74a9a in __GI___gettimeofday (tv=tv@entry=0xffe79f80, tz=tz@entry=0x0) at ../sysdeps/unix/sysv/linux/gettimeofday.c:35
#2  0xf3f74a56 in __GI_time (t=0xffe7a410) at ../sysdeps/posix/time.c:31
#3  0x0e6461b4 in sys_util::syslog::log ()
#4  0x0e639774 in devices::proxy::child_proc ()
#5  0x0e621a24 in device_manager::DeviceManager::register_mmio ()
#6  0x0e6134e4 in crosvm::linux::run_config ()
#7  0x0e6770c8 in crosvm::crosvm_main ()
#8  0x0e674f10 in crosvm::main ()
#9  0x0e6648a8 in std::rt::lang_start::{{closure}} ()
#10 0x0e674d6c in main ()


So I tried an experiement where I commented out all of the "error!" macro calls in devices::proxy::child_proc()

and I haven't been able to reproduce the seccomp violation.  So, it's looking like there's another syscall that gets invoked in the error logging path that's not currently covered.

I'm guessing this is the race -- we're trigging the seccomp violation in the device processes and the main process is dying too before we can cleanly shut down the VM.  Does that sound plausible?
it turns out it really was gettimeofday that was causing the seccomp violation
I think this is usually a vdso syscall on x86 -- I'm not sure why it's a real syscall on ARM but it appears to be using the real syscall when it's outputing error messages.  I think it's okay to enable this call, so let's just whitelist it.

Project Member

Comment 4 by bugdroid1@chromium.org, Jun 2 2018

The following revision refers to this bug:
  https://chromium.googlesource.com/chromiumos/platform/crosvm/+/90c50419d4ed58f226a65a0751f404be26aa97c1

commit 90c50419d4ed58f226a65a0751f404be26aa97c1
Author: Sonny Rao <sonnyrao@chromium.org>
Date: Sat Jun 02 00:44:30 2018

crosvm: aarch64: whitelist gettimeofday for error messages

It looks like on ARM we use the real gettimeofday system call when
we're outputting error messages, so we need to whitelist this to avoid
crashing instead of seeing the error messages.

BUG= chromium:843807 
TEST=run vm_CrosVmStart and make sure there are no crashes for crosvm

Change-Id: I9f47da8dabe31f0677bcaa1d431e56545e20c9c9
Reviewed-on: https://chromium-review.googlesource.com/1081390
Commit-Ready: ChromeOS CL Exonerator Bot <chromiumos-cl-exonerator@appspot.gserviceaccount.com>
Tested-by: Sonny Rao <sonnyrao@chromium.org>
Reviewed-by: Stephen Barber <smbarber@chromium.org>

[modify] https://crrev.com/90c50419d4ed58f226a65a0751f404be26aa97c1/seccomp/aarch64/rng_device.policy
[modify] https://crrev.com/90c50419d4ed58f226a65a0751f404be26aa97c1/seccomp/aarch64/block_device.policy
[modify] https://crrev.com/90c50419d4ed58f226a65a0751f404be26aa97c1/seccomp/aarch64/vhost_net_device.policy
[modify] https://crrev.com/90c50419d4ed58f226a65a0751f404be26aa97c1/seccomp/aarch64/net_device.policy
[modify] https://crrev.com/90c50419d4ed58f226a65a0751f404be26aa97c1/seccomp/aarch64/vhost_vsock_device.policy
[modify] https://crrev.com/90c50419d4ed58f226a65a0751f404be26aa97c1/seccomp/aarch64/wl_device.policy
[modify] https://crrev.com/90c50419d4ed58f226a65a0751f404be26aa97c1/seccomp/aarch64/balloon_device.policy

Status: Fixed (was: Started)
This seems much better after this landed in 10744.0.0 

https://stainless.corp.google.com/search?view=matrix&row=test&col=build&first_date=2018-05-31&last_date=2018-06-04&suite=%5Ebvt%5C-perbuild%24&test=%5Evm%5C_&board=%5Ekevin%24&model=%5Ekevin%24&exclude_cts=true&exclude_not_run=false&exclude_non_release=true&exclude_au=true&exclude_acts=true&exclude_retried=true&exclude_non_production=false

There was one failure but it the test duration was very short at 17 seconds vs more like 45 seconds for a normal run, so I don't think it actually  ran.

Sign in to add a comment