vm_CrosVmStart is flakey on Kevin |
||
Issue descriptionOn our builders we're seeing a lot of flakiness https://stainless.corp.google.com/search?exclude_cts=true&exclude_non_release=true&board=%5Ekevin%24&test=%5Evm%5C_&suite=%5Ebvt%5C-perbuild%24&model=%5Ekevin%24&view=matrix&col=build&row=test&first_date=2018-05-10&last_date=2018-05-16 It looks like there were core files in the failing runs If I run it manually it generally passes all the time, but sometimes I get core dumps and SIG31 crashes -- which is usually seccomp violation. It seems like whatever the violation is, it only happens some of the time, and maybe doesn't always cause the test to fail. I notice if I do a debug build that I don't ever seem to get the core files. We also have issues with minijail not properly reporting the cause of the seccomp violations on Arm so we should fix that as well.
,
May 17 2018
I'm guessing this is the race -- we're trigging the seccomp violation in the device processes and the main process is dying too before we can cleanly shut down the VM. Does that sound plausible?
,
Jun 1 2018
it turns out it really was gettimeofday that was causing the seccomp violation I think this is usually a vdso syscall on x86 -- I'm not sure why it's a real syscall on ARM but it appears to be using the real syscall when it's outputing error messages. I think it's okay to enable this call, so let's just whitelist it.
,
Jun 2 2018
The following revision refers to this bug: https://chromium.googlesource.com/chromiumos/platform/crosvm/+/90c50419d4ed58f226a65a0751f404be26aa97c1 commit 90c50419d4ed58f226a65a0751f404be26aa97c1 Author: Sonny Rao <sonnyrao@chromium.org> Date: Sat Jun 02 00:44:30 2018 crosvm: aarch64: whitelist gettimeofday for error messages It looks like on ARM we use the real gettimeofday system call when we're outputting error messages, so we need to whitelist this to avoid crashing instead of seeing the error messages. BUG= chromium:843807 TEST=run vm_CrosVmStart and make sure there are no crashes for crosvm Change-Id: I9f47da8dabe31f0677bcaa1d431e56545e20c9c9 Reviewed-on: https://chromium-review.googlesource.com/1081390 Commit-Ready: ChromeOS CL Exonerator Bot <chromiumos-cl-exonerator@appspot.gserviceaccount.com> Tested-by: Sonny Rao <sonnyrao@chromium.org> Reviewed-by: Stephen Barber <smbarber@chromium.org> [modify] https://crrev.com/90c50419d4ed58f226a65a0751f404be26aa97c1/seccomp/aarch64/rng_device.policy [modify] https://crrev.com/90c50419d4ed58f226a65a0751f404be26aa97c1/seccomp/aarch64/block_device.policy [modify] https://crrev.com/90c50419d4ed58f226a65a0751f404be26aa97c1/seccomp/aarch64/vhost_net_device.policy [modify] https://crrev.com/90c50419d4ed58f226a65a0751f404be26aa97c1/seccomp/aarch64/net_device.policy [modify] https://crrev.com/90c50419d4ed58f226a65a0751f404be26aa97c1/seccomp/aarch64/vhost_vsock_device.policy [modify] https://crrev.com/90c50419d4ed58f226a65a0751f404be26aa97c1/seccomp/aarch64/wl_device.policy [modify] https://crrev.com/90c50419d4ed58f226a65a0751f404be26aa97c1/seccomp/aarch64/balloon_device.policy
,
Jun 5 2018
This seems much better after this landed in 10744.0.0 https://stainless.corp.google.com/search?view=matrix&row=test&col=build&first_date=2018-05-31&last_date=2018-06-04&suite=%5Ebvt%5C-perbuild%24&test=%5Evm%5C_&board=%5Ekevin%24&model=%5Ekevin%24&exclude_cts=true&exclude_not_run=false&exclude_non_release=true&exclude_au=true&exclude_acts=true&exclude_retried=true&exclude_non_production=false There was one failure but it the test duration was very short at 17 seconds vs more like 45 seconds for a normal run, so I don't think it actually ran. |
||
►
Sign in to add a comment |
||
Comment 1 by sonnyrao@chromium.org
, May 17 2018Also I did have some backtraces from gdb of these crashes: #0 __libc_do_syscall () at ../sysdeps/unix/sysv/linux/arm/libc-do-syscall.S:47 #1 0xf3f74a9a in __GI___gettimeofday (tv=tv@entry=0xffe79f80, tz=tz@entry=0x0) at ../sysdeps/unix/sysv/linux/gettimeofday.c:35 #2 0xf3f74a56 in __GI_time (t=0xffe7a410) at ../sysdeps/posix/time.c:31 #3 0x0e6461b4 in sys_util::syslog::log () #4 0x0e639774 in devices::proxy::child_proc () #5 0x0e621a24 in device_manager::DeviceManager::register_mmio () #6 0x0e6134e4 in crosvm::linux::run_config () #7 0x0e6770c8 in crosvm::crosvm_main () #8 0x0e674f10 in crosvm::main () #9 0x0e6648a8 in std::rt::lang_start::{{closure}} () #10 0x0e674d6c in main () So I tried an experiement where I commented out all of the "error!" macro calls in devices::proxy::child_proc() and I haven't been able to reproduce the seccomp violation. So, it's looking like there's another syscall that gets invoked in the error logging path that's not currently covered.