crosvm startup failure due to TSC init error |
||||||
Issue descriptionThe tast tests for Crostini have been hitting lots of failures with the CrosStartVM test. I was able to reproduce this on eve (which very rarely fails this test in tast) by putting the device under load. I reproduced this with 10997.0.0; but likely anything recent will make it happen. I repro'd it by scp'ing a large file to the DUT while running this in another SSH session on the device (some of this may need to be modified for a specific setup depending upon termina version): crosvm run --socket /tmp/crosvm_socket12345 --root /run/imageloader/cros-termina/10975.0.0/vm_rootfs.img -p init=/bin/bash /run/imageloader/cros-termina/10975.0.0/vm_kernel It may take a few tries to happen...but it was always less than 10 for me under load; happened only once w/out load. The following lines relating to tsc appear in the kernel log: On Success: 6,17,0,-;tsc: Fast TSC calibration using PIT 6,18,0,-;tsc: Detected 1608.152 MHz processor .... 6,248,1440522,-;clocksource: tsc: mask: 0xffffffffffffffff max_cycles: 0x172e39dcbf2, max_idle_ns: 440795258581 ns On Failure: 6,17,0,-;tsc: Fast TSC calibration failed ... 4,95,1000,-;WARNING: CPU: 0 PID: 0 at ../../../../../tmp/portage/sys-kernel/chromeos-kernel-4_14-4.14.63-r449/work/chromeos-kernel-4_14-4.14.63/arch/x86/kernel/tsc.c:1315 determine_cpu_tsc_frequencies+0xf3/0xfa 4,96,1000,-;CPU: 0 PID: 0 Comm: swapper/0 Not tainted 4.14.63-06698-g8e70c938c971 #1 4,97,1000,-;task: ffffffff81e1a480 task.stack: ffffffff81e00000 4,98,1000,-;RIP: 0010:determine_cpu_tsc_frequencies+0xf3/0xfa 4,99,1000,-;RSP: 0000:ffffffff81e03ef0 EFLAGS: 00010212 4,100,1000,-;RAX: ffffffff8101948c RBX: ffffffff8226e2d0 RCX: ffff88000e800270 4,101,1000,-;RDX: 0000000000000000 RSI: ffffffff81e03d80 RDI: 0000000000000000 4,102,1000,-;RBP: 0000000000000000 R08: 0000000000000001 R09: 0000000000000246 4,103,1000,-;R10: 0000000000015a00 R11: ffffffff8102f2db R12: 0000000000000000 4,104,1000,-;R13: 0000000000000000 R14: ffffffff81e03f38 R15: 0000000000000000 4,105,1000,-;FS: 0000000000000000(0000) GS:ffff88000fc00000(0000) knlGS:0000000000000000 4,106,1000,-;CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 4,107,1000,-;CR2: 00000000ffffffff CR3: 0000000002014001 CR4: 00000000000606b0 4,108,1000,-;DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 4,109,1000,-;DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400 4,110,1000,-;Call Trace: 4,111,1000,-; tsc_init+0x4d/0x20f 4,112,1000,-; start_kernel+0x2a7/0x362 4,113,1000,-; ? start_kernel+0x362/0x362 4,114,1000,-; secondary_startup_64+0xa5/0xb0 4,115,1000,-;Code: e8 81 74 1b 31 d2 f7 f5 89 c1 48 c7 c7 55 db d4 81 31 c0 48 89 ce e8 da 89 ea fe eb 02 31 db 89 d8 5b 5d c3 0f 0b e9 21 ff ff ff <0f> 0b e9 65 ff ff ff 55 53 50 48 c7 c0 22 91 01 81 48 39 04 25 5,116,1000,-;random: get_random_bytes called from init_oops_id+0x29/0x2c with crng_init=0 4,117,1000,-;---[ end trace caef8185e121a4b0 ]--- 4,118,2067,-;tsc: Unable to calibrate against PIT 5,119,3003,-;tsc: No reference (HPET/PMTIMER) available 6,120,4004,-;tsc: Marking TSC unstable due to could not calculate TSC khz
,
Sep 11
I tried setting CONFIG_X86_TSC=n, but it turns out it's implied by having an x86_64 processor. We can try setting clocksource on the kernel command line, but I think there are kernel pieces that will always use tsc when it's available. We might be better off seeing what we're doing wrong with it.
,
Sep 17
Issue 884911 has been merged into this issue.
,
Sep 19
Assigning to myself while we wait for mutexlox@ (who will be working on it) to get a chromium.org account.
,
Sep 24
,
Sep 29
The following revision refers to this bug: https://chromium.googlesource.com/chromiumos/platform/tast-tests/+/2f05c69330879c875332b1b816dc0125251a4835 commit 2f05c69330879c875332b1b816dc0125251a4835 Author: Jeffrey Kardatzke <jkardatzke@google.com> Date: Sat Sep 29 07:27:31 2018 tast-tests: Fix localhost prompt checking in StartCrosvm Previously this test worked because the VM was logging out a line from the kernel after the bash prompt was shown, which then caused a line terminator to come through stdout. There was a kernel change which made that line come earlier, so we are not seeing the line terminator on stdout after the 'localhost #' prompt like we used to. This fixes that problem by scanning for bytes from stdout rather than lines. BUG= chromium:878908 TEST=tast run vm.StartCrosvm Change-Id: I3aebffe588a0c9a6ae9f012e9d74877cb1833455 Reviewed-on: https://chromium-review.googlesource.com/1249764 Commit-Ready: Jeffrey Kardatzke <jkardatzke@google.com> Tested-by: Jeffrey Kardatzke <jkardatzke@google.com> Reviewed-by: Dan Erat <derat@chromium.org> [modify] https://crrev.com/2f05c69330879c875332b1b816dc0125251a4835/src/chromiumos/tast/local/bundles/cros/vm/start_crosvm.go
,
Oct 1
The above fix I did causes the test to pass now...but the underlying TSC issue is still there (it was causing the test to fail because it was flakey).
,
Oct 6
I believe that the root cause of this bug is a partially-applied patchset: https://lore.kernel.org/lkml/20180719205545.16512-12-pasha.tatashin@oracle.com/T/#m6f1410941dcc6efb59be58d06f9e1fffb4a21d54 Specifically, we have applied 8 / 26 commits from that patch applied in third_party/kernel/v4.14. The patchset is rather large: 33 files changed, 439 insertions(+), 447 deletions(-) and does not merge cleanly. (For example, one of the early patches in that set comes after https://github.com/torvalds/linux/commit/819aeee065e5d1b417ecd633897427c89f3253ec, and trying to apply the patchset without 819aeee causes non-trivial merge conflicts.) Note that I'm not positive that applying the patchset I linked earlier will fix the problem -- I haven't successfully applied it to test yet. As I see it, we have a few options: 1) Apply the patchset, manually resolving merge conflicts and applying older patches as needed. This could easily take weeks, and would be very error-prone. 2) Upgrade to a newer kernel version (e.g. 4.18). While not LTS, a later kernel version would have more of the patches we need. However, I'm not sure how much work it is to do a kernel upgrade. How far does our tree diverge from upstream? 3) Wait for version 4.19. It's not released yet, but should be soon for some value of "soon". Does anyone have thoughts on these options?
,
Oct 6
we only need this in the VM kernel right ? if we can't find something small to resolve it, waiting for 4.19 is prob our best bet imo. applying the patches would affect our host kernels too which might be dicey.
,
Oct 6
We should only need it in the guest. One downside of waiting that I didn't mention is that it is somewhat likely that this bug is causing user-noticeable issues ( https://crbug.com/891024 and https://crbug.com/823406 ).
,
Oct 6
can we workaround it on the host side ? have cicerone or something listen to the resume event and then have it force the time to change in the VMs ? not sure if that's something kvm offers.
,
Oct 8
Can we try a 4.19-rcX build to see if that really fixes it?
,
Oct 8
Good idea. Are there any critical kernel patches in our 4.14 that aren't in the latest 4.19-rcX?
,
Oct 8
Indeed, it seems to be fixed in next/chromeos-next (I can no longer see the warning).
,
Oct 24
We've got an official 4.19 branch now, smbarber is working to get the VMs updated.
,
Nov 21
Dylan, smbarber: have you been able to confirm that the TSC warnings don't happen in the 4.19 branch? (Just confirming that they start up with the tast test isn't enough; that was only failing due to a test infrastructure issue) What milestone are we targeting for 4.19 kernel?
,
Nov 26
assign to smbarber to confirm.
,
Nov 27
,
Nov 28
I haven't seen this since we switched to 4.19, on M72.
,
Nov 28
Excellent! Thanks :-) |
||||||
►
Sign in to add a comment |
||||||
Comment 1 by jkardatzke@chromium.org
, Aug 29