New issue
Advanced search Search tips

Issue 878908 link

Starred by 4 users

Issue metadata

Status: Verified
Owner:
Closed: Nov 28
Cc:
Components:
EstimatedDays: ----
NextAction: ----
OS: Chrome
Pri: 2
Type: Bug



Sign in to add a comment

crosvm startup failure due to TSC init error

Project Member Reported by jkardatzke@chromium.org, Aug 29

Issue description

The tast tests for Crostini have been hitting lots of failures with the CrosStartVM test.  I was able to reproduce this on eve (which very rarely fails this test in tast) by putting the device under load.

I reproduced this with 10997.0.0; but likely anything recent will make it happen.

I repro'd it by scp'ing a large file to the DUT while running this in another SSH session on the device (some of this may need to be modified for a specific setup depending upon termina version):

crosvm run --socket /tmp/crosvm_socket12345 --root /run/imageloader/cros-termina/10975.0.0/vm_rootfs.img  -p init=/bin/bash /run/imageloader/cros-termina/10975.0.0/vm_kernel

It may take a few tries to happen...but it was always less than 10 for me under load; happened only once w/out load.

The following lines relating to tsc appear in the kernel log:

On Success:

6,17,0,-;tsc: Fast TSC calibration using PIT
6,18,0,-;tsc: Detected 1608.152 MHz processor
....
6,248,1440522,-;clocksource: tsc: mask: 0xffffffffffffffff max_cycles: 0x172e39dcbf2, max_idle_ns: 440795258581 ns

On Failure:

6,17,0,-;tsc: Fast TSC calibration failed
...
4,95,1000,-;WARNING: CPU: 0 PID: 0 at ../../../../../tmp/portage/sys-kernel/chromeos-kernel-4_14-4.14.63-r449/work/chromeos-kernel-4_14-4.14.63/arch/x86/kernel/tsc.c:1315 determine_cpu_tsc_frequencies+0xf3/0xfa
4,96,1000,-;CPU: 0 PID: 0 Comm: swapper/0 Not tainted 4.14.63-06698-g8e70c938c971 #1
4,97,1000,-;task: ffffffff81e1a480 task.stack: ffffffff81e00000
4,98,1000,-;RIP: 0010:determine_cpu_tsc_frequencies+0xf3/0xfa
4,99,1000,-;RSP: 0000:ffffffff81e03ef0 EFLAGS: 00010212
4,100,1000,-;RAX: ffffffff8101948c RBX: ffffffff8226e2d0 RCX: ffff88000e800270
4,101,1000,-;RDX: 0000000000000000 RSI: ffffffff81e03d80 RDI: 0000000000000000
4,102,1000,-;RBP: 0000000000000000 R08: 0000000000000001 R09: 0000000000000246
4,103,1000,-;R10: 0000000000015a00 R11: ffffffff8102f2db R12: 0000000000000000
4,104,1000,-;R13: 0000000000000000 R14: ffffffff81e03f38 R15: 0000000000000000
4,105,1000,-;FS:  0000000000000000(0000) GS:ffff88000fc00000(0000) knlGS:0000000000000000
4,106,1000,-;CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
4,107,1000,-;CR2: 00000000ffffffff CR3: 0000000002014001 CR4: 00000000000606b0
4,108,1000,-;DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
4,109,1000,-;DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
4,110,1000,-;Call Trace:
4,111,1000,-; tsc_init+0x4d/0x20f
4,112,1000,-; start_kernel+0x2a7/0x362
4,113,1000,-; ? start_kernel+0x362/0x362
4,114,1000,-; secondary_startup_64+0xa5/0xb0
4,115,1000,-;Code: e8 81 74 1b 31 d2 f7 f5 89 c1 48 c7 c7 55 db d4 81 31 c0 48 89 ce e8 da 89 ea fe eb 02 31 db 89 d8 5b 5d c3 0f 0b e9 21 ff ff ff <0f> 0b e9 65 ff ff ff 55 53 50 48 c7 c0 22 91 01 81 48 39 04 25 
5,116,1000,-;random: get_random_bytes called from init_oops_id+0x29/0x2c with crng_init=0
4,117,1000,-;---[ end trace caef8185e121a4b0 ]---
4,118,2067,-;tsc: Unable to calibrate against PIT
5,119,3003,-;tsc: No reference (HPET/PMTIMER) available
6,120,4004,-;tsc: Marking TSC unstable due to could not calculate TSC khz
 
On a kefka board (braswell) this happens every time I run it.
I tried setting CONFIG_X86_TSC=n, but it turns out it's implied by having an x86_64 processor.

We can try setting clocksource on the kernel command line, but I think there are kernel pieces that will always use tsc when it's available. We might be better off seeing what we're doing wrong with it.
 Issue 884911  has been merged into this issue.
Components: OS>Systems>Containers
Owner: jkardatzke@chromium.org
Status: Assigned (was: Available)
Assigning to myself while we wait for mutexlox@ (who will be working on it) to get a chromium.org account.
Owner: mutexlox@chromium.org
Project Member

Comment 6 by bugdroid1@chromium.org, Sep 29

The following revision refers to this bug:
  https://chromium.googlesource.com/chromiumos/platform/tast-tests/+/2f05c69330879c875332b1b816dc0125251a4835

commit 2f05c69330879c875332b1b816dc0125251a4835
Author: Jeffrey Kardatzke <jkardatzke@google.com>
Date: Sat Sep 29 07:27:31 2018

tast-tests: Fix localhost prompt checking in StartCrosvm

Previously this test worked because the VM was logging out a line from
the kernel after the bash prompt was shown, which then caused a line
terminator to come through stdout. There was a kernel change which made
that line come earlier, so we are not seeing the line terminator on
stdout after the 'localhost #' prompt like we used to. This fixes that
problem by scanning for bytes from stdout rather than lines.

BUG= chromium:878908 
TEST=tast run vm.StartCrosvm

Change-Id: I3aebffe588a0c9a6ae9f012e9d74877cb1833455
Reviewed-on: https://chromium-review.googlesource.com/1249764
Commit-Ready: Jeffrey Kardatzke <jkardatzke@google.com>
Tested-by: Jeffrey Kardatzke <jkardatzke@google.com>
Reviewed-by: Dan Erat <derat@chromium.org>

[modify] https://crrev.com/2f05c69330879c875332b1b816dc0125251a4835/src/chromiumos/tast/local/bundles/cros/vm/start_crosvm.go

The above fix I did causes the test to pass now...but the underlying TSC issue is still there (it was causing the test to fail because it was flakey).
I believe that the root cause of this bug is a partially-applied patchset: https://lore.kernel.org/lkml/20180719205545.16512-12-pasha.tatashin@oracle.com/T/#m6f1410941dcc6efb59be58d06f9e1fffb4a21d54

Specifically, we have applied 8 / 26 commits from that patch applied in third_party/kernel/v4.14. The patchset is rather large:
  33 files changed, 439 insertions(+), 447 deletions(-)
and does not merge cleanly.  (For example, one of the early patches in that set comes after https://github.com/torvalds/linux/commit/819aeee065e5d1b417ecd633897427c89f3253ec, and trying to apply the patchset without 819aeee causes non-trivial merge conflicts.)

Note that I'm not positive that applying the patchset I linked earlier will fix the problem -- I haven't successfully applied it to test yet.

As I see it, we have a few options:

1) Apply the patchset, manually resolving merge conflicts and applying older patches as needed. This could easily take weeks, and would be very error-prone.
2) Upgrade to a newer kernel version (e.g. 4.18). While not LTS, a later kernel version would have more of the patches we need. However, I'm not sure how much work it is to do a kernel upgrade. How far does our tree diverge from upstream?
3) Wait for version 4.19. It's not released yet, but should be soon for some value of "soon".

Does anyone have thoughts on these options?
we only need this in the VM kernel right ?

if we can't find something small to resolve it, waiting for 4.19 is prob our best bet imo.  applying the patches would affect our host kernels too which might be dicey.
We should only need it in the guest.

One downside of waiting that I didn't mention is that it is somewhat likely that this bug is causing user-noticeable issues ( https://crbug.com/891024  and  https://crbug.com/823406 ).
can we workaround it on the host side ?  have cicerone or something listen to the resume event and then have it force the time to change in the VMs ?  not sure if that's something kvm offers.
Can we try a 4.19-rcX build to see if that really fixes it?
Good idea. Are there any critical kernel patches in our 4.14 that aren't in the latest 4.19-rcX?

Indeed, it seems to be fixed in next/chromeos-next (I can no longer see the warning).
We've got an official 4.19 branch now, smbarber is working to get the VMs updated.
Dylan, smbarber: have you been able to confirm that the TSC warnings don't happen in the 4.19 branch? (Just confirming that they start up with the tast test isn't enough; that was only failing due to a test infrastructure issue)

What milestone are we targeting for 4.19 kernel?
Owner: smbar...@chromium.org
assign to smbarber to confirm.
Cc: mutexlox@chromium.org
Status: Verified (was: Assigned)
I haven't seen this since we switched to 4.19, on M72.
Excellent! Thanks :-)

Sign in to add a comment