New issue
Advanced search Search tips

Issue 906523 link

Starred by 9 users

Issue metadata

Status: Assigned
Owner:
Cc:
Components:
EstimatedDays: ----
NextAction: ----
OS: Chrome
Pri: 2
Type: Bug

Blocked on:
issue 920875



Sign in to add a comment

Guest VM timers are not stopped when the device is suspended

Project Member Reported by ljulliard@google.com, Nov 19

Issue description

Chrome version: 70.0.3538.76 (Official Build) (64-bit)
OS: Chrome

Repro steps:
1. Launch a process using 100% CPUs (e.g. compiling tensorflow)
2. Let the Pixelbook go to sleep after the normal grace period
3. Wake up the Pixelbook 

Expected: the linux container should still be here and the running job should keep going
Actual: the crostini terminal is gone (crashed) and crostini must be restarted from scrach

 
Cc: mutexlox@chromium.org
Does this crash happen consistently if the chromebook goes to sleep while the guest is using 100% CPU?

Can you please file detailed feedback using 'Alt+Shift+I' so that we can get logs? This sounds to me like a similar symptom to  https://crbug.com/891024 , and I'd like to see if logs confirm or refute that it's the same issue.  (If it is, that's great, as we don't have a reliable way to reproduce that bug yet.)
Hi there,

Yes it crashed systematically when I was compiling tensorflow and using all
4 cores (the Bazel tool used to compile Tensorflow mobilizes all cores
available unless instructed otherwise).

As to the ctrl-shf-I shall I do it right after the resume from the chrome
desktop or shall I terminal first ?

Laurent

On Thu, Nov 22, 2018, 01:11 mutexlox via monorail <
monorail+v2.1285344933@chromium.org wrote:
Components: OS>Systems>Containers
Right after you notice the issue should be fine. Please also include "#crostini" in your report.

Thanks!
Cc: za...@chromium.org
Cc: jkardatzke@chromium.org
Is this something you know about Jeff?
No, I haven't heard of this one before that I recall.
Cc: dgreid@chromium.org
I just had the VM crash after waking up a sand device that I'd had the lid closed on since yesterday. This is what was in messages:

2018-11-26T11:40:23.974823-08:00 INFO VM(3)[3736]:  tremplin[142]: 2018/11/26 19:40:21 Received GetContainerUsername RPC: penguin#012
2018-11-27T15:22:49.186328-08:00 INFO VM(3)[3787]: [99833.984887] watchdog: BUG: soft lockup - CPU#1 stuck for 92463s! [swapper/1:0]#015
2018-11-27T15:22:49.205371-08:00 INFO VM(3)[3787]: [99834.010082] watchdog: BUG: soft lockup - CPU#0 stuck for 92463s! [swapper/0:0]#015
2018-11-27T15:22:49.218535-08:00 INFO VM(3)[3787]: [99834.010501] CPU: 1 PID: 0 Comm: swapper/1 Not tainted 4.19.2-02071-g869138e6e829 #1#015
2018-11-27T15:22:49.230321-08:00 INFO VM(3)[3787]: [99834.010568] RIP: 0010:native_safe_halt+0x2/0x3#015
2018-11-27T15:22:49.242025-08:00 INFO VM(3)[3787]: [99834.010575] Code: 74 0c 65 81 24 25 14 4d 01 00 ff ff ff 7f 65 48 8b 04 25 28 00 00 00 48 3b 44 24 08 75 05 48 83 c4 10 c3 e8 8e 99 96 ff fb f4 <c3> f4 c3 e8 55 ca 98 ff 84 c0 0f 84 b8 a2 91 ff e9 40 ca 98 ff 53#015
2018-11-27T15:22:49.246819-08:00 INFO VM(3)[3787]: [99834.010578] RSP: 0018:ffffc90000073ec0 EFLAGS: 00000246 ORIG_RAX: ffffffffffffff13#015
2018-11-27T15:22:49.273251-08:00 INFO VM(3)[3787]: [99834.010583] RAX: 0000000000000001 RBX: 0000000000000001 RCX: 254887d6caec9600#015
2018-11-27T15:22:49.284401-08:00 INFO VM(3)[3787]: [99834.010585] RDX: ffffffff8203c040 RSI: ffff8800b0520680 RDI: 0000000000000001#015
2018-11-27T15:22:49.308445-08:00 INFO VM(3)[3787]: [99834.010588] RBP: ffff8800af8b5a00 R08: 0000000000000000 R09: 0000000000000008#015
2018-11-27T15:22:49.316591-08:00 INFO VM(3)[3787]: [99834.010590] R10: 0000000000000000 R11: ffffffff81701638 R12: 0000000000000000#015
2018-11-27T15:22:49.331927-08:00 INFO VM(3)[3787]: [99834.010592] R13: 0000000000000000 R14: ffffc90000073ed6 R15: 0000000000000000#015
2018-11-27T15:22:49.360556-08:00 INFO VM(3)[3787]: [99834.010596] FS:  0000000000000000(0000) GS:ffff8800b0500000(0000) knlGS:0000000000000000#015
2018-11-27T15:22:49.362365-08:00 INFO VM(3)[3787]: [99834.010599] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033#015
2018-11-27T15:22:49.370985-08:00 INFO vm_concierge[3754]: Setting resolv config for VM 3
2018-11-27T15:22:49.378746-08:00 INFO VM(3)[3787]: [99834.010601] CR2: 00007cd57abffc68 CR3: 00000000aeeba000 CR4: 00000000003406a0#015
2018-11-27T15:22:49.379515-08:00 INFO VM(3)[3787]: [99834.010609] Call Trace:#015
2018-11-27T15:22:49.394828-08:00 INFO VM(3)[3787]: [99834.010699]  default_idle+0xd/0x18#015
2018-11-27T15:22:49.395782-08:00 INFO VM(3)[3787]: [99834.010782]  do_idle+0xeb/0x21f#015
2018-11-27T15:22:49.406586-08:00 INFO VM(3)[3787]: [99834.010835]  cpu_startup_entry+0x60/0x62#015
2018-11-27T15:22:49.415784-08:00 INFO VM(3)[3787]: [99834.010859]  start_secondary+0x1a8/0x1c3#015
2018-11-27T15:22:49.417052-08:00 INFO VM(3)[3787]: [99834.010893]  secondary_startup_64+0xa4/0xb0#015
2018-11-27T15:22:49.432442-08:00 INFO VM(3)[3787]: [99834.010921] Kernel panic - not syncing: softlockup: hung tasks#015
2018-11-27T15:22:49.436029-08:00 INFO VM(3)[3787]: [99834.010928] CPU: 1 PID: 0 Comm: swapper/1 Tainted: G             L    4.19.2-02071-g869138e6e829 #1#015
2018-11-27T15:22:49.446913-08:00 INFO VM(3)[3787]: [99834.010929] Call Trace:#015
2018-11-27T15:22:49.447499-08:00 INFO VM(3)[3787]: [99834.010965]  <IRQ>#015
2018-11-27T15:22:49.452571-08:00 INFO VM(3)[3787]: [99834.010974]  dump_stack+0x43/0x5c#015
2018-11-27T15:22:49.456972-08:00 INFO VM(3)[3787]: [99834.010992]  panic+0xf0/0x25c#015
2018-11-27T15:22:49.458444-08:00 INFO VM(3)[3787]: [99834.011018]  ? release_firmware_map_entry+0x81/0x97#015
2018-11-27T15:22:49.477278-08:00 INFO VM(3)[3787]: [99834.011038]  watchdog_timer_fn+0x267/0x267#015
2018-11-27T15:22:49.481728-08:00 INFO VM(3)[3787]: [99834.011069]  ? proc_watchdog_cpumask+0x5b/0x5b#015
2018-11-27T15:22:49.491011-08:00 INFO VM(3)[3787]: [99834.011091]  __hrtimer_run_queues+0x93/0xef#015
2018-11-27T15:22:49.493765-08:00 INFO VM(3)[3787]: [99834.011108]  hrtimer_run_queues+0xab/0xdb#015
2018-11-27T15:22:49.502848-08:00 INFO VM(3)[3787]: [99834.011110]  run_local_timers+0x16/0x40#015
2018-11-27T15:22:49.504145-08:00 INFO VM(3)[3787]: [99834.011112]  update_process_times+0x1c/0x46#015
2018-11-27T15:22:49.510919-08:00 INFO VM(3)[3787]: [99834.011115]  tick_handle_periodic+0x1c/0x5f#015
2018-11-27T15:22:49.512319-08:00 INFO VM(3)[3787]: [99834.011126]  smp_apic_timer_interrupt+0x5a/0x95#015
2018-11-27T15:22:49.529793-08:00 INFO VM(3)[3787]: [99834.011139]  apic_timer_interrupt+0xf/0x20#015
2018-11-27T15:22:49.530307-08:00 INFO VM(3)[3787]: [99834.011141]  </IRQ>#015
2018-11-27T15:22:49.534419-08:00 INFO VM(3)[3787]: [99834.011143] RIP: 0010:native_safe_halt+0x2/0x3#015
2018-11-27T15:22:49.558191-08:00 INFO VM(3)[3787]: [99834.011145] Code: 74 0c 65 81 24 25 14 4d 01 00 ff ff ff 7f 65 48 8b 04 25 28 00 00 00 48 3b 44 24 08 75 05 48 83 c4 10 c3 e8 8e 99 96 ff fb f4 <c3> f4 c3 e8 55 ca 98 ff 84 c0 0f 84 b8 a2 91 ff e9 40 ca 98 ff 53#015
2018-11-27T15:22:49.563931-08:00 INFO VM(3)[3787]: [99834.011146] RSP: 0018:ffffc90000073ec0 EFLAGS: 00000246 ORIG_RAX: ffffffffffffff13#015
2018-11-27T15:22:49.584463-08:00 INFO VM(3)[3787]: [99834.011147] RAX: 0000000000000001 RBX: 0000000000000001 RCX: 254887d6caec9600#015
2018-11-27T15:22:49.599145-08:00 INFO VM(3)[3787]: [99834.011148] RDX: ffffffff8203c040 RSI: ffff8800b0520680 RDI: 0000000000000001#015
2018-11-27T15:22:49.603643-08:00 INFO VM(3)[3787]: [99834.011149] RBP: ffff8800af8b5a00 R08: 0000000000000000 R09: 0000000000000008#015
2018-11-27T15:22:49.627067-08:00 INFO VM(3)[3787]: [99834.011150] R10: 0000000000000000 R11: ffffffff81701638 R12: 0000000000000000#015
2018-11-27T15:22:49.632083-08:00 INFO VM(3)[3787]: [99834.011151] R13: 0000000000000000 R14: ffffc90000073ed6 R15: 0000000000000000#015
2018-11-27T15:22:49.633266-08:00 INFO VM(3)[3787]: [99834.011153]  ? __sched_text_end+0x1/0x1#015
2018-11-27T15:22:49.663913-08:00 INFO VM(3)[3787]: [99834.011155]  default_idle+0xd/0x18#015
2018-11-27T15:22:49.664869-08:00 INFO VM(3)[3787]: [99834.011158]  do_idle+0xeb/0x21f#015
2018-11-27T15:22:49.671335-08:00 INFO VM(3)[3787]: [99834.011160]  cpu_startup_entry+0x60/0x62#015
2018-11-27T15:22:49.672558-08:00 INFO VM(3)[3787]: [99834.011163]  start_secondary+0x1a8/0x1c3#015
2018-11-27T15:22:49.673825-08:00 INFO VM(3)[3787]: [99834.011164]  secondary_startup_64+0xa4/0xb0#015
2018-11-27T15:22:49.691269-08:00 INFO VM(3)[3787]: [99834.494282] CPU: 0 PID: 0 Comm: swapper/0 Tainted: G             L    4.19.2-02071-g869138e6e829 #1#015
2018-11-27T15:22:49.694761-08:00 INFO VM(3)[3787]: [99834.513021] RIP: 0010:native_safe_halt+0x2/0x3#015
2018-11-27T15:22:49.712295-08:00 INFO VM(3)[3787]: [99834.515210] Code: 74 0c 65 81 24 25 14 4d 01 00 ff ff ff 7f 65 48 8b 04 25 28 00 00 00 48 3b 44 24 08 75 05 48 83 c4 10 c3 e8 8e 99 96 ff fb f4 <c3> f4 c3 e8 55 ca 98 ff 84 c0 0f 84 b8 a2 91 ff e9 40 ca 98 ff 53#015
2018-11-27T15:22:49.722794-08:00 INFO VM(3)[3787]: [99834.532730] RSP: 0018:ffffffff82003ea0 EFLAGS: 00000246 ORIG_RAX: ffffffffffffffcf#015
2018-11-27T15:22:49.727623-08:00 INFO VM(3)[3787]: [99834.543225] RAX: 0000000000000000 RBX: 0000000000000000 RCX: 531447993d38cc00#015
2018-11-27T15:22:49.733596-08:00 INFO VM(3)[3787]: [99834.548055] RDX: ffffffff8203c040 RSI: ffff8800b0420680 RDI: 0000000000000000#015
2018-11-27T15:22:49.735775-08:00 INFO VM(3)[3787]: [99834.554025] RBP: ffffffff8201b480 R08: 0000000000000000 R09: 8080808080808080#015
2018-11-27T15:22:49.741454-08:00 INFO VM(3)[3787]: [99834.556206] R10: 0000000000000000 R11: ffffffff81701638 R12: 0000000000000000#015
2018-11-27T15:22:49.754653-08:00 INFO VM(3)[3787]: [99834.561906] R13: 0000000000000000 R14: ffffffff82003eb6 R15: 0000000000000000#015
2018-11-27T15:22:49.762370-08:00 INFO VM(3)[3787]: [99834.575086] FS:  0000000000000000(0000) GS:ffff8800b0400000(0000) knlGS:0000000000000000#015
2018-11-27T15:22:49.764072-08:00 INFO VM(3)[3787]: [99834.582800] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033#015
2018-11-27T15:22:49.801588-08:00 INFO VM(3)[3787]: [99834.584514] CR2: 00007cd5740020c0 CR3: 00000000aeeba000 CR4: 00000000003406b0#015
2018-11-27T15:22:49.808174-08:00 INFO VM(3)[3787]: [99834.622116] Call Trace:#015
2018-11-27T15:22:49.809277-08:00 INFO VM(3)[3787]: [99834.628618]  default_idle+0xd/0x18#015
2018-11-27T15:22:49.826315-08:00 INFO VM(3)[3787]: [99834.629723]  do_idle+0xeb/0x21f#015
2018-11-27T15:22:49.827437-08:00 INFO VM(3)[3787]: [99834.646711]  cpu_startup_entry+0x60/0x62#015
2018-11-27T15:22:49.839249-08:00 INFO VM(3)[3787]: [99834.647923]  start_kernel+0x37e/0x3e0#015
2018-11-27T15:22:49.840388-08:00 INFO VM(3)[3787]: [99834.659704]  ? start_kernel+0x3e0/0x3e0#015
2018-11-27T15:22:49.853732-08:00 INFO VM(3)[3787]: [99834.660857]  secondary_startup_64+0xa4/0xb0#015
2018-11-27T15:22:51.654711-08:00 INFO VM(3)[3787]: [99836.473483] Shutting down cpus with NMI#015
2018-11-27T15:22:51.656515-08:00 INFO VM(3)[3787]: [99836.475475] Kernel Offset: disabled#015
2018-11-27T15:22:51.658436-08:00 INFO VM(3)[3787]: [99836.477151] Rebooting in 1 seconds..#015

Cc: dverkamp@chromium.org
Cc: chirantan@chromium.org
So this looks like the problem: 

2018-11-27T15:22:49.186328-08:00 INFO VM(3)[3787]: [99833.984887] watchdog: BUG: soft lockup - CPU#1 stuck for 92463s! [swapper/1:0]#015
2018-11-27T15:22:49.205371-08:00 INFO VM(3)[3787]: [99834.010082] watchdog: BUG: soft lockup - CPU#0 stuck for 92463s! 


92463 seconds is about 25 hours, which sounds like it matches the time that the device was suspended.  I'm guessing the VM's timer should be stopped while the device is suspended since the guest VM has no idea when we suspend and resume the system.  I know arm has separate physical and virtual timers for dealing with this.  Does x86 have something similar?
I had thought that x86 used the kvmclock which would handle this kind of detail, but perhaps either it doesn't handle the issue or we're not using it correctly?
I know this is a recent issue as I had never experienced this issue before last week when I was traveling with my pixelbook. It seemed to happen pretty regularly. No recent change in crosvm would have started or stopped plumbing a virtualized clock into the guest, which would indicate the problem is in a recent change to the guest or host kernel.
Owner: chirantan@chromium.org
Status: Assigned (was: Untriaged)
<triage>Assigning to chirantan from triage</triage>
Cc: smbar...@chromium.org tbuck...@chromium.org
 Issue 918552  has been merged into this issue.
Labels: M-73 ReleaseBlock-Beta
Labels: -ReleaseBlock-Beta -M-73 ReleaseBlock-Stable M-72
Actually moving to M72 stable blocker for now since this affects 71.
Cc: rohi...@chromium.org
 Issue 891024  has been merged into this issue.
Cc: groeck@chromium.org
Summary: Guest VM timers are not stopped when the device is suspended (was: Linux container crashes when Pixelbook suspended)
So I tried reproducing this.  I noticed some interesting guest kernel warnings when the system resumed:


2019-01-09T16:05:20.240360-08:00 INFO VM(3)[4070]: [ 2001.168173] clocksource: timekeeping watchdog on CPU0: Marking clocksource 'tsc' as unstable because the skew is too large:#015
2019-01-09T16:05:20.270963-08:00 INFO VM(3)[4070]: [ 2001.174743] clocksource:                       'kvm-clock' wd_now: 1e6346da20b wd_last: 1de17eb7607 mask: ffffffffffffffff#015
2019-01-09T16:05:20.294693-08:00 INFO VM(3)[4070]: [ 2001.205707] clocksource:                       'tsc' cs_now: 2e05b0850b0 cs_last: 2e03d524cd4 mask: ffffffffffffffff#015
2019-01-09T16:05:20.308640-08:00 INFO VM(3)[4070]: [ 2001.229356] tsc: Marking TSC unstable due to clocksource watchdog#015


These lines come from the clocksource_watchdog function in https://chromium.googlesource.com/chromiumos/third_party/kernel/+/chromeos-4.19/kernel/time/clocksource.c



I'm not sure how marking a clocksource unstable affects the rest of the system but at least this confirms my suspicion that the timer value that the guest kernel reads is not stopping when the system is suspended.


+groeck in case he's seen anything like this before
Labels: -ReleaseBlock-Stable
I was able to reproduce this problem in both R68 and R69 so I know this is *not* a recent issue.  It has been a problem since the very beginning.  The important thing is to have a process that has high cpu usage inside the VM at the time that the device goes to suspend.  Reading the kernel watchdog code it seems like there are some scheduler hooks to prevent the softlockup trigger when the kernel scheduler is idle, which is why suspend/resume is fine when the VM is idle.  Someone who knows more about this can correct me if I'm wrong.


There is a kvm ioctl, KVM_KVMCLOCK_CTRL, which sets a bit to tell the VM that it was paused.  The watchdog code has an explicit check for this flag when it would otherwise trigger a softlockup.  So setting this bit for every vcpu when the device suspends seems like it would make the problem go away.  However I haven't been able to figure out from the patch series that added this bit or the associated documentation whether this is the proper way to solve this problem or just a workaround.  Everything I have found just talks about migrating VMs between kernels and not about suspending and resuming on the same device.



Also, since this has been broken since launch I don't think it should be ReleaseBlock-Stable.
I have been unable to reproduce this on my kevin so it's likely a kvmclock-specific issue, which makes sense since kvmclock is only used on x86.
The clock skew and "Marking TSC unstable" sound a whole lot like  Issue 878908 , which we thought was due to an incompletely-applied patch set.

Does our 4.19 kernel that this device was running have the full patch set mentioned in https://bugs.chromium.org/p/chromium/issues/detail?id=878908#c8?
Blockedon: 920875

Sign in to add a comment