Guest VM timers are not stopped when the device is suspended |
|||||||||||||
Issue descriptionChrome version: 70.0.3538.76 (Official Build) (64-bit) OS: Chrome Repro steps: 1. Launch a process using 100% CPUs (e.g. compiling tensorflow) 2. Let the Pixelbook go to sleep after the normal grace period 3. Wake up the Pixelbook Expected: the linux container should still be here and the running job should keep going Actual: the crostini terminal is gone (crashed) and crostini must be restarted from scrach
,
Nov 22
Hi there, Yes it crashed systematically when I was compiling tensorflow and using all 4 cores (the Bazel tool used to compile Tensorflow mobilizes all cores available unless instructed otherwise). As to the ctrl-shf-I shall I do it right after the resume from the chrome desktop or shall I terminal first ? Laurent On Thu, Nov 22, 2018, 01:11 mutexlox via monorail < monorail+v2.1285344933@chromium.org wrote:
,
Nov 23
,
Nov 26
Right after you notice the issue should be fine. Please also include "#crostini" in your report. Thanks!
,
Nov 26
,
Nov 27
Is this something you know about Jeff?
,
Nov 27
No, I haven't heard of this one before that I recall.
,
Nov 27
,
Nov 27
I just had the VM crash after waking up a sand device that I'd had the lid closed on since yesterday. This is what was in messages: 2018-11-26T11:40:23.974823-08:00 INFO VM(3)[3736]: tremplin[142]: 2018/11/26 19:40:21 Received GetContainerUsername RPC: penguin#012 2018-11-27T15:22:49.186328-08:00 INFO VM(3)[3787]: [99833.984887] watchdog: BUG: soft lockup - CPU#1 stuck for 92463s! [swapper/1:0]#015 2018-11-27T15:22:49.205371-08:00 INFO VM(3)[3787]: [99834.010082] watchdog: BUG: soft lockup - CPU#0 stuck for 92463s! [swapper/0:0]#015 2018-11-27T15:22:49.218535-08:00 INFO VM(3)[3787]: [99834.010501] CPU: 1 PID: 0 Comm: swapper/1 Not tainted 4.19.2-02071-g869138e6e829 #1#015 2018-11-27T15:22:49.230321-08:00 INFO VM(3)[3787]: [99834.010568] RIP: 0010:native_safe_halt+0x2/0x3#015 2018-11-27T15:22:49.242025-08:00 INFO VM(3)[3787]: [99834.010575] Code: 74 0c 65 81 24 25 14 4d 01 00 ff ff ff 7f 65 48 8b 04 25 28 00 00 00 48 3b 44 24 08 75 05 48 83 c4 10 c3 e8 8e 99 96 ff fb f4 <c3> f4 c3 e8 55 ca 98 ff 84 c0 0f 84 b8 a2 91 ff e9 40 ca 98 ff 53#015 2018-11-27T15:22:49.246819-08:00 INFO VM(3)[3787]: [99834.010578] RSP: 0018:ffffc90000073ec0 EFLAGS: 00000246 ORIG_RAX: ffffffffffffff13#015 2018-11-27T15:22:49.273251-08:00 INFO VM(3)[3787]: [99834.010583] RAX: 0000000000000001 RBX: 0000000000000001 RCX: 254887d6caec9600#015 2018-11-27T15:22:49.284401-08:00 INFO VM(3)[3787]: [99834.010585] RDX: ffffffff8203c040 RSI: ffff8800b0520680 RDI: 0000000000000001#015 2018-11-27T15:22:49.308445-08:00 INFO VM(3)[3787]: [99834.010588] RBP: ffff8800af8b5a00 R08: 0000000000000000 R09: 0000000000000008#015 2018-11-27T15:22:49.316591-08:00 INFO VM(3)[3787]: [99834.010590] R10: 0000000000000000 R11: ffffffff81701638 R12: 0000000000000000#015 2018-11-27T15:22:49.331927-08:00 INFO VM(3)[3787]: [99834.010592] R13: 0000000000000000 R14: ffffc90000073ed6 R15: 0000000000000000#015 2018-11-27T15:22:49.360556-08:00 INFO VM(3)[3787]: [99834.010596] FS: 0000000000000000(0000) GS:ffff8800b0500000(0000) knlGS:0000000000000000#015 2018-11-27T15:22:49.362365-08:00 INFO VM(3)[3787]: [99834.010599] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033#015 2018-11-27T15:22:49.370985-08:00 INFO vm_concierge[3754]: Setting resolv config for VM 3 2018-11-27T15:22:49.378746-08:00 INFO VM(3)[3787]: [99834.010601] CR2: 00007cd57abffc68 CR3: 00000000aeeba000 CR4: 00000000003406a0#015 2018-11-27T15:22:49.379515-08:00 INFO VM(3)[3787]: [99834.010609] Call Trace:#015 2018-11-27T15:22:49.394828-08:00 INFO VM(3)[3787]: [99834.010699] default_idle+0xd/0x18#015 2018-11-27T15:22:49.395782-08:00 INFO VM(3)[3787]: [99834.010782] do_idle+0xeb/0x21f#015 2018-11-27T15:22:49.406586-08:00 INFO VM(3)[3787]: [99834.010835] cpu_startup_entry+0x60/0x62#015 2018-11-27T15:22:49.415784-08:00 INFO VM(3)[3787]: [99834.010859] start_secondary+0x1a8/0x1c3#015 2018-11-27T15:22:49.417052-08:00 INFO VM(3)[3787]: [99834.010893] secondary_startup_64+0xa4/0xb0#015 2018-11-27T15:22:49.432442-08:00 INFO VM(3)[3787]: [99834.010921] Kernel panic - not syncing: softlockup: hung tasks#015 2018-11-27T15:22:49.436029-08:00 INFO VM(3)[3787]: [99834.010928] CPU: 1 PID: 0 Comm: swapper/1 Tainted: G L 4.19.2-02071-g869138e6e829 #1#015 2018-11-27T15:22:49.446913-08:00 INFO VM(3)[3787]: [99834.010929] Call Trace:#015 2018-11-27T15:22:49.447499-08:00 INFO VM(3)[3787]: [99834.010965] <IRQ>#015 2018-11-27T15:22:49.452571-08:00 INFO VM(3)[3787]: [99834.010974] dump_stack+0x43/0x5c#015 2018-11-27T15:22:49.456972-08:00 INFO VM(3)[3787]: [99834.010992] panic+0xf0/0x25c#015 2018-11-27T15:22:49.458444-08:00 INFO VM(3)[3787]: [99834.011018] ? release_firmware_map_entry+0x81/0x97#015 2018-11-27T15:22:49.477278-08:00 INFO VM(3)[3787]: [99834.011038] watchdog_timer_fn+0x267/0x267#015 2018-11-27T15:22:49.481728-08:00 INFO VM(3)[3787]: [99834.011069] ? proc_watchdog_cpumask+0x5b/0x5b#015 2018-11-27T15:22:49.491011-08:00 INFO VM(3)[3787]: [99834.011091] __hrtimer_run_queues+0x93/0xef#015 2018-11-27T15:22:49.493765-08:00 INFO VM(3)[3787]: [99834.011108] hrtimer_run_queues+0xab/0xdb#015 2018-11-27T15:22:49.502848-08:00 INFO VM(3)[3787]: [99834.011110] run_local_timers+0x16/0x40#015 2018-11-27T15:22:49.504145-08:00 INFO VM(3)[3787]: [99834.011112] update_process_times+0x1c/0x46#015 2018-11-27T15:22:49.510919-08:00 INFO VM(3)[3787]: [99834.011115] tick_handle_periodic+0x1c/0x5f#015 2018-11-27T15:22:49.512319-08:00 INFO VM(3)[3787]: [99834.011126] smp_apic_timer_interrupt+0x5a/0x95#015 2018-11-27T15:22:49.529793-08:00 INFO VM(3)[3787]: [99834.011139] apic_timer_interrupt+0xf/0x20#015 2018-11-27T15:22:49.530307-08:00 INFO VM(3)[3787]: [99834.011141] </IRQ>#015 2018-11-27T15:22:49.534419-08:00 INFO VM(3)[3787]: [99834.011143] RIP: 0010:native_safe_halt+0x2/0x3#015 2018-11-27T15:22:49.558191-08:00 INFO VM(3)[3787]: [99834.011145] Code: 74 0c 65 81 24 25 14 4d 01 00 ff ff ff 7f 65 48 8b 04 25 28 00 00 00 48 3b 44 24 08 75 05 48 83 c4 10 c3 e8 8e 99 96 ff fb f4 <c3> f4 c3 e8 55 ca 98 ff 84 c0 0f 84 b8 a2 91 ff e9 40 ca 98 ff 53#015 2018-11-27T15:22:49.563931-08:00 INFO VM(3)[3787]: [99834.011146] RSP: 0018:ffffc90000073ec0 EFLAGS: 00000246 ORIG_RAX: ffffffffffffff13#015 2018-11-27T15:22:49.584463-08:00 INFO VM(3)[3787]: [99834.011147] RAX: 0000000000000001 RBX: 0000000000000001 RCX: 254887d6caec9600#015 2018-11-27T15:22:49.599145-08:00 INFO VM(3)[3787]: [99834.011148] RDX: ffffffff8203c040 RSI: ffff8800b0520680 RDI: 0000000000000001#015 2018-11-27T15:22:49.603643-08:00 INFO VM(3)[3787]: [99834.011149] RBP: ffff8800af8b5a00 R08: 0000000000000000 R09: 0000000000000008#015 2018-11-27T15:22:49.627067-08:00 INFO VM(3)[3787]: [99834.011150] R10: 0000000000000000 R11: ffffffff81701638 R12: 0000000000000000#015 2018-11-27T15:22:49.632083-08:00 INFO VM(3)[3787]: [99834.011151] R13: 0000000000000000 R14: ffffc90000073ed6 R15: 0000000000000000#015 2018-11-27T15:22:49.633266-08:00 INFO VM(3)[3787]: [99834.011153] ? __sched_text_end+0x1/0x1#015 2018-11-27T15:22:49.663913-08:00 INFO VM(3)[3787]: [99834.011155] default_idle+0xd/0x18#015 2018-11-27T15:22:49.664869-08:00 INFO VM(3)[3787]: [99834.011158] do_idle+0xeb/0x21f#015 2018-11-27T15:22:49.671335-08:00 INFO VM(3)[3787]: [99834.011160] cpu_startup_entry+0x60/0x62#015 2018-11-27T15:22:49.672558-08:00 INFO VM(3)[3787]: [99834.011163] start_secondary+0x1a8/0x1c3#015 2018-11-27T15:22:49.673825-08:00 INFO VM(3)[3787]: [99834.011164] secondary_startup_64+0xa4/0xb0#015 2018-11-27T15:22:49.691269-08:00 INFO VM(3)[3787]: [99834.494282] CPU: 0 PID: 0 Comm: swapper/0 Tainted: G L 4.19.2-02071-g869138e6e829 #1#015 2018-11-27T15:22:49.694761-08:00 INFO VM(3)[3787]: [99834.513021] RIP: 0010:native_safe_halt+0x2/0x3#015 2018-11-27T15:22:49.712295-08:00 INFO VM(3)[3787]: [99834.515210] Code: 74 0c 65 81 24 25 14 4d 01 00 ff ff ff 7f 65 48 8b 04 25 28 00 00 00 48 3b 44 24 08 75 05 48 83 c4 10 c3 e8 8e 99 96 ff fb f4 <c3> f4 c3 e8 55 ca 98 ff 84 c0 0f 84 b8 a2 91 ff e9 40 ca 98 ff 53#015 2018-11-27T15:22:49.722794-08:00 INFO VM(3)[3787]: [99834.532730] RSP: 0018:ffffffff82003ea0 EFLAGS: 00000246 ORIG_RAX: ffffffffffffffcf#015 2018-11-27T15:22:49.727623-08:00 INFO VM(3)[3787]: [99834.543225] RAX: 0000000000000000 RBX: 0000000000000000 RCX: 531447993d38cc00#015 2018-11-27T15:22:49.733596-08:00 INFO VM(3)[3787]: [99834.548055] RDX: ffffffff8203c040 RSI: ffff8800b0420680 RDI: 0000000000000000#015 2018-11-27T15:22:49.735775-08:00 INFO VM(3)[3787]: [99834.554025] RBP: ffffffff8201b480 R08: 0000000000000000 R09: 8080808080808080#015 2018-11-27T15:22:49.741454-08:00 INFO VM(3)[3787]: [99834.556206] R10: 0000000000000000 R11: ffffffff81701638 R12: 0000000000000000#015 2018-11-27T15:22:49.754653-08:00 INFO VM(3)[3787]: [99834.561906] R13: 0000000000000000 R14: ffffffff82003eb6 R15: 0000000000000000#015 2018-11-27T15:22:49.762370-08:00 INFO VM(3)[3787]: [99834.575086] FS: 0000000000000000(0000) GS:ffff8800b0400000(0000) knlGS:0000000000000000#015 2018-11-27T15:22:49.764072-08:00 INFO VM(3)[3787]: [99834.582800] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033#015 2018-11-27T15:22:49.801588-08:00 INFO VM(3)[3787]: [99834.584514] CR2: 00007cd5740020c0 CR3: 00000000aeeba000 CR4: 00000000003406b0#015 2018-11-27T15:22:49.808174-08:00 INFO VM(3)[3787]: [99834.622116] Call Trace:#015 2018-11-27T15:22:49.809277-08:00 INFO VM(3)[3787]: [99834.628618] default_idle+0xd/0x18#015 2018-11-27T15:22:49.826315-08:00 INFO VM(3)[3787]: [99834.629723] do_idle+0xeb/0x21f#015 2018-11-27T15:22:49.827437-08:00 INFO VM(3)[3787]: [99834.646711] cpu_startup_entry+0x60/0x62#015 2018-11-27T15:22:49.839249-08:00 INFO VM(3)[3787]: [99834.647923] start_kernel+0x37e/0x3e0#015 2018-11-27T15:22:49.840388-08:00 INFO VM(3)[3787]: [99834.659704] ? start_kernel+0x3e0/0x3e0#015 2018-11-27T15:22:49.853732-08:00 INFO VM(3)[3787]: [99834.660857] secondary_startup_64+0xa4/0xb0#015 2018-11-27T15:22:51.654711-08:00 INFO VM(3)[3787]: [99836.473483] Shutting down cpus with NMI#015 2018-11-27T15:22:51.656515-08:00 INFO VM(3)[3787]: [99836.475475] Kernel Offset: disabled#015 2018-11-27T15:22:51.658436-08:00 INFO VM(3)[3787]: [99836.477151] Rebooting in 1 seconds..#015
,
Nov 27
,
Nov 27
,
Nov 28
So this looks like the problem: 2018-11-27T15:22:49.186328-08:00 INFO VM(3)[3787]: [99833.984887] watchdog: BUG: soft lockup - CPU#1 stuck for 92463s! [swapper/1:0]#015 2018-11-27T15:22:49.205371-08:00 INFO VM(3)[3787]: [99834.010082] watchdog: BUG: soft lockup - CPU#0 stuck for 92463s! 92463 seconds is about 25 hours, which sounds like it matches the time that the device was suspended. I'm guessing the VM's timer should be stopped while the device is suspended since the guest VM has no idea when we suspend and resume the system. I know arm has separate physical and virtual timers for dealing with this. Does x86 have something similar?
,
Nov 28
I had thought that x86 used the kvmclock which would handle this kind of detail, but perhaps either it doesn't handle the issue or we're not using it correctly?
,
Nov 28
I know this is a recent issue as I had never experienced this issue before last week when I was traveling with my pixelbook. It seemed to happen pretty regularly. No recent change in crosvm would have started or stopped plumbing a virtualized clock into the guest, which would indicate the problem is in a recent change to the guest or host kernel.
,
Nov 29
<triage>Assigning to chirantan from triage</triage>
,
Jan 2
,
Jan 3
,
Jan 3
Actually moving to M72 stable blocker for now since this affects 71.
,
Jan 4
,
Jan 10
So I tried reproducing this. I noticed some interesting guest kernel warnings when the system resumed: 2019-01-09T16:05:20.240360-08:00 INFO VM(3)[4070]: [ 2001.168173] clocksource: timekeeping watchdog on CPU0: Marking clocksource 'tsc' as unstable because the skew is too large:#015 2019-01-09T16:05:20.270963-08:00 INFO VM(3)[4070]: [ 2001.174743] clocksource: 'kvm-clock' wd_now: 1e6346da20b wd_last: 1de17eb7607 mask: ffffffffffffffff#015 2019-01-09T16:05:20.294693-08:00 INFO VM(3)[4070]: [ 2001.205707] clocksource: 'tsc' cs_now: 2e05b0850b0 cs_last: 2e03d524cd4 mask: ffffffffffffffff#015 2019-01-09T16:05:20.308640-08:00 INFO VM(3)[4070]: [ 2001.229356] tsc: Marking TSC unstable due to clocksource watchdog#015 These lines come from the clocksource_watchdog function in https://chromium.googlesource.com/chromiumos/third_party/kernel/+/chromeos-4.19/kernel/time/clocksource.c I'm not sure how marking a clocksource unstable affects the rest of the system but at least this confirms my suspicion that the timer value that the guest kernel reads is not stopping when the system is suspended. +groeck in case he's seen anything like this before
,
Jan 11
I was able to reproduce this problem in both R68 and R69 so I know this is *not* a recent issue. It has been a problem since the very beginning. The important thing is to have a process that has high cpu usage inside the VM at the time that the device goes to suspend. Reading the kernel watchdog code it seems like there are some scheduler hooks to prevent the softlockup trigger when the kernel scheduler is idle, which is why suspend/resume is fine when the VM is idle. Someone who knows more about this can correct me if I'm wrong. There is a kvm ioctl, KVM_KVMCLOCK_CTRL, which sets a bit to tell the VM that it was paused. The watchdog code has an explicit check for this flag when it would otherwise trigger a softlockup. So setting this bit for every vcpu when the device suspends seems like it would make the problem go away. However I haven't been able to figure out from the patch series that added this bit or the associated documentation whether this is the proper way to solve this problem or just a workaround. Everything I have found just talks about migrating VMs between kernels and not about suspending and resuming on the same device. Also, since this has been broken since launch I don't think it should be ReleaseBlock-Stable.
,
Jan 11
I have been unable to reproduce this on my kevin so it's likely a kvmclock-specific issue, which makes sense since kvmclock is only used on x86.
,
Jan 11
The clock skew and "Marking TSC unstable" sound a whole lot like Issue 878908 , which we thought was due to an incompletely-applied patch set. Does our 4.19 kernel that this device was running have the full patch set mentioned in https://bugs.chromium.org/p/chromium/issues/detail?id=878908#c8?
,
Jan 11
|
|||||||||||||
►
Sign in to add a comment |
|||||||||||||
Comment 1 by mutexlox@chromium.org
, Nov 22