Kernel lockup starting VM |
|||
Issue descriptionObserved in stainless results for 11241.0.0 on celes. https://stainless.corp.google.com/browse/chromeos-autotest-results/255973409-chromeos-test/ 2018-11-08T00:54:30.583988-08:00 INFO vm_cicerone[8013]: Received SetUpLxdContainerUser request 2018-11-08T00:54:30.978005-08:00 INFO VM(3)[8040]: [ 3089.506646] new mount options do not match the existing superblock, will be ignored#015 2018-11-08T00:54:32.801949-08:00 INFO VM(3)[8008]: lxd[120]: action=start created=2018-11-08T08:52:49+0000 ephemeral=false lvl=info msg="Starting container" name=penguin stateful=false t=2018-11-08T08:54:28+0000 used=1970-01-01T00:00:00+0000#012 2018-11-08T00:54:32.801961-08:00 INFO VM(3)[8008]: lxd[120]: action=start created=2018-11-08T08:52:49+0000 ephemeral=false lvl=info msg="Started container" name=penguin stateful=false t=2018-11-08T08:54:29+0000 used=1970-01-01T00:00:00+0000#012 2018-11-08T00:54:32.801965-08:00 INFO VM(3)[8008]: tremplin[142]: 2018/11/08 08:54:30 Received SetUpUser RPC: penguin#012 2018-11-08T00:55:30.594398-08:00 ERR vm_cicerone[8013]: SetUpUser RPC failed: Deadline Exceeded 2018-11-08T00:55:30.599191-08:00 INFO VM(3)[8040]: [ 3149.131569] watchdog: BUG: soft lockup - CPU#1 stuck for 55s! [swapper/1:0]#015 2018-11-08T00:55:30.604158-08:00 INFO VM(3)[8040]: [ 3149.131577] watchdog: BUG: soft lockup - CPU#0 stuck for 55s! [swapper/0:0]#015 2018-11-08T00:55:30.623724-08:00 INFO VM(3)[8040]: [ 3149.131588] CPU: 0 PID: 0 Comm: swapper/0 Not tainted 4.19.0-00950-g35c37b4312d4 #1#015 2018-11-08T00:55:30.629201-08:00 INFO VM(3)[8040]: [ 3149.132515] CPU: 1 PID: 0 Comm: swapper/1 Not tainted 4.19.0-00950-g35c37b4312d4 #1#015 2018-11-08T00:55:30.646375-08:00 INFO VM(3)[8040]: [ 3149.132465] RIP: 0010:native_safe_halt+0x2/0x3#015 2018-11-08T00:55:30.679837-08:00 INFO VM(3)[8040]: [ 3149.132465] Code: 74 0c 65 81 24 25 14 4d 01 00 ff ff ff 7f 65 48 8b 04 25 28 00 00 00 48 3b 44 24 08 75 05 48 83 c4 10 c3 e8 c6 5f 99 ff fb f4 <c3> f4 c3 e8 85 90 9b ff 84 c0 0f 84 b0 69 94 ff e9 70 90 9b ff 53#015 2018-11-08T00:55:30.690202-08:00 INFO VM(3)[8040]: [ 3149.132465] RSP: 0018:ffffffff81e03ea0 EFLAGS: 00000246 ORIG_RAX: ffffffffffffffcf#015 2018-11-08T00:55:30.710992-08:00 INFO VM(3)[8040]: [ 3149.132465] RAX: 0000000000000000 RBX: 0000000000000000 RCX: 0ed1a7c2d0008a00#015 2018-11-08T00:55:30.720418-08:00 INFO VM(3)[8040]: [ 3149.132515] RIP: 0010:native_safe_halt+0x2/0x3#015 2018-11-08T00:55:30.743945-08:00 INFO VM(3)[8040]: [ 3149.132515] Code: 74 0c 65 81 24 25 14 4d 01 00 ff ff ff 7f 65 48 8b 04 25 28 00 00 00 48 3b 44 24 08 75 05 48 83 c4 10 c3 e8 c6 5f 99 ff fb f4 <c3> f4 c3 e8 85 90 9b ff 84 c0 0f 84 b0 69 94 ff e9 70 90 9b ff 53#015 2018-11-08T00:55:30.760316-08:00 INFO VM(3)[8040]: [ 3149.132515] RSP: 0018:ffffc90000073ec0 EFLAGS: 00000246 ORIG_RAX: ffffffffffffff13#015 2018-11-08T00:55:30.767640-08:00 INFO VM(3)[8040]: [ 3149.132515] RAX: 0000000000000001 RBX: 0000000000000001 RCX: 1a3b14df22114400#015 2018-11-08T00:55:30.773218-08:00 INFO VM(3)[8040]: [ 3149.132515] RDX: ffffffff81e3c040 RSI: ffff880059520680 RDI: 0000000000000001#015 2018-11-08T00:55:30.781494-08:00 INFO VM(3)[8040]: [ 3149.132515] RBP: ffff880058ca3300 R08: 0000000000000001 R09: ffffc9000040bd60#015 2018-11-08T00:55:30.791403-08:00 INFO VM(3)[8040]: [ 3149.132515] R10: 0000000000000001 R11: ffffffff816d4f40 R12: 0000000000000000#015 2018-11-08T00:55:30.796175-08:00 INFO VM(3)[8040]: [ 3149.132515] R13: 0000000000000000 R14: ffffc90000073ed6 R15: 0000000000000000#015 2018-11-08T00:55:30.801918-08:00 INFO VM(3)[8040]: [ 3149.132515] FS: 0000000000000000(0000) GS:ffff880059500000(0000) knlGS:0000000000000000#015 2018-11-08T00:55:30.807461-08:00 INFO VM(3)[8040]: [ 3149.132515] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033#015 2018-11-08T00:55:30.812007-08:00 INFO VM(3)[8040]: [ 3149.132515] CR2: 000056571ef48bf8 CR3: 0000000008d12000 CR4: 00000000001006a0#015 2018-11-08T00:55:30.813687-08:00 INFO VM(3)[8040]: [ 3149.132515] Call Trace:#015 2018-11-08T00:55:30.815753-08:00 INFO VM(3)[8040]: [ 3149.132515] default_idle+0xd/0x18#015 2018-11-08T00:55:30.817774-08:00 INFO VM(3)[8040]: [ 3149.132515] do_idle+0xeb/0x21f#015 2018-11-08T00:55:30.822575-08:00 INFO VM(3)[8040]: [ 3149.132515] cpu_startup_entry+0x60/0x62#015 2018-11-08T00:55:30.828763-08:00 INFO VM(3)[8040]: [ 3149.132515] start_secondary+0x1a8/0x1c3#015 2018-11-08T00:55:30.831835-08:00 INFO VM(3)[8040]: [ 3149.132515] secondary_startup_64+0xa4/0xb0#015 2018-11-08T00:55:30.835744-08:00 INFO VM(3)[8040]: [ 3149.132515] Kernel panic - not syncing: softlockup: hung tasks#015 2018-11-08T00:55:30.840639-08:00 INFO VM(3)[8040]: [ 3149.132465] RDX: ffffffff81e3c040 RSI: ffff880059420680 RDI: 0000000000000000#015 2018-11-08T00:55:30.845601-08:00 INFO VM(3)[8040]: [ 3149.132465] RBP: ffffffff81e1b480 R08: 0000000000000000 R09: 0000000000000000#015 2018-11-08T00:55:30.850932-08:00 INFO VM(3)[8040]: [ 3149.132465] R10: 0000000000000000 R11: ffffffff816d4f40 R12: 0000000000000000#015 2018-11-08T00:55:30.856125-08:00 INFO VM(3)[8040]: [ 3149.132465] R13: 0000000000000000 R14: ffffffff81e03eb6 R15: 0000000000000000#015 2018-11-08T00:55:30.861446-08:00 INFO VM(3)[8040]: [ 3149.132465] FS: 0000000000000000(0000) GS:ffff880059400000(0000) knlGS:0000000000000000#015 2018-11-08T00:55:30.874590-08:00 INFO VM(3)[8040]: [ 3149.132465] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033#015 2018-11-08T00:55:30.882160-08:00 INFO VM(3)[8040]: [ 3149.132465] CR2: 000056571cf80462 CR3: 0000000002014000 CR4: 00000000001006b0#015 2018-11-08T00:55:30.885821-08:00 INFO VM(3)[8040]: [ 3149.132465] Call Trace:#015 2018-11-08T00:55:30.890065-08:00 INFO VM(3)[8040]: [ 3149.132465] default_idle+0xd/0x18#015 2018-11-08T00:55:30.893592-08:00 INFO VM(3)[8040]: [ 3149.132465] do_idle+0xeb/0x21f#015 2018-11-08T00:55:30.896243-08:00 INFO VM(3)[8040]: [ 3149.132465] cpu_startup_entry+0x60/0x62#015 2018-11-08T00:55:30.901582-08:00 INFO VM(3)[8040]: [ 3149.132465] start_kernel+0x37e/0x3e0#015 2018-11-08T00:55:30.912751-08:00 INFO VM(3)[8040]: [ 3149.132465] ? start_kernel+0x3e0/0x3e0#015 2018-11-08T00:55:30.917910-08:00 INFO VM(3)[8040]: [ 3149.132465] secondary_startup_64+0xa4/0xb0#015 2018-11-08T00:55:30.928905-08:00 INFO VM(3)[8040]: [ 3149.132515] CPU: 1 PID: 0 Comm: swapper/1 Tainted: G L 4.19.0-00950-g35c37b4312d4 #1#015 2018-11-08T00:55:30.932569-08:00 INFO VM(3)[8040]: [ 3149.132515] Call Trace:#015 2018-11-08T00:55:30.935749-08:00 INFO VM(3)[8040]: [ 3149.132515] <IRQ>#015 2018-11-08T00:55:30.940291-08:00 INFO VM(3)[8040]: [ 3149.132515] dump_stack+0x43/0x5c#015 2018-11-08T00:55:30.945214-08:00 INFO VM(3)[8040]: [ 3149.132515] panic+0xf0/0x25c#015 2018-11-08T00:55:30.951059-08:00 INFO VM(3)[8040]: [ 3149.132515] ? release_firmware_map_entry+0x21/0x97#015 2018-11-08T00:55:30.956529-08:00 INFO VM(3)[8040]: [ 3149.132515] watchdog_timer_fn+0x267/0x267#015 2018-11-08T00:55:30.960982-08:00 INFO VM(3)[8040]: [ 3149.132515] ? proc_watchdog_cpumask+0x5b/0x5b#015 2018-11-08T00:55:30.966628-08:00 INFO VM(3)[8040]: [ 3149.132515] __hrtimer_run_queues+0x93/0xef#015 2018-11-08T00:55:30.971544-08:00 INFO VM(3)[8040]: [ 3149.132515] hrtimer_run_queues+0xab/0xdb#015 2018-11-08T00:55:30.976057-08:00 INFO VM(3)[8040]: [ 3149.132515] run_local_timers+0x16/0x40#015 2018-11-08T00:55:30.980201-08:00 INFO VM(3)[8040]: [ 3149.132515] update_process_times+0x1c/0x46#015 2018-11-08T00:55:30.984951-08:00 INFO VM(3)[8040]: [ 3149.132515] tick_handle_periodic+0x1c/0x5f#015 2018-11-08T00:55:30.988865-08:00 INFO VM(3)[8040]: [ 3149.132515] smp_apic_timer_interrupt+0x5a/0x95#015 2018-11-08T00:55:30.992634-08:00 INFO VM(3)[8040]: [ 3149.132515] apic_timer_interrupt+0xf/0x20#015 2018-11-08T00:55:30.994794-08:00 INFO VM(3)[8040]: [ 3149.132515] </IRQ>#015 2018-11-08T00:55:30.998579-08:00 INFO VM(3)[8040]: [ 3149.132515] RIP: 0010:native_safe_halt+0x2/0x3#015 2018-11-08T00:55:31.015179-08:00 INFO VM(3)[8040]: [ 3149.132515] Code: 74 0c 65 81 24 25 14 4d 01 00 ff ff ff 7f 65 48 8b 04 25 28 00 00 00 48 3b 44 24 08 75 05 48 83 c4 10 c3 e8 c6 5f 99 ff fb f4 <c3> f4 c3 e8 85 90 9b ff 84 c0 0f 84 b0 69 94 ff e9 70 90 9b ff 53#015 2018-11-08T00:55:31.024544-08:00 INFO VM(3)[8040]: [ 3149.132515] RSP: 0018:ffffc90000073ec0 EFLAGS: 00000246 ORIG_RAX: ffffffffffffff13#015 2018-11-08T00:55:31.034993-08:00 INFO VM(3)[8040]: [ 3149.132515] RAX: 0000000000000001 RBX: 0000000000000001 RCX: 1a3b14df22114400#015 2018-11-08T00:55:31.046071-08:00 INFO VM(3)[8040]: [ 3149.132515] RDX: ffffffff81e3c040 RSI: ffff880059520680 RDI: 0000000000000001#015 2018-11-08T00:55:31.053823-08:00 INFO VM(3)[8040]: [ 3149.132515] RBP: ffff880058ca3300 R08: 0000000000000001 R09: ffffc9000040bd60#015 2018-11-08T00:55:31.063327-08:00 INFO VM(3)[8040]: [ 3149.132515] R10: 0000000000000001 R11: ffffffff816d4f40 R12: 0000000000000000#015 2018-11-08T00:55:31.074107-08:00 INFO VM(3)[8040]: [ 3149.132515] R13: 0000000000000000 R14: ffffc90000073ed6 R15: 0000000000000000#015 2018-11-08T00:55:31.081745-08:00 INFO VM(3)[8040]: [ 3149.132515] ? __sched_text_end+0x1/0x1#015 2018-11-08T00:55:31.099528-08:00 INFO VM(3)[8040]: [ 3149.132515] default_idle+0xd/0x18#015 2018-11-08T00:55:31.109117-08:00 INFO VM(3)[8040]: [ 3149.132515] do_idle+0xeb/0x21f#015 2018-11-08T00:55:31.127827-08:00 INFO VM(3)[8040]: [ 3149.132515] cpu_startup_entry+0x60/0x62#015 2018-11-08T00:55:31.134821-08:00 INFO VM(3)[8040]: [ 3149.132515] start_secondary+0x1a8/0x1c3#015 2018-11-08T00:55:31.148977-08:00 INFO VM(3)[8040]: [ 3149.132515] secondary_startup_64+0xa4/0xb0#015 2018-11-08T00:55:33.350682-08:00 INFO vm_concierge[8022]: Shutting down due to SIGTERM
,
Nov 8
dverkamp, can you take a look? You might have to beg/borrow/steal a Celes.
,
Nov 8
At a first glance, it looks like something odd must be going on with timekeeping - the timestamps in the kernel log are >3000 seconds (50 minutes), but we should have just booted this VM. This seems to happen even on successful celes runs: https://stainless.corp.google.com/browse/chromeos-autotest-results/255973409-chromeos-test/ The jump happens right here: 2018-11-08T00:52:14.835270-08:00 INFO VM(3)[8040]: [ 0.000000] NX (Execute Disable) protection: active#015 2018-11-08T00:52:14.836750-08:00 INFO VM(3)[8040]: [ 0.000000] DMI not present or invalid.#015 2018-11-08T00:52:14.839551-08:00 INFO VM(3)[8040]: [ 0.000000] Hypervisor detected: KVM#015 2018-11-08T00:52:14.843655-08:00 INFO VM(3)[8040]: [ 0.000000] kvm-clock: Using msrs 4b564d01 and 4b564d00#015 2018-11-08T00:52:14.848046-08:00 INFO VM(3)[8040]: [ 2952.532183] kvm-clock: cpu 0, msr 24ef001, primary cpu clock#015 2018-11-08T00:52:14.855953-08:00 INFO VM(3)[8040]: [ 2952.532184] clocksource: kvm-clock: mask: 0xffffffffffffffff max_cycles: 0x1cd42e4dffb, max_idle_ns: 881590591483 ns#015 2018-11-08T00:52:14.859123-08:00 INFO VM(3)[8040]: [ 2952.532202] tsc: Initial usec timer 913970#015 2018-11-08T00:52:14.862675-08:00 INFO VM(3)[8040]: [ 2952.532205] tsc: Detected 1600.000 MHz processor#015
,
Nov 8
That's definitely new behavior. It may have something to do with the new clock sync code that was added...but I'm pretty sure that code isn't getting called yet. Could also be from the kernel uprev to 4.19.
,
Nov 8
yeah, all that output is from kernel boot and long before userspace is started. i vaguely recall clock jumps like that being "normal", but prob worth double checking.
,
Nov 8
I don't think this could have anything to do with my changes: - if userspace hasn't started running yet, there's no way for maitred to make any changes - my changes change the time with settimeofday(), which afaik does NOT affect the clocksource used for timekeeping in the sense of "seconds since boot". Uusers should be able to call settimeofday() however they want. If this causes kernel crashes it's a bug in the kernel.
,
Nov 8
Yes, sorry for the confusion - I don't think this has anything to do with the settimeofday() RPC. The timestamp offset may just be a red herring anyway, but it seems odd (I always see what looks like 0-based timestamps on my test systems). It might just be a host kernel difference; this board is still on 3.18, unfortunately. I believe this is also one of the systems where TSC calibration was failing on the 4.14 guest kernel: https://stainless.corp.google.com/browse/chromeos-autotest-results/253778387-chromeos-test/ zachr also noted that the host load numbers in the 'uptime' output look very high for a 2-core system - is this captured at the end of the test? I'm not sure why the system would be so loaded if it's just running our test (or do multiple tests get scheduled on the same machine at once?).
,
Nov 8
Only one test should be running at a time. But starting the VM/container would introduce a high load average during that phase I'd imagine (and likely the other tests run may induce a high loadavg). |
|||
►
Sign in to add a comment |
|||
Comment 1 by jkardatzke@chromium.org
, Nov 8