reusing vsock cids causes gRPC from VM to host to fail |
|||||
Issue descriptionOnce a vsock cid is used by a VM, it can spuriously fail to be reused after a concierge restart. This seems to be a host kernel issue. Steps to repro: 1) Start a few VMs (expected cids: 3, 4, 5, ...) 2) Shut down concierge. All VMs will be shut down. 3) Start concierge and attempt to start VMs again (expected cids: 3, 4, 5, ...). Some VMs will not be able to notify concierge that they've started up, despite the VM previously occupying the cid being shut down.
,
Mar 15 2018
,
Mar 30 2018
Interestingly there are 2 separate kernel BUGs that may be related to this. The first happens when we try to set the owner of a reused cid: 2018-03-29T19:24:45.205756-07:00 ERR kernel: [ 1635.822286] BUG: sleeping function called from invalid context at ../../../../../tmp/portage/sys-kernel/chromeos-kernel-4_4-4.4.121-r1424/work/chromeos-kernel-4_4-4.4.121/mm/page_alloc.c:3279 2018-03-29T19:24:45.205769-07:00 ERR kernel: [ 1635.822292] in_atomic(): 0, irqs_disabled(): 1, pid: 12325, name: vhost_vsock 2018-03-29T19:24:45.205770-07:00 WARNING kernel: [ 1635.822296] CPU: 1 PID: 12325 Comm: vhost_vsock Tainted: G W 4.4.121-13277-gf2a506858314 #2 2018-03-29T19:24:45.205771-07:00 WARNING kernel: [ 1635.822302] Hardware name: Google Eve/Eve, BIOS Google_Eve.9584.95.0 09/27/2017 2018-03-29T19:24:45.205772-07:00 WARNING kernel: [ 1635.822316] 0000000000000000 facd550212af6408 ffff880075de7ab0 ffffffffa6f2ef3d 2018-03-29T19:24:45.205773-07:00 WARNING kernel: [ 1635.822327] ffff880075de7ae8 ffffffffa6f2eef2 0000003000000028 ffff880075de7af8 2018-03-29T19:24:45.205773-07:00 WARNING kernel: [ 1635.822332] 0000000000000086 facd550212af6408 ffff88042793c740 ffff880075de7b10 2018-03-29T19:24:45.205774-07:00 WARNING kernel: [ 1635.822337] Call Trace: 2018-03-29T19:24:45.205775-07:00 WARNING kernel: [ 1635.822344] [<ffffffffa6f2ef3d>] __dump_stack+0x19/0x1b 2018-03-29T19:24:45.205776-07:00 WARNING kernel: [ 1635.822347] [<ffffffffa6f2eef2>] dump_stack+0x4f/0x81 2018-03-29T19:24:45.205777-07:00 WARNING kernel: [ 1635.822350] [<ffffffffa6d5af75>] ___might_sleep+0x134/0x159 2018-03-29T19:24:45.205777-07:00 WARNING kernel: [ 1635.822353] [<ffffffffa6d5adfa>] __might_sleep+0x34/0x7b 2018-03-29T19:24:45.205778-07:00 WARNING kernel: [ 1635.822356] [<ffffffffa6db2693>] __alloc_pages_nodemask+0x94/0x1d14 2018-03-29T19:24:45.205779-07:00 WARNING kernel: [ 1635.822360] [<ffffffffa73e7ebc>] ? __schedule+0x417/0x9fd 2018-03-29T19:24:45.205780-07:00 WARNING kernel: [ 1635.822363] [<ffffffffa73e7eb0>] ? __schedule+0x40b/0x9fd 2018-03-29T19:24:45.205780-07:00 WARNING kernel: [ 1635.822365] [<ffffffffa73e7ebc>] ? __schedule+0x417/0x9fd 2018-03-29T19:24:45.205781-07:00 WARNING kernel: [ 1635.822367] [<ffffffffa73e7eb0>] ? __schedule+0x40b/0x9fd 2018-03-29T19:24:45.205782-07:00 WARNING kernel: [ 1635.822370] [<ffffffffa73e7ebc>] ? __schedule+0x417/0x9fd 2018-03-29T19:24:45.205783-07:00 WARNING kernel: [ 1635.822372] [<ffffffffa73e7eb0>] ? __schedule+0x40b/0x9fd 2018-03-29T19:24:45.205783-07:00 WARNING kernel: [ 1635.822374] [<ffffffffa73e7ebc>] ? __schedule+0x417/0x9fd 2018-03-29T19:24:45.205784-07:00 WARNING kernel: [ 1635.822376] [<ffffffffa73e7eb0>] ? __schedule+0x40b/0x9fd 2018-03-29T19:24:45.205785-07:00 WARNING kernel: [ 1635.822379] [<ffffffffa73e7ebc>] ? __schedule+0x417/0x9fd 2018-03-29T19:24:45.205785-07:00 WARNING kernel: [ 1635.822381] [<ffffffffa73e7eb0>] ? __schedule+0x40b/0x9fd 2018-03-29T19:24:45.205786-07:00 WARNING kernel: [ 1635.822383] [<ffffffffa73e7ebc>] ? __schedule+0x417/0x9fd 2018-03-29T19:24:45.205787-07:00 WARNING kernel: [ 1635.822385] [<ffffffffa73e7eb0>] ? __schedule+0x40b/0x9fd 2018-03-29T19:24:45.205788-07:00 WARNING kernel: [ 1635.822388] [<ffffffffa73e7ebc>] ? __schedule+0x417/0x9fd 2018-03-29T19:24:45.205788-07:00 WARNING kernel: [ 1635.822390] [<ffffffffa73e7eb0>] ? __schedule+0x40b/0x9fd 2018-03-29T19:24:45.205789-07:00 WARNING kernel: [ 1635.822392] [<ffffffffa73e7ebc>] ? __schedule+0x417/0x9fd 2018-03-29T19:24:45.205790-07:00 WARNING kernel: [ 1635.822395] [<ffffffffa73e7ebc>] ? __schedule+0x417/0x9fd 2018-03-29T19:24:45.205791-07:00 WARNING kernel: [ 1635.822397] [<ffffffffa73e7eb0>] ? __schedule+0x40b/0x9fd 2018-03-29T19:24:45.205791-07:00 WARNING kernel: [ 1635.822400] [<ffffffffa73e7ebc>] ? __schedule+0x417/0x9fd 2018-03-29T19:24:45.205792-07:00 WARNING kernel: [ 1635.822402] [<ffffffffa73e7eb0>] ? __schedule+0x40b/0x9fd 2018-03-29T19:24:45.205793-07:00 WARNING kernel: [ 1635.822404] [<ffffffffa73e7ebc>] ? __schedule+0x417/0x9fd 2018-03-29T19:24:45.205793-07:00 WARNING kernel: [ 1635.822407] [<ffffffffa6db4362>] alloc_pages_node+0x18/0x1a 2018-03-29T19:24:45.205794-07:00 WARNING kernel: [ 1635.822410] [<ffffffffa6d80bf8>] alloc_kmem_pages+0x18/0x1a 2018-03-29T19:24:45.205795-07:00 WARNING kernel: [ 1635.822413] [<ffffffffa6d91d1e>] kmalloc_order+0x18/0x38 2018-03-29T19:24:45.205796-07:00 WARNING kernel: [ 1635.822416] [<ffffffffa6d91d6f>] kmalloc_order_trace+0x31/0xed 2018-03-29T19:24:45.205796-07:00 WARNING kernel: [ 1635.822419] [<ffffffffa6dcf80c>] __kmalloc+0x126/0x169 2018-03-29T19:24:45.205797-07:00 WARNING kernel: [ 1635.822422] [<ffffffffa6d6a53d>] ? prepare_to_wait_event+0xf8/0xf8 2018-03-29T19:24:45.205798-07:00 WARNING kernel: [ 1635.822428] [<ffffffffc03df5da>] vhost_dev_set_owner+0xf0/0x1b0 [vhost] 2018-03-29T19:24:45.205799-07:00 WARNING kernel: [ 1635.822432] [<ffffffffc03e101a>] ? vhost_disable_notify+0x4e/0x4e [vhost] 2018-03-29T19:24:45.205799-07:00 WARNING kernel: [ 1635.822436] [<ffffffffc03e11db>] vhost_dev_ioctl+0x39/0x419 [vhost] 2018-03-29T19:24:45.205800-07:00 WARNING kernel: [ 1635.822438] [<ffffffffa6d5adfa>] ? __might_sleep+0x34/0x7b 2018-03-29T19:24:45.205801-07:00 WARNING kernel: [ 1635.822442] [<ffffffffc043029b>] vhost_vsock_dev_ioctl+0x182/0x384 [vhost_vsock] 2018-03-29T19:24:45.205801-07:00 WARNING kernel: [ 1635.822446] [<ffffffffa6e80606>] SyS_ioctl+0x100/0x24e 2018-03-29T19:24:45.205802-07:00 WARNING kernel: [ 1635.822449] [<ffffffffa73ec313>] entry_SYSCALL_64_fastpath+0x27/0x9b The second happens when we try to release the handle: 018-03-29T19:24:20.135909-07:00 ERR kernel: [ 1610.751801] BUG: sleeping function called from invalid context at ../../../../../tmp/portage/sys-kernel/chromeos-kernel-4_4-4.4.121-r1424/work/chromeos-kernel-4_4-4.4.121/kernel/sched/completion.c:90 2018-03-29T19:24:20.135916-07:00 ERR kernel: [ 1610.751807] in_atomic(): 0, irqs_disabled(): 1, pid: 11210, name: vhost_vsock 2018-03-29T19:24:20.135918-07:00 WARNING kernel: [ 1610.751811] CPU: 2 PID: 11210 Comm: vhost_vsock Tainted: G W 4.4.121-13277-gf2a506858314 #2 2018-03-29T19:24:20.135919-07:00 WARNING kernel: [ 1610.751814] Hardware name: Google Eve/Eve, BIOS Google_Eve.9584.95.0 09/27/2017 2018-03-29T19:24:20.135920-07:00 WARNING kernel: [ 1610.751816] 0000000000000000 5f10ba78695b55d2 ffff880400823a38 ffffffffa6f2ef3d 2018-03-29T19:24:20.135920-07:00 WARNING kernel: [ 1610.751823] ffff880400823a70 ffffffffa6f2eef2 0000000000000000 0000003000000028 2018-03-29T19:24:20.135921-07:00 WARNING kernel: [ 1610.751828] 0000000000000086 5f10ba78695b55d2 ffff88006d1e0e40 ffff880400823a98 2018-03-29T19:24:20.135922-07:00 WARNING kernel: [ 1610.751833] Call Trace: 2018-03-29T19:24:20.135923-07:00 WARNING kernel: [ 1610.751839] [<ffffffffa6f2ef3d>] __dump_stack+0x19/0x1b 2018-03-29T19:24:20.135924-07:00 WARNING kernel: [ 1610.751845] [<ffffffffa6f2eef2>] dump_stack+0x4f/0x81 2018-03-29T19:24:20.135928-07:00 WARNING kernel: [ 1610.751855] [<ffffffffa6d5af75>] ___might_sleep+0x134/0x159 2018-03-29T19:24:20.135929-07:00 WARNING kernel: [ 1610.751859] [<ffffffffa6d5adfa>] __might_sleep+0x34/0x7b 2018-03-29T19:24:20.135930-07:00 WARNING kernel: [ 1610.751871] [<ffffffffa73e8f71>] wait_for_common+0x2a/0x63 2018-03-29T19:24:20.135931-07:00 WARNING kernel: [ 1610.751874] [<ffffffffa73e8f45>] wait_for_completion+0x1b/0x1d 2018-03-29T19:24:20.135931-07:00 WARNING kernel: [ 1610.751877] [<ffffffffa6cc7d0b>] kthread_stop+0xb9/0x154 2018-03-29T19:24:20.135932-07:00 WARNING kernel: [ 1610.751882] [<ffffffffc03dfa19>] vhost_dev_cleanup+0x189/0x1bc [vhost] 2018-03-29T19:24:20.135933-07:00 WARNING kernel: [ 1610.751885] [<ffffffffc04306b6>] vhost_vsock_dev_release+0xd4/0xee [vhost_vsock] 2018-03-29T19:24:20.135934-07:00 WARNING kernel: [ 1610.751889] [<ffffffffa6e7a9d0>] __fput+0xe2/0x1e0 2018-03-29T19:24:20.135935-07:00 WARNING kernel: [ 1610.751892] [<ffffffffa6e7a8ec>] ____fput+0xe/0x10 2018-03-29T19:24:20.135936-07:00 WARNING kernel: [ 1610.751894] [<ffffffffa6d581f0>] task_work_run+0x69/0xac 2018-03-29T19:24:20.135936-07:00 WARNING kernel: [ 1610.751897] [<ffffffffa6cb0e43>] do_exit+0x25d/0x7d2 2018-03-29T19:24:20.135937-07:00 WARNING kernel: [ 1610.751899] [<ffffffffa6cb148b>] do_group_exit+0x97/0xa6 2018-03-29T19:24:20.135938-07:00 WARNING kernel: [ 1610.751902] [<ffffffffa6cba148>] get_signal+0x135/0x56e 2018-03-29T19:24:20.135939-07:00 WARNING kernel: [ 1610.751904] [<ffffffffa6c0955f>] ? queue_work+0x18/0x1a 2018-03-29T19:24:20.135940-07:00 WARNING kernel: [ 1610.751907] [<ffffffffa6c4462a>] do_signal+0x35/0x573 2018-03-29T19:24:20.135940-07:00 WARNING kernel: [ 1610.751909] [<ffffffffa6d6a257>] ? __wake_up_common+0x54/0x78 2018-03-29T19:24:20.135941-07:00 WARNING kernel: [ 1610.751912] [<ffffffffa73ebc90>] ? _raw_spin_unlock_irq+0x1f/0x53 2018-03-29T19:24:20.135942-07:00 WARNING kernel: [ 1610.751914] [<ffffffffa6e04376>] ? eventfd_write+0x1b7/0x1dd 2018-03-29T19:24:20.135942-07:00 WARNING kernel: [ 1610.751916] [<ffffffffa6d79c78>] ? seccomp_phase1+0x86/0x2e2 2018-03-29T19:24:20.135943-07:00 WARNING kernel: [ 1610.751919] [<ffffffffa6ca83e6>] prepare_exit_to_usermode+0x80/0x89 2018-03-29T19:24:20.135944-07:00 WARNING kernel: [ 1610.751921] [<ffffffffa6ca842c>] syscall_return_slowpath+0x3d/0x154 2018-03-29T19:24:20.135945-07:00 WARNING kernel: [ 1610.751923] [<ffffffffa6e820ed>] ? SyS_ppoll+0x1e6/0x262 2018-03-29T19:24:20.135945-07:00 WARNING kernel: [ 1610.751926] [<ffffffffa73ec494>] int_ret_from_sys_call+0x25/0x93 2018-03-29T19:24:20.191086-07:00 INFO localhos[11183]: crosvm[6]: [src/main.rs:408] crosvm has exited normally Adding to the weirdness: vm_syslog from inside the VM has no problems connecting to vmlog_forwarder on the host (this is how we're still getting vm logs) and concierge has no problems with sending maitre'd a Shutdown RPC. It's only maitre'd that has trouble connecting to concierge when it first starts up.
,
Mar 31 2018
The following revision refers to this bug: https://chromium.googlesource.com/chromiumos/platform2/+/e154b2b89b12a1733e92844a91fb330b0a3bba5a commit e154b2b89b12a1733e92844a91fb330b0a3bba5a Author: Chirantan Ekbote <chirantan@chromium.org> Date: Sat Mar 31 02:46:58 2018 vm_tools: concierge: Persist used vsock cids There is some state in the vhost_vsock kernel subsystem that is not being cleaned up properly when the crosvm process exits. This means that if a user logs out and logs back in, their first few attempts to start a VM will fail because we reuse vsock context ids. Save the next context id in a file in /run so that we will never reuse context ids when vm_concierge restarts. This change should be reverted once we fix the underlying problem in the kernel. BUG=chromium:821478 TEST='vmc start gentoo', 'stop vm_concierge', 'vmc start gentoo', log out, log back in, 'vmc start gentoo' works. Also manually inspecting the file with 'xxd /run/vm/next_cid' to ensure that the values are being properly persisted. Change-Id: Id4d13cfe5247cb79a44f63df69127c523db3d3cd Signed-off-by: Chirantan Ekbote <chirantan@chromium.org> Reviewed-on: https://chromium-review.googlesource.com/987389 Reviewed-by: Stephen Barber <smbarber@chromium.org> [modify] https://crrev.com/e154b2b89b12a1733e92844a91fb330b0a3bba5a/vm_tools/host.gypi [add] https://crrev.com/e154b2b89b12a1733e92844a91fb330b0a3bba5a/vm_tools/concierge/vsock_cid_pool.cc [modify] https://crrev.com/e154b2b89b12a1733e92844a91fb330b0a3bba5a/vm_tools/concierge/service.cc [modify] https://crrev.com/e154b2b89b12a1733e92844a91fb330b0a3bba5a/vm_tools/concierge/vsock_cid_pool.h
,
May 10 2018
,
May 14 2018
chirantan, does this still happen? sonny, can you take a look if it does? Thanks.
,
May 14 2018
I haven't seen this because we just don't re-use CIDs right now, they just keep incrementing. I can revert that CL in comment #4 and see if it happens then, but I'm guessing it does.
,
Jul 27
<triage>What's the status of this?</triage>
,
Jul 27
haven't looked at it but it's not a problem currently because of the workaround -- will bump the priority down |
|||||
►
Sign in to add a comment |
|||||
Comment 1 by chirantan@chromium.org
, Mar 13 2018