New issue
Advanced search Search tips
Note: Color blocks (like or ) mean that a user may not be available. Tooltip shows the reason.

Issue 821478 link

Starred by 6 users

Issue metadata

Status: Assigned
Owner:
Cc:
Components:
EstimatedDays: ----
NextAction: ----
OS: Chrome
Pri: 3
Type: Bug



Sign in to add a comment

reusing vsock cids causes gRPC from VM to host to fail

Project Member Reported by smbar...@chromium.org, Mar 13 2018

Issue description

Once a vsock cid is used by a VM, it can spuriously fail to be reused after a concierge restart. This seems to be a host kernel issue.

Steps to repro:

1) Start a few VMs (expected cids: 3, 4, 5, ...)
2) Shut down concierge. All VMs will be shut down.
3) Start concierge and attempt to start VMs again (expected cids: 3, 4, 5, ...). Some VMs will not be able to notify concierge that they've started up, despite the VM previously occupying the cid being shut down.
 
Cc: dtor@chromium.org
+dtor in case he has some ideas about what's happening here
Labels: Proj-Containers
Interestingly there are 2 separate kernel BUGs that may be related to this.  The first happens when we try to set the owner of a reused cid: 

2018-03-29T19:24:45.205756-07:00 ERR kernel: [ 1635.822286] BUG: sleeping function called from invalid context at ../../../../../tmp/portage/sys-kernel/chromeos-kernel-4_4-4.4.121-r1424/work/chromeos-kernel-4_4-4.4.121/mm/page_alloc.c:3279
2018-03-29T19:24:45.205769-07:00 ERR kernel: [ 1635.822292] in_atomic(): 0, irqs_disabled(): 1, pid: 12325, name: vhost_vsock
2018-03-29T19:24:45.205770-07:00 WARNING kernel: [ 1635.822296] CPU: 1 PID: 12325 Comm: vhost_vsock Tainted: G        W       4.4.121-13277-gf2a506858314 #2
2018-03-29T19:24:45.205771-07:00 WARNING kernel: [ 1635.822302] Hardware name: Google Eve/Eve, BIOS Google_Eve.9584.95.0 09/27/2017
2018-03-29T19:24:45.205772-07:00 WARNING kernel: [ 1635.822316]  0000000000000000 facd550212af6408 ffff880075de7ab0 ffffffffa6f2ef3d
2018-03-29T19:24:45.205773-07:00 WARNING kernel: [ 1635.822327]  ffff880075de7ae8 ffffffffa6f2eef2 0000003000000028 ffff880075de7af8
2018-03-29T19:24:45.205773-07:00 WARNING kernel: [ 1635.822332]  0000000000000086 facd550212af6408 ffff88042793c740 ffff880075de7b10
2018-03-29T19:24:45.205774-07:00 WARNING kernel: [ 1635.822337] Call Trace:
2018-03-29T19:24:45.205775-07:00 WARNING kernel: [ 1635.822344]  [<ffffffffa6f2ef3d>] __dump_stack+0x19/0x1b
2018-03-29T19:24:45.205776-07:00 WARNING kernel: [ 1635.822347]  [<ffffffffa6f2eef2>] dump_stack+0x4f/0x81
2018-03-29T19:24:45.205777-07:00 WARNING kernel: [ 1635.822350]  [<ffffffffa6d5af75>] ___might_sleep+0x134/0x159
2018-03-29T19:24:45.205777-07:00 WARNING kernel: [ 1635.822353]  [<ffffffffa6d5adfa>] __might_sleep+0x34/0x7b
2018-03-29T19:24:45.205778-07:00 WARNING kernel: [ 1635.822356]  [<ffffffffa6db2693>] __alloc_pages_nodemask+0x94/0x1d14
2018-03-29T19:24:45.205779-07:00 WARNING kernel: [ 1635.822360]  [<ffffffffa73e7ebc>] ? __schedule+0x417/0x9fd
2018-03-29T19:24:45.205780-07:00 WARNING kernel: [ 1635.822363]  [<ffffffffa73e7eb0>] ? __schedule+0x40b/0x9fd
2018-03-29T19:24:45.205780-07:00 WARNING kernel: [ 1635.822365]  [<ffffffffa73e7ebc>] ? __schedule+0x417/0x9fd
2018-03-29T19:24:45.205781-07:00 WARNING kernel: [ 1635.822367]  [<ffffffffa73e7eb0>] ? __schedule+0x40b/0x9fd
2018-03-29T19:24:45.205782-07:00 WARNING kernel: [ 1635.822370]  [<ffffffffa73e7ebc>] ? __schedule+0x417/0x9fd
2018-03-29T19:24:45.205783-07:00 WARNING kernel: [ 1635.822372]  [<ffffffffa73e7eb0>] ? __schedule+0x40b/0x9fd
2018-03-29T19:24:45.205783-07:00 WARNING kernel: [ 1635.822374]  [<ffffffffa73e7ebc>] ? __schedule+0x417/0x9fd
2018-03-29T19:24:45.205784-07:00 WARNING kernel: [ 1635.822376]  [<ffffffffa73e7eb0>] ? __schedule+0x40b/0x9fd
2018-03-29T19:24:45.205785-07:00 WARNING kernel: [ 1635.822379]  [<ffffffffa73e7ebc>] ? __schedule+0x417/0x9fd
2018-03-29T19:24:45.205785-07:00 WARNING kernel: [ 1635.822381]  [<ffffffffa73e7eb0>] ? __schedule+0x40b/0x9fd
2018-03-29T19:24:45.205786-07:00 WARNING kernel: [ 1635.822383]  [<ffffffffa73e7ebc>] ? __schedule+0x417/0x9fd
2018-03-29T19:24:45.205787-07:00 WARNING kernel: [ 1635.822385]  [<ffffffffa73e7eb0>] ? __schedule+0x40b/0x9fd
2018-03-29T19:24:45.205788-07:00 WARNING kernel: [ 1635.822388]  [<ffffffffa73e7ebc>] ? __schedule+0x417/0x9fd
2018-03-29T19:24:45.205788-07:00 WARNING kernel: [ 1635.822390]  [<ffffffffa73e7eb0>] ? __schedule+0x40b/0x9fd
2018-03-29T19:24:45.205789-07:00 WARNING kernel: [ 1635.822392]  [<ffffffffa73e7ebc>] ? __schedule+0x417/0x9fd
2018-03-29T19:24:45.205790-07:00 WARNING kernel: [ 1635.822395]  [<ffffffffa73e7ebc>] ? __schedule+0x417/0x9fd
2018-03-29T19:24:45.205791-07:00 WARNING kernel: [ 1635.822397]  [<ffffffffa73e7eb0>] ? __schedule+0x40b/0x9fd
2018-03-29T19:24:45.205791-07:00 WARNING kernel: [ 1635.822400]  [<ffffffffa73e7ebc>] ? __schedule+0x417/0x9fd
2018-03-29T19:24:45.205792-07:00 WARNING kernel: [ 1635.822402]  [<ffffffffa73e7eb0>] ? __schedule+0x40b/0x9fd
2018-03-29T19:24:45.205793-07:00 WARNING kernel: [ 1635.822404]  [<ffffffffa73e7ebc>] ? __schedule+0x417/0x9fd
2018-03-29T19:24:45.205793-07:00 WARNING kernel: [ 1635.822407]  [<ffffffffa6db4362>] alloc_pages_node+0x18/0x1a
2018-03-29T19:24:45.205794-07:00 WARNING kernel: [ 1635.822410]  [<ffffffffa6d80bf8>] alloc_kmem_pages+0x18/0x1a
2018-03-29T19:24:45.205795-07:00 WARNING kernel: [ 1635.822413]  [<ffffffffa6d91d1e>] kmalloc_order+0x18/0x38
2018-03-29T19:24:45.205796-07:00 WARNING kernel: [ 1635.822416]  [<ffffffffa6d91d6f>] kmalloc_order_trace+0x31/0xed
2018-03-29T19:24:45.205796-07:00 WARNING kernel: [ 1635.822419]  [<ffffffffa6dcf80c>] __kmalloc+0x126/0x169
2018-03-29T19:24:45.205797-07:00 WARNING kernel: [ 1635.822422]  [<ffffffffa6d6a53d>] ? prepare_to_wait_event+0xf8/0xf8
2018-03-29T19:24:45.205798-07:00 WARNING kernel: [ 1635.822428]  [<ffffffffc03df5da>] vhost_dev_set_owner+0xf0/0x1b0 [vhost]
2018-03-29T19:24:45.205799-07:00 WARNING kernel: [ 1635.822432]  [<ffffffffc03e101a>] ? vhost_disable_notify+0x4e/0x4e [vhost]
2018-03-29T19:24:45.205799-07:00 WARNING kernel: [ 1635.822436]  [<ffffffffc03e11db>] vhost_dev_ioctl+0x39/0x419 [vhost]
2018-03-29T19:24:45.205800-07:00 WARNING kernel: [ 1635.822438]  [<ffffffffa6d5adfa>] ? __might_sleep+0x34/0x7b
2018-03-29T19:24:45.205801-07:00 WARNING kernel: [ 1635.822442]  [<ffffffffc043029b>] vhost_vsock_dev_ioctl+0x182/0x384 [vhost_vsock]
2018-03-29T19:24:45.205801-07:00 WARNING kernel: [ 1635.822446]  [<ffffffffa6e80606>] SyS_ioctl+0x100/0x24e
2018-03-29T19:24:45.205802-07:00 WARNING kernel: [ 1635.822449]  [<ffffffffa73ec313>] entry_SYSCALL_64_fastpath+0x27/0x9b


The second happens when we try to release the handle:

018-03-29T19:24:20.135909-07:00 ERR kernel: [ 1610.751801] BUG: sleeping function called from invalid context at ../../../../../tmp/portage/sys-kernel/chromeos-kernel-4_4-4.4.121-r1424/work/chromeos-kernel-4_4-4.4.121/kernel/sched/completion.c:90
2018-03-29T19:24:20.135916-07:00 ERR kernel: [ 1610.751807] in_atomic(): 0, irqs_disabled(): 1, pid: 11210, name: vhost_vsock
2018-03-29T19:24:20.135918-07:00 WARNING kernel: [ 1610.751811] CPU: 2 PID: 11210 Comm: vhost_vsock Tainted: G        W       4.4.121-13277-gf2a506858314 #2
2018-03-29T19:24:20.135919-07:00 WARNING kernel: [ 1610.751814] Hardware name: Google Eve/Eve, BIOS Google_Eve.9584.95.0 09/27/2017
2018-03-29T19:24:20.135920-07:00 WARNING kernel: [ 1610.751816]  0000000000000000 5f10ba78695b55d2 ffff880400823a38 ffffffffa6f2ef3d
2018-03-29T19:24:20.135920-07:00 WARNING kernel: [ 1610.751823]  ffff880400823a70 ffffffffa6f2eef2 0000000000000000 0000003000000028
2018-03-29T19:24:20.135921-07:00 WARNING kernel: [ 1610.751828]  0000000000000086 5f10ba78695b55d2 ffff88006d1e0e40 ffff880400823a98
2018-03-29T19:24:20.135922-07:00 WARNING kernel: [ 1610.751833] Call Trace:
2018-03-29T19:24:20.135923-07:00 WARNING kernel: [ 1610.751839]  [<ffffffffa6f2ef3d>] __dump_stack+0x19/0x1b
2018-03-29T19:24:20.135924-07:00 WARNING kernel: [ 1610.751845]  [<ffffffffa6f2eef2>] dump_stack+0x4f/0x81
2018-03-29T19:24:20.135928-07:00 WARNING kernel: [ 1610.751855]  [<ffffffffa6d5af75>] ___might_sleep+0x134/0x159
2018-03-29T19:24:20.135929-07:00 WARNING kernel: [ 1610.751859]  [<ffffffffa6d5adfa>] __might_sleep+0x34/0x7b
2018-03-29T19:24:20.135930-07:00 WARNING kernel: [ 1610.751871]  [<ffffffffa73e8f71>] wait_for_common+0x2a/0x63
2018-03-29T19:24:20.135931-07:00 WARNING kernel: [ 1610.751874]  [<ffffffffa73e8f45>] wait_for_completion+0x1b/0x1d
2018-03-29T19:24:20.135931-07:00 WARNING kernel: [ 1610.751877]  [<ffffffffa6cc7d0b>] kthread_stop+0xb9/0x154
2018-03-29T19:24:20.135932-07:00 WARNING kernel: [ 1610.751882]  [<ffffffffc03dfa19>] vhost_dev_cleanup+0x189/0x1bc [vhost]
2018-03-29T19:24:20.135933-07:00 WARNING kernel: [ 1610.751885]  [<ffffffffc04306b6>] vhost_vsock_dev_release+0xd4/0xee [vhost_vsock]
2018-03-29T19:24:20.135934-07:00 WARNING kernel: [ 1610.751889]  [<ffffffffa6e7a9d0>] __fput+0xe2/0x1e0
2018-03-29T19:24:20.135935-07:00 WARNING kernel: [ 1610.751892]  [<ffffffffa6e7a8ec>] ____fput+0xe/0x10
2018-03-29T19:24:20.135936-07:00 WARNING kernel: [ 1610.751894]  [<ffffffffa6d581f0>] task_work_run+0x69/0xac
2018-03-29T19:24:20.135936-07:00 WARNING kernel: [ 1610.751897]  [<ffffffffa6cb0e43>] do_exit+0x25d/0x7d2
2018-03-29T19:24:20.135937-07:00 WARNING kernel: [ 1610.751899]  [<ffffffffa6cb148b>] do_group_exit+0x97/0xa6
2018-03-29T19:24:20.135938-07:00 WARNING kernel: [ 1610.751902]  [<ffffffffa6cba148>] get_signal+0x135/0x56e
2018-03-29T19:24:20.135939-07:00 WARNING kernel: [ 1610.751904]  [<ffffffffa6c0955f>] ? queue_work+0x18/0x1a
2018-03-29T19:24:20.135940-07:00 WARNING kernel: [ 1610.751907]  [<ffffffffa6c4462a>] do_signal+0x35/0x573
2018-03-29T19:24:20.135940-07:00 WARNING kernel: [ 1610.751909]  [<ffffffffa6d6a257>] ? __wake_up_common+0x54/0x78
2018-03-29T19:24:20.135941-07:00 WARNING kernel: [ 1610.751912]  [<ffffffffa73ebc90>] ? _raw_spin_unlock_irq+0x1f/0x53
2018-03-29T19:24:20.135942-07:00 WARNING kernel: [ 1610.751914]  [<ffffffffa6e04376>] ? eventfd_write+0x1b7/0x1dd
2018-03-29T19:24:20.135942-07:00 WARNING kernel: [ 1610.751916]  [<ffffffffa6d79c78>] ? seccomp_phase1+0x86/0x2e2
2018-03-29T19:24:20.135943-07:00 WARNING kernel: [ 1610.751919]  [<ffffffffa6ca83e6>] prepare_exit_to_usermode+0x80/0x89
2018-03-29T19:24:20.135944-07:00 WARNING kernel: [ 1610.751921]  [<ffffffffa6ca842c>] syscall_return_slowpath+0x3d/0x154
2018-03-29T19:24:20.135945-07:00 WARNING kernel: [ 1610.751923]  [<ffffffffa6e820ed>] ? SyS_ppoll+0x1e6/0x262
2018-03-29T19:24:20.135945-07:00 WARNING kernel: [ 1610.751926]  [<ffffffffa73ec494>] int_ret_from_sys_call+0x25/0x93
2018-03-29T19:24:20.191086-07:00 INFO localhos[11183]: crosvm[6]: [src/main.rs:408] crosvm has exited normally





Adding to the weirdness: vm_syslog from inside the VM has no problems connecting to vmlog_forwarder on the host (this is how we're still getting vm logs) and concierge has no problems with sending maitre'd a Shutdown RPC.  It's only maitre'd that has trouble connecting to concierge when it first starts up.
Project Member

Comment 4 by bugdroid1@chromium.org, Mar 31 2018

The following revision refers to this bug:
  https://chromium.googlesource.com/chromiumos/platform2/+/e154b2b89b12a1733e92844a91fb330b0a3bba5a

commit e154b2b89b12a1733e92844a91fb330b0a3bba5a
Author: Chirantan Ekbote <chirantan@chromium.org>
Date: Sat Mar 31 02:46:58 2018

vm_tools: concierge: Persist used vsock cids

There is some state in the vhost_vsock kernel subsystem that is not
being cleaned up properly when the crosvm process exits.  This means
that if a user logs out and logs back in, their first few attempts to
start a VM will fail because we reuse vsock context ids.

Save the next context id in a file in /run so that we will never reuse
context ids when vm_concierge restarts.

This change should be reverted once we fix the underlying problem in the
kernel.

BUG=chromium:821478
TEST='vmc start gentoo', 'stop vm_concierge', 'vmc start gentoo',
     log out, log back in, 'vmc start gentoo' works.  Also manually
     inspecting the file with 'xxd /run/vm/next_cid' to ensure that the
     values are being properly persisted.

Change-Id: Id4d13cfe5247cb79a44f63df69127c523db3d3cd
Signed-off-by: Chirantan Ekbote <chirantan@chromium.org>
Reviewed-on: https://chromium-review.googlesource.com/987389
Reviewed-by: Stephen Barber <smbarber@chromium.org>

[modify] https://crrev.com/e154b2b89b12a1733e92844a91fb330b0a3bba5a/vm_tools/host.gypi
[add] https://crrev.com/e154b2b89b12a1733e92844a91fb330b0a3bba5a/vm_tools/concierge/vsock_cid_pool.cc
[modify] https://crrev.com/e154b2b89b12a1733e92844a91fb330b0a3bba5a/vm_tools/concierge/service.cc
[modify] https://crrev.com/e154b2b89b12a1733e92844a91fb330b0a3bba5a/vm_tools/concierge/vsock_cid_pool.h

Labels: Hotlist-Crostini-Platform

Comment 6 by dgreid@chromium.org, May 14 2018

Owner: sonnyrao@chromium.org
Status: Assigned (was: Available)
chirantan, does this still happen?
sonny, can you take a look if it does?

Thanks.
I haven't seen this because we just don't re-use CIDs right now, they just keep incrementing.  I can revert that CL in comment #4 and see if it happens then, but I'm guessing it does.
<triage>What's the status of this?</triage>
Labels: -Pri-1 Pri-3
haven't looked at it but it's not a problem currently because of the workaround -- will bump the priority down

Sign in to add a comment