LXD 3.0.2 sometimes fails to add image to dqlite |
||
Issue descriptionThis shows up in vm.CrostiniStartEverything after the container is downloaded. Failed to set up default container: failed to create container: status: 4 reason: UNIQUE constraint failed: images.fingerprint I believe this is the sequence of events: 1) LXD downloads the image from the simplestreams remote. 2) The guest kernel begins flushing the buffer cache to disk (virtio-blk) 3) LXD attempts to add the image to its dqlite db: https://github.com/lxc/lxd/blob/master/lxd/daemon_images.go#L548 4) The underlying sqlite write is now blocked behind all of the image writes. go-dqlite will time out on executing its transaction (the timeout is 5 seconds). 5) go-dqlite assumes the connection is broken (ErrBadConn), and go's sql package will shut down and restart the connection. 6) LXD notices the ErrBadConn error and attempts the transaction again. https://github.com/lxc/lxd/blob/4ab8394f6364daf2c875990f646ae02ecc951139/lxd/db/query/retry.go#L16 The transaction actually did succeed, though, and attempting to INSERT again causes "UNIQUE constraint failed: images.fingerprint" I believe we can split this into a few bugs. * I think the retry logic in LXD needs some rework. INSERT isn't an idempotent operation, so maybe replacing those with INSERT OR REPLACE or UPSERT would let the retries work consistently. I'll file this upstream. * crosvm should run virtio-blk I/O asynchronously to prevent problems like the dqlite hang. This is allowed by the virtio-blk spec. "The driver queues requests to the virtqueue, and they are used by the device (not necessarily in order)." * workaround for now - increase go-dqlite timeouts. I'll track that in this bug.
,
Nov 1
Upstream issue for retry logic filed: https://github.com/lxc/lxd/issues/5234 Filed issue 901139 for out-of-order support in virtio-blk for crosvm.
,
Nov 2
The following revision refers to this bug: https://chromium.googlesource.com/chromiumos/overlays/chromiumos-overlay/+/80f4155396d7a4bd5e95987e83a10df6d83ff7fd commit 80f4155396d7a4bd5e95987e83a10df6d83ff7fd Author: Stephen Barber <smbarber@chromium.org> Date: Fri Nov 02 07:26:27 2018 app-emulation/lxd: increase dqlite timeouts BUG= chromium:901108 TEST=vm.CrostiniStartEverything succeeds over 100 times on kevin Change-Id: Iaf1a6d5a8158cffae1ae2a081c4661c317fa4d80 Reviewed-on: https://chromium-review.googlesource.com/1313771 Commit-Ready: Stephen Barber <smbarber@chromium.org> Tested-by: Stephen Barber <smbarber@chromium.org> Reviewed-by: Chirantan Ekbote <chirantan@chromium.org> [rename] https://crrev.com/80f4155396d7a4bd5e95987e83a10df6d83ff7fd/app-emulation/lxd/lxd-3.0.2-r3.ebuild [add] https://crrev.com/80f4155396d7a4bd5e95987e83a10df6d83ff7fd/app-emulation/lxd/files/lxd-3.0.2-dqlite-timeout.patch [modify] https://crrev.com/80f4155396d7a4bd5e95987e83a10df6d83ff7fd/app-emulation/lxd/lxd-3.0.2.ebuild
,
Nov 2
Verified on component 11220.0.0. |
||
►
Sign in to add a comment |
||
Comment 1 by smbar...@chromium.org
, Nov 1