New issue
Advanced search Search tips

Issue 901108 link

Starred by 2 users

Issue metadata

Status: Verified
Owner:
Closed: Nov 2
Cc:
Components:
EstimatedDays: ----
NextAction: ----
OS: Chrome
Pri: 1
Type: Bug



Sign in to add a comment

LXD 3.0.2 sometimes fails to add image to dqlite

Project Member Reported by smbar...@chromium.org, Nov 1

Issue description

This shows up in vm.CrostiniStartEverything after the container is downloaded.

Failed to set up default container: failed to create container: status: 4 reason: UNIQUE constraint failed: images.fingerprint

I believe this is the sequence of events:
1) LXD downloads the image from the simplestreams remote.
2) The guest kernel begins flushing the buffer cache to disk (virtio-blk)
3) LXD attempts to add the image to its dqlite db: https://github.com/lxc/lxd/blob/master/lxd/daemon_images.go#L548
4) The underlying sqlite write is now blocked behind all of the image writes. go-dqlite will time out on executing its transaction (the timeout is 5 seconds).
5) go-dqlite assumes the connection is broken (ErrBadConn), and go's sql package will shut down and restart the connection.
6) LXD notices the ErrBadConn error and attempts the transaction again. https://github.com/lxc/lxd/blob/4ab8394f6364daf2c875990f646ae02ecc951139/lxd/db/query/retry.go#L16 The transaction actually did succeed, though, and attempting to INSERT again causes "UNIQUE constraint failed: images.fingerprint"

I believe we can split this into a few bugs.
* I think the retry logic in LXD needs some rework. INSERT isn't an idempotent operation, so maybe replacing those with INSERT OR REPLACE or UPSERT would let the retries work consistently. I'll file this upstream.
* crosvm should run virtio-blk I/O asynchronously to prevent problems like the dqlite hang. This is allowed by the virtio-blk spec. "The driver queues requests to the virtqueue, and they are used by the device (not necessarily in order)."
* workaround for now - increase go-dqlite timeouts. I'll track that in this bug.
 
Relevant logs:

2018-10-30T23:58:30.469625+00:00 INFO VM(4)[17987]:  lxd[172]: lvl=info msg="Dqlite: closing client" t=2018-10-30T23:58:28+0000#012
2018-10-30T23:58:30.469646+00:00 INFO VM(4)[17987]:  lxd[172]: lvl=info msg="Dqlite: handling new connection (fd=26)" t=2018-10-30T23:58:28+0000#012
2018-10-30T23:58:30.469652+00:00 INFO VM(4)[17987]:  lxd[172]: lvl=info msg="Dqlite: connected address=0 attempt=0" t=2018-10-30T23:58:29+0000#012

https://stainless.corp.google.com/search?status=FAIL&status=ERROR&status=ABORT&reason=Failed+to+set+up+default+container%3A+failed+to+create+container%3A+status%3A+4+reason%3A+UNIQUE+constraint+failed%3A+images.fingerprint&exclude_non_release=true&exclude_cts=true&view=list&days=14
Upstream issue for retry logic filed: https://github.com/lxc/lxd/issues/5234

Filed issue 901139 for out-of-order support in virtio-blk for crosvm.
Project Member

Comment 3 by bugdroid1@chromium.org, Nov 2

The following revision refers to this bug:
  https://chromium.googlesource.com/chromiumos/overlays/chromiumos-overlay/+/80f4155396d7a4bd5e95987e83a10df6d83ff7fd

commit 80f4155396d7a4bd5e95987e83a10df6d83ff7fd
Author: Stephen Barber <smbarber@chromium.org>
Date: Fri Nov 02 07:26:27 2018

app-emulation/lxd: increase dqlite timeouts

BUG= chromium:901108 
TEST=vm.CrostiniStartEverything succeeds over 100 times on kevin

Change-Id: Iaf1a6d5a8158cffae1ae2a081c4661c317fa4d80
Reviewed-on: https://chromium-review.googlesource.com/1313771
Commit-Ready: Stephen Barber <smbarber@chromium.org>
Tested-by: Stephen Barber <smbarber@chromium.org>
Reviewed-by: Chirantan Ekbote <chirantan@chromium.org>

[rename] https://crrev.com/80f4155396d7a4bd5e95987e83a10df6d83ff7fd/app-emulation/lxd/lxd-3.0.2-r3.ebuild
[add] https://crrev.com/80f4155396d7a4bd5e95987e83a10df6d83ff7fd/app-emulation/lxd/files/lxd-3.0.2-dqlite-timeout.patch
[modify] https://crrev.com/80f4155396d7a4bd5e95987e83a10df6d83ff7fd/app-emulation/lxd/lxd-3.0.2.ebuild

Status: Verified (was: Started)
Verified on component 11220.0.0.

Sign in to add a comment