crosvm slow to start writing lots of data |
|||
Issue descriptionStarting a tatl image which has been used many times and has a number of containers is taking a minute writing 10MB/s for the majority of that time. localhost ~ # time concierge_client --start_termina_vm --cryptohome_id=f66ea3860115aa350cc20760139aea05625a4169 --name=tatl [INFO:client.cc(670)] Starting Termina VM 'tatl' [INFO:client.cc(371)] Creating disk image [INFO:client.cc(759)] Started Termina VM with [INFO:client.cc(760)] ip address: 100.115.92.6 [INFO:client.cc(761)] vsock cid: 13 [INFO:client.cc(762)] process id: 130 [INFO:client.cc(763)] seneschal server handle: 11 real 1m3.901s user 0m0.007s sys 0m0.002s Snippet from iostat -xc 10 on host: Device: rrqm/s wrqm/s r/s w/s rkB/s wkB/s avgrq-sz avgqu-sz await r_await w_await svctm %util loop0 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 loop1 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 loop2 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 loop3 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 loop4 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 loop5 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 loop6 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 mmcblk0 0.00 162.90 0.00 419.90 0.00 11810.80 56.26 1.67 3.98 0.00 3.98 0.75 31.45 mmcblk0rpmb 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 mmcblk0boot1 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 mmcblk0boot0 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 dm-1 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 zram0 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 dm-2 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 avg-cpu: %user %nice %system %iowait %steal %idle 17.85 0.00 3.17 6.57 0.00 72.40 qcow2 file is 23G: localhost /mnt/stateful_partition/home/.shadow/f66ea3860115aa350cc20760139aea056 25a4169/mount/root/crosvm # ls -l total 23256296 -rw-rw-r--. 1 crosvm crosvm 14024704 Apr 4 2018 'YW1kNjQ=.qcow2' -rw-rw-r--. 1 crosvm crosvm 196607 Apr 17 11:54 'YmV0dHk=.qcow2' -rw-rw-r--. 1 crosvm crosvm 78679179264 Oct 4 17:26 'dGF0bA==.qcow2' -rw-rw-r--. 1 crosvm crosvm 13762560 Apr 4 2018 'dGF0bG0=.qcow2' localhost /mnt/stateful_partition/home/.shadow/f66ea3860115aa350cc20760139aea056 25a4169/mount/root/crosvm # du -hs * 13M YW1kNjQ=.qcow2 12K YmV0dHk=.qcow2 23G dGF0bA==.qcow2 13M dGF0bG0=.qcow2
,
Oct 5
No noticeable difference. localhost /mnt/stateful_partition/home/.shadow/f66ea3860115aa350cc20760139aea056 25a4169/mount/root/crosvm # time concierge_client --start_termina_vm --cryptohome_id=f66ea3860115aa350cc20760139aea05625a4169 --name=tatl [INFO:client.cc(670)] Starting Termina VM 'tatl' [INFO:client.cc(371)] Creating disk image [INFO:client.cc(759)] Started Termina VM with [INFO:client.cc(760)] ip address: 100.115.92.6 [INFO:client.cc(761)] vsock cid: 14 [INFO:client.cc(762)] process id: 146 [INFO:client.cc(763)] seneschal server handle: 12 real 1m4.857s user 0m0.006s sys 0m0.004s Device: rrqm/s wrqm/s r/s w/s rkB/s wkB/s avgrq-sz avgqu-sz await r_await w_await svctm %util loop0 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 loop1 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 loop2 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 loop3 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 loop4 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 loop5 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 loop6 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 mmcblk0 0.00 286.40 2.20 359.30 11.20 9528.40 52.78 0.96 2.63 0.18 2.64 0.84 30.40 mmcblk0rpmb 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 mmcblk0boot1 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 mmcblk0boot0 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 dm-1 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 zram0 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 dm-2 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 avg-cpu: %user %nice %system %iowait %steal %idle 17.42 0.00 4.22 6.69 0.00 71.68
,
Oct 8
,
Oct 8
<triage>Assigning to owner</triage>
,
Oct 9
My VM is now non-booting which appears to be from a timeout. If I backed out a whole string of qcow changes, things boot again: 80b27f3 (HEAD -> xwayland-gpu2) Revert "qcow: Cache address and refcount tables" 29c5990 Revert "qcow: Add accessors for more file state" be81a27 Revert "qcow: buffer reads and writes in QcowRawFile" 3a082d1 Revert "qcow: optimize sync_caches to avoid extra writes" ea5c624 Revert "qcow: track deallocated clusters as unreferenced" b1570f2 qcow: track deallocated clusters as unreferenced 46f70e8 qcow: optimize sync_caches to avoid extra writes eaaf3e1 qcow: buffer reads and writes in QcowRawFile 39401ff qcow: Add accessors for more file state 32e17bc qcow: Cache address and refcount tables 80b27f3 - boots okay localhost /mnt/stateful_partition # concierge_client --start_termina_vm --cryptohome_id=f66ea3860115aa350cc20760139aea05625a4169 --name=tatl [INFO:client.cc(670)] Starting Termina VM 'tatl' [INFO:client.cc(371)] Creating disk image [INFO:client.cc(759)] Started Termina VM with [INFO:client.cc(760)] ip address: 100.115.92.6 [INFO:client.cc(761)] vsock cid: 7 [INFO:client.cc(762)] process id: 68 [INFO:client.cc(763)] seneschal server handle: 5 b1570f2 - fails to boot localhost /mnt/stateful_partition # concierge_client --start_termina_vm --cryptohome_id=f66ea3860115aa350cc20760139aea05625a4169 --name=tatl [INFO:client.cc(670)] Starting Termina VM 'tatl' [INFO:client.cc(371)] Creating disk image [ERROR:object_proxy.cc(580)] Failed to call method: org.chromium.VmConcierge.StartVm: object_path= /org/chromium/VmConcierge: org.freedesktop.DBus.Error.NoReply: Did not receive a reply. Possible causes include: the remote application did not send a reply, the message bus security policy blocked the reply, the reply timeout expired, or the network connection was broken. [ERROR:client.cc(739)] Failed to send dbus message to concierge service
,
Oct 9
I'm copying my qcow2 image up for debugging, but I think regardless of what we turn out we might want to increase the timeout if we can.
,
Nov 27
Just checking in on this - has it reproduced after switching to a raw block device? I believe the "writing lots of data" part of the original description is probably caused by I/O to the VM disk image filling up the host kernel's page cache and causing writeback of dirty pages, so I don't think that is specific to qcow. I investigated opening the disk images with O_DIRECT, which would cause I/O to bypass the host page cache, but it's not easily possible to do this in general because our disk emulation is exposing 512-byte sectors, whereas the host may require 4096-byte aligned I/O when opened with O_DIRECT. Using O_DIRECT would also potentially regress performance, since we lose any benefit of the host page cache. (See https://chromium-review.googlesource.com/c/chromiumos/platform/crosvm/+/1324772 for the WIP O_DIRECT implementation.)
,
Nov 27
I haven't run into these issues recently, but: a) I haven't been restarting my VM as much recently b) I don't think I've installed everything in my VM that I had before |
|||
►
Sign in to add a comment |
|||
Comment 1 by dverkamp@google.com
, Oct 5