Investigate slow texture upload times for crosvm |
||
Issue descriptionSpun off from 896920. glbench shows slow texture upload times with virtio-gpu/crosvm. https://docs.google.com/spreadsheets/d/1soaP06c8DJ01uLV8eaATjh4q_J8gcjPe1dl3KmKri7I/edit#gid=1487276513 On small textures, guest texture upload speed is about 4% that of the host. Analysis of 10k iterations of texture_upload_rgba_teximage2d_32 shows that virgl processing is fast (15-20us), but since the command is fenced, crosvm waits a while longer for virgl to indicate that the fence has been signaled before responding to the descriptor. I have timings that I don't completely trust because I think the act of logging to get the timings affects the results, but I think it's about 10x-15x overhead - a 15us initial call to texture upload ends up taking 150us-225us before it's acked back to the guest. Here's a typical instrumentation (host side) I took which follows a single request. 53 2018-10-24T13:24:28.902309-07:00 DEBUG localhos[22008]: crosvm[285]: [devices/src/virtio/gpu/mod.rs:421] Some(TransferToHost3d) -> OkNoData (processed 18.375µs) (39.803µs) fence 714 desc.index 6 - log done at the end of process_descriptor, 39.803us for the entire call, 18.375us to do the first half of decode + process_gpu_command/virgl request. 21 2018-10-24T13:24:28.902330-07:00 DEBUG localhos[22008]: crosvm[285]: [devices/src/virtio/gpu/mod.rs:610] signal_used_queue 118 2018-10-24T13:24:28.902448-07:00 DEBUG localhos[22008]: crosvm[285]: [devices/src/virtio/gpu/mod.rs:465] fence poll current seq 714 duration 8.529µs - fence signalled by virgl 17 2018-10-24T13:24:28.902465-07:00 DEBUG localhos[22008]: crosvm[285]: [devices/src/virtio/gpu/mod.rs:592] add_used index 6 - acknowledge the control descriptor 52 2018-10-24T13:24:28.902517-07:00 DEBUG localhos[22008]: crosvm[285]: [devices/src/virtio/gpu/mod.rs:421] Some(TransferToHost3d) -> OkNoData (processed 18.206µs) (39.419µs) fence 715 desc.index 16 22 2018-10-24T13:24:28.902539-07:00 DEBUG localhos[22008]: crosvm[285]: [devices/src/virtio/gpu/mod.rs:610] signal_used_queue - signal back to the guest In total, the operation took 270us from the time that the command started getting processed until the descriptor was acked back to the guest. On the guest side, the kernel accepts VIRTGPU_TRANSFER_TO_HOST ioctls fast until the control descriptor queue fills up. At that point there's a call chain that looks like: virtio_gpu_transfer_to_host_ioctl virtio_gpu_cmd_transfer_to_host_3d virtio_gpu_queue_fenced_ctrl_buffer Within virtio_gpu_queue_fenced_ctrl_buffer, the guest kernel blocks waiting on acks until the control descriptor has room for three descriptors. When that ack comes in, the next pending operation gets queued and the cycle continues. 22.000 glbench-411 [002] .... 764333.038079: virtgpu_queue_fenced_ctrl_buffer: duration=248329 num_free=2 seq=712 1.000 glbench-411 [002] .... 764333.038080: virtgpu_transfer_to_host_ioctl: duration=250677 seq=712 163.000 kworker/0:1-34 [000] .... 764333.038243: virtgpu_fence_event_process: last_seq=202 22.000 glbench-411 [002] .... 764333.038265: virtgpu_queue_fenced_ctrl_buffer: duration=172684 num_free=2 seq=713 1.000 glbench-411 [002] .... 764333.038266: virtgpu_transfer_to_host_ioctl: duration=174942 seq=713 163.000 kworker/0:1-34 [000] .... 764333.038429: virtgpu_fence_event_process: last_seq=203 21.000 glbench-411 [002] .... 764333.038450: virtgpu_queue_fenced_ctrl_buffer: duration=170217 num_free=2 seq=714 1.000 glbench-411 [002] .... 764333.038451: virtgpu_transfer_to_host_ioctl: duration=172804 seq=714 In this case, I had increased the queue size from 256 to 1024, which results in the kernel getting to a steady state of being ~512 transfer behind at all times. The actual ack in this case comes back 166ms later, an average of 324us per transfer: 210.000 kworker/0:1-34 [000] .... 764333.204217: virtgpu_fence_event_process: last_seq=714 (This analysis is in-progress/incomplete, just wanted to get things recorded before I go on vacation and switch gears next week).
,
Nov 9
From the graphics perspective, we're thinking of: 1) Putting transfers in a command buffer (https://gitlab.freedesktop.org/virgl/virglrenderer/issues/51). 2) Reducing copies (udmabuf + PCI bar memory) It would nice to know the cost of a vm-exit (in microseconds) for this.
,
Nov 20
I had to rebuild my setup due to losing my container fs again and my glbench texture_upload numbers are much higher than before: davidriley@stretch-x:~/autotest/client/deps/glbench$ ./glbench -notemp -tests texture_upload # board_id: Red Hat - Gallium 0.4 on virgl # MAX_VIEWPORT_DIMS=(16384, 16384) # GL_MAX_TEXTURE_SIZE=16384 # Running: ./glbench -notemp -tests texture_upload # DateTime: Tue Nov 20 18:54:58 2018 @RESULT: texture_upload_luminance_teximage2d_32 = 18.27 mtexel_sec [none] @RESULT: texture_upload_luminance_teximage2d_128 = 248.05 mtexel_sec [none] @RESULT: texture_upload_luminance_teximage2d_256 = 909.93 mtexel_sec [none] @RESULT: texture_upload_luminance_teximage2d_512 = 2233.60 mtexel_sec [none] @RESULT: texture_upload_luminance_teximage2d_768 = 2451.31 mtexel_sec [none] @RESULT: texture_upload_luminance_teximage2d_1024 = 1987.26 mtexel_sec [none] @RESULT: texture_upload_luminance_teximage2d_1536 = 1858.69 mtexel_sec [none] @RESULT: texture_upload_luminance_teximage2d_2048 = 1705.08 mtexel_sec [none] @RESULT: texture_upload_luminance_texsubimage2d_32 = 17.39 mtexel_sec [none] @RESULT: texture_upload_luminance_texsubimage2d_128 = 258.76 mtexel_sec [none] @RESULT: texture_upload_luminance_texsubimage2d_256 = 696.15 mtexel_sec [none] @RESULT: texture_upload_luminance_texsubimage2d_512 = 1307.85 mtexel_sec [none] @RESULT: texture_upload_luminance_texsubimage2d_768 = 1931.87 mtexel_sec [none] @RESULT: texture_upload_luminance_texsubimage2d_1024 = 1889.34 mtexel_sec [none] @RESULT: texture_upload_luminance_texsubimage2d_1536 = 1850.51 mtexel_sec [none] @RESULT: texture_upload_luminance_texsubimage2d_2048 = 1792.57 mtexel_sec [none] @RESULT: texture_upload_rgba_teximage2d_32 = 71.66 mtexel_sec [none] @RESULT: texture_upload_rgba_teximage2d_128 = 773.61 mtexel_sec [none] @RESULT: texture_upload_rgba_teximage2d_256 = 1471.47 mtexel_sec [none] @RESULT: texture_upload_rgba_teximage2d_512 = 2690.15 mtexel_sec [none] @RESULT: texture_upload_rgba_teximage2d_768 = 2565.93 mtexel_sec [none] @RESULT: texture_upload_rgba_teximage2d_1024 = 2697.45 mtexel_sec [none] @RESULT: texture_upload_rgba_teximage2d_1536 = 2931.45 mtexel_sec [none] @RESULT: texture_upload_rgba_teximage2d_2048 = 3397.03 mtexel_sec [none] @RESULT: texture_upload_rgba_texsubimage2d_32 = 58.86 mtexel_sec [none] @RESULT: texture_upload_rgba_texsubimage2d_128 = 622.08 mtexel_sec [none] @RESULT: texture_upload_rgba_texsubimage2d_256 = 2006.99 mtexel_sec [none] @RESULT: texture_upload_rgba_texsubimage2d_512 = 2574.90 mtexel_sec [none] @RESULT: texture_upload_rgba_texsubimage2d_768 = 2565.87 mtexel_sec [none] @RESULT: texture_upload_rgba_texsubimage2d_1024 = 2739.74 mtexel_sec [none] @RESULT: texture_upload_rgba_texsubimage2d_1536 = 3239.20 mtexel_sec [none] @RESULT: texture_upload_rgba_texsubimage2d_2048 = 3587.97 mtexel_sec [none] # DateTime: Tue Nov 20 18:56:32 2018 @TEST_END Some of it appears to be debug vs release (which I had tested before and it didn't make much difference), but that only explains a little of the difference. I also hacked together a version of crosvm which returned TransferToHost3d commands immediately without waiting for the fences to be written by the GPU and it made another 2-2.5x improvement for small transfers and should more closely reflect the overhead caused by crosvm command processing (instead of gpu/fence processing overhead): davidriley@stretch-x:~/autotest/client/deps/glbench$ ./glbench -notemp -tests texture_upload # board_id: Red Hat - Gallium 0.4 on virgl # MAX_VIEWPORT_DIMS=(16384, 16384) # GL_MAX_TEXTURE_SIZE=16384 # Running: ./glbench -notemp -tests texture_upload # DateTime: Tue Nov 20 19:54:06 2018 @RESULT: texture_upload_luminance_teximage2d_32 = 45.40 mtexel_sec [none] @RESULT: texture_upload_luminance_teximage2d_128 = 665.57 mtexel_sec [none] @RESULT: texture_upload_luminance_teximage2d_256 = 2043.34 mtexel_sec [none] @RESULT: texture_upload_luminance_teximage2d_512 = 3105.42 mtexel_sec [none] @RESULT: texture_upload_luminance_teximage2d_768 = 2951.39 mtexel_sec [none] @RESULT: texture_upload_luminance_teximage2d_1024 = 2383.29 mtexel_sec [none] @RESULT: texture_upload_luminance_teximage2d_1536 = 1987.47 mtexel_sec [none] @RESULT: texture_upload_luminance_teximage2d_2048 = 1791.65 mtexel_sec [none] @RESULT: texture_upload_luminance_texsubimage2d_32 = 45.04 mtexel_sec [none] @RESULT: texture_upload_luminance_texsubimage2d_128 = 461.71 mtexel_sec [none] @RESULT: texture_upload_luminance_texsubimage2d_256 = 1326.15 mtexel_sec [none] @RESULT: texture_upload_luminance_texsubimage2d_512 = 2370.53 mtexel_sec [none] @RESULT: texture_upload_luminance_texsubimage2d_768 = 2398.60 mtexel_sec [none] @RESULT: texture_upload_luminance_texsubimage2d_1024 = 2353.80 mtexel_sec [none] @RESULT: texture_upload_luminance_texsubimage2d_1536 = 2028.28 mtexel_sec [none] @RESULT: texture_upload_luminance_texsubimage2d_2048 = 1838.56 mtexel_sec [none] @RESULT: texture_upload_rgba_teximage2d_32 = 149.91 mtexel_sec [none] @RESULT: texture_upload_rgba_teximage2d_128 = 1582.76 mtexel_sec [none] @RESULT: texture_upload_rgba_teximage2d_256 = 2903.33 mtexel_sec [none] @RESULT: texture_upload_rgba_teximage2d_512 = 3195.76 mtexel_sec [none] @RESULT: texture_upload_rgba_teximage2d_768 = 2903.60 mtexel_sec [none] @RESULT: texture_upload_rgba_teximage2d_1024 = 2951.51 mtexel_sec [none] @RESULT: texture_upload_rgba_teximage2d_1536 = 3167.94 mtexel_sec [none] @RESULT: texture_upload_rgba_teximage2d_2048 = 3354.82 mtexel_sec [none] @RESULT: texture_upload_rgba_texsubimage2d_32 = 149.34 mtexel_sec [none] @RESULT: texture_upload_rgba_texsubimage2d_128 = 1553.83 mtexel_sec [none] @RESULT: texture_upload_rgba_texsubimage2d_256 = 2971.28 mtexel_sec [none] @RESULT: texture_upload_rgba_texsubimage2d_512 = 3310.81 mtexel_sec [none] @RESULT: texture_upload_rgba_texsubimage2d_768 = 2927.56 mtexel_sec [none] @RESULT: texture_upload_rgba_texsubimage2d_1024 = 2906.78 mtexel_sec [none] @RESULT: texture_upload_rgba_texsubimage2d_1536 = 3008.60 mtexel_sec [none] @RESULT: texture_upload_rgba_texsubimage2d_2048 = 3351.15 mtexel_sec [none] # DateTime: Tue Nov 20 19:55:40 2018 @TEST_END
,
Dec 5
|
||
►
Sign in to add a comment |
||
Comment 1 by tbuck...@chromium.org
, Nov 2