New issue
Advanced search Search tips
Note: Color blocks (like or ) mean that a user may not be available. Tooltip shows the reason.

Issue 898725 link

Starred by 4 users

Issue metadata

Status: Assigned
Owner:
Cc:
Components:
EstimatedDays: ----
NextAction: ----
OS: ----
Pri: 3
Type: Bug
Gfx



Sign in to add a comment

Investigate slow texture upload times for crosvm

Project Member Reported by davidri...@chromium.org, Oct 25

Issue description

Spun off from 896920.

glbench shows slow texture upload times with virtio-gpu/crosvm.  https://docs.google.com/spreadsheets/d/1soaP06c8DJ01uLV8eaATjh4q_J8gcjPe1dl3KmKri7I/edit#gid=1487276513  On small textures, guest texture upload speed is about 4% that of the host.

Analysis of 10k iterations of texture_upload_rgba_teximage2d_32 shows that virgl processing is fast (15-20us), but since the command is fenced, crosvm waits a while longer for virgl to indicate that the fence has been signaled before responding to the descriptor.  I have timings that I don't completely trust because I think the act of logging to get the timings affects the results, but I think it's about 10x-15x overhead - a 15us initial call to texture upload ends up taking 150us-225us before it's acked back to the guest.

Here's a typical instrumentation (host side) I took which follows a single request.
53 2018-10-24T13:24:28.902309-07:00 DEBUG localhos[22008]: crosvm[285]: [devices/src/virtio/gpu/mod.rs:421] Some(TransferToHost3d) -> OkNoData (processed 18.375µs) (39.803µs) fence 714 desc.index 6 - log done at the end of process_descriptor, 39.803us for the entire call, 18.375us to do the first half of decode + process_gpu_command/virgl request.
21 2018-10-24T13:24:28.902330-07:00 DEBUG localhos[22008]: crosvm[285]: [devices/src/virtio/gpu/mod.rs:610] signal_used_queue
118 2018-10-24T13:24:28.902448-07:00 DEBUG localhos[22008]: crosvm[285]: [devices/src/virtio/gpu/mod.rs:465] fence poll current seq 714 duration 8.529µs - fence signalled by virgl
17 2018-10-24T13:24:28.902465-07:00 DEBUG localhos[22008]: crosvm[285]: [devices/src/virtio/gpu/mod.rs:592] add_used index 6 - acknowledge the control descriptor
52 2018-10-24T13:24:28.902517-07:00 DEBUG localhos[22008]: crosvm[285]: [devices/src/virtio/gpu/mod.rs:421] Some(TransferToHost3d) -> OkNoData (processed 18.206µs) (39.419µs) fence 715 desc.index 16
22 2018-10-24T13:24:28.902539-07:00 DEBUG localhos[22008]: crosvm[285]: [devices/src/virtio/gpu/mod.rs:610] signal_used_queue - signal back to the guest

In total, the operation took 270us from the time that the command started getting processed until the descriptor was acked back to the guest.

On the guest side, the kernel accepts VIRTGPU_TRANSFER_TO_HOST ioctls fast until the control descriptor queue fills up.  At that point there's a call chain that looks like:
virtio_gpu_transfer_to_host_ioctl
virtio_gpu_cmd_transfer_to_host_3d
virtio_gpu_queue_fenced_ctrl_buffer

Within virtio_gpu_queue_fenced_ctrl_buffer, the guest kernel blocks waiting on acks until the control descriptor has room for three descriptors.  When that ack comes in, the next pending operation gets queued and the cycle continues.

22.000          glbench-411   [002] .... 764333.038079: virtgpu_queue_fenced_ctrl_buffer: duration=248329 num_free=2 seq=712
1.000          glbench-411   [002] .... 764333.038080: virtgpu_transfer_to_host_ioctl: duration=250677 seq=712
163.000      kworker/0:1-34    [000] .... 764333.038243: virtgpu_fence_event_process: last_seq=202
22.000          glbench-411   [002] .... 764333.038265: virtgpu_queue_fenced_ctrl_buffer: duration=172684 num_free=2 seq=713
1.000          glbench-411   [002] .... 764333.038266: virtgpu_transfer_to_host_ioctl: duration=174942 seq=713
163.000      kworker/0:1-34    [000] .... 764333.038429: virtgpu_fence_event_process: last_seq=203
21.000          glbench-411   [002] .... 764333.038450: virtgpu_queue_fenced_ctrl_buffer: duration=170217 num_free=2 seq=714
1.000          glbench-411   [002] .... 764333.038451: virtgpu_transfer_to_host_ioctl: duration=172804 seq=714

In this case, I had increased the queue size from 256 to 1024, which results in the kernel getting to a steady state of being ~512 transfer behind at all times.

The actual ack in this case comes back 166ms later, an average of 324us per transfer:
210.000      kworker/0:1-34    [000] .... 764333.204217: virtgpu_fence_event_process: last_seq=714

(This analysis is in-progress/incomplete, just wanted to get things recorded before I go on vacation and switch gears next week).
 
Status: Assigned (was: Untriaged)
From the graphics perspective, we're thinking of:

1) Putting transfers in a command buffer (https://gitlab.freedesktop.org/virgl/virglrenderer/issues/51).

2) Reducing copies (udmabuf + PCI bar memory)

It would nice to know the cost of a vm-exit (in microseconds) for this.
I had to rebuild my setup due to losing my container fs again and my glbench texture_upload numbers are much higher than before:
davidriley@stretch-x:~/autotest/client/deps/glbench$ ./glbench -notemp -tests texture_upload
# board_id: Red Hat - Gallium 0.4 on virgl
# MAX_VIEWPORT_DIMS=(16384, 16384)
# GL_MAX_TEXTURE_SIZE=16384
# Running: ./glbench -notemp -tests texture_upload
# DateTime: Tue Nov 20 18:54:58 2018
@RESULT: texture_upload_luminance_teximage2d_32        =      18.27 mtexel_sec      [none]
@RESULT: texture_upload_luminance_teximage2d_128       =     248.05 mtexel_sec      [none]
@RESULT: texture_upload_luminance_teximage2d_256       =     909.93 mtexel_sec      [none]
@RESULT: texture_upload_luminance_teximage2d_512       =    2233.60 mtexel_sec      [none]
@RESULT: texture_upload_luminance_teximage2d_768       =    2451.31 mtexel_sec      [none]
@RESULT: texture_upload_luminance_teximage2d_1024      =    1987.26 mtexel_sec      [none]
@RESULT: texture_upload_luminance_teximage2d_1536      =    1858.69 mtexel_sec      [none]
@RESULT: texture_upload_luminance_teximage2d_2048      =    1705.08 mtexel_sec      [none]
@RESULT: texture_upload_luminance_texsubimage2d_32     =      17.39 mtexel_sec      [none]
@RESULT: texture_upload_luminance_texsubimage2d_128    =     258.76 mtexel_sec      [none]
@RESULT: texture_upload_luminance_texsubimage2d_256    =     696.15 mtexel_sec      [none]
@RESULT: texture_upload_luminance_texsubimage2d_512    =    1307.85 mtexel_sec      [none]
@RESULT: texture_upload_luminance_texsubimage2d_768    =    1931.87 mtexel_sec      [none]
@RESULT: texture_upload_luminance_texsubimage2d_1024   =    1889.34 mtexel_sec      [none]
@RESULT: texture_upload_luminance_texsubimage2d_1536   =    1850.51 mtexel_sec      [none]
@RESULT: texture_upload_luminance_texsubimage2d_2048   =    1792.57 mtexel_sec      [none]
@RESULT: texture_upload_rgba_teximage2d_32             =      71.66 mtexel_sec      [none]
@RESULT: texture_upload_rgba_teximage2d_128            =     773.61 mtexel_sec      [none]
@RESULT: texture_upload_rgba_teximage2d_256            =    1471.47 mtexel_sec      [none]
@RESULT: texture_upload_rgba_teximage2d_512            =    2690.15 mtexel_sec      [none]
@RESULT: texture_upload_rgba_teximage2d_768            =    2565.93 mtexel_sec      [none]
@RESULT: texture_upload_rgba_teximage2d_1024           =    2697.45 mtexel_sec      [none]
@RESULT: texture_upload_rgba_teximage2d_1536           =    2931.45 mtexel_sec      [none]
@RESULT: texture_upload_rgba_teximage2d_2048           =    3397.03 mtexel_sec      [none]
@RESULT: texture_upload_rgba_texsubimage2d_32          =      58.86 mtexel_sec      [none]
@RESULT: texture_upload_rgba_texsubimage2d_128         =     622.08 mtexel_sec      [none]
@RESULT: texture_upload_rgba_texsubimage2d_256         =    2006.99 mtexel_sec      [none]
@RESULT: texture_upload_rgba_texsubimage2d_512         =    2574.90 mtexel_sec      [none]
@RESULT: texture_upload_rgba_texsubimage2d_768         =    2565.87 mtexel_sec      [none]
@RESULT: texture_upload_rgba_texsubimage2d_1024        =    2739.74 mtexel_sec      [none]
@RESULT: texture_upload_rgba_texsubimage2d_1536        =    3239.20 mtexel_sec      [none]
@RESULT: texture_upload_rgba_texsubimage2d_2048        =    3587.97 mtexel_sec      [none]
# DateTime: Tue Nov 20 18:56:32 2018
@TEST_END

Some of it appears to be debug vs release (which I had tested before and it didn't make much difference), but that only explains a little of the difference.

I also hacked together a version of crosvm which returned TransferToHost3d commands immediately without waiting for the fences to be written by the GPU and it made another 2-2.5x improvement for small transfers and should more closely reflect the overhead caused by crosvm command processing (instead of gpu/fence processing overhead):
davidriley@stretch-x:~/autotest/client/deps/glbench$ ./glbench -notemp -tests texture_upload
# board_id: Red Hat - Gallium 0.4 on virgl
# MAX_VIEWPORT_DIMS=(16384, 16384)
# GL_MAX_TEXTURE_SIZE=16384
# Running: ./glbench -notemp -tests texture_upload
# DateTime: Tue Nov 20 19:54:06 2018
@RESULT: texture_upload_luminance_teximage2d_32        =      45.40 mtexel_sec      [none]
@RESULT: texture_upload_luminance_teximage2d_128       =     665.57 mtexel_sec      [none]
@RESULT: texture_upload_luminance_teximage2d_256       =    2043.34 mtexel_sec      [none]
@RESULT: texture_upload_luminance_teximage2d_512       =    3105.42 mtexel_sec      [none]
@RESULT: texture_upload_luminance_teximage2d_768       =    2951.39 mtexel_sec      [none]
@RESULT: texture_upload_luminance_teximage2d_1024      =    2383.29 mtexel_sec      [none]
@RESULT: texture_upload_luminance_teximage2d_1536      =    1987.47 mtexel_sec      [none]
@RESULT: texture_upload_luminance_teximage2d_2048      =    1791.65 mtexel_sec      [none]
@RESULT: texture_upload_luminance_texsubimage2d_32     =      45.04 mtexel_sec      [none]
@RESULT: texture_upload_luminance_texsubimage2d_128    =     461.71 mtexel_sec      [none]
@RESULT: texture_upload_luminance_texsubimage2d_256    =    1326.15 mtexel_sec      [none]
@RESULT: texture_upload_luminance_texsubimage2d_512    =    2370.53 mtexel_sec      [none]
@RESULT: texture_upload_luminance_texsubimage2d_768    =    2398.60 mtexel_sec      [none]
@RESULT: texture_upload_luminance_texsubimage2d_1024   =    2353.80 mtexel_sec      [none]
@RESULT: texture_upload_luminance_texsubimage2d_1536   =    2028.28 mtexel_sec      [none]
@RESULT: texture_upload_luminance_texsubimage2d_2048   =    1838.56 mtexel_sec      [none]
@RESULT: texture_upload_rgba_teximage2d_32             =     149.91 mtexel_sec      [none]
@RESULT: texture_upload_rgba_teximage2d_128            =    1582.76 mtexel_sec      [none]
@RESULT: texture_upload_rgba_teximage2d_256            =    2903.33 mtexel_sec      [none]
@RESULT: texture_upload_rgba_teximage2d_512            =    3195.76 mtexel_sec      [none]
@RESULT: texture_upload_rgba_teximage2d_768            =    2903.60 mtexel_sec      [none]
@RESULT: texture_upload_rgba_teximage2d_1024           =    2951.51 mtexel_sec      [none]
@RESULT: texture_upload_rgba_teximage2d_1536           =    3167.94 mtexel_sec      [none]
@RESULT: texture_upload_rgba_teximage2d_2048           =    3354.82 mtexel_sec      [none]
@RESULT: texture_upload_rgba_texsubimage2d_32          =     149.34 mtexel_sec      [none]
@RESULT: texture_upload_rgba_texsubimage2d_128         =    1553.83 mtexel_sec      [none]
@RESULT: texture_upload_rgba_texsubimage2d_256         =    2971.28 mtexel_sec      [none]
@RESULT: texture_upload_rgba_texsubimage2d_512         =    3310.81 mtexel_sec      [none]
@RESULT: texture_upload_rgba_texsubimage2d_768         =    2927.56 mtexel_sec      [none]
@RESULT: texture_upload_rgba_texsubimage2d_1024        =    2906.78 mtexel_sec      [none]
@RESULT: texture_upload_rgba_texsubimage2d_1536        =    3008.60 mtexel_sec      [none]
@RESULT: texture_upload_rgba_texsubimage2d_2048        =    3351.15 mtexel_sec      [none]
# DateTime: Tue Nov 20 19:55:40 2018
@TEST_END
Labels: -Grfx Gfx

Sign in to add a comment