I investigated the regressions in first_gesture_scroll_update_latency, max_frame_delay and both of these are from changes in shader compilation with OOP. This is a costly operation that bogs down the GPU main thread and was naturally getting throttled by the command buffer with GPU raster. Since the work was getting chunked, the display context swaps could sneak in between the renderer worker commands.
But with OOP, all this is internally driven in skia and in some cases can hold the GPU main thread for > 200ms, blocking the complete display pipeline. But, its rare to hit this in practice because we commonly get a cache hit from the framework's shader disk cache (thanks Eric for all the debugging help!). The telemetry test runner starts with a cold cache, so we see this worst case show up in the benchmarks.
DDL still won't fix this since this will still run on GPU main thread when we process the DDL. One option would be to pull out the shaders required for a DDL, so they can be processed one at a time? But do we need to optimize for this case anyway, if we commonly get a cache hit from the shader disk cache?
Could this be a general issue in the way pre-emption granularity is affected under OOP-R, as it's more likely we will run whole tiles worth of commands at a time?
Could it make sense to have the RasterCHROMIUM handler yield back to the scheduler, based on some heuristics? RasterDecoder would need to keep enough state to be able to resume.
I tried that, chunked up paint commands on the client into multiple RasterCHROMIUM and yielded back to the scheduler after each RasterCHROMIUM but that didn't help. And that's because skia internally batches up all the gpu work in a GrOp list and executes it together in prepareForExternalIO. Its not possible to interrupt that work.
^ I meant that didn't help in this case because all the shader work was for a single tile and was executed in prepareForExternalIO for that tile. In general, we could do something like in #9 to make OOP work interruptible but I haven't seen any case come up where it was an issue.
I looked up some old issues that had motivated thinking about scheduling and pre-empting at sub-tile level.
Issue 495344 involved blur filters on Search results. The Search implementation has changed since so it's not an issue there anymore. I wonder if we could find other cases in CT corpus. Blurs use many frame buffer switches, and they were very slow, at least on hardware at the time like Nexus 6. Yielding between FBO swaps seemed like a good idea.
Issue 492861 involved MSAA FBO resolve on Mac.
Comment 1 by 42576172...@developer.gserviceaccount.com
, Jun 20 2018