Issue metadata
Sign in to add a comment
|
3.2% regression in page_cycler.top_10_mobile at 387300:387345 |
||||||||||||||||||||
Issue descriptionSee the link to graphs below.
,
Apr 16 2016
=== Auto-CCing suspected CL author scroggo@chromium.org === Hi scroggo@chromium.org, the bisect results pointed to your CL below as possibly causing a regression. Please have a look at this info and see whether your CL be related. ===== BISECT JOB RESULTS ===== Status: completed ===== SUSPECTED CL(s) ===== Subject : Eliminate copies of encoded image data Author : scroggo Commit description: Remove the ThreadSafeDataTransport, which required making two copies of the encoded data. Instead, copy the data (once) into an SkRWBuffer, which allows creating read only snapshots that can be safely read in another thread. Design doc can be found at https://docs.google.com/document/d/1aNmMaWfjkSSkdnVkAJoBUN7mNwdCXT1wLghrhDhhSUw/edit#heading=h.hvbfr8rfyj9k Benchmark results can be found at https://docs.google.com/spreadsheets/d/1Y_9UbzBvyg1HRs2uieeV2Twb7seJFKZulRVg7OQOqEo/ DeferredImageDecoder: Copy into an SkRWBuffer, and create snapshots to pass to each DecodingImageGenerator. DecodingImageGenerator: Store a snapshot of the data, along with whether the snapshot contains all of the data. Pass this to the ImageFrameGenerator when it is time to decode. Since it already knows whether it has received all the data, never attempt to YUV decode if not. When created from an SkData, skip the copy into a SharedBuffer. Instead, wrap in a SegmentReader. ImageFrameGenerator: Its decoding APIs now take the encoded data and whether it's complete as parameters. Does not own the encoded data, and no longer has the implementation of onRefEncodedData. Make the object thread-safe where possible, so we can only lock the mutex during operations that truly need it. SegmentReader: Add a new interface that looks like a read only SharedBuffer. Add three implementations: - SharedBufferSegmentReader: Used when we already have a SharedBuffer (e.g. in the main thread). - ROBufferSegmentReader: Facilitates decoding in a worker thread without requiring mutex locks/ an extra copy of the data. (Note - this still requires a mutex for using the SkROBuffer::Iter. But in practice, there is no contention, unlike the old implementation, which had to lock in order to copy into the reader thread.) - DataSegmentReader: Allows using an SkData directly without copying. ImageDecoders/FastSharedBufferReader: Read data from a SegmentReader instead of directly from a SharedBuffer. (Add an ImageDecoder helper that takes a SharedBuffer parameter, reducing the need to change code that already has a SharedBuffer.) Since the API is the same, the changes here are small, except for WEBPImageDecoder, which no longer calls SharedBuffer::data(), and instead calls SegmentReader::refAsSkData(). Use PassRefPtr instead of raw pointers for parameter passing in setData. Tests: Adapt where a SharedBuffer is no longer used. PictureSnapshot.cpp Wrap SkData in SegmentReader. BUG= 467772 BUG= 483369 BUG= 335694 This may also help with crrbug.com/335694. The DecodingImageGenerator will no longer attempt to read beyond the data that had been received when it was created. e.g DeferredImageDecoder did = ... did.setData(...); SkImage image = did.createFrameAtIndex(0); did.setData(...); // More data image will not attempt to decode the extra data provided in the last call to setData. However, it could use a decoder that had previously been used to decode more data than it was created with. e.g. DeferredImageDecoder did = ... did.setData(...); SkImage imageA = did.createFrameAtIndex(0); // DecodingImageGenerator did.setData(...); // More data SkImage imageB = did.createFrameAtIndex(0); // draw imageB - this decodes to the end of data provided in the second // call to setData // draw imageA - this can use the same ImageDecoder retrieved from the // ImageDecodingStore, meaning it will have already // decoded beyond the data supplied to imageA imageA *may* look like imageB, even though it had less data. Review URL: https://codereview.chromium.org/1812273003 Cr-Commit-Position: refs/heads/master@{#387344} Commit : d2234904faee943bd987bd38d620096db808efca Date : Thu Apr 14 17:05:12 2016 ===== TESTED REVISIONS ===== Revision Mean Value Std. Dev. Num Values Good? chromium@387299 27431.5 133.712752 6 good chromium@387322 27363.8 193.312441 5 good chromium@387334 27508.6 190.410872 5 good chromium@387340 27349.4 179.217187 5 good chromium@387343 27482.0 221.286014 5 good chromium@387344 28347.6 147.23213 5 bad <- chromium@387345 28205.4 215.779517 5 bad Bisect job ran on: android_s5_perf_bisect Bug ID: 604008 Test Command: src/tools/perf/run_benchmark -v --browser=android-chromium --output-format=chartjson --also-run-disabled-tests page_cycler.top_10_mobile Test Metric: vm_private_dirty_final_renderer/vm_private_dirty_final_renderer Relative Change: 3.08% Score: 99.8 Buildbot stdio: http://build.chromium.org/p/tryserver.chromium.perf/builders/android_s5_perf_bisect/builds/603 Job details: https://chromeperf.appspot.com/buildbucket_job_status/9015284054956236992 Not what you expected? We'll investigate and get back to you! https://chromeperf.appspot.com/bad_bisect?try_job_id=604008 | O O | Visit http://www.chromium.org/developers/speed-infra/perf-bug-faq | X | for more information addressing perf regression bugs. For feedback, | / \ | file a bug with component Tests>AutoBisect. Thank you!
,
Apr 18 2016
I expected an improvement (less copies, less mutex contention), and if I look at https://chromeperf.appspot.com//group_report?rev=387344 (improvements and regressions around the time of my change), I see many improvements. In particular, I see a couple of improvements (and no regressions) for the image_decoding.image_decoding_measurement Test Suite (2.0% on chromium-rel-win7-gpu-intel and 18.6% on android-nexus9). That test suite is specifically for image decoding, and my CL should only have affected image decoding. That said, most sites have images (citation needed), and this test suite only times a few images, so it's possible that it misses interesting cases. Looking further in that list, I see several large performance differences (some improvements and some regressions) that may be worth investigating to see whether they're related to my CL.
,
May 2 2016
As stated before, my CL should actually improve image decoding (and does for the image decoding specific tests) and should not affect other things.
But there is a case where I could speculate that my CL would cause a slow down. The change is in DecodingImageGenerator::onRefEncodedData:
Prior to my CL, onRefEncodedData did the following:
1. lock a mutex and copy encoded data from the writer thread to the reader thread (in ThreadSafeDataTransport::data)
2. a) If the data is not all received, return nullptr
b) If onRefEncodedData has been called before, return the existing SkData object
c) Call SharedBuffer::data() to copy all the data into a contiguous buffer (throwing away the original segments).
Create an SkData object wrapping it. Return the SkData, and keep a pointer (ref) to it, for use next time.
With my CL:
- we skip 1, meaning that during load (not all data has been received) we go straight to 2a and return nullptr. This is faster than the old code.
- Once all the data is loaded, we call SegmentReader::refAsSkData, which copies all the data into into a contiguous buffer *without* deleting the old data.[1]
- Here's the potentially bad part. On future calls to onRefEncodedData, we do the copy again, since we did not hang on to the contiguous buffer.
So I expect this method to be faster during load, but slower when called after all the data has been received.
So the question in my mind is, are the cases where we slow down a result of calling this method when all data has been received? The primary caller is inside SkImageCacherator::lockTexture [2]. It only calls onRefEncodedData if the image is not already in the cache, though based on https://codereview.chromium.org/1925533003/, it sounds like we are not using the cache properly. For an animated GIF, I could see how this would be slower. (This assumes we're still calling lockTexture in the new ImageDecodeController world.)
We already have a bug around the fact that onRefEncodedData is so expensive to begin with ( issue 568016 ). Before and after my patch, we should be able to skip this, due to the fact that we are only calling onRefEncodedData to allow the GPU to directly support files like pkm and ktx. DecodingImageGenerator should never have to copy all the data, since it never holds one of those types of files internally.
I have a CL in progress for issue 568016 in https://codereview.chromium.org/1862133002/. That should also address this regression, if it is due to my CL.
[1] Typically. We do have a fast path for when the data is already contiguous in an SkROBuffer, but this won't be the case unless all the data was loaded initially. (Since we're talking about a DecodingImageGenerator, the SegmentReader must be an ROBufferSegmentReader, so we can ignore the fact that DataSegmentReader also need not do a copy, and that we *could* add a fast path for SharedBufferSegmentReader.)
[2] https://code.google.com/p/chromium/codesearch#chromium/src/third_party/skia/src/core/SkImageCacherator.cpp&sq=package:chromium&l=280&rcl=1449620985
,
May 10 2016
This is still pending: https://codereview.chromium.org/1862133002/
,
May 11 2016
That CL has been replaced by https://codereview.chromium.org/1970773002/ in Skia. Once that merges into Chromium (https://codereview.chromium.org/1970803002/), this should improve.
,
May 11 2016
That merge CL got replaced with https://codereview.chromium.org/1971543003/ (the first one failed.)
,
Jun 1 2016
Moving this nonessential bug to the next milestone. For more details visit https://www.chromium.org/issue-tracking/autotriage - Your friendly Sheriffbot
,
Jul 8 2016
The metrics have recovered, I think we can close this. |
|||||||||||||||||||||
►
Sign in to add a comment |
|||||||||||||||||||||
Comment 1 by rsch...@chromium.org
, Apr 15 2016