Typing in Inbox slow |
|||||||||||
Issue descriptionTyping in an Inbox compose window uses an incredible amount of CPU and is most certainly draining the battery on my Chromebook. Find trace and screenshot of trace attached.
,
Sep 12 2016
From the trace I believe this is sw raster.
,
Sep 12 2016
@#2, correct. SW raster as this is on ChromeOS. Who on the skia team can help us determine why raster is so expensive in this case?
,
Sep 12 2016
I don't believe any of us is set up to develop on ChromeOS, or even has development hardware. Is there a comprehensive guide somewhere? Do we have any hint of when this might have regressed? Could it be connected to SkRasterPipeline blitter changes?
,
Sep 12 2016
We don't know if it's a regression. It might have always been this slow or if it used to be fast then maybe Inbox changed somehow. I don't think this is ChromeOS specific. SW raster on Linux should allow you to reproduce it.
,
Sep 12 2016
I think reed@ could take a look or find someone who can.
,
Sep 12 2016
> Could it be connected to SkRasterPipeline blitter changes? Probably not. Can you guys get a perf trace? The systrace there is stimulating but we'd need to see the stack the whole way down to have a hope of remote diagnosis. Given the ratio of cpu to wall time spent there inside drawRRect()s, it's probably just some very expensive effect being drawn. If we sort those out we're probably gold... it looks like the rest of the drawing fits within a millisecond.
,
Sep 12 2016
Here's what looks like the slow stack to me:
- 6.03% 5.88% CompositorTileW libskia.so [.] boxBlurInterp
5.86% start_thread
base::(anonymous namespace)::ThreadFunc
base::SimpleThread::ThreadMain
content::(anonymous namespace)::CategorizedWorkerPoolThread::Run
content::CategorizedWorkerPool::Run
content::CategorizedWorkerPool::RunTaskWithLockAcquired
content::CategorizedWorkerPool::RunTaskInCategoryWithLockAcquired
cc::(anonymous namespace)::RasterTaskImpl::RunOnWorkerThread
cc::OneCopyRasterBufferProvider::RasterBufferImpl::Playback
cc::OneCopyRasterBufferProvider::PlaybackAndCopyOnWorkerThread
cc::OneCopyRasterBufferProvider::PlaybackToStagingBuffer
cc::RasterBufferProvider::PlaybackToMemory
cc::RasterSource::PlaybackToCanvas
cc::RasterSource::PlaybackToCanvas
cc::RasterSource::RasterCommon
cc::DisplayItemList::Raster
cc::DrawingDisplayItem::Raster
SkCanvas::drawPicture
- SkCanvas::drawPicture
- 5.57% cc::ImageHijackCanvas::onDrawPicture
SkCanvas::onDrawPicture
SkBigPicture::playback
SkRecordDraw
SkRecord::visit<SkRecords::Draw&>
SkRecord::Record::visit<SkRecords::Draw&>
SkRecords::Draw::operator()<SkRecords::DrawRRect>
SkRecords::Draw::draw<SkRecords::DrawRRect>
SkCanvas::drawRRect
SkNWayCanvas::onDrawRRect
SkCanvas::drawRRect
- SkCanvas::onDrawRRect
- 5.56% SkBitmapDevice::drawRRect
SkBitmapDevice::drawPath
SkDraw::drawPath
SkDraw::drawPath
SkDraw::drawDevPath
SkMaskFilter::filterPath
SkBlurMaskFilterImpl::filterMask
SkBlurMask::BoxBlur
boxBlurInterp
As far as I can tell, we're blurring this rrect (whatever it is) every keystroke. It's also re-blurred any time I change the highlighting on the text. Seems like changes in there are invalidating something quite expensive, even when visibly only the text is changing.
I notice also that SkBlurMask::BoxBlur() and below do not appear to be particularly optimized, unlike, say, the box blurs used by SkBlurImageFilter. Once we figure out why we're drawing and redrawing and redrawing this blur mask, we might want to see if we can rewrite it to also use Skopts::box_blur_{xx,xy,yx}().
,
Oct 21 2016
[Adding a few folks from Elm too]
reveman@: How do you get the Skia events into tracing? I enabled skia but my profile stack ends at the OneCopyRasterBuffer::PlayBack().
mtklein@: -g doesn't seem to work on Kevin so I can only use perf to get flat profile. How did you get yours/on what platform? Mine looks like following:
17.92% chrome chrome [.] _ZL13boxBlurInterpPKhiPhiiibh
1.89% chrome chrome [.] _init
1.84% chrome chrome [.] _ZN14neon_and_crc32L16blit_mask_d32_a8EPjjPKhjjii
1.70% chrome chrome [.] _ZNK3gpu5gles212ContextState26RestoreTextureUnitBindingsEjPKS1_
1.48% chrome libc-2.19.so [.] 0x0005dc4e
1.03% chrome libc-2.19.so [.] 0x0005dc66
1.02% chrome libc-2.19.so [.] 0x0005dc48
0.96% chrome [kernel.kallsyms] [k] 0xffffffc000203e18
0.92% chrome chrome [.] _ZN8tcmalloc11ThreadCache22CreateCacheIfNecessaryEv
0.90% chrome libmali.so.0.1.20 [.] 0x000a2b1c
0.88% chrome chrome [.] _ZN8tcmalloc15CentralFreeList18ReleaseListToSpansEPv
0.88% chrome chrome [.] _ZN8tcmalloc15CentralFreeList11InsertRangeEPvS1_i
0.75% chrome chrome [.] _ZN8tcmalloc6FL_PopEPPv
0.74% chrome chrome [.] _ZN8tcmalloc15CentralFreeList11RemoveRangeEPPvS2_i
And the skia symbols seem to be static linked into chrome instead of in libskia.so.
,
Oct 21 2016
Even without a non-flat profile, you're showing 18% in boxBlurInterp, even worse than Mike's data.
,
Oct 21 2016
I think my profile was likely a desktop x86-64 Linux component build, so the fact that mine sees symbols in libskia.so is likely the anomaly. As far as I know, Skia is ordinarily statically linked into Chrome on every platform. I take it that Kevin is an ARMv8 Chromebook?
,
Oct 21 2016
Yes both Kevin and Elm are ARMv8 Chromebooks, but user space code (including Chrome) is still compiled as 32-bit. Do we also want to reach out to the Inbox team to see if they can use simpler effect?
,
Oct 21 2016
Wouldn't you like to figure out why Chrome's re-drawing everything instead of just the changes? This seems like some sort of layering problem. It never hurts to draw less or simpler, but this doesn't seem like Inbox's fault.
,
Oct 21 2016
Oh to clarify - we definitely want to understand the re-drawing problem too.
,
Oct 25 2016
y=0xb37b7000 "", src_y_stride=551, dst=0xb4016000 "",
dst@entry=0xf187be80 " ", radius=radius@entry=10, width=551, height=573,
height@entry=-149352784, transpose=true, transpose@entry=false,
outer_weight=outer_weight@entry=146 '\222')
at ../../third_party/skia/src/effects/SkBlurMask.cpp:284
284 in ../../third_party/skia/src/effects/SkBlurMask.cpp
(gdb) c
Continuing.
Thread 5 "CompositorTileW" hit Breakpoint 1, boxBlurInterp (src=0xb37b7000 "",
src_y_stride=498, dst=0xb3d4d000 "", radius=5, width=498, height=512,
transpose=false, outer_weight=243 '\363')
at ../../third_party/skia/src/effects/SkBlurMask.cpp:284
284 in ../../third_party/skia/src/effects/SkBlurMask.cpp
(gdb) c
Continuing.
Thread 5 "CompositorTileW" hit Breakpoint 1, boxBlurInterp (src=0xb3d4d000 "",
src@entry=0xacc88d21 <SkBlurMaskFilterImpl::filterMask(SkMask*, SkMask const&, SkMatrix const&, SkIPoint*) const> "\370\034M\004F\315\351\003x\315\351\001V\315\351\005\236\203\260\005j\002\257\220F\211F\352\a>j\037Ր\355\002\n\337\355\022z\264\356\347\n\361\356\020\372\001\324\260\356g\n\342hHF",
src_y_stride=508, dst=0xb4016000 "", dst@entry=0xf187be80 " ",
radius=radius@entry=5, width=508, width@entry=-149352784, height=512,
height@entry=-242762096, transpose=transpose@entry=false,
outer_weight=outer_weight@entry=243 '\363')
at ../../third_party/skia/src/effects/SkBlurMask.cpp:284
284 in ../../third_party/skia/src/effects/SkBlurMask.cpp
(gdb) c
Continuing.
Thread 5 "CompositorTileW" hit Breakpoint 1, boxBlurInterp (src=0xb4016000 "",
src@entry=0xf187be80 " ", src_y_stride=518, dst=0xb3d4d000 "",
dst@entry=0xb37b7000 "", radius=radius@entry=5, width=518, height=512,
height@entry=-242762096, transpose=true, transpose@entry=false,
outer_weight=outer_weight@entry=243 '\363')
at ../../third_party/skia/src/effects/SkBlurMask.cpp:284
284 in ../../third_party/skia/src/effects/SkBlurMask.cpp
(gdb) c
Continuing.
Thread 5 "CompositorTileW" hit Breakpoint 1, boxBlurInterp (src=0xb3d4d000 "",
src@entry=0xb37b7000 "", src_y_stride=512, dst=0xb4016000 "",
dst@entry=0xf187be80 " ", radius=radius@entry=5, width=512, height=528,
height@entry=-242762096, transpose=false, transpose@entry=true,
outer_weight=outer_weight@entry=243 '\363')
at ../../third_party/skia/src/effects/SkBlurMask.cpp:284
284 in ../../third_party/skia/src/effects/SkBlurMask.cpp
(gdb) c
Continuing.
Looks like src_y_stride continues to change in the repeated calls. However I don't know Skia/raster well enough to judge what's happening there...
,
Oct 25 2016
[I missed some text in the beginning part in comment #15] I compiled Chrome locally with debugging information and I set a breakpoint at boxBlurInterp. It appears that the boxBlurInterp function is invoked 30 ~ 60 times after a single character is typed.
,
Oct 25 2016
Every application of a blur mask results in 2-6 calls to boxBlurInterp (https://cs.chromium.org/chromium/src/third_party/skia/src/effects/SkBlurMask.cpp?q=boxBlurInterp&sq=package:chromium&l=569&dr=C). With those widths/heights/strides this sounds vaguely like compositor tiles being blurred, but they don't quite match? Sounds like you're doing a fairly large blur (radius 5?) on a large tiled layer (10-15 tiles?)
,
Oct 25 2016
,
Oct 27 2016
I found sometimes the typing response is pretty fast but sometimes isn't, so I did an experiment to disable small cores and all of a sudden Inbox feels very smooth. Here are some additional performance data I collected: Octane scores: 2 little cores: ~3000 2 big cores: ~9000 FPS while typing in Inbox: 2 little cores: fluctuating between 6 ~ 19 2 big cores: stable at 19 perf profile: 2 little cores: 16% in boxBlurInterp 2 big cores: 10% in boxBlurInterp I think if we can pin the CompositorTileW thread to the big core, it will greatly improve the user experience on Inbox.
,
Oct 27 2016
Pinning of threads to big cores is currently reserved to high priority threads that are critical for producing smooth animations and allowing well behaved pages and apps to run at 60fps. Using big cores for CompositorTileWorker might improve inbox but it will regress arc++ apps, smooth scrolling and animations. CompositorTileWorker threads are worker threads that perform expensive tasks such as image decode and sw raster. If we pin these threads to big cores then for example decoding of invisible images will be more likely cause jank when scrolling a page. I don't think we want that.
,
Oct 27 2016
For sonnyrao@ and dtor@: Will Sonny's patches that disable EAS help here? If so, how much improvement do we see in this case? dtor@ - what information is unavailable to the kernel scheduler here to make the right call? If we are not running an arc++ app and inbox at the same time, why is the right thing not happening - on keyboard input, interactive governor hints should be boosting the CPUs as well as try to schedule more work on the big cores. Is this not true?
,
Oct 27 2016
@21 - I believe the interactive governor only boosts CPU frequency, it does not "try to schedule more work on the big cores". On chat, reveman noted that trace in the original posting was actually taken on Samus, where typing in Inbox is also slow. So, moving CompositorTileWorker to big cores isn't going to help there. One theory is that this issue is common to samus & kevin because they both use dsf 2.0. As a reference point, I tried typing in Inbox on elm (arm64 2x2 big.LITTLE, dsf 1.25) @ [8743.76.0/54.0.2840.79], but Inbox didn't seem slow at all. Attached is the trace. I do see ~25-30 ms CompositorTileWorker events, however they do not contain any "skCanvas::drawPicture()" events. The "InputLatency::Char" trace shows ~50-100 ms latency for most of the key strokes (The largest I saw was one @ 125 ms).
,
Oct 28 2016
FYI: This is the CL referred to in #21 https://chromium-review.googlesource.com/#/c/384434/ This will bias more towards bigs rather than littles, so may help improve consistency
,
Oct 28 2016
I tried the EAS patch but it doesn't help Kevin. I used the FPS meter from Chrome devtools as the metrics and it cannot stay at 19 FPS. For the Elm trace in #22, if you look at the Kernel section near the top you can see that most of the time the CompositorTileW thread is scheduled at the big cores 2/3 (eg starting from 3037.406ms). If I explicitly disable the two big cores typing in Inbox feels sluggish too. For Kevin, since it has 4 little cores will it have higher chance to schedule the worker thread to little cores than Elm? Also, being a high resolution device won't help either.
,
Dec 16 2016
So with the scheduler changes we can ensure that all the important processes are running on the big cores but we still see significant latency on the order of 200 ms for typing. Here is a trace of this case: https://drive.google.com/a/google.com/file/d/0B3pFZucxX_SvSzZHMVVtQUY0aFk/view?usp=sharing To me it looks like V8 is the long pole here but it would be good to get additional input from other folks
,
Dec 16 2016
,
Dec 16 2016
Yeah, this is all limited by V8. I'll see if we are doing something suboptimal while building V8, but in any case I don't think there's a lot of headroom. Should we tell people to use the android inbox app instead? :)
,
Dec 17 2016
I took inbox compose traces on gru and elm (1080p screens vs 2K on kevin) just to see and indeed it does seem much better, character latency is generally well under 100ms on both machines -- if I look at the traces it seems like there's significantly less V8 happening on gru/elm and significantly longer compositor tile worker slices on kevin vs gru/elm. The v8 slices seem to be related to timers, so I'm hypothesizing that the v8 slices are running timers and waiting for the compositor tile workers to finish? Could it be doing something dumb like busy waiting? Maybe pras@ could shed some light on that. I believe the big chrome tiles would be 4x larger on kevin vs the other two, but the time seems much worse on Kevin -- like 10x worse, so that seems strange. The function in the trace for the tile workers is OneCopyRasterBuffer::Playback Here's a folder on drive with various traces https://drive.google.com/open?id=0B3pFZucxX_Svb1ZpS1lPdThXLTA
,
Dec 17 2016
sorry should have read earlier comments -- I think the traces in #28 are just showing that even with a good scheduler -- we're still limited by the tile workers doing those blurs, and that V8 cpu time is somehow tied to how long this blur takes.
,
Dec 19 2016
,
Dec 19 2016
I tried the experiment with the Jackal compiler vs the old compiler and jackal might be slightly faster but it seems like it's almost within the noise I put both traces in this folder https://drive.google.com/open?id=0B3pFZucxX_Svem9uWXFuR2RPT2M both show char input latencies around 200ms
,
Dec 19 2016
I also tried the test where I put an external monitor on my Kevin and closed the lid putting it in docked mode and keyboard latency on inbox became reasonable after that.
,
Dec 19 2016
Wow interesting. Is it due to resolution change or PSR? Attached is the ddrmon utility that shows the memory traffic amount. Could you try it to see if there is noticeable difference?
,
Dec 19 2016
re #33 -- it's most likely due to lower resolution and matches with the fact that inbox is reasonable on Gru and Elm
,
Dec 20 2016
,
Dec 22 2016
See analysis of issue by reveman@: https://docs.google.com/document/d/1UiAwBZbaHAA6xg3rdZ6hx287z8ZQ7lDR0_GK2d7neBc/edit#heading=h.ul2pcnvd3yzl
,
Sep 7
,
Dec 14
GPU Triage: adlr@ is this something that is still a problem / worth investigating? With CrOS now using GPU raster, and at ~2yrs since the last update, seems likely we can just close this out? |
|||||||||||
►
Sign in to add a comment |
|||||||||||
Comment 1 by reve...@chromium.org
, Sep 10 2016