New issue
Advanced search Search tips
Note: Color blocks (like or ) mean that a user may not be available. Tooltip shows the reason.

Issue 645778 link

Starred by 2 users

Issue metadata

Status: Assigned
Owner:
Cc:
Components:
EstimatedDays: ----
NextAction: ----
OS: Chrome
Pri: 1
Type: Bug



Sign in to add a comment

Typing in Inbox slow

Project Member Reported by reve...@chromium.org, Sep 10 2016

Issue description

Typing in an Inbox compose window uses an incredible amount of CPU and is most certainly draining the battery on my Chromebook. Find trace and screenshot of trace attached.
 
trace_inbox.json.gz
6.7 MB Download
Screenshot 2016-09-09 at 12.13.00 PM.png
151 KB View Download
Cc: reed@chromium.org tomhud...@chromium.org bsalomon@chromium.org

Comment 2 by bsalo...@google.com, Sep 12 2016

From the trace I believe this is sw raster.
Labels: -OS-All OS-Chrome
@#2, correct. SW raster as this is on ChromeOS. Who on the skia team can help us determine why raster is so expensive in this case?
Cc: mtklein@chromium.org
I don't believe any of us is set up to develop on ChromeOS, or even has development hardware. Is there a comprehensive guide somewhere?

Do we have any hint of when this might have regressed? Could it be connected to SkRasterPipeline blitter changes?
We don't know if it's a regression. It might have always been this slow or if it used to be fast then maybe Inbox changed somehow. I don't think this is ChromeOS specific. SW raster on Linux should allow you to reproduce it.

Comment 6 by bsalo...@google.com, Sep 12 2016

I think reed@ could take a look or find someone who can.

Comment 7 by mtkl...@google.com, Sep 12 2016

> Could it be connected to SkRasterPipeline blitter changes?
Probably not.

Can you guys get a perf trace?  The systrace there is stimulating but we'd need to see the stack the whole way down to have a hope of remote diagnosis.  Given the ratio of cpu to wall time spent there inside drawRRect()s, it's probably just some very expensive effect being drawn.  If we sort those out we're probably gold... it looks like the rest of the drawing fits within a millisecond.
Here's what looks like the slow stack to me:

-    6.03%     5.88%  CompositorTileW  libskia.so                  [.] boxBlurInterp
     5.86% start_thread
        base::(anonymous namespace)::ThreadFunc
        base::SimpleThread::ThreadMain
        content::(anonymous namespace)::CategorizedWorkerPoolThread::Run
        content::CategorizedWorkerPool::Run
        content::CategorizedWorkerPool::RunTaskWithLockAcquired
        content::CategorizedWorkerPool::RunTaskInCategoryWithLockAcquired
        cc::(anonymous namespace)::RasterTaskImpl::RunOnWorkerThread
        cc::OneCopyRasterBufferProvider::RasterBufferImpl::Playback
        cc::OneCopyRasterBufferProvider::PlaybackAndCopyOnWorkerThread
        cc::OneCopyRasterBufferProvider::PlaybackToStagingBuffer
        cc::RasterBufferProvider::PlaybackToMemory
        cc::RasterSource::PlaybackToCanvas
        cc::RasterSource::PlaybackToCanvas
        cc::RasterSource::RasterCommon
        cc::DisplayItemList::Raster
        cc::DrawingDisplayItem::Raster
        SkCanvas::drawPicture
      - SkCanvas::drawPicture
         - 5.57% cc::ImageHijackCanvas::onDrawPicture
              SkCanvas::onDrawPicture
              SkBigPicture::playback
              SkRecordDraw
              SkRecord::visit<SkRecords::Draw&>
              SkRecord::Record::visit<SkRecords::Draw&>
              SkRecords::Draw::operator()<SkRecords::DrawRRect>
              SkRecords::Draw::draw<SkRecords::DrawRRect>
              SkCanvas::drawRRect
              SkNWayCanvas::onDrawRRect
              SkCanvas::drawRRect
            - SkCanvas::onDrawRRect
               - 5.56% SkBitmapDevice::drawRRect
                    SkBitmapDevice::drawPath
                    SkDraw::drawPath
                    SkDraw::drawPath
                    SkDraw::drawDevPath
                    SkMaskFilter::filterPath
                    SkBlurMaskFilterImpl::filterMask
                    SkBlurMask::BoxBlur
                    boxBlurInterp

As far as I can tell, we're blurring this rrect (whatever it is) every keystroke.  It's also re-blurred any time I change the highlighting on the text.   Seems like changes in there are invalidating something quite expensive, even when visibly only the text is changing.

I notice also that SkBlurMask::BoxBlur() and below do not appear to be particularly optimized, unlike, say, the box blurs used by SkBlurImageFilter.  Once we figure out why we're drawing and redrawing and redrawing this blur mask, we might want to see if we can rewrite it to also use Skopts::box_blur_{xx,xy,yx}().
Cc: djkurtz@chromium.org conradlo@chromium.org llozano@chromium.org tipp@chromium.org
[Adding a few folks from Elm too]

reveman@: How do you get the Skia events into tracing? I enabled skia but my profile stack ends at the OneCopyRasterBuffer::PlayBack().

mtklein@: -g doesn't seem to work on Kevin so I can only use perf to get flat profile. How did you get yours/on what platform? Mine looks like following:

    17.92%  chrome           chrome                    [.] _ZL13boxBlurInterpPKhiPhiiibh                                                                                                                                                                                             
     1.89%  chrome           chrome                    [.] _init                                                                                                                                                                                                                     
     1.84%  chrome           chrome                    [.] _ZN14neon_and_crc32L16blit_mask_d32_a8EPjjPKhjjii                                                                                                                                                                         
     1.70%  chrome           chrome                    [.] _ZNK3gpu5gles212ContextState26RestoreTextureUnitBindingsEjPKS1_                                                                                                                                                           
     1.48%  chrome           libc-2.19.so              [.] 0x0005dc4e                                                                                                                                                                                                                
     1.03%  chrome           libc-2.19.so              [.] 0x0005dc66                                                                                                                                                                                                                
     1.02%  chrome           libc-2.19.so              [.] 0x0005dc48                                                                                                                                                                                                                
     0.96%  chrome           [kernel.kallsyms]         [k] 0xffffffc000203e18                                                                                                                                                                                                        
     0.92%  chrome           chrome                    [.] _ZN8tcmalloc11ThreadCache22CreateCacheIfNecessaryEv                                                                                                                                                                       
     0.90%  chrome           libmali.so.0.1.20         [.] 0x000a2b1c                                                                                                                                                                                                                
     0.88%  chrome           chrome                    [.] _ZN8tcmalloc15CentralFreeList18ReleaseListToSpansEPv                                                                                                                                                                      
     0.88%  chrome           chrome                    [.] _ZN8tcmalloc15CentralFreeList11InsertRangeEPvS1_i                                                                                                                                                                         
     0.75%  chrome           chrome                    [.] _ZN8tcmalloc6FL_PopEPPv                                                                                                                                                                                                   
     0.74%  chrome           chrome                    [.] _ZN8tcmalloc15CentralFreeList11RemoveRangeEPPvS2_i   

And the skia symbols seem to be static linked into chrome instead of in libskia.so.
Even without a non-flat profile, you're showing 18% in boxBlurInterp, even worse than Mike's data.

Comment 11 by mtkl...@google.com, Oct 21 2016

I think my profile was likely a desktop x86-64 Linux component build, so the fact that mine sees symbols in libskia.so is likely the anomaly.  As far as I know, Skia is ordinarily statically linked into Chrome on every platform.

I take it that Kevin is an ARMv8 Chromebook?
Yes both Kevin and Elm are ARMv8 Chromebooks, but user space code (including Chrome) is still compiled as 32-bit.

Do we also want to reach out to the Inbox team to see if they can use simpler effect?

Comment 13 by mtkl...@google.com, Oct 21 2016

Wouldn't you like to figure out why Chrome's re-drawing everything instead of just the changes?  This seems like some sort of layering problem.  It never hurts to draw less or simpler, but this doesn't seem like Inbox's fault.
Oh to clarify - we definitely want to understand the re-drawing problem too. 
y=0xb37b7000 "", src_y_stride=551, dst=0xb4016000 "", 
    dst@entry=0xf187be80 " ", radius=radius@entry=10, width=551, height=573, 
    height@entry=-149352784, transpose=true, transpose@entry=false, 
    outer_weight=outer_weight@entry=146 '\222')
    at ../../third_party/skia/src/effects/SkBlurMask.cpp:284
284	in ../../third_party/skia/src/effects/SkBlurMask.cpp
(gdb) c
Continuing.

Thread 5 "CompositorTileW" hit Breakpoint 1, boxBlurInterp (src=0xb37b7000 "", 
    src_y_stride=498, dst=0xb3d4d000 "", radius=5, width=498, height=512, 
    transpose=false, outer_weight=243 '\363')
    at ../../third_party/skia/src/effects/SkBlurMask.cpp:284
284	in ../../third_party/skia/src/effects/SkBlurMask.cpp
(gdb) c
Continuing.

Thread 5 "CompositorTileW" hit Breakpoint 1, boxBlurInterp (src=0xb3d4d000 "", 
    src@entry=0xacc88d21 <SkBlurMaskFilterImpl::filterMask(SkMask*, SkMask const&, SkMatrix const&, SkIPoint*) const> "\370\034M\004F\315\351\003x\315\351\001V\315\351\005\236\203\260\005j\002\257\220F\211F\352\a>j\037Ր\355\002\n\337\355\022z\264\356\347\n\361\356\020\372\001\324\260\356g\n\342hHF", 
    src_y_stride=508, dst=0xb4016000 "", dst@entry=0xf187be80 " ", 
    radius=radius@entry=5, width=508, width@entry=-149352784, height=512, 
    height@entry=-242762096, transpose=transpose@entry=false, 
    outer_weight=outer_weight@entry=243 '\363')
    at ../../third_party/skia/src/effects/SkBlurMask.cpp:284
284	in ../../third_party/skia/src/effects/SkBlurMask.cpp
(gdb) c
Continuing.

Thread 5 "CompositorTileW" hit Breakpoint 1, boxBlurInterp (src=0xb4016000 "", 
    src@entry=0xf187be80 " ", src_y_stride=518, dst=0xb3d4d000 "", 
    dst@entry=0xb37b7000 "", radius=radius@entry=5, width=518, height=512, 
    height@entry=-242762096, transpose=true, transpose@entry=false, 
    outer_weight=outer_weight@entry=243 '\363')
    at ../../third_party/skia/src/effects/SkBlurMask.cpp:284
284	in ../../third_party/skia/src/effects/SkBlurMask.cpp
(gdb) c
Continuing.

Thread 5 "CompositorTileW" hit Breakpoint 1, boxBlurInterp (src=0xb3d4d000 "", 
    src@entry=0xb37b7000 "", src_y_stride=512, dst=0xb4016000 "", 
    dst@entry=0xf187be80 " ", radius=radius@entry=5, width=512, height=528, 
    height@entry=-242762096, transpose=false, transpose@entry=true, 
    outer_weight=outer_weight@entry=243 '\363')
    at ../../third_party/skia/src/effects/SkBlurMask.cpp:284
284	in ../../third_party/skia/src/effects/SkBlurMask.cpp
(gdb) c
Continuing.

Looks like src_y_stride continues to change in the repeated calls. However I don't know Skia/raster well enough to judge what's happening there...
[I missed some text in the beginning part in comment #15]

I compiled Chrome locally with debugging information and I set a breakpoint at boxBlurInterp. It appears that the boxBlurInterp function is invoked 30 ~ 60 times after a single character is typed.
Every application of a blur mask results in 2-6 calls to boxBlurInterp (https://cs.chromium.org/chromium/src/third_party/skia/src/effects/SkBlurMask.cpp?q=boxBlurInterp&sq=package:chromium&l=569&dr=C).

With those widths/heights/strides this sounds vaguely like compositor tiles being blurred, but they don't quite match?

Sounds like you're doing a fairly large blur (radius 5?) on a large tiled layer (10-15 tiles?)
Cc: -tomhud...@chromium.org
I found sometimes the typing response is pretty fast but sometimes isn't, so I did an experiment to disable small cores and all of a sudden Inbox feels very smooth. Here are some additional performance data I collected:

Octane scores:
2 little cores: ~3000
2 big cores: ~9000

FPS while typing in Inbox:
2 little cores: fluctuating between 6 ~ 19
2 big cores: stable at 19

perf profile:
2 little cores: 16% in boxBlurInterp
2 big cores: 10% in boxBlurInterp

I think if we can pin the CompositorTileW thread to the big core, it will greatly improve the user experience on Inbox.
Pinning of threads to big cores is currently reserved to high priority threads that are critical for producing smooth animations and allowing well behaved pages and apps to run at 60fps.

Using big cores for CompositorTileWorker might improve inbox but it will regress arc++ apps, smooth scrolling and animations. CompositorTileWorker threads are worker threads that perform expensive tasks such as image decode and sw raster. If we pin these threads to big cores then for example decoding of invisible images will be more likely cause jank when scrolling a page. I don't think we want that.
Cc: sonnyrao@chromium.org
For sonnyrao@ and dtor@: 
Will Sonny's patches that disable EAS help here?  If so, how much improvement do we see in this case?

dtor@ - what information is unavailable to the kernel scheduler here to make the right call?  If we are not running an arc++ app and inbox at the same time, why is the right thing not happening - on keyboard input, interactive governor hints should be boosting the CPUs as well as try to schedule more work on the big cores.  Is this not true?
@21 - I believe the interactive governor only boosts CPU frequency, it does not "try to schedule more work on the big cores".

On chat, reveman noted that trace in the original posting was actually taken on Samus, where typing in Inbox is also slow.  So, moving CompositorTileWorker to big cores isn't going to help there.  One theory is that this issue is common to samus & kevin because they both use dsf 2.0.

As a reference point, I tried typing in Inbox on elm (arm64 2x2 big.LITTLE, dsf 1.25) @ [8743.76.0/54.0.2840.79], but Inbox didn't seem slow at all.

Attached is the trace.  I do see ~25-30 ms CompositorTileWorker events, however they do not contain any "skCanvas::drawPicture()" events.

The "InputLatency::Char" trace shows ~50-100 ms latency for most of the key strokes (The largest I saw was one @ 125 ms).
trace_inbox_compose_typing_elm.json.gz
5.8 MB Download
FYI: This is the CL referred to in #21
https://chromium-review.googlesource.com/#/c/384434/

This will bias more towards bigs rather than littles, so may help improve consistency
I tried the EAS patch but it doesn't help Kevin. I used the FPS meter from Chrome devtools as the metrics and it cannot stay at 19 FPS.

For the Elm trace in #22, if you look at the Kernel section near the top you can see that most of the time the CompositorTileW thread is scheduled at the big cores 2/3 (eg starting from 3037.406ms). If I explicitly disable the two big cores typing in Inbox feels sluggish too.

For Kevin, since it has 4 little cores will it have higher chance to schedule the worker thread to little cores than Elm? Also, being a high resolution device won't help either.
Cc: marc...@chromium.org
Labels: -Pri-2 Pri-1
So with the scheduler changes we can ensure that all the important processes are running on the big cores but we still see significant latency on the order of 200 ms for typing.

Here is a trace of this case: https://drive.google.com/a/google.com/file/d/0B3pFZucxX_SvSzZHMVVtQUY0aFk/view?usp=sharing

To me it looks like V8 is the long pole here but it would be good to get additional input from other folks
Cc: pras@google.com
Yeah, this is all limited by V8. I'll see if we are doing something suboptimal while building V8, but in any case I don't think there's a lot of headroom.

Should we tell people to use the android inbox app instead? :)
I took inbox compose traces on gru and elm (1080p screens vs 2K on kevin) just to see and indeed it does seem much better, character latency is generally well under 100ms on both machines -- if I look at the traces it seems like there's significantly less V8 happening on gru/elm and significantly longer compositor tile worker slices on kevin vs gru/elm.

The v8 slices seem to be related to timers, so I'm hypothesizing that the v8 slices are running timers and waiting for the compositor tile workers to finish?  Could it be doing something dumb like busy waiting? 

Maybe pras@ could shed some light on that.

I believe the big chrome tiles would be 4x larger on kevin vs the other two, but the time seems much worse on Kevin -- like 10x worse, so that seems strange.

The function in the trace for the tile workers is OneCopyRasterBuffer::Playback

Here's a folder on drive with various traces
https://drive.google.com/open?id=0B3pFZucxX_Svb1ZpS1lPdThXLTA
sorry should have read earlier comments -- I think the traces in #28 are just showing that even with a good scheduler -- we're still limited by the tile workers doing those blurs, and that V8 cpu time is somehow tied to how long this blur takes.
Cc: bjanakiraman@chromium.org
I tried the experiment with the Jackal compiler vs the old compiler and jackal might be slightly faster but it seems like it's almost within the noise

I put both traces in this folder

https://drive.google.com/open?id=0B3pFZucxX_Svem9uWXFuR2RPT2M

both show char input latencies around 200ms
I also tried the test where I put an external monitor on my Kevin and closed the lid putting it in docked mode and keyboard latency on inbox became reasonable after that.
Wow interesting. Is it due to resolution change or PSR?

Attached is the ddrmon utility that shows the memory traffic amount. Could you try it to see if there is noticeable difference?
ddrmon
506 KB View Download
re #33 -- it's most likely due to lower resolution and matches with the fact that inbox is reasonable on Gru and Elm
Cc: danno@chromium.org
Components: Internals>GPU
GPU Triage: adlr@ is this something that is still a problem / worth investigating? With CrOS now using GPU raster, and at ~2yrs since the last update, seems likely we can just close this out?

Sign in to add a comment