New issue
Advanced search Search tips

Issue 746518 link

Starred by 4 users

Issue metadata

Status: Fixed
Owner:
Closed: Dec 2017
Cc:
Components:
EstimatedDays: ----
NextAction: ----
OS: Windows
Pri: 3
Type: Bug

Blocked on:
issue 598854
issue crashpad:192



Sign in to add a comment

FastSetActiveURL isn't always fast

Project Member Reported by liber...@chromium.org, Jul 19 2017

Issue description

this call accounts for 3 CPU profiler samples during video playback:

https://cs.chromium.org/chromium/src/gpu/ipc/service/gpu_command_buffer_stub.cc?rcl=daf2a068f3270585fe91baed4608078613a49b76&l=982

assigning to me until i find an owner for it.
 
On one trace during our power hackathon (2017-07-19_10-16-40 144p_silent.etl) I see 23 samples out of 467 in the GPU process inside of gpu::`anonymous namespace'::FastSetActiveURL, mostly in base::debug::SetCrashKeyValue:

  |    |    |    |    |    |    |    |    |    |    |    |- chrome_child.dll!gpu::`anonymous namespace'::FastSetActiveURL
  |    |    |    |    |    |    |    |    |    |    |    |    chrome_child.dll!ui::GpuService::SetActiveURL
  |    |    |    |    |    |    |    |    |    |    |    |    |- chrome_child.dll!base::debug::SetCrashKeyValue

There may have been four other samples that were misattributed due to stack walking errors - I can't otherwise explain why there were 26 samples in base::debug::SetCrashKeyValue.

Either way this looks like more expense than we want to pay, and an easy win.

It's possible that this was related to some other debugging code, now removed in change https://chromium-review.googlesource.com/c/578066.

This is completely unrelated to https://chromium-review.googlesource.com/c/578066
I thought the changes in https://chromium-review.googlesource.com/c/578066 are down the stack from FastSetActiveURL. That change directly removes some code calling base::debug::SetCrashKeyValue, which is shown in bruce's trace to be in a subcall from FastSetActiveURL. 

Comment 4 by crouleau@google.com, Jul 25 2017

Cc: liber...@chromium.org
Owner: brucedaw...@chromium.org
Status: Assigned (was: Available)
Actually, maybe I'm wrong. Bruce is looking into this.

Comment 5 by piman@chromium.org, Aug 10 2017

Components: Internals>CrashReporting
FastSetActiveURL is super important to be able to associate GPU process crash reports to the URL that generated the commands that are associated with the crash. Without a URL, GPU crash reports become even less actionable (it's hard enough already).

Is there a way to optimize SetCrashKeyValue? I don't think our use case fundamentally requires string processing on every call (the key is constant, and the value is one-of-N where the set of possible values changes rarely, e.g. when opening/closing a tab). Maybe there's a different way of storing the data?
I went to investigate this again and couldn't find FastSetActiveURL. Apparently the compiler change has altered inlining or something like that, and now it shows up under chrome_child.dll!viz::GpuServiceImpl::SetActiveURL. Searching on SetActiveURL or SetCrashKeyValue in the CPU Sampled data is probably the robust thing to do.

Acknowledged that we shouldn't plan on removing this function. I suspect that it can be optimized.

I've pasted in some CPU sampling data below. The left column is the number of samples that went through that call stack. In this trace there were 654 samples going through SetActiveURL, five in the function itself (see the last line) and 649 going into SetCrashKeyValue, and downwards from there:

654	chrome_child.dll!viz::GpuServiceImpl::SetActiveURL							
649	|- chrome_child.dll!base::debug::SetCrashKeyValue							
305	|    |- chrome_child.dll!child_process_logging::`anonymous namespace'::ClearCrashKeyValueTrampoline							
214	|    |    |- chrome_elf.dll!ClearCrashKeyValueImpl							
159	|    |    |    |- chrome_elf.dll!crashpad::TSimpleStringDictionary<256,256,64>::RemoveKey							
124	|    |    |    |    |- chrome_elf.dll!strncmp							
35	|    |    |    |    |- chrome_elf.dll!crashpad::TSimpleStringDictionary<256,256,64>::RemoveKey<itself>							
37	|    |    |    |- chrome_elf.dll!base::UTF16ToUTF8							
11	|    |    |    |- chrome_elf.dll!wcslen							
6	|    |    |    |- chrome_elf.dll!ClearCrashKeyValueImpl<itself>							
1	|    |    |    |- chrome_elf.dll!std::basic_string<char,std::char_traits<char>,std::allocator<char> >::~basic_string							
55	|    |    |- chrome_child.dll!base::UTF8ToWide							
27	|    |    |- chrome_child.dll!std::basic_string<wchar_t,std::char_traits<wchar_t>,std::allocator<wchar_t> >::_Tidy							
9	|    |    |- chrome_child.dll!child_process_logging::`anonymous namespace'::ClearCrashKeyValueTrampoline<itself>							
160	|    |- chrome_child.dll!base::StringPrintf							
127	|    |- chrome_child.dll!child_process_logging::`anonymous namespace'::SetCrashKeyValueTrampoline							
88	|    |    |- chrome_elf.dll!SetCrashKeyValueImpl							
30	|    |    |- chrome_child.dll!base::UTF8ToWide							
6	|    |    |- chrome_child.dll!std::basic_string<wchar_t,std::char_traits<wchar_t>,std::allocator<wchar_t> >::_Tidy							
3	|    |    |- chrome_child.dll!child_process_logging::`anonymous namespace'::SetCrashKeyValueTrampoline<itself>							
32	|    |- chrome_child.dll!base::debug::ChunkCrashKeyValue							
11	|    |- chrome_child.dll!base::debug::SetCrashKeyValue<itself>							
8	|    |- chrome_child.dll!base::debug::LookupCrashKey							
3	|    |- chrome_child.dll!std::basic_string<char,std::char_traits<char>,std::allocator<char> >::~basic_string							
2	|    |    |- ntdll.dll!RtlFreeHeap							
1	|    |    |- chrome_child.dll!std::basic_string<char,std::char_traits<char>,std::allocator<char> >::~basic_string<itself>							
3	|    |- chrome_child.dll!std::allocator<v8::CpuProfileDeoptInfo>::deallocate							
5	|- chrome_child.dll!viz::GpuServiceImpl::SetActiveURL<itself>							

ClearCrashKeyValueTrampoline seems quite expensive, and all of the string manipulations (wcslen, strncmp, UTF8ToWide, UTF16ToUTF8) seems unfortunate and perhaps avoidable. I'm posting this information in case anyone else has any ideas, but I plan to try optimizing this sometime.

For context, the 654 samples were out of a total of 13,140 samples in the GPU process during playback of a 144p video.
Some observations:

SetCrashKeyValueTrampoline and ClearCrashKeyValueTrampoline both take char* based strings, convert them to wchar_t* based strings, then call SetCrashKeyValueImpl and ClearCrashKeyValueImpl respectively. The two Impl functions then convert the wchar_t* based strings back to char* based strings. The two conversions can be avoided, thus getting rid of the UTF16ToUTF8, wcslen, UTF8ToWide costs, plus some basic_string<wchar_t> constructor/destructor overhead. This accounts for at least 25% of the samples.

The kActiveURL crash key ("url-chunk") is configured to use kLargeSize which means that it has 16 chunks. The unused chunks are always cleared by base::debug::SetCrashKeyValue. On the particular test case that we are using the URLs alternate between fitting into one and two chunks. This means that 14 or 15 chunks are cleared, which explains why ClearCrashKeyValueTrampoline consumes almost half of the CPU time. If you add in the cost of the 15 or 16 StringPrintf calls associated with ClearCrashKeyValueTrampoline we see that clearing the excess chunks accounts for well over half of the CPU time. Does anybody know if clearing the unused chunks is actually necessary? Intuitively I would guess that clearing one unused chunk would be sufficient, but I am speculating.

There is some double counting going on here so if we avoided both of these costs but even then if we could avoid these costs we would probably quadruple the performance of this function.

ChunkCrashKeyValue() looks like it could be slimmed down a bit in order to avoid creating so many temporary strings. Some combination of chunks.reserve(), std::move, and not creating value_string should reduce its cost but would still leave it allocating a vector of strings. It could instead fill in an array of StringPiece objects, thus avoiding any allocations or string copies. It's not a super expensive function but ...?

The one remaining cost that bothers me is the StringPrintf calls in SetCrashKeyValue. I'm not thrilled with that design but I'm not sure I want to try changing it. Avoiding the full clearing loop would mitigate that significantly, and printing to a local array would also help, and that's probably good enough.

The TL;DR is that it should be practical to make SetActiveURL significantly faster, perhaps fast enough that it will disappear into the background noise. We should still be careful not to use it to excess.

Labels: Performance-Power
Owner: rsesek@chromium.org
Conveniently enough there are already plans to rework this system. As such I am reassigning this bug, since it should be resolved as part of those efforts. The design doc is here:

https://docs.google.com/document/d/1IeWeuvwhYqEK66V4-4TChU6UmdFbuEr5lgu-WvlLlOA/edit

Comment 9 by rsesek@chromium.org, Aug 16 2017

Blockedon: crashpad:192
Thanks for the insights, Bruce. A couple of thoughts:

The string conversions do seem unnecessary. It looks like those were added by Scott in https://codereview.chromium.org/1517503004, but I'm not sure why the wchar_t variants were added instead of exposing const char* ones instead. I think it's because that's how breakpad_win.cc exports them?

Both Set and Clear are more expensive than they need to be because of the linear scan of the crash key table. This will be going away in v2.

The chunking will also be going away with v2, which will also remove the StringPrintfs. All the chunks do need to be cleared, though, otherwise e.g. if a shorter URL replaces a longer active URL, then garbage data will be appended when the chunks are concatenated on the server.
I talked to the syzygy team about changing the signature of SetCrashKeyValueImpl/ClearCrashKeyValueImpl and they think it should be fine. I have an initial pass at changing those at crrev.com/c/616321 (needs double-checking). Is that change still relevant?

> All the chunks do need to be cleared

Can't the URL reconstruction code be made smart enough to stop when it hits a blank chunk? It sounds like this will be irrelevant soon so I guess it doesn't really matter.

> I have an initial pass at changing those at crrev.com/c/616321 (needs double-checking). Is that change still relevant?

Yes, that change is probably worth pursuing since it already exists and looks like it would cut down on some of the more significant samples.

> Can't the URL reconstruction code be made smart enough to stop when it hits a blank chunk? It sounds like this will be irrelevant soon so I guess it doesn't really matter.

I suppose it could, yes. But I'm trying to get v2 done by EOQ so it's probably not pursuing right now.
Project Member

Comment 12 by bugdroid1@chromium.org, Aug 18 2017

The following revision refers to this bug:
  https://chromium.googlesource.com/chromium/src.git/+/1853e218a9f3634c89f476f0b5aad4b8a87eca08

commit 1853e218a9f3634c89f476f0b5aad4b8a87eca08
Author: Bruce Dawson <brucedawson@chromium.org>
Date: Fri Aug 18 05:36:19 2017

Add char* versions of *CrashKeyValueImpl* funcs

The wchar_t* versions of SetCrashKeyValueImpl and ClearCrashKeyValueImpl
mean that base::debug::SetCrashKeyValue has to convert keys and values
from char* to wchar_t* and then immediately back. This accounts for
roughly 25% of the cost of base::debug::SetCrashKeyValue and helps make
it expensive enough that it shows up on profiles. This CL adds 'Ex'
versions of these functions which avoids the conversions. This will make
a 144p-silent-video test case use about 1% less CPU in the GPU process.
The remaining costs of base::debug::SetCrashKeyValue are being addressed
in a separate project.

Bug:  746518 
Cq-Include-Trybots: master.tryserver.chromium.win:win10_chromium_x64_rel_ng
Change-Id: Iedf0c5efef2142bf522b3fa7275e616b4cca2b7b
Reviewed-on: https://chromium-review.googlesource.com/616321
Reviewed-by: Scott Graham <scottmg@chromium.org>
Reviewed-by: Justin Schuh <jschuh@chromium.org>
Commit-Queue: Bruce Dawson <brucedawson@chromium.org>
Cr-Commit-Position: refs/heads/master@{#495472}
[modify] https://crrev.com/1853e218a9f3634c89f476f0b5aad4b8a87eca08/chrome/common/child_process_logging_win.cc
[modify] https://crrev.com/1853e218a9f3634c89f476f0b5aad4b8a87eca08/chrome_elf/chrome_elf.def
[modify] https://crrev.com/1853e218a9f3634c89f476f0b5aad4b8a87eca08/components/crash/content/app/crashpad.cc

Issue 599250 is a related bug where I tried to optimize calls to FastSetActiveURL to avoid setting internal URLs like chrome://gpu/GpuProcessTransportFactory::CreateContextCommon.
WIP fix: https://chromium-review.googlesource.com/c/chromium/src/+/728707

Comment 14 by piman@chromium.org, Nov 10 2017

Cc: sunn...@chromium.org piman@chromium.org vmi...@chromium.org
 Issue 781983  has been merged into this issue.
From the merged bug: the UMA Sampling Profiler is collecting up-to-date, somewhat higher resolution profiling information on this in the GPU process:

https://uma.googleplex.com/p/chrome/callstacks?sid=a433e6687829f40f45da2c4d7616df5d

Note that almost half the runtime associated with setting crash key values is coming from code paths outside of SetActiveURL.
Blockedon: 598854
The existing crash keys system should be replaced hopefully within the next week or two.
Status: Fixed (was: Assigned)
This should be fixed by https://chromium.googlesource.com/chromium/src.git/+/f88863f430c77b1261531cbc6c4320481d4caf3d.

Sign in to add a comment