FastSetActiveURL isn't always fast |
||||||||
Issue descriptionthis call accounts for 3 CPU profiler samples during video playback: https://cs.chromium.org/chromium/src/gpu/ipc/service/gpu_command_buffer_stub.cc?rcl=daf2a068f3270585fe91baed4608078613a49b76&l=982 assigning to me until i find an owner for it.
,
Jul 25 2017
This is completely unrelated to https://chromium-review.googlesource.com/c/578066
,
Jul 25 2017
I thought the changes in https://chromium-review.googlesource.com/c/578066 are down the stack from FastSetActiveURL. That change directly removes some code calling base::debug::SetCrashKeyValue, which is shown in bruce's trace to be in a subcall from FastSetActiveURL.
,
Jul 25 2017
Actually, maybe I'm wrong. Bruce is looking into this.
,
Aug 10 2017
FastSetActiveURL is super important to be able to associate GPU process crash reports to the URL that generated the commands that are associated with the crash. Without a URL, GPU crash reports become even less actionable (it's hard enough already). Is there a way to optimize SetCrashKeyValue? I don't think our use case fundamentally requires string processing on every call (the key is constant, and the value is one-of-N where the set of possible values changes rarely, e.g. when opening/closing a tab). Maybe there's a different way of storing the data?
,
Aug 15 2017
I went to investigate this again and couldn't find FastSetActiveURL. Apparently the compiler change has altered inlining or something like that, and now it shows up under chrome_child.dll!viz::GpuServiceImpl::SetActiveURL. Searching on SetActiveURL or SetCrashKeyValue in the CPU Sampled data is probably the robust thing to do. Acknowledged that we shouldn't plan on removing this function. I suspect that it can be optimized. I've pasted in some CPU sampling data below. The left column is the number of samples that went through that call stack. In this trace there were 654 samples going through SetActiveURL, five in the function itself (see the last line) and 649 going into SetCrashKeyValue, and downwards from there: 654 chrome_child.dll!viz::GpuServiceImpl::SetActiveURL 649 |- chrome_child.dll!base::debug::SetCrashKeyValue 305 | |- chrome_child.dll!child_process_logging::`anonymous namespace'::ClearCrashKeyValueTrampoline 214 | | |- chrome_elf.dll!ClearCrashKeyValueImpl 159 | | | |- chrome_elf.dll!crashpad::TSimpleStringDictionary<256,256,64>::RemoveKey 124 | | | | |- chrome_elf.dll!strncmp 35 | | | | |- chrome_elf.dll!crashpad::TSimpleStringDictionary<256,256,64>::RemoveKey<itself> 37 | | | |- chrome_elf.dll!base::UTF16ToUTF8 11 | | | |- chrome_elf.dll!wcslen 6 | | | |- chrome_elf.dll!ClearCrashKeyValueImpl<itself> 1 | | | |- chrome_elf.dll!std::basic_string<char,std::char_traits<char>,std::allocator<char> >::~basic_string 55 | | |- chrome_child.dll!base::UTF8ToWide 27 | | |- chrome_child.dll!std::basic_string<wchar_t,std::char_traits<wchar_t>,std::allocator<wchar_t> >::_Tidy 9 | | |- chrome_child.dll!child_process_logging::`anonymous namespace'::ClearCrashKeyValueTrampoline<itself> 160 | |- chrome_child.dll!base::StringPrintf 127 | |- chrome_child.dll!child_process_logging::`anonymous namespace'::SetCrashKeyValueTrampoline 88 | | |- chrome_elf.dll!SetCrashKeyValueImpl 30 | | |- chrome_child.dll!base::UTF8ToWide 6 | | |- chrome_child.dll!std::basic_string<wchar_t,std::char_traits<wchar_t>,std::allocator<wchar_t> >::_Tidy 3 | | |- chrome_child.dll!child_process_logging::`anonymous namespace'::SetCrashKeyValueTrampoline<itself> 32 | |- chrome_child.dll!base::debug::ChunkCrashKeyValue 11 | |- chrome_child.dll!base::debug::SetCrashKeyValue<itself> 8 | |- chrome_child.dll!base::debug::LookupCrashKey 3 | |- chrome_child.dll!std::basic_string<char,std::char_traits<char>,std::allocator<char> >::~basic_string 2 | | |- ntdll.dll!RtlFreeHeap 1 | | |- chrome_child.dll!std::basic_string<char,std::char_traits<char>,std::allocator<char> >::~basic_string<itself> 3 | |- chrome_child.dll!std::allocator<v8::CpuProfileDeoptInfo>::deallocate 5 |- chrome_child.dll!viz::GpuServiceImpl::SetActiveURL<itself> ClearCrashKeyValueTrampoline seems quite expensive, and all of the string manipulations (wcslen, strncmp, UTF8ToWide, UTF16ToUTF8) seems unfortunate and perhaps avoidable. I'm posting this information in case anyone else has any ideas, but I plan to try optimizing this sometime. For context, the 654 samples were out of a total of 13,140 samples in the GPU process during playback of a 144p video.
,
Aug 15 2017
Some observations:
SetCrashKeyValueTrampoline and ClearCrashKeyValueTrampoline both take char* based strings, convert them to wchar_t* based strings, then call SetCrashKeyValueImpl and ClearCrashKeyValueImpl respectively. The two Impl functions then convert the wchar_t* based strings back to char* based strings. The two conversions can be avoided, thus getting rid of the UTF16ToUTF8, wcslen, UTF8ToWide costs, plus some basic_string<wchar_t> constructor/destructor overhead. This accounts for at least 25% of the samples.
The kActiveURL crash key ("url-chunk") is configured to use kLargeSize which means that it has 16 chunks. The unused chunks are always cleared by base::debug::SetCrashKeyValue. On the particular test case that we are using the URLs alternate between fitting into one and two chunks. This means that 14 or 15 chunks are cleared, which explains why ClearCrashKeyValueTrampoline consumes almost half of the CPU time. If you add in the cost of the 15 or 16 StringPrintf calls associated with ClearCrashKeyValueTrampoline we see that clearing the excess chunks accounts for well over half of the CPU time. Does anybody know if clearing the unused chunks is actually necessary? Intuitively I would guess that clearing one unused chunk would be sufficient, but I am speculating.
There is some double counting going on here so if we avoided both of these costs but even then if we could avoid these costs we would probably quadruple the performance of this function.
ChunkCrashKeyValue() looks like it could be slimmed down a bit in order to avoid creating so many temporary strings. Some combination of chunks.reserve(), std::move, and not creating value_string should reduce its cost but would still leave it allocating a vector of strings. It could instead fill in an array of StringPiece objects, thus avoiding any allocations or string copies. It's not a super expensive function but ...?
The one remaining cost that bothers me is the StringPrintf calls in SetCrashKeyValue. I'm not thrilled with that design but I'm not sure I want to try changing it. Avoiding the full clearing loop would mitigate that significantly, and printing to a local array would also help, and that's probably good enough.
The TL;DR is that it should be practical to make SetActiveURL significantly faster, perhaps fast enough that it will disappear into the background noise. We should still be careful not to use it to excess.
,
Aug 16 2017
Conveniently enough there are already plans to rework this system. As such I am reassigning this bug, since it should be resolved as part of those efforts. The design doc is here: https://docs.google.com/document/d/1IeWeuvwhYqEK66V4-4TChU6UmdFbuEr5lgu-WvlLlOA/edit
,
Aug 16 2017
Thanks for the insights, Bruce. A couple of thoughts: The string conversions do seem unnecessary. It looks like those were added by Scott in https://codereview.chromium.org/1517503004, but I'm not sure why the wchar_t variants were added instead of exposing const char* ones instead. I think it's because that's how breakpad_win.cc exports them? Both Set and Clear are more expensive than they need to be because of the linear scan of the crash key table. This will be going away in v2. The chunking will also be going away with v2, which will also remove the StringPrintfs. All the chunks do need to be cleared, though, otherwise e.g. if a shorter URL replaces a longer active URL, then garbage data will be appended when the chunks are concatenated on the server.
,
Aug 16 2017
I talked to the syzygy team about changing the signature of SetCrashKeyValueImpl/ClearCrashKeyValueImpl and they think it should be fine. I have an initial pass at changing those at crrev.com/c/616321 (needs double-checking). Is that change still relevant? > All the chunks do need to be cleared Can't the URL reconstruction code be made smart enough to stop when it hits a blank chunk? It sounds like this will be irrelevant soon so I guess it doesn't really matter.
,
Aug 16 2017
> I have an initial pass at changing those at crrev.com/c/616321 (needs double-checking). Is that change still relevant? Yes, that change is probably worth pursuing since it already exists and looks like it would cut down on some of the more significant samples. > Can't the URL reconstruction code be made smart enough to stop when it hits a blank chunk? It sounds like this will be irrelevant soon so I guess it doesn't really matter. I suppose it could, yes. But I'm trying to get v2 done by EOQ so it's probably not pursuing right now.
,
Aug 18 2017
The following revision refers to this bug: https://chromium.googlesource.com/chromium/src.git/+/1853e218a9f3634c89f476f0b5aad4b8a87eca08 commit 1853e218a9f3634c89f476f0b5aad4b8a87eca08 Author: Bruce Dawson <brucedawson@chromium.org> Date: Fri Aug 18 05:36:19 2017 Add char* versions of *CrashKeyValueImpl* funcs The wchar_t* versions of SetCrashKeyValueImpl and ClearCrashKeyValueImpl mean that base::debug::SetCrashKeyValue has to convert keys and values from char* to wchar_t* and then immediately back. This accounts for roughly 25% of the cost of base::debug::SetCrashKeyValue and helps make it expensive enough that it shows up on profiles. This CL adds 'Ex' versions of these functions which avoids the conversions. This will make a 144p-silent-video test case use about 1% less CPU in the GPU process. The remaining costs of base::debug::SetCrashKeyValue are being addressed in a separate project. Bug: 746518 Cq-Include-Trybots: master.tryserver.chromium.win:win10_chromium_x64_rel_ng Change-Id: Iedf0c5efef2142bf522b3fa7275e616b4cca2b7b Reviewed-on: https://chromium-review.googlesource.com/616321 Reviewed-by: Scott Graham <scottmg@chromium.org> Reviewed-by: Justin Schuh <jschuh@chromium.org> Commit-Queue: Bruce Dawson <brucedawson@chromium.org> Cr-Commit-Position: refs/heads/master@{#495472} [modify] https://crrev.com/1853e218a9f3634c89f476f0b5aad4b8a87eca08/chrome/common/child_process_logging_win.cc [modify] https://crrev.com/1853e218a9f3634c89f476f0b5aad4b8a87eca08/chrome_elf/chrome_elf.def [modify] https://crrev.com/1853e218a9f3634c89f476f0b5aad4b8a87eca08/components/crash/content/app/crashpad.cc
,
Oct 25 2017
Issue 599250 is a related bug where I tried to optimize calls to FastSetActiveURL to avoid setting internal URLs like chrome://gpu/GpuProcessTransportFactory::CreateContextCommon. WIP fix: https://chromium-review.googlesource.com/c/chromium/src/+/728707
,
Nov 10 2017
Issue 781983 has been merged into this issue.
,
Nov 10 2017
From the merged bug: the UMA Sampling Profiler is collecting up-to-date, somewhat higher resolution profiling information on this in the GPU process: https://uma.googleplex.com/p/chrome/callstacks?sid=a433e6687829f40f45da2c4d7616df5d Note that almost half the runtime associated with setting crash key values is coming from code paths outside of SetActiveURL.
,
Nov 10 2017
The existing crash keys system should be replaced hopefully within the next week or two.
,
Dec 21 2017
This should be fixed by https://chromium.googlesource.com/chromium/src.git/+/f88863f430c77b1261531cbc6c4320481d4caf3d. |
||||||||
►
Sign in to add a comment |
||||||||
Comment 1 by brucedaw...@chromium.org
, Jul 20 2017