Issue metadata
Sign in to add a comment
|
>3% of renderer time spent in NtSetTimerResolution. |
||||||||||||||||||||
Issue descriptionBased on data from the sampling profiler. From analysis by whitman@: "the top 0.15% of non-idle profiles are responsible for 3/4 of this total. These profiles are seeing 20-100% of their samples in this function." It appears this is happening here: https://cs.chromium.org/chromium/src/base/message_loop/message_loop.cc?q=activatehighre&sq=package:chromium&l=669&dr=C Perhaps windows folks might have some ideas?
,
Jul 12
Adding hysteresis seems reasonable to me. How often would we want to check if we should lower the frequency? At this point we could raise/lower it multiple times per message loop pump. Any thoughts on reducing this to a maximum of once per pump of the message loop? My guess is that it's quite rare for this to cycle more than once during a pump of the message loop, but it's hard to say. Another option would be tweaking the requirements for a task to be considered high resolution (https://cs.chromium.org/chromium/src/base/message_loop/incoming_task_queue.cc?rcl=1447fae2e377e748f4ddb532a99f6c5ba7319c98&l=69).
,
Jul 12
@wittman : which use case is this? The only place where I can see a significant amount of time spent is 1.8% on Renderer Main + Periodic Collection + Non-Idle [1], but that's likely because Renderer Main is idle a lot because when not excluding idle I get 0.01% of time spent in NtSetTimerResolution. [1] https://uma.googleplex.com/p/chrome/callstacks/?sid=9f99e26b1a5a096c8797ae0e0ca37e8b Whether r574259 affected this is an interesting question but UMA sampling profiler hasn't caught up to it (69.0.3489.0) yet (a sampling profiler diff between 69.0.3488.0 and 69.0.3489.0 will be revealing when it comes out)
,
Jul 12
,
Jul 12
Yes, this is Renderer Main + Periodic Collection + Non-Idle, but on dev channel. Copying the full analysis from a separate email thread: Re: NtSetTimerResolution in 69.0.3472.3 dev https://uma.googleplex.com/p/chrome/callstacks/?sid=72a9fd0e1f56be144926bd7e01ff3474 This looks like a legitimate problem to me, and getting worse over time: date NtSetTimerResolution % 20180626 1.03 20180627 1.86 20180628 2.32 20180629 2.98 20180630 3.69 20180701 3.8 20180702 3.8 20180703 4.06 20180704 4.35 20180705 4.81 Breaking down the 4.81% mean execution percentage for 20180705 in the attached graph, the top 0.15% of non-idle profiles are responsible for 3/4 of this total. These profiles are seeing 20-100% of their samples in this function. --- Taking a look at historical data it appears this problem has been happening since at least mid-April when we started collecting profiles from the render thread. Some releases on some days are seeing >10% of non-idle samples in NtSetTimerResolution: dev_version non_idle_sample_percentage_by_day_of_release 67.0.3396.10-64 1,1,2,3,2,2,2,4,4,7,4,7,3,6,4,5,3,5 67.0.3396.18-64 1,2,2,2,3,4,3,5,4,4,3 67.0.3396.87-64 2,3,2,2,1,2,2 67.0.3396.99-64 2,2,2 68.0.3409.2-64 3,4,4,5,5,5,5,6,7,7,9,8,10,9,8 68.0.3418.2-64 3,4,4,5,8,11,9,11,8 68.0.3423.2-64 2,3,5,5,5,5,6,5,5,6,7,7,8,8,8,8,8,7,9,8,9,8,9 68.0.3432.3-64 3,4,4,4,5,6,6,8,9,10,10,9,9,9,8,8 68.0.3438.3-64 2,3,4,4,5,5,6,7,9,8,8,9,10,9,9,12,10,12 68.0.3440.15-64 1,3,4,5,7,8,7,7,10 68.0.3440.7-64 2,3,4,5,5,5,5,8,9,9,8,10,8,9,9 69.0.3452.0-64 2,3,3,4,4,5,5,4,4,4,4,4,5,5,6,6,6,6,7,6,7,7,9,7,6,7,6 69.0.3464.0-64 2,3,2,3,3,3,4,5,6,6,5,5,6,5,6,6 69.0.3472.3-64 1,2,2,3,4,4,4,4,4,5,4,5,4
,
Jul 13
How interesting is "% of non-idle" though? Can't that vary if the denominator, %idle vs non-idle, changes? What's the "top 0.15% of non-idle profiles"? Also, I'm trying to make sense of that graph but I'm probably lacking nomenclature knowledge (what exactly is a profile, an execution, etc. in this world)? From my end, looks like CLs for crbug.com/854237 mostly landed after the implicated time range and as such I'll yield back to Bruce. Thanks!
,
Jul 13
The important thing the graph is saying is that for a non-trivial percentage of active execution on the thread (x axis), large parts of that execution are occurring in NtSetTimerResolution (color mapping). Definitions: profile: a set of 300 stack samples collected from a single client on a single thread over a period of 30 seconds, sampling at 10Hz non-idle profile: a profile with at least one sample not idling in the message loop. This is a crude but straightforward way of identifying profiles occurring during times the thread was 'active', in an attempt to avoid consideration of uninteresting execution. It's intended to provide an 'order-of-magnitude accurate' measure of activity. percentage of execution (per profile): # of samples in recorded in NtSetTimerResolution / 300 samples top X% of non-idle profiles: the X% of non-idle profiles with the highest percentage of execution time in NtSetTimerResolution The graph shows contribution to the 4.81% mean execution percentage of NtSetTimerResolution (y axis) broken down by percentage of execution observed in NtSetTimerResolution (color mapping) and fraction of non-idle profiles seen with that percentage of execution (x axis). The % of non-idle varies by 10% over the time period above, so has minimal effect. The order-of-magnitude accuracy of the non-idle profile distinction likely has a much bigger effect, but if anything this significantly understates the severity of the problem since it's an overestimate of true thread activity.
,
Jul 17
I just ran a quick test on my workstation, on Chrome Stable (M67). During 6 s of normal usage I see 293 requests to change the timer frequency. I have 58 different chrome.exe processes running and the requests come from 18 of them. Since Windows 10 and WPA now support thread names I can see that the Compositor and CrRendererMain threads account for the majority of the timer frequency adjustments. Here are the stats by thread type: 32 - CrBrowserMain 19 - Chrome_IOThread 40 - CrGpuMain 94 - CrRendererMain 108 - Compositor I then opened the browser to about:blank and did a proper idle test - much more scientific. Over a 10 second period Chrome requested timer frequency changes 139 times from 14 different processes from these two thread types: 12 - Chrome_IOThread 127 - CrRendererMain I don't know why 14 different hidden tabs are keeping the CrRendererMain thread so busy but that suggests one possible improvement - never raise the timer interrupt frequency when not visible, at least not from CrRendererMain. Disclaimer: there is no guarantee that the pathological behavior that some customers are seeing is related to what I am seeing. Counterpoint: fixing the behavior I am seeing might be easy and productive even if it isn't the same cause Methodology: I recorded a trace with UIforETW with call stacks turned off (to reduce overhead). In the Generic Events table I changed the view to Randomascii System Time Resolution and then drilled cown into Microsoft-Windows-Kernel-Power-> SystemTimeResolutionChange-> Process Name. I added the ThreadName column and turned off the ThreadId column and grouped by various combinations of Thread Name and Pid (Field 2) to gather this data.
,
Jul 17
It occurred to me that we might be seeing some degree of synchronization bias in this data, although it seems unlikely this can explain the full effect of what's happening. The sampling profiler is indirectly dependent on the timer tick for scheduling its samples, via PostDelayedTask. Raising the timer frequency would tend to have the effect of causing the next sample to happen sooner due to the increased precision. If the desired sample time had already passed but the sample had not yet been taken because it was waiting for the next low-resolution timer tick, then the sample would occur on the first high-resolution tick after changing the resolution. Assuming the first high-resolution tick is equally likely to occur any time within the 1ms period after the resolution is raised by NtSetTimerResolution, then the chance of the next sample being in that function would be dependent on how long NtSetTimerResolution continues to execute after increasing the timer resolution. (Or alternately, would be dependent on the OS taking the thread off CPU when calling NtSetTimerResolution.) Do we have some idea how long NtSetTimerResolution takes to execute, or if there's some mechanism by which the calling thread is more likely to be context switched during its execution? (Issue 567194 covers synchronization bias with the profiler. I'll update the bug to reference this scenario.)
,
Jul 17
Note: I wrote this up before comment #9 :-), but yes, I also wonder about synchronization bias. This data may all be a statistical anomaly. If the sampling profiler is waking up periodically after sleeping with Sleep(n) or WaitForSingleObject(h, n) then it is being woken up by the timer interrupt, which is what NtSetTimerResolution controls. So, the expense of this function may be artificial - just a statistical anomaly triggered by a timing correlation between sampling and the calls to NtSetTimerResolution. As some evidence for that I grabbed another idle trace, this time with high-speed (8 kHz) ETW sampling, with ~58 chrome processes. I saw 22,112 samples (representing ~2.67 s of CPU time) in the trace and only 35 samples (~4.2 ms) were in NtSetTimerResolution. So... - The large number of samples in NtSetTimerResolution could be an artifact of the sampling times being correlated with when we call that function - The large number of samples could also be indicative of some pathological situation where we change the frequency in a loop - The large number of samples could also indicate that NtSetTimerResolution is very slow on some machines - Changing the frequency frequently is wasteful regardless, especially when it is being done from CrRendererMain in an invisible process So, some action items (probably for me): - Measure the cost of NtSetTimerResolution (but maybe it varies depending on machine type?) - Add code to detect and report on excessive changes of the frequency, or the frequency of frequency changes, or something like that. - Add code to discourage or prevent raising the timer frequency in hidden tabs - Rethink the frequency raising logic given that it takes a little while to take effect Maybe wittman@ can devise some local experiment to see if his sampling profiler reports more time spent in NtSetTimerResolution than other measures such as ETW or elapsed-time-wrappers show I'll take this bug for now for my action items.
,
Jul 17
Bruce, see MessageLoop::SetTimerSlack(). Currently it's a no-op on non Mac platforms but the goal is to increase the "slack" (lack of high precision) on timers in some situations. I think it's already invoked from backgrounded renderers. For Windows we probably want this to affect the process wide bit instead of the per thread MessageLoop, but I suspect the required hooks can be dropped in the same place :). We should also have MessageLoop::SetTimerSlack() make it such that MessageLoops that are never high priority also never trigger a vote to activate Windows hi res timers. Le mar. 17 juill. 2018 14 h 20, brucedawson via monorail < monorail+v2.1410982004@chromium.org> a écrit :
,
Sep 28
I wrote a simple test program to measure the cost of calling timeBeginPeriod/timeEndPeriod. On my work station it can do ~720,000 matched pairs of these calls per second, and over 800,000 when run as a 64-bit process. On a Surface Pro laptop it was much slower but could still do at least 120,000 calls in a second. I grabbed some ETW traces to see where the time is spent (flame graph attached) but it didn't reveal anything useful.
It's still unclear what is happening here. The likely possibilities are:
1) On some machines NtSetTimerResolution is much more expensive - could be, especially on VMs perhaps?
2) In some situations NtSetTimerResolution is called too often - 3% of a core would take ~3,600 calls per second, 3% of non-idle time would take less, so, maybe?
3) Maybe this is synchronization bias
I guess I'm just repeating my thoughts from comment #10. Here's my sample code:
#include <Windows.h>
#include <timeapi.h>
#include <inttypes.h>
#include <stdio.h>
#pragma comment(lib, "winmm.lib")
const DWORD kDuration = 1000;
int main() {
DWORD start = GetTickCount();
int64_t count = 0;
for (;;) {
++count;
timeBeginPeriod(1);
timeEndPeriod(1);
DWORD elapsed = GetTickCount() - start;
if (elapsed > kDuration)
break;
}
printf("Did %lld pairs of timer period changes in %lu ms.\n", count, kDuration);
}
,
Oct 11
Possibly relevant - bug 892113 - (googler visible only) records timeBeginPeriod taking multiple seconds to return. The reason why is not clear yet but if this happens to some percentage of customers then it could explain the sampling results, maybe. |
|||||||||||||||||||||
►
Sign in to add a comment |
|||||||||||||||||||||
Comment 1 by brucedaw...@chromium.org
, Jul 12Labels: -Pri-1 Pri-2