New issue
Advanced search Search tips

Issue 751340 link

Starred by 4 users

Issue metadata

Status: Available
Owner: ----
Cc:
Components:
EstimatedDays: ----
NextAction: ----
OS: Windows
Pri: 3
Type: Bug

Blocking:
issue 704794



Sign in to add a comment

Improve distribution of BeginFrame latency with D3DVsync experiment

Project Member Reported by stanisc@chromium.org, Aug 2 2017

Issue description

The goal of this is to investigate and improve distribution of Scheduling.Browser.BeginImplFrameLatency2 and Scheduling.Renderer.BeginImplFrameLatency2 UMA with D3DVsync experiment enabled

Here is what I currently see when comparing these metrics between the experiment enabled group and the control group

1) GPU v-sync improves mean and median latency by about 200-400 us which is great

2) But at 90% percentile GPU V-sync latency gets worse than the delay based v-sync, which is especially noticeable at [3,000 - 14,000] us range

3) After ~14,000 us there is an artificially looking drop which I suspect is caused by the DWM timestamp adjustment formula

4) GPU V-sync latency has a noticeably larger [0 - 1] us bucket which could be attributed to either negative latency due to incorrect DWM timestamp adjustment or low resolution time

5) GPU V-sync latency distribution has a longer tail which even shows a small increase after ~1 sec. About 0.24% of all samples fall into >1 sec buckets.
 
Project Member

Comment 1 by bugdroid1@chromium.org, Aug 2 2017

The following revision refers to this bug:
  https://chromium.googlesource.com/chromium/src.git/+/e2c169757c7cefab9081da1c332d8886621ea4b2

commit e2c169757c7cefab9081da1c332d8886621ea4b2
Author: Stanislav Chiknavaryan <stanisc@chromium.org>
Date: Wed Aug 02 22:21:50 2017

Increase GPU v-sync worker thread priority to reduce v-sync latency

This applies only to code running under D3DVsync experiment.

My experiments show that increasing the GPU V-sync worker thread priority has 
a noticeable impact on latency on the GPU process side, even before the v-sync
signal is sent over IPC to the browser.

The tables below show the deltas (in us) between the adjusted DWM v-blank timestamp
and the now() time when on the thread wakeup on v-blank.
This was tested on a consumer grade DELL XPS laptop with 4 cores.

With the current DISPLAY priority the distribution looks like this:
< 0       9.8%
0-99      78.7%
100-199   4.8%
200-499   1.7%
500-999   2.2%
1000-1999 1.4%
2000-4999 1.0%
5000-9999 0.4%
10000+    <0.1%

With the higher REALTIME_AUDIO priority the distribution looks much improved
with a much steeper drop after 200 us:
< 0       8.6%
0-99      82.4%
100-199   8.5%
200-499   0.2%
500-999   0.2%
1000-1999 <0.1%

Bug: 751340
Cq-Include-Trybots: master.tryserver.chromium.android:android_optional_gpu_tests_rel;master.tryserver.chromium.linux:linux_optional_gpu_tests_rel;master.tryserver.chromium.mac:mac_optional_gpu_tests_rel;master.tryserver.chromium.win:win_optional_gpu_tests_rel
Change-Id: I082d4561e6b27fc23e2242a9023c19d6787492fd
Reviewed-on: https://chromium-review.googlesource.com/596625
Reviewed-by: Bruce Dawson <brucedawson@chromium.org>
Commit-Queue: Stanislav Chiknavaryan <stanisc@chromium.org>
Cr-Commit-Position: refs/heads/master@{#491528}
[modify] https://crrev.com/e2c169757c7cefab9081da1c332d8886621ea4b2/gpu/ipc/service/gpu_vsync_provider_win.cc

When looking at crash dumps for Issue 754142 I've noticed at least 3 crash dumps which indicated 32Hz display refresh rate.

That seems fairly unlikely but looking at the code I realized a significant flaw that might explain some of the weirdness in the latency distribution.

This code puts too much trust in the display frequency reported by EnumDisplaySettings. The returned value is used to seed the initial refresh interval as 1/display_frequency and is also used to define the accepted interval range.

Where this goes wrong is that the initial interval is then used in the adjustment formula which shifts the DWM reported timestamp forward or backward a whole number of the refresh intervals. Initially we have just one interval so using that to adjust the DWM timestamp might produce a pretty much arbitrary result if the display frequency is wrong due to a driver issue. For example if the actual refresh rate is 60 Hz but the driver reported 32Hz for whatever reason the formula would shift the timestamp by a number of 31 ms increments rather by a number of 16 ms increments. That might easily produce an error in several milliseconds or potentially any value in [0, interval] range. 

Comment 3 by jer...@duckware.com, Aug 29 2017

stanisc, regarding comment #2 above, I looked at the code and noticed it is asking for display settings (EnumDisplaySettings), calculating min/max accepted intervals, etc.

TIP: A technique used by vsynctester.com that figures out 'valid' intervals automatically -- without any external information (only 'tFrame', the time of rAF callbacks, or for you, the D3DKMT loop wake up time) -- is as follows (view html source at vsynctester and look for the 'var HZ=' object):

  function add(tFrame) {
    L0=L1; L1=L2; L2=L3; L3=L4; L4=tFrame;
    var grouping=Math.max(L4-L3,L3-L2,L2-L1,L1-L0)-Math.min(L4-L3,L3-L2,L2-L1,L1-L0);
    if (grouping<TIGHTGROUPMS) {
      if (--nJavaScriptSkip<0 && --nSkip<0) {
        var avems = (L4-L0)/4;
        if (avems<(1000/LOWESTVALIDHZ)) {
          // 'avems' is considered a valid, representative, interval
          }
        }
      }
    }

The key 'filter' (to find valid intervals) in this algorithm is the 'grouping' calculation.  Tight grouping can only happen when the interval is consistent.

Hopefully you can adapt or use this idea in the code?


Comment 4 by jer...@duckware.com, Aug 29 2017

I just realized that the D3DVsync experiment is using DWM vsync times (qpcVBlank) when running on the primary monitor (I thought you were using the D3DVsync loop wake up time only).

I have seen first hand in multiple monitor configurations where the time/period from DwmGetCompositionTimingInfo() was flat out wrong.  The period was a constant 16.667ms, but the qpcVBlank jumped around (deltas produced 16.667, 16.667, 26.xxx -- in a repeating pattern).  This is documented in the tests.cpp file in my D3DKMT test code.  Maybe that accounts for some strangeness seen?

If increasing the D3DVsync loop thread priority in r491528 makes those wake up times much closer to DWM values, that should allow you to just always use the wake up time (and never use DWM times, which can be wrong)?

Comment 5 by jer...@duckware.com, Aug 29 2017

Q: how is there a "< 0" histogram slot?

Comment 6 by jer...@duckware.com, Aug 29 2017

What is the ultimate goal of the UMA numbers?  To obtain a latency from the TRUE vsync time??

If so, DWM times won't provide that (same for D3DVsync wake up times), as both use 'wake up' times for the vsync time, which can actually be past true vsync (under load).

The only way to know the offset from 'true' vsync is to implement the algorithm (or similar) described at http://www.vsynctester.com/howtocomputevsync.html.  This is the "var HZ=" object at vsynctester.com.  The algorithm takes vsync wakeup times, and computes a very accurate 'true' vsync 'timebase' and 'interval' to use.

Comment 7 by jer...@duckware.com, Aug 30 2017

stanisc, I think there may be an error in how the GetAverageInterval() is calculated?  For debugging purposes, maybe add the interval value to the 'TRACE' in the loop?

The issue is that the interval is seeded from EnumDisplaySettings() in UpdateCurrentDisplayFrequency -- which is a integer rounded value in error, which is then used in SendGpuVSyncUpdate to find a vsync timestamp (use_dwm case) using that interval, which is then is feed back via AddTimestamp() to create more samples at that interval.

The average interval calculated must instead be based upon 'now' timestamps, instead of 'adjusted' time stamps -- so, in SendGpuVSyncUpdate(), remove AddTimestamp(timestamp) at the end and add a AddTimestamp(now) at the top of the function?
stanisc, I have a 'vsynctester like' algorithm for your D3Dvsync code working in prototype code.  It is very space/time efficient and is quite compact (not much code).  The only thing the algorithm needs is D3Dvsync wakeup times, and the algorithm passes back a very accurate vsync 'timebase' and 'interval'.  Let me know if you want that posted here or sent directly to you.

I have put the code through extensive testing and one thing that revealed is that the state of the processor greatly affects how far behind (or not) the D3Dvsync wakeup time is from the DWM times.  With the processor not busy, I am seeing 45 microsecond times.  With the processor busy, that DROPS to 15 microseconds.

Comment 9 by stanisc@google.com, Sep 2 2017

@7: I don't think there is an error. The actual problem is seeding the array of intervals with a potentially incorrect interval in the beginning AND making any further intervals out of adjusted timestamps. What happens in that the incorrect initial interval is used in the timestamp adjustment formula so if DWM timestamp happens to be far enough off that produces a significant mistake which further perpetuates the error.
I had an idea how to fix it just before going on vacation but if you have a working solution I'll gladly review/reuse it.

@8: Jerry, feel free to post here. I appreciate your help.
Just to clarify my comment @7. If there was an error if calculating the average it would be very noticeable when testing locally in local BeginFrame latency UMA distribution histograms. But the local data is very clean on two different types of hardware that I tested it with. I'll still double check it with longer runs.

stanisc/9: Sorry, I should have spelled out the issue in somewhat greater detail...

First, the issue will only be visible on systems that have an actual Hz that is NOT near an integer Hz value already.  For example 59.802 Hz yes.  But 59.995 Hz no.

Second, even with the problem, over the long term, the interval will converge on the correct interval value.  Errors are seen in the short term.  This can be seen when the first couple of passes in the algorithm are manually run through.

Third, the issue is greatly hidden by the fact that 'no' intervals are being thrown away.  The code allows for that, but the tight grouping of wakeup times means that in reality, very few (if any) intervals are being thrown away.  When no intervals are being thrown away, the intervals stack back to back, and the average interval is always the end time minus the begin time divided by the number of intervals (regardless of errors in those intervals).

So the bottom line is that the current code will always (I think) converge to the correct value, just slower than the algorithm should.
stanisc/9, attached is my prototype code (prototype.txt).  To see this code run 'live' in my prototype app, run http://www.duckware.com/test/chrome/dwm-vsync-tests.exe, and select the 'Extended Chromium test' from the 'Tests' menu (runs for about 90 seconds).  Exit after that (or early), and see a grid of results pop up.  See attached testcode.txt for a use example, and for help in decoding columns (provides raw D3Dvsync wake up time, then timebase/interval of that run through the algorithm, and for comparison, DWM times and the timebase/interval of that run through the algorithm).

The 'rolling interval' was just added a couple of hours ago (so still verifying that).  On my test systems, the interval converges to the correct interval (accurate to the microsecond) very quickly -- and eventually gets accurate to just nanoseconds.

A new bracketing method (out of a desire for very low overhead) is used to discover the vsync 'timebase' and to keep it in sync -- as compared to the 'line of best fit' method vsynctester.com uses (incredibly accurate, but expensive in cpu cycles).  The new method seems to work very well, but it requires more stress testing.

This prototype needs testing on NON-intel integrated graphics systems (timing and drift concerns).  So far all of my testing has been on intel systems, which appear to have no vsync drift at all, even after 1000000 intervals (appears cpu + gpu driven off same clock source explains).  [It that is an issue, I have an idea for a simple way to overcome that].

I have used TimeTicks and TimeDelta typedefs as hints to intent (on when they can be used -- if something makes its way into Chromium), but some usage may actually be wrong, but work in my code since everything is essentially a double.  Any remaining 'double' and 'long' in the code must remain to work around 'microsecond' accuracy of Chromium types (as the code needs nanoseconds+).

I hope this helps...
stanisc, is it possible the UMA distribution is sometimes all over the place because DWM's qpcVBlank stops advancing when there are no true screen updates?

I can replicate DWM's qpcVBlank halting (while using D3Dvsync) just by moving a rAF animation to a background tab or by minimizing the Chrome app running a rAF animation.  [Are Chrome internals  still running something and collecting stats?]

With "interval=GetAverageInterval()" only accurate to microsecond(s), that means that EACH frame DWM is behind adds this microsecond(s) error into the 'adjustment' formula result.

[if GetAverageInterval is off by 1 microsecond, and DWM qpcVBlank pauses for 17 seconds (at 60Hz), that results in an adjustment error of 1ms]

Another reason not to use DWM in the D3Dvsync code?
As a follow up to comment #8 (processor state affects latency), see the attached, which is the inter-frame wakeup times from the D3Dvsync loop in the prototype -- which are a LOT more stable after the heavy workload is added to the processor.
processor-state-vs-jitter.jpg
90.2 KB View Download
The prototype in comment #12 has been significantly improved -- it now locks onto the vsync interval to a high degree of accuracy very quickly (to nanosecond accuracy within seconds).  Just search the tests.cpp file for this issue number (751340) to find the new prototype code:

    www.duckware.com/test/chrome/467617-source-code.zip
Jerry, I re-read your comments and realized that some of your suggestions nailed it. I can now repro the bad latency pattern. That requires a page that requests v-sync but doesn't actually render any frames. A page with an off-screen animation does provide a consistent repro. The next step is to put Chrome in the background, although I am not sure if that is truly required. With that I can see a lot of v-sync latency in 5-15 ms range on the browser process side. Normally the latency is sub 1 ms.

I now have some data that proves that the current implementation has a timestamp calculation issue when all of the following conditions are true:
- Chrome is in background
- v-sync production is requested
- Chrome doesn't render any new frames

This happens when Chrome displays a static page that keeps requesting an rAF notification or if there is an off-screen animation.

What happens is that DWM stops advancing v-blank timestamps so Chrome makes new v-sync timestamps by adding a number of v-sync intervals to the last good DWM timestamp it has. If v-sync interval is even slightly inaccurate the error may accumulate over hundreds or even possibly thousands v-sync cycles. What is worst the new inaccurate timestamps further influence the average interval so the interval keeps drifting until DWM resumes advancing v-blank timestamps.

I think I can try the following strategy to avoid accumulating the error and drifting the timestamp and the interval:
1) We can detect when DWM timestamp stops advancing for more than a few frames.
2) Based on good frames where DWM timestamp does advance we can calculate the average adjustment - the latency between v-sync thread 'now' timestamp and DWM timestamp. Normally it is around 20-60 us.
3) When DWM stops advancing I think it should be OK to just use the last known average adjustment, so basically the reported timestamps will be the thread 'now' timestamps minus the last known average adjustment. This might introduce a bit more jitter in the timestamps but (a) that would be still way better than the current method where timestamps drift away and (b) with nothing presented on the screen the accuracy shouldn't matter as much. 
4) As soon as DWM timestamp starts advancing again the v-sync timestamp should instantly self-correct.

Project Member

Comment 18 by bugdroid1@chromium.org, Oct 2 2017

The following revision refers to this bug:
  https://chromium.googlesource.com/chromium/src.git/+/6c346dfd75ac21787938a1820527047231a308ad

commit 6c346dfd75ac21787938a1820527047231a308ad
Author: Stanislav Chiknavaryan <stanisc@chromium.org>
Date: Mon Oct 02 06:32:09 2017

Prevent drift of v-sync timestamp when DWM v-blank timestamp stops advancing

This fix affects only code that runs under D3DVsync experiment and addresses an
issue with calculation of v-sync timestamp that could be observed through a bad
distribution pattern of Scheduling.Browser.BeginImplFrameLatency2 and
Scheduling.Renderer.BeginImplFrameLatency2 UMA.

Apparently when Chrome is in backgroup and it doesn't render any new frames
DWM's qpcVBlank timestamp stops advancing.  Any consecutive calls to
DwmGetCompositionTimingInfo just keep returning the same timestamp. 
DWM's qpcVBlank timestamp is used in conjunction with the the v-sync thread
'now' time to calculate the v-sync time and v-sync interval returned
by GpuVSyncProviderWin. Normally the DWM timestamp can be a few frames in the
past and sometimes even 1-2 frames in the future. The v-sync provider code copes
with that by adjusting the DWM timestamp by translating it by a number of 
intervals so that the result is around the 'now' time. That however depends on
how accurate the interval is, and when DWM timestamp gets substantially stale
the adjustment may produce an arbitrary v-sync timestamp. 

This change introduces a hybrid approach which does the following:
- Detect when DWM qpcVBlank timestamp is too far in the past
- When DWM timestamp is OK, calculate the adjustment the same way as before and
  accumulate the last 60 adjustment values so that we can calculate an 
  average adjustment. 
- When DWM timestamp is too old, just use the average adjustment calculated above
- Either way the adjustment is subtracted from the 'now' time (which is v-sync
  thread v-blank wake-up time) to find the v-sync time. 

With this fix in place v-sync timing seems to be stable even when Chrome is in
background and no frames are produced, and no drifting is observed.
Distribution of local Scheduling.Browser.BeginImplFrameLatency2 and
Scheduling.Renderer.BeginImplFrameLatency2 UMA looks good too.

Bug: 751340
Cq-Include-Trybots: master.tryserver.chromium.android:android_optional_gpu_tests_rel;master.tryserver.chromium.linux:linux_optional_gpu_tests_rel;master.tryserver.chromium.mac:mac_optional_gpu_tests_rel;master.tryserver.chromium.win:win_optional_gpu_tests_rel
Change-Id: I18c3b5682bca86a4847dc9b03b124fc9ba7b82c7
Reviewed-on: https://chromium-review.googlesource.com/691153
Commit-Queue: Stanislav Chiknavaryan <stanisc@chromium.org>
Reviewed-by: Bruce Dawson <brucedawson@chromium.org>
Cr-Commit-Position: refs/heads/master@{#505545}
[modify] https://crrev.com/6c346dfd75ac21787938a1820527047231a308ad/gpu/ipc/service/gpu_vsync_provider_win.cc

With the last fix the distribution has improved but looks somewhat abnormal

Scheduling.Browser.BeginImplFrameLatency2 values at the following percentiles with D3DVsync enabled:

50.00%	456.3
75.00%	1014
90.00%	5009
95.00%	11333

For comparison, Scheduling.Browser.BeginImplFrameLatency2 values at the same percentiles with the control group (D3DVsync disabled):

50.00%	709.0
75.00%	1161
90.00%	1898
95.00%	3053

For comparison, here is Scheduling.Browser.BeginImplFrameLatency2 values with D3DVsync enabled one week ago:

50.00%	455.2
75.00%	1146
90.00%	6617
95.00%	12008

So clearly there is some improvement but the distribution still looks bad compared to D3DVsync=disabled at 90 percentile and above.
Owner: ----
Status: Untriaged (was: Assigned)
I am moving to another team and won't be working on this anymore.
This issue is currently blocking enabling D3DVsync experiment to more users.
Blocking: 704794
Owner: briander...@chromium.org
Status: Assigned (was: Untriaged)
+brianderson for triage.
Owner: ----
Status: Available (was: Assigned)

Sign in to add a comment