New issue
Advanced search Search tips
Note: Color blocks (like or ) mean that a user may not be available. Tooltip shows the reason.

Issue 605089 link

Starred by 1 user

Issue metadata

Status: Verified
Owner:
Closed: May 2016
Cc:
EstimatedDays: ----
NextAction: ----
OS: Windows
Pri: 2
Type: Bug-Regression

Blocked on:
issue 599932



Sign in to add a comment

tracing.tracing_with_debug_overhead flaky on Win 10 Perf (3) bot due to LowResolutionTimeError

Project Member Reported by petrcermak@chromium.org, Apr 20 2016

Issue description

Revision range first seen: 384522:384540 (pretty much the beginning of the bot's available history)
Bot: https://build.chromium.org/p/chromium.perf/builders/Win%2010%20Perf%20%283%29?numbuilds=200
Link to failing step log: https://build.chromium.org/p/chromium.perf/builders/Win%2010%20Perf%20%283%29/builds/2660/steps/tracing.tracing_with_debug_overhead/logs/stdio

I've seen the following error on at least two different URLs (google.com and amazon.com) in two separate builds:


Traceback (most recent call last):
  File "C:\b\build\slave\Win_10_Perf__3_\build\src\third_party\catapult\telemetry\telemetry\value\failure.py", line 41, in _GetExcInfoFromMessage
    raise Exception(message)
Exception: LowResolutionTimeError: Trace doesn't have high resolution time, cannot map.
    at Object.mapSingleTrace (/perf_insights/map_single_trace.html:39:17)
    at eval (C:\b\build\slave\Win_10_Perf__3_\build\src\third_party\catapult\perf_insights\perf_insights\map_single_trace_cmdline.html:61:14)
    at Object.runAndConvertErrorsToFailures (/perf_insights/map_single_trace.html:26:10)
    at mapSingleTraceWithResult (C:\b\build\slave\Win_10_Perf__3_\build\src\third_party\catapult\perf_insights\perf_insights\map_single_trace_cmdline.html:52:8)
    at Object.mapSingleTraceMain (C:\b\build\slave\Win_10_Perf__3_\build\src\third_party\catapult\perf_insights\perf_insights\map_single_trace_cmdline.html:76:18)
    at eval (C:\b\build\slave\Win_10_Perf__3_\build\src\third_party\catapult\perf_insights\perf_insights\map_single_trace_cmdline.html:89:11)
    at Object.loadHTMLFile (C:\b\build\slave\Win_10_Perf__3_\build\src\third_party\catapult\third_party\vinn\vinn\html_imports_loader.js:140:5)
    at C:\b\build\slave\Win_10_Perf__3_\build\src\third_party\catapult\third_party\vinn\vinn\d8_bootstrap.js:150:19

[  FAILED  ] https://www.google.com/#hl=en&q=barack+obama (10904 ms)


If the test is disabled, please downgrade to Pri-2.
 
Cc: nedngu...@google.com
This is also happening on Win 7 x64 Perf (3):

https://build.chromium.org/p/chromium.perf/builders/Win%207%20x64%20Perf%20%283%29?numbuilds=200
Cc: oysteine@chromium.org dtu@chromium.org fmea...@chromium.org
Oh wow, this has to do with the fact that some trace don't have high resolution time and would affect *all* the TBM v2 benchmarks.

+Oystein, Fadi: whether the trace is using highResTimer should depends only on hardware of the machine, is it correct?
Cc: charliea@chromium.org
+Charlie who also has some experience in this area (see #2).

Dave, I'm pretty sure we'll need to take timer resolutions into account when selecting perfbot hardware.
Yep, correct; this should be consistent on a given CPU. The test could be selectively disabled, of course, but that would lead to very noisy metrics :/.
Actually, this seems to be flaky on the same bot? There may be something racy in the way that metadata is generated then :/. Maybe this metadata key is sometimes missing, rather than false?
Hmhh, this seems like a tracing bug. I will add some logging so we know whether the metadata key is missing or false.
Here is the json version of the trace file using ./tracing/bin/html2trace
trace-file-id_0-2016-04-20_03-30-56-58527.json
12.6 MB View Download
Owner: oysteine@chromium.org
Assign to Oysteine to triage this bug. This affects all the tracing pipeline, so it would be great if you can find someone to work on this.
Ah, I think I found the issue.

If the trace metadata isn't set (which always seems to be the case here; no top-level metadata in the trace file?), the trace importer falls back to a heuristic: model.html:isTimeHighResolutionHeuristic_.

This heuristic basically test for >90% of the events being clustered on the same number for each 0-100ms range, as a signal for the trace most likely being from low-resolution timer hardware.

Since this'll happen randomly at times, especially for traces with very few events, I don't think we should assert on this for TBMv2 metrics.

My suggestion would be to just remove the assert from map_single_trace. We already block Slow Reports from being uploaded from hardware with no lowres timers, so I don't think there's much value in the check anymore (it made sense when we had a bunch of reports laying around from older hardware that the P-I reports should skip). WDYT?
Blockedon: 599932
Status: Assigned (was: Untriaged)
Thanks for your analysis, the trace metadata isn't set because of  issue 599932 . I think the html:isTimeHighResolutionHeuristic_ is probably problematic here, and maybe we should remove it & change the assert in map_single_trace to "assert isHighResTimer if metadata exists"
Labels: -Pri-1 Pri-2

Comment 13 by dtu@chromium.org, May 5 2016

Which CPUs do/don't have high-res timers? Is there an easy way to categorize them?
@Dave: the bug is not we have machines that don't have high-res timer, the bug is  issue 599932 , which devtool doesn't stream the trace metadata that includes the timing info. 

When chromium trace doesn't include timing info, there is a heuristic in tracing code (model.html:isTimeHighResolutionHeuristic_) that test whether the trace is being from low-resolution timer hardware, hence we have a flaky failure here.
The real fix is fixing  issue 599932 , the bandaid fix could be ignore the check for high res timer in model importing when there is no timing info. I think maybe we should just focus on doing the real fix here. I will pump the priority of  issue 599932 
(Sorry for not seeing the ping earlier, but it looks like the bug turned out to be unrelated to timers, so slowly stepping away)
Removing the check in https://codereview.chromium.org/1956073002. It makes sense as a warning in the trace-viewer (which is still there), but makes no sense as a mapSingleTrace assert anymore (it was originally there to filter out low-res Slow Reports traces, but we don't have those anymore).
Project Member

Comment 17 by bugdroid1@chromium.org, May 8 2016

The following revision refers to this bug:
  https://chromium.googlesource.com/chromium/src.git/+/7b287abf66c88a5ac187b301c7dc4c23fd27b792

commit 7b287abf66c88a5ac187b301c7dc4c23fd27b792
Author: catapult-deps-roller <catapult-deps-roller@chromium.org>
Date: Sun May 08 22:03:26 2016

Roll src/third_party/catapult/ 1bfce2f49..8d108d1cc (1 commit).

https://chromium.googlesource.com/external/github.com/catapult-project/catapult.git/+log/1bfce2f492bd..8d108d1cc708

$ git log 1bfce2f49..8d108d1cc --date=short --no-merges --format='%ad %ae %s'

BUG= 605089 

TBR=catapult-sheriff@chromium.org

Review-Url: https://codereview.chromium.org/1956233002
Cr-Commit-Position: refs/heads/master@{#392279}

[modify] https://crrev.com/7b287abf66c88a5ac187b301c7dc4c23fd27b792/DEPS

Cc: -charliea@chromium.org
(un-ccing myself given that this turned out to be related to metadata, not timers. feel free to CC me again if necessary)

Sign in to add a comment