Issue metadata
Sign in to add a comment
|
tracing.tracing_with_debug_overhead flaky on Win 10 Perf (3) bot due to LowResolutionTimeError |
||||||||||||||||||||
Issue descriptionRevision range first seen: 384522:384540 (pretty much the beginning of the bot's available history) Bot: https://build.chromium.org/p/chromium.perf/builders/Win%2010%20Perf%20%283%29?numbuilds=200 Link to failing step log: https://build.chromium.org/p/chromium.perf/builders/Win%2010%20Perf%20%283%29/builds/2660/steps/tracing.tracing_with_debug_overhead/logs/stdio I've seen the following error on at least two different URLs (google.com and amazon.com) in two separate builds: Traceback (most recent call last): File "C:\b\build\slave\Win_10_Perf__3_\build\src\third_party\catapult\telemetry\telemetry\value\failure.py", line 41, in _GetExcInfoFromMessage raise Exception(message) Exception: LowResolutionTimeError: Trace doesn't have high resolution time, cannot map. at Object.mapSingleTrace (/perf_insights/map_single_trace.html:39:17) at eval (C:\b\build\slave\Win_10_Perf__3_\build\src\third_party\catapult\perf_insights\perf_insights\map_single_trace_cmdline.html:61:14) at Object.runAndConvertErrorsToFailures (/perf_insights/map_single_trace.html:26:10) at mapSingleTraceWithResult (C:\b\build\slave\Win_10_Perf__3_\build\src\third_party\catapult\perf_insights\perf_insights\map_single_trace_cmdline.html:52:8) at Object.mapSingleTraceMain (C:\b\build\slave\Win_10_Perf__3_\build\src\third_party\catapult\perf_insights\perf_insights\map_single_trace_cmdline.html:76:18) at eval (C:\b\build\slave\Win_10_Perf__3_\build\src\third_party\catapult\perf_insights\perf_insights\map_single_trace_cmdline.html:89:11) at Object.loadHTMLFile (C:\b\build\slave\Win_10_Perf__3_\build\src\third_party\catapult\third_party\vinn\vinn\html_imports_loader.js:140:5) at C:\b\build\slave\Win_10_Perf__3_\build\src\third_party\catapult\third_party\vinn\vinn\d8_bootstrap.js:150:19 [ FAILED ] https://www.google.com/#hl=en&q=barack+obama (10904 ms) If the test is disabled, please downgrade to Pri-2.
,
Apr 21 2016
Oh wow, this has to do with the fact that some trace don't have high resolution time and would affect *all* the TBM v2 benchmarks. +Oystein, Fadi: whether the trace is using highResTimer should depends only on hardware of the machine, is it correct?
,
Apr 21 2016
+Charlie who also has some experience in this area (see #2). Dave, I'm pretty sure we'll need to take timer resolutions into account when selecting perfbot hardware.
,
Apr 21 2016
Yep, correct; this should be consistent on a given CPU. The test could be selectively disabled, of course, but that would lead to very noisy metrics :/.
,
Apr 21 2016
Actually, this seems to be flaky on the same bot? There may be something racy in the way that metadata is generated then :/. Maybe this metadata key is sometimes missing, rather than false?
,
Apr 22 2016
Hmhh, this seems like a tracing bug. I will add some logging so we know whether the metadata key is missing or false.
,
Apr 22 2016
Actually, there is the trace that failed: " View generated trace files online at https://console.developers.google.com/m/cloudstorage/b/chrome-telemetry-output/o/trace-file-id_0-2016-04-20_03-30-56-58527.html for page https://www.google.com/#hl=en&q=barack+obama "
,
Apr 22 2016
Here is the json version of the trace file using ./tracing/bin/html2trace
,
Apr 22 2016
Assign to Oysteine to triage this bug. This affects all the tracing pipeline, so it would be great if you can find someone to work on this.
,
Apr 22 2016
Ah, I think I found the issue. If the trace metadata isn't set (which always seems to be the case here; no top-level metadata in the trace file?), the trace importer falls back to a heuristic: model.html:isTimeHighResolutionHeuristic_. This heuristic basically test for >90% of the events being clustered on the same number for each 0-100ms range, as a signal for the trace most likely being from low-resolution timer hardware. Since this'll happen randomly at times, especially for traces with very few events, I don't think we should assert on this for TBMv2 metrics. My suggestion would be to just remove the assert from map_single_trace. We already block Slow Reports from being uploaded from hardware with no lowres timers, so I don't think there's much value in the check anymore (it made sense when we had a bunch of reports laying around from older hardware that the P-I reports should skip). WDYT?
,
Apr 22 2016
Thanks for your analysis, the trace metadata isn't set because of issue 599932 . I think the html:isTimeHighResolutionHeuristic_ is probably problematic here, and maybe we should remove it & change the assert in map_single_trace to "assert isHighResTimer if metadata exists"
,
May 5 2016
,
May 5 2016
Which CPUs do/don't have high-res timers? Is there an easy way to categorize them?
,
May 5 2016
@Dave: the bug is not we have machines that don't have high-res timer, the bug is issue 599932 , which devtool doesn't stream the trace metadata that includes the timing info. When chromium trace doesn't include timing info, there is a heuristic in tracing code (model.html:isTimeHighResolutionHeuristic_) that test whether the trace is being from low-resolution timer hardware, hence we have a flaky failure here. The real fix is fixing issue 599932 , the bandaid fix could be ignore the check for high res timer in model importing when there is no timing info. I think maybe we should just focus on doing the real fix here. I will pump the priority of issue 599932
,
May 5 2016
(Sorry for not seeing the ping earlier, but it looks like the bug turned out to be unrelated to timers, so slowly stepping away)
,
May 6 2016
Removing the check in https://codereview.chromium.org/1956073002. It makes sense as a warning in the trace-viewer (which is still there), but makes no sense as a mapSingleTrace assert anymore (it was originally there to filter out low-res Slow Reports traces, but we don't have those anymore).
,
May 8 2016
The following revision refers to this bug: https://chromium.googlesource.com/chromium/src.git/+/7b287abf66c88a5ac187b301c7dc4c23fd27b792 commit 7b287abf66c88a5ac187b301c7dc4c23fd27b792 Author: catapult-deps-roller <catapult-deps-roller@chromium.org> Date: Sun May 08 22:03:26 2016 Roll src/third_party/catapult/ 1bfce2f49..8d108d1cc (1 commit). https://chromium.googlesource.com/external/github.com/catapult-project/catapult.git/+log/1bfce2f492bd..8d108d1cc708 $ git log 1bfce2f49..8d108d1cc --date=short --no-merges --format='%ad %ae %s' BUG= 605089 TBR=catapult-sheriff@chromium.org Review-Url: https://codereview.chromium.org/1956233002 Cr-Commit-Position: refs/heads/master@{#392279} [modify] https://crrev.com/7b287abf66c88a5ac187b301c7dc4c23fd27b792/DEPS
,
May 10 2016
(un-ccing myself given that this turned out to be related to metadata, not timers. feel free to CC me again if necessary)
,
May 10 2016
This is fixed: https://build.chromium.org/p/chromium.perf/builders/Win%2010%20Perf%20%283%29?numbuilds=200 |
|||||||||||||||||||||
►
Sign in to add a comment |
|||||||||||||||||||||
Comment 1 by petrcermak@chromium.org
, Apr 21 2016