Desktop bots system_health.common_desktop failing because of OOM errors |
|||||||||||||||
Issue descriptionhttps://build.chromium.org/p/chromium.perf/builders/Win%2010%20Perf/builds/50/steps/system_health.common_desktop%20on%20%28102b%29%20GPU%20on%20Windows%20on%20Windows-10-10240/logs/stdio https://build.chromium.org/p/chromium.perf/builders/Win%208%20Perf/builds/52/steps/system_health.common_desktop%20on%20%28102b%29%20GPU%20on%20Windows%20on%20Windows-2012ServerR2-SP0/logs/stdio https://build.chromium.org/p/chromium.perf/builders/Win%207%20Perf/builds/40/steps/system_health.common_desktop%20on%20%28102b%29%20GPU%20on%20Windows%20on%20Windows-2008ServerR2-SP1/logs/stdio https://build.chromium.org/p/chromium.perf/builders/Win%207%20x64%20Perf/builds/43/steps/system_health.common_desktop%20on%20%28102b%29%20GPU%20on%20Windows%20on%20Windows-2008ServerR2-SP1/logs/stdio And others A lot of the errors have stack traces like this: Traceback (most recent call last): File "c:\b\s\w\irwsbsne\third_party\catapult\telemetry\telemetry\internal\platform\tracing_agent\chrome_tracing_agent.py", line 237, in CollectAgentTraceData client.CollectChromeTracingData(trace_data_builder) File "c:\b\s\w\irwsbsne\third_party\catapult\telemetry\telemetry\internal\backends\chrome_inspector\devtools_client_backend.py", line 370, in CollectChromeTracingData self._tracing_backend.CollectTraceData(trace_data_builder, timeout) File "c:\b\s\w\irwsbsne\third_party\catapult\telemetry\telemetry\internal\backends\chrome_inspector\tracing_backend.py", line 227, in CollectTraceData self._CollectTracingData(trace_data_builder, timeout) File "c:\b\s\w\irwsbsne\third_party\catapult\telemetry\telemetry\internal\backends\chrome_inspector\tracing_backend.py", line 248, in _CollectTracingData self._inspector_websocket.DispatchNotifications(timeout) File "c:\b\s\w\irwsbsne\third_party\catapult\telemetry\telemetry\internal\backends\chrome_inspector\inspector_websocket.py", line 134, in DispatchNotifications self._Receive(timeout) File "c:\b\s\w\irwsbsne\third_party\catapult\telemetry\telemetry\internal\backends\chrome_inspector\inspector_websocket.py", line 168, in _Receive self._HandleAsyncResponse(result) File "c:\b\s\w\irwsbsne\third_party\catapult\telemetry\telemetry\internal\backends\chrome_inspector\inspector_websocket.py", line 184, in _HandleAsyncResponse callback(result) File "c:\b\s\w\irwsbsne\third_party\catapult\telemetry\telemetry\internal\backends\chrome_inspector\tracing_backend.py", line 80, in _GotChunkFromStream trace_string = ''.join(self._data) MemoryError I've seen these pages fail with this: [ FAILED ] long_running:tools:gmail-foreground [ FAILED ] browse:news:nytimes [ FAILED ] browse:news:reddit Might be a couple others as well.
,
Dec 7 2016
Also seen an exception like this (from https://build.chromium.org/p/chromium.perf/builders/Mac%2010.10%20Perf/builds/152/steps/system_health.common_desktop%20on%20Intel%20GPU%20on%20Mac%20on%20Mac-10.10/logs/stdio): INFO:root:Trace sizes in bytes: {TraceDataPart("tabIds"): 40, TraceDataPart("telemetry"): 112832, TraceDataPart("traceEvents"): 410250244} [ RUN ] /b/s/w/itSFPe4t/tmpySM3Y0.html TraceImportError: Invalid string length at Array.join (native) at Object.f [as string] (/b/s/w/irGWtjql/third_party/catapult/tracing/third_party/jszip/jszip.min.js:12:20494) at Object.c.transformTo (/b/s/w/irGWtjql/third_party/catapult/tracing/third_party/jszip/jszip.min.js:12:22247) at Object.c.transformTo (/b/s/w/irGWtjql/third_party/catapult/tracing/third_party/jszip/jszip.min.js:12:6414) at Function.GzipImporter.transformToString (/tracing/extras/importer/gzip_importer.html:141:26) at Function.GzipImporter.inflateGzipData_ (/tracing/extras/importer/gzip_importer.html:116:31) at GzipImporter.extractSubtraces (/tracing/extras/importer/gzip_importer.html:177:36) at Import.createImports (/tracing/importer/import.html:138:40) at Task.run (/tracing/base/task.html:71:21) at Function.Task.RunSynchronously (/tracing/base/task.html:152:25) [ FAILED ] /b/s/w/itSFPe4t/tmpySM3Y0.html (26693 ms) Might be similar to the memory error.
,
Dec 7 2016
This is tracked in https://github.com/catapult-project/catapult/issues/2826
,
Dec 7 2016
,
Dec 7 2016
This is actually not tracked in that bug! That bug is a JS out of memory error, which is what #2 is referencing. The other error is a new error on windows, which charliea@ and I looked and and hopefully fixed today. CL coming soon! (I think).
,
Dec 8 2016
The following revision refers to this bug: https://chromium.googlesource.com/chromium/src.git/+/a2619f93849d0ccd022cbfe1c3e8382441525755 commit a2619f93849d0ccd022cbfe1c3e8382441525755 Author: catapult-deps-roller <catapult-deps-roller@chromium.org> Date: Thu Dec 08 01:55:11 2016 Roll src/third_party/catapult/ 70578734a..11d3d44fb (2 commits). https://chromium.googlesource.com/external/github.com/catapult-project/catapult.git/+log/70578734ad34..11d3d44fb9d2 $ git log 70578734a..11d3d44fb --date=short --no-merges --format='%ad %ae %s' 2016-12-07 charliea Reduce the memory required by Telemetry to stream back a Chrome trace 2016-12-07 eakuefner [Telemetry] Fix examples/run_benchmark BUG= 672097 Documentation for the AutoRoller is here: https://skia.googlesource.com/buildbot/+/master/autoroll/README.md If the roll is causing failures, see: http://www.chromium.org/developers/tree-sheriffs/sheriff-details-chromium#TOC-Failures-due-to-DEPS-rolls CQ_INCLUDE_TRYBOTS=master.tryserver.chromium.android:android_optional_gpu_tests_rel TBR=catapult-sheriff@chromium.org Review-Url: https://codereview.chromium.org/2558203002 Cr-Commit-Position: refs/heads/master@{#437132} [modify] https://crrev.com/a2619f93849d0ccd022cbfe1c3e8382441525755/DEPS
,
Dec 9 2016
An update! I haven't seen any more MemoryError exceptions on the windows bots. They're now failing on the javascript exception as reported in the catapult bug (https://github.com/catapult-project/catapult/issues/2826). Closing this, as it is actually fixed :)
,
Dec 12 2016
Ok, they're back :( https://luci-logdog.appspot.com/v/?s=chrome%2Fbb%2Fchromium.perf%2FWin_Zenbook_Perf%2F55%2F%2B%2Frecipes%2Fsteps%2Fsystem_health.common_desktop_on_Intel_GPU_on_Windows_on_Windows-10-10240%2F0%2Fstdout Looks like it's failing when it's trying to do a json.loads on the trace. File "c:\b\s\w\ir5mijmn\third_party\catapult\telemetry\telemetry\internal\backends\chrome_inspector\tracing_backend.py", line 84, in _GotChunkFromStream self._callback(trace_string) File "c:\b\s\w\ir5mijmn\third_party\catapult\telemetry\telemetry\internal\backends\chrome_inspector\tracing_backend.py", line 285, in _ReceivedAllTraceDataFromStream trace = json.loads(data) File "c:\b\depot_tools\python276_bin\lib\json\__init__.py", line 338, in loads return _default_decoder.decode(s) File "c:\b\depot_tools\python276_bin\lib\json\decoder.py", line 365, in decode obj, end = self.raw_decode(s, idx=_w(s, 0).end()) File "c:\b\depot_tools\python276_bin\lib\json\decoder.py", line 381, in raw_decode obj, end = self.scan_once(s, idx) MemoryError Not sure what to do here :(
,
Dec 13 2016
The right thing to do here is not to do json.loads(data) & just keep the string form.
,
Dec 13 2016
It needs to load the data. It does stuff with the parsed data. Maybe we can use something like https://pypi.python.org/pypi/ijson ??
,
Dec 13 2016
martiniss: sorry for the lack of context here. In the past, we does process tracing data in python, so using json.parse is a requirement. However, with the TBMv2 architecture, all the work of processing trace data are delegated to the d8 binary, hence all we need is to pass a string to that binary. When we transition from the python processing to d8 processing, we were not able to make a complete move, so some callsites are still relying on the python form. However, for system health benchmark, the json.loads are not needed at all & can be done so by delaying json.loads to https://github.com/catapult-project/catapult/blob/master/telemetry/telemetry/timeline/model.py#L75
,
Jan 3 2017
This has been red for almost a month now on most all desktop bots. Should we disable this test until a real solution can be implemented?
,
Jan 3 2017
,
Jan 3 2017
From discussion last time, Charlie will drive fixing the trace size problem in JS with helps of Andrey. I will help with the Python OOM.
,
Jan 3 2017
So we are just going to leave the bots red until a fix can be done for it? What is the eta on that?
,
Jan 5 2017
chiniforooshan@ is working on https://github.com/catapult-project/catapult/issues/2826 for the JS side. I will be working on Python side, hopefully can get it working this week.
,
Jan 6 2017
I gave up on reproducing the problem on my Windows machine as it always takes 300Mb max in Python. I will proceed with the trace on disk solution in https://github.com/catapult-project/catapult/issues/3119
,
Jan 10 2017
,
Jan 10 2017
Keep this bug about the sizing problem on Python side. The blocking bug ( issue 679768 ) is for the JS side.
,
Jan 23 2017
After the memory improvement, local profiling shows that Telemetry's memory on running system_health's browsing stories is now stable around 70MiB: http://pastebin.com/1n4amvzf Mark this as fixed.
,
Jan 23 2017
For the record, without the fix, the peak memory usage of just running a single browsing story (browse:news:nytimes) is 480MiB: http://pastebin.com/CrgV1i2Z
,
Feb 3 2017
Issue 688494 has been merged into this issue.
,
Apr 17 2017
,
May 30 2017
,
Aug 1 2017
,
Oct 14 2017
|
|||||||||||||||
►
Sign in to add a comment |
|||||||||||||||
Comment 1 by martiniss@chromium.org
, Dec 7 2016