New issue
Advanced search Search tips
Note: Color blocks (like or ) mean that a user may not be available. Tooltip shows the reason.

Issue 672097 link

Starred by 3 users

Issue metadata

Status: Archived
Owner:
Closed: Jan 2017
Cc:
Components:
EstimatedDays: ----
NextAction: ----
OS: Chrome
Pri: 2
Type: Bug

Blocked on:
issue 679768



Sign in to add a comment

Desktop bots system_health.common_desktop failing because of OOM errors

Project Member Reported by martiniss@chromium.org, Dec 7 2016

Issue description

https://build.chromium.org/p/chromium.perf/builders/Win%2010%20Perf/builds/50/steps/system_health.common_desktop%20on%20%28102b%29%20GPU%20on%20Windows%20on%20Windows-10-10240/logs/stdio
https://build.chromium.org/p/chromium.perf/builders/Win%208%20Perf/builds/52/steps/system_health.common_desktop%20on%20%28102b%29%20GPU%20on%20Windows%20on%20Windows-2012ServerR2-SP0/logs/stdio
https://build.chromium.org/p/chromium.perf/builders/Win%207%20Perf/builds/40/steps/system_health.common_desktop%20on%20%28102b%29%20GPU%20on%20Windows%20on%20Windows-2008ServerR2-SP1/logs/stdio
https://build.chromium.org/p/chromium.perf/builders/Win%207%20x64%20Perf/builds/43/steps/system_health.common_desktop%20on%20%28102b%29%20GPU%20on%20Windows%20on%20Windows-2008ServerR2-SP1/logs/stdio

And others

A lot of the errors have stack traces like this:
Traceback (most recent call last):
  File "c:\b\s\w\irwsbsne\third_party\catapult\telemetry\telemetry\internal\platform\tracing_agent\chrome_tracing_agent.py", line 237, in CollectAgentTraceData
    client.CollectChromeTracingData(trace_data_builder)
  File "c:\b\s\w\irwsbsne\third_party\catapult\telemetry\telemetry\internal\backends\chrome_inspector\devtools_client_backend.py", line 370, in CollectChromeTracingData
    self._tracing_backend.CollectTraceData(trace_data_builder, timeout)
  File "c:\b\s\w\irwsbsne\third_party\catapult\telemetry\telemetry\internal\backends\chrome_inspector\tracing_backend.py", line 227, in CollectTraceData
    self._CollectTracingData(trace_data_builder, timeout)
  File "c:\b\s\w\irwsbsne\third_party\catapult\telemetry\telemetry\internal\backends\chrome_inspector\tracing_backend.py", line 248, in _CollectTracingData
    self._inspector_websocket.DispatchNotifications(timeout)
  File "c:\b\s\w\irwsbsne\third_party\catapult\telemetry\telemetry\internal\backends\chrome_inspector\inspector_websocket.py", line 134, in DispatchNotifications
    self._Receive(timeout)
  File "c:\b\s\w\irwsbsne\third_party\catapult\telemetry\telemetry\internal\backends\chrome_inspector\inspector_websocket.py", line 168, in _Receive
    self._HandleAsyncResponse(result)
  File "c:\b\s\w\irwsbsne\third_party\catapult\telemetry\telemetry\internal\backends\chrome_inspector\inspector_websocket.py", line 184, in _HandleAsyncResponse
    callback(result)
  File "c:\b\s\w\irwsbsne\third_party\catapult\telemetry\telemetry\internal\backends\chrome_inspector\tracing_backend.py", line 80, in _GotChunkFromStream
    trace_string = ''.join(self._data)
MemoryError

I've seen these pages fail with this:
[  FAILED  ]  long_running:tools:gmail-foreground
[  FAILED  ]  browse:news:nytimes
[  FAILED  ]  browse:news:reddit

Might be a couple others as well.
 
Error seems to happen flakily; on Win 8 perf it happens about about 50% of the time.
Also seen an exception like this (from https://build.chromium.org/p/chromium.perf/builders/Mac%2010.10%20Perf/builds/152/steps/system_health.common_desktop%20on%20Intel%20GPU%20on%20Mac%20on%20Mac-10.10/logs/stdio):

INFO:root:Trace sizes in bytes: {TraceDataPart("tabIds"): 40, TraceDataPart("telemetry"): 112832, TraceDataPart("traceEvents"): 410250244}
[ RUN      ] /b/s/w/itSFPe4t/tmpySM3Y0.html
TraceImportError: Invalid string length
    at Array.join (native)
    at Object.f [as string] (/b/s/w/irGWtjql/third_party/catapult/tracing/third_party/jszip/jszip.min.js:12:20494)
    at Object.c.transformTo (/b/s/w/irGWtjql/third_party/catapult/tracing/third_party/jszip/jszip.min.js:12:22247)
    at Object.c.transformTo (/b/s/w/irGWtjql/third_party/catapult/tracing/third_party/jszip/jszip.min.js:12:6414)
    at Function.GzipImporter.transformToString (/tracing/extras/importer/gzip_importer.html:141:26)
    at Function.GzipImporter.inflateGzipData_ (/tracing/extras/importer/gzip_importer.html:116:31)
    at GzipImporter.extractSubtraces (/tracing/extras/importer/gzip_importer.html:177:36)
    at Import.createImports (/tracing/importer/import.html:138:40)
    at Task.run (/tracing/base/task.html:71:21)
    at Function.Task.RunSynchronously (/tracing/base/task.html:152:25)
[  FAILED  ] /b/s/w/itSFPe4t/tmpySM3Y0.html (26693 ms)

Might be similar to the memory error.
Cc: nduca@chromium.org charliea@chromium.org
This is tracked in https://github.com/catapult-project/catapult/issues/2826


Cc: primiano@chromium.org
This is actually not tracked in that bug! That bug is a JS out of memory error, which is what #2 is referencing. The other error is a new error on windows, which charliea@ and I looked and and hopefully fixed today.

CL coming soon! (I think).
Project Member

Comment 6 by bugdroid1@chromium.org, Dec 8 2016

The following revision refers to this bug:
  https://chromium.googlesource.com/chromium/src.git/+/a2619f93849d0ccd022cbfe1c3e8382441525755

commit a2619f93849d0ccd022cbfe1c3e8382441525755
Author: catapult-deps-roller <catapult-deps-roller@chromium.org>
Date: Thu Dec 08 01:55:11 2016

Roll src/third_party/catapult/ 70578734a..11d3d44fb (2 commits).

https://chromium.googlesource.com/external/github.com/catapult-project/catapult.git/+log/70578734ad34..11d3d44fb9d2

$ git log 70578734a..11d3d44fb --date=short --no-merges --format='%ad %ae %s'
2016-12-07 charliea Reduce the memory required by Telemetry to stream back a Chrome trace
2016-12-07 eakuefner [Telemetry] Fix examples/run_benchmark

BUG= 672097 

Documentation for the AutoRoller is here:
https://skia.googlesource.com/buildbot/+/master/autoroll/README.md

If the roll is causing failures, see:
http://www.chromium.org/developers/tree-sheriffs/sheriff-details-chromium#TOC-Failures-due-to-DEPS-rolls

CQ_INCLUDE_TRYBOTS=master.tryserver.chromium.android:android_optional_gpu_tests_rel
TBR=catapult-sheriff@chromium.org

Review-Url: https://codereview.chromium.org/2558203002
Cr-Commit-Position: refs/heads/master@{#437132}

[modify] https://crrev.com/a2619f93849d0ccd022cbfe1c3e8382441525755/DEPS

Status: Fixed (was: Available)
An update!

I haven't seen any more MemoryError exceptions on the windows bots. They're now failing on the javascript exception as reported in the catapult bug (https://github.com/catapult-project/catapult/issues/2826). Closing this, as it is actually fixed :)
Status: Available (was: Fixed)
Ok, they're back :(

https://luci-logdog.appspot.com/v/?s=chrome%2Fbb%2Fchromium.perf%2FWin_Zenbook_Perf%2F55%2F%2B%2Frecipes%2Fsteps%2Fsystem_health.common_desktop_on_Intel_GPU_on_Windows_on_Windows-10-10240%2F0%2Fstdout

Looks like it's failing when it's trying to do a json.loads on the trace.

  File "c:\b\s\w\ir5mijmn\third_party\catapult\telemetry\telemetry\internal\backends\chrome_inspector\tracing_backend.py", line 84, in _GotChunkFromStream
    self._callback(trace_string)
  File "c:\b\s\w\ir5mijmn\third_party\catapult\telemetry\telemetry\internal\backends\chrome_inspector\tracing_backend.py", line 285, in _ReceivedAllTraceDataFromStream
    trace = json.loads(data)
  File "c:\b\depot_tools\python276_bin\lib\json\__init__.py", line 338, in loads
    return _default_decoder.decode(s)
  File "c:\b\depot_tools\python276_bin\lib\json\decoder.py", line 365, in decode
    obj, end = self.raw_decode(s, idx=_w(s, 0).end())
  File "c:\b\depot_tools\python276_bin\lib\json\decoder.py", line 381, in raw_decode
    obj, end = self.scan_once(s, idx)
MemoryError

Not sure what to do here :(
The right thing to do here is not to do json.loads(data) & just keep the string form.
It needs to load the data. It does stuff with the parsed data.

Maybe we can use something like https://pypi.python.org/pypi/ijson ??
martiniss: sorry for the lack of context here. In the past, we does process tracing data in python, so using json.parse is a requirement. However, with the TBMv2 architecture, all the work of processing trace data are delegated to the d8 binary, hence all we need is to pass a string to that binary.

When we transition from the python processing to d8 processing, we were not able to make a complete move, so some callsites are still relying on the python form. However, for system health benchmark, the json.loads are not needed at all & can be done so by delaying json.loads to https://github.com/catapult-project/catapult/blob/master/telemetry/telemetry/timeline/model.py#L75
Cc: martiniss@chromium.org nednguyen@chromium.org
This has been red for almost a month now on most all desktop bots.  

Should we disable this test until a real solution can be implemented?
Cc: eyaich@chromium.org
 Issue 676626  has been merged into this issue.
Cc: caseq@chromium.org
Owner: charliea@chromium.org
Status: Assigned (was: Available)
From discussion last time, Charlie will drive fixing the trace size problem in JS with helps of Andrey. I will help with the Python OOM.
So we are just going to leave the bots red until a fix can be done for it?  What is the eta on that?
Owner: chiniforooshan@chromium.org
chiniforooshan@ is working on https://github.com/catapult-project/catapult/issues/2826 for the JS side. I will be working on Python side, hopefully can get it working this week.
I gave up on reproducing the problem on my Windows machine as it always takes 300Mb max in Python. I will proceed with the trace on disk solution in https://github.com/catapult-project/catapult/issues/3119
Blockedon: 679768
Owner: nedngu...@google.com
Keep this bug about the sizing problem on Python side. The blocking bug ( issue 679768 ) is for the JS side.
Status: Fixed (was: Assigned)
After the memory improvement, local profiling shows that Telemetry's memory on running system_health's browsing stories is now stable around 70MiB: http://pastebin.com/1n4amvzf

Mark this as fixed.
For the record, without the fix, the peak memory usage of just running a single browsing story (browse:news:nytimes) is 480MiB: http://pastebin.com/CrgV1i2Z


Issue 688494 has been merged into this issue.

Comment 23 by dchan@google.com, Apr 17 2017

Labels: VerifyIn-59

Comment 24 by dchan@google.com, May 30 2017

Labels: VerifyIn-60
Labels: VerifyIn-61

Comment 26 by dchan@chromium.org, Oct 14 2017

Status: Archived (was: Fixed)

Sign in to add a comment