New issue
Advanced search Search tips
Note: Color blocks (like or ) mean that a user may not be available. Tooltip shows the reason.

Issue 800006 link

Starred by 1 user

Issue metadata

Status: Fixed
Owner:
Closed: Jan 2018
Cc:
Components:
EstimatedDays: ----
NextAction: ----
OS: Mac
Pri: 2
Type: Bug



Sign in to add a comment

telemetry_perf_unittests failing on chromium.mac/Mac10.11 Tests

Project Member Reported by sheriff-...@appspot.gserviceaccount.com, Jan 8 2018

Issue description

Filed by sheriff-o-matic@appspot.gserviceaccount.com on behalf of tmartino@google.com

telemetry_perf_unittests failing on chromium.mac/Mac10.11 Tests

Builders failed on: 
- Mac10.11 Tests: 
  https://build.chromium.org/p/chromium.mac/builders/Mac10.11%20Tests


 
Owner: nednguyen@chromium.org
Status: Assigned (was: Available)
Assigning to Ned at his suggestion.

For context, the last 3 builds on this builder have failed, all with System Health Smoke Test failures (but different tests failing each time).
Cc: perezju@chromium.org
Components: Speed>Telemetry
This is an OOM problem :-(

  Traceback (most recent call last):
    RunBenchmark at /b/s/w/ir/third_party/catapult/telemetry/telemetry/internal/story_runner.py:359
      max_num_values=benchmark.MAX_NUM_VALUES)
    Run at /b/s/w/ir/third_party/catapult/telemetry/telemetry/internal/story_runner.py:224
      _RunStoryAndProcessErrorIfNeeded(story, results, state, test)
    _RunStoryAndProcessErrorIfNeeded at /b/s/w/ir/third_party/catapult/telemetry/telemetry/internal/story_runner.py:118
      state.WillRunStory(story)
    traced_function at /b/s/w/ir/third_party/catapult/common/py_trace_event/py_trace_event/trace_event_impl/decorators.py:75
      return func(*args, **kwargs)
    WillRunStory at /b/s/w/ir/third_party/catapult/telemetry/telemetry/page/shared_page_state.py:231
      archive_path, page.make_javascript_deterministic, self._extra_wpr_args)
    traced_function at /b/s/w/ir/third_party/catapult/common/py_trace_event/py_trace_event/trace_event_impl/decorators.py:75
      return func(*args, **kwargs)
    StartReplay at /b/s/w/ir/third_party/catapult/telemetry/telemetry/core/network_controller.py:53
      archive_path, make_javascript_deterministic, extra_wpr_args)
    StartReplay at /b/s/w/ir/third_party/catapult/telemetry/telemetry/internal/platform/network_controller_backend.py:150
      local_ports = self._StartReplayServer()
    _StartReplayServer at /b/s/w/ir/third_party/catapult/telemetry/telemetry/internal/platform/network_controller_backend.py:173
      return self._wpr_server.StartServer()
    StartServer at /b/s/w/ir/third_party/catapult/telemetry/telemetry/internal/util/webpagereplay_go_server.py:236
      ''.join(self._LogLines()))
  ReplayNotStartedError: Web Page Replay failed to start. Log output:
  2018/01/08 08:50:49 Loading cert from /b/s/w/ir/third_party/catapult/web_page_replay_go/wpr_cert.pem
  2018/01/08 08:50:49 Loading key from /b/s/w/ir/third_party/catapult/web_page_replay_go/wpr_key.pem
  2018/01/08 08:50:49 Loading script from /b/s/w/ir/third_party/catapult/web_page_replay_go/deterministic.js
  
  Locals:
    is_posix : True
    log_fh   : <closed file '/b/s/w/it86IArw/tmpP9UXCU', mode 'w' at 0x1121019c0>
  
  Used 15.9 GiB out of 16.0 GiB memory available.
Looks like we are leaking WPR processes?
https://chromium-swarm.appspot.com/task?id=3aef6fe2bc59fb10&refresh=10&show_raw=1


  Locals:
    is_posix : True
    log_fh   : <closed file '/b/s/w/it86IArw/tmpP9UXCU', mode 'w' at 0x1121019c0>
  
  Used 15.9 GiB out of 16.0 GiB memory available.
  Top 10 memory consumers:
  - 1.1 GiB (pid=287) Finder
  - 268.9 MiB (pid=1055) wpr
  - 238.7 MiB (pid=949) wpr
  - 155.1 MiB (pid=243) lsd
  - 80.3 MiB (pid=887) Python
  - 69.7 MiB (pid=1067) wpr
  - 66.7 MiB (pid=957) wpr
  - 66.6 MiB (pid=1099) wpr
  - 63.6 MiB (pid=1078) wpr
  - 63.5 MiB (pid=1132) wpr
  Current process:
  - 80.3 MiB (pid=887) Python
Cc: nedngu...@google.com
Owner: perezju@chromium.org
:O

Yep, it does look like that. I'll have a look.
Owner: nedngu...@google.com
Stealing this from Juan.. :-P
Hmhh, looking at the log again. The first time the test fails, the OS's memory is full but not because of Telemetry. We failed at clean up WPR in this case which created lingering wpr processes later.

https://chromium-swarm.appspot.com/task?id=3aff541fddc11110&refresh=10&show_raw=1

  2018/01/11 10:44:42 Loading script from /b/s/w/ir/third_party/catapult/web_page_replay_go/deterministic.js
  
  Locals:
    is_posix : True
    log_fh   : <closed file '/b/s/w/itMTubF7/tmpUsN1fL', mode 'w' at 0x109e26930>
  
  Used 15.9 GiB out of 16.0 GiB memory available.
  Top 10 memory consumers:
  - 1.1 GiB (pid=277) Finder
  - 205.8 MiB (pid=519) Python
  - 110.7 MiB (pid=243) lsd
  - 60.4 MiB (pid=592) wpr
  - 51.9 MiB (pid=270) Mail
  - 41.6 MiB (pid=296) com.apple.dock.extra
  - 41.3 MiB (pid=299) SpotlightNetHelper
  - 38.1 MiB (pid=480) Python
  - 37.5 MiB (pid=340) Python
  - 32.5 MiB (pid=247) identityservicesd
  Current process:
  - 205.8 MiB (pid=519) Python
  Traceback (most recent call last):
    File "/b/s/w/ir/tools/perf/benchmarks/system_health_smoke_test.py", line 126, in RunTest
      msg='Failed: %s' % benchmark_class)
  AssertionError: Failed: <class 'benchmarks.system_health.DesktopMemorySystemHealth'>
I had an idea when looking at these memory summaries. Since there is no single largest culprit, maybe we should aggregate by process name? e.g. something like:

  - 1.1 GiB Finder [pids: 277]
  - xxx GiB Python [pids: 519, 480, 340, ...]
  - xxx GiB wpr [pids: 592, ...]
  - ...
Project Member

Comment 8 by bugdroid1@chromium.org, Jan 12 2018

The following revision refers to this bug:
  https://chromium.googlesource.com/catapult/+/d3d295104430f480cf3b44d27e4fe865a997c1ce

commit d3d295104430f480cf3b44d27e4fe865a997c1ce
Author: Nghia Nguyen <nednguyen@google.com>
Date: Fri Jan 12 11:29:53 2018

Ensure webpagereplay_go_server process is killed if there is exception in starting up WPR server

Bug:  chromium:800006 
Change-Id: I9e2ed586df8b16ca01f96dc597401765f31d93f9
Reviewed-on: https://chromium-review.googlesource.com/862365
Reviewed-by: Juan Antonio Navarro Pérez <perezju@chromium.org>
Reviewed-by: Ned Nguyen <nednguyen@google.com>
Commit-Queue: Ned Nguyen <nednguyen@google.com>

[modify] https://crrev.com/d3d295104430f480cf3b44d27e4fe865a997c1ce/telemetry/telemetry/internal/util/webpagereplay_go_server_unittest.py
[modify] https://crrev.com/d3d295104430f480cf3b44d27e4fe865a997c1ce/telemetry/telemetry/internal/util/webpagereplay_go_server.py

Project Member

Comment 9 by bugdroid1@chromium.org, Jan 12 2018

The following revision refers to this bug:
  https://chromium.googlesource.com/catapult/+/d4489b13e59fc89e51c5c6ac7fd14827e516febe

commit d4489b13e59fc89e51c5c6ac7fd14827e516febe
Author: Nghia Nguyen <nednguyen@google.com>
Date: Fri Jan 12 13:18:23 2018

Change memory_debug to log total memory usage grouped by process names

Bug:  chromium:800006 
Change-Id: Ib70ea438507d37178a79410861e87b2c920e504b
Reviewed-on: https://chromium-review.googlesource.com/863906
Commit-Queue: Ned Nguyen <nednguyen@google.com>
Reviewed-by: Juan Antonio Navarro Pérez <perezju@chromium.org>

[modify] https://crrev.com/d4489b13e59fc89e51c5c6ac7fd14827e516febe/common/py_utils/py_utils/memory_debug.py

Happen again:
https://chromium-swarm.appspot.com/task?id=3b1476edc58c3e10&refresh=10&show_raw=1

  Used 15.9 GiB out of 16.0 GiB memory available.
  Memory usage of top 10 processes groups
  - Finder - 1.8 GiB - pids: ['282']
  - Python - 198.3 MiB - pids: ['336', '429', '430', '452', '453', '454', '455']
  - lsd - 59.0 MiB - pids: ['212', '246']
  - Mail - 52.0 MiB - pids: ['275']
  - SpotlightNetHelper - 42.2 MiB - pids: ['307']
  - com.apple.dock.extra - 41.7 MiB - pids: ['301']
  - identityservicesd - 33.8 MiB - pids: ['250']
  - python - 31.4 MiB - pids: ['345']
  - NotificationCenter - 29.6 MiB - pids: ['337']
  - Spotlight - 27.1 MiB - pids: ['300']
  Current process:
  80.2 MiB (pid=455)
  Traceback (most recent call last):

Not sure how this can add up to 15.9 GiB 

:-/
Yep, I don't see any of our processes using unreasonable amounts of memory. Maybe hand this off to infra so they can check if there is anything strange going on with the machine?
Cc: jbudorick@chromium.org bpastene@chromium.org mar...@chromium.org
+Ben, John, Maruel: can you help me route this to the right people? We suspect that swarming bots are consuming way too much memory & interfere with tests
Labels: OS-Mac Type-Bug
The two bots you linked to:
https://chromium-swarm.appspot.com/bot?id=vm987-m4&selected=1&show_all_events=true&show_state=true
https://chromium-swarm.appspot.com/bot?id=vm143-m1&selected=1&show_all_events=true&show_state=true

Both have "periodic_reboot_secs": 43200, so they reboot. They also reboot after a task failure.

Are the values resident memory or virtual memory? I think it is resident, but resident memory normally eventually all get used, the rest normally used by the OS disk cache.

Maybe there's a problem with one of the tests running on this fleet of VMs. The variety of tests ran seems fairly low, so it should be possible to run each repeatedly to find the culprit.

Maybe the list of task failures may help diagnose:
https://chromium-swarm.appspot.com/tasklist?c=name&c=state&c=created_ts&c=duration&c=pending_time&c=pool&c=bot&c=stepname&c=buildername&c=buildnumber&et=1516105260000&f=state%3ACOMPLETED_FAILURE&f=cpu%3Ax86-64&f=gpu%3Anone&f=os%3AMac-10.11&f=pool%3AChrome&l=500&n=true&s=created_ts%3Adesc&st=1515846000000
Cc: erikc...@chromium.org
" but resident memory normally eventually all get used, the rest normally used by the OS disk cache." --> interesting. WHat happen when resident memory all get used up & user try to launch a new process? Does the MacOS return the memory back?

Also is there a better way to log actual memory usage of all alive processes (discounting memory used by the OS disk cache)?
The OS will aggressively free up memory used by the disk cache if the user launches a new process. So depending on how you measure mem usage, it may look like you're oom but in reality the userspace still has a lot of ram to work with.

Our monitoring data collects memory usage across the fleet and ignores the page cache. (We record "available" mem, not "free" mem.) Here's the usage of one of the mac vms:
http://shortn/_uMXdUdjbke
Looks like it never breaks 70%.
LogHostMemoryUsage will not do what you want on macOS, as it only records "resident" memory. If you write a program that does this:
"""
while (true)
  LeakSomeMemory()
"""
The resident memory of the program will approach 0 [as all the memory is paged out], and the system memory will eventually run out.

To acquire total system memory stats, always use "vm_stat" on macOS. The interesting lines if you're leaking memory will be "pages stored in compressor" and "pages occupied by compressor".

To acquire the "memory" usage of a program on macOS, I recommend the C api task_info(TASK_VM_INFO, ...) and pull out phys_footprint.
For an example, see: https://cs.chromium.org/chromium/src/base/process/process_metrics_mac.cc?type=cs&q=GetTaskVMInfo&sq=package:chromium&l=305
And for those less familiar with mac-isms.

You want a C-program that does something like

void main(int argc, char** argv) {
  int pid = atoi(argv[2])
  task_t task = TaskForPid(pid);
  output;
  task_info(TASK_VM_INFO, task, output);
  printf(task_info.phys_footprint)
}

and then you'll need to run it with root privileges to be able to use TaskForPid.
Thanks for the guidance, Erik! I will try to reproduce this problem with the C-program locally first before proceeding with a bot solution.

Looks like I already have a test file that does this, along with some other stuff. Feel free to use it as a starting point.
private_memory_footprint.cc
4.0 KB View Download
Thanks for the script, Erik! However, I found that it works on some process & not other. Any idea why?

Example:

sudo ./private_memory_footprint 726
pid: 726
fast [phys_footprint]: 5160 [kb]
fast [internal + compressed]: 5160 [kb]
slow but accurate: 4044 [kb]
slow with dirty COW: 5036 [kb]
difference between fast [phys_footprint] and slow with dirty COW: 0.024031


sudo ./private_memory_footprint 21462
<Nothing got printed out>
what's the return value of the program? At a guess, it's (2) and task_for_pid is failing. You'll need to disable system integrity protection. I'll send you a link offline.
The memory usage is a red herring. Something else is really off about this bug, I tried to reproduce the bug locally by replacing the wprgo binary with a fake binary and still get the stdoutlog. 

So the fact that we get "  ReplayNotStartedError: Web Page Replay failed to start. Log output:
  (N/A)" log mean Telemetry even fails to create the templogfile (https://cs.chromium.org/chromium/src/third_party/catapult/telemetry/telemetry/internal/util/webpagereplay_go_server.py?rcl=0a20f3ce6c4d6802892cbaa7489cde394eaa6fac&l=224)
 
which happens before it tries to launch the wprgo process (https://cs.chromium.org/chromium/src/third_party/catapult/telemetry/telemetry/internal/util/webpagereplay_go_server.py?rcl=0a20f3ce6c4d6802892cbaa7489cde394eaa6fac&l=226)

I will add more logging to see what's going on there
Project Member

Comment 24 by bugdroid1@chromium.org, Jan 22 2018

The following revision refers to this bug:
  https://chromium.googlesource.com/catapult/+/54db141e38b1cc886b5f24bffc5660aefef44e78

commit 54db141e38b1cc886b5f24bffc5660aefef44e78
Author: Nghia Nguyen <nednguyen@google.com>
Date: Mon Jan 22 21:58:28 2018

Make sure wpr stdout/stderr log is captured before the log file is cleaned up

Before this change, stout/stderr log of wprgo process is lost upon a wprgo startup
crash because we cleaned up the log file before querying the log. This makes sure
that we caputres all the logs before clean up the log file in such scenario.

Bug: chromium:800006 
Change-Id: I64a0348e599c7730cfe8016ab5a90ab5d3860f4b
TBR=perezju@chromium.org
Reviewed-on: https://chromium-review.googlesource.com/879327
Reviewed-by: Ned Nguyen <nednguyen@google.com>
Commit-Queue: Ned Nguyen <nednguyen@google.com>

[modify] https://crrev.com/54db141e38b1cc886b5f24bffc5660aefef44e78/telemetry/telemetry/internal/util/webpagereplay_go_server.py

Cc: xunji...@chromium.org
Looks like wprgo stuck after started:

    File "/b/s/w/ir/third_party/catapult/telemetry/telemetry/internal/util/webpagereplay_go_server.py", line 236, in StartServer
      'Web Page Replay failed to start. Log output:\n%s' % log_output)
  ReplayNotStartedError: Web Page Replay failed to start. Log output:
  2018/01/22 22:44:11 Loading cert from /b/s/w/ir/third_party/catapult/web_page_replay_go/wpr_cert.pem
  2018/01/22 22:44:11 Loading key from /b/s/w/ir/third_party/catapult/web_page_replay_go/wpr_key.pem
  2018/01/22 22:44:11 Loading script from /b/s/w/ir/third_party/catapult/web_page_replay_go/deterministic.js
  

We probably need to add more logs to web_page_replay_go to figure out why it stuck
I am highly suspect that this flakiness come from the fact that some archive file is big & it takes more than 30s for wprgo to process it. I am adding more log to verify that theory.
Project Member

Comment 27 by bugdroid1@chromium.org, Jan 23 2018

The following revision refers to this bug:
  https://chromium.googlesource.com/catapult/+/82173870d1bd9df5954e941aaa8b7781cf1f5b75

commit 82173870d1bd9df5954e941aaa8b7781cf1f5b75
Author: Nghia Nguyen <nednguyen@google.com>
Date: Tue Jan 23 14:58:48 2018

Make ReplayServer._CleanUpTempLogFilePath quits gracefully if log file already cleaned

When ReplayServer.StartServer() fails, it will try to clean up the temp logfile.
This makes subsequent call to ReplayServer.StopServer() fails. See
https://chromium-swarm.appspot.com/task?id=3b3a8c33ae9d5410&refresh=10&show_raw=1
for an example of how this happens.

This change
makes ReplayServer._CleanUpTempLogFilePath() quits gracefully, which also makes
ReplayServer.StopServer idempotent (subsequent calls will be no-ops)



Bug:  chromium:800006 
Change-Id: I7d1180843cb03c808ac79bfcafeb6594781c9976
Reviewed-on: https://chromium-review.googlesource.com/881022
Reviewed-by: Juan Antonio Navarro Pérez <perezju@chromium.org>
Commit-Queue: Ned Nguyen <nednguyen@google.com>

[modify] https://crrev.com/82173870d1bd9df5954e941aaa8b7781cf1f5b75/telemetry/telemetry/internal/util/webpagereplay_go_server.py

Project Member

Comment 28 by bugdroid1@chromium.org, Jan 23 2018

The following revision refers to this bug:
  https://chromium.googlesource.com/catapult/+/296e7c3a900bec6123d48cda8436a31b382d0ffe

commit 296e7c3a900bec6123d48cda8436a31b382d0ffe
Author: Nghia Nguyen <nednguyen@google.com>
Date: Tue Jan 23 20:49:49 2018

Add logging when web_page_replay_go try to read archive

Bug:  chromium:800006 
Change-Id: I659fdcd6a4b8044f5c02f34c96b386c664f07dd5
Reviewed-on: https://chromium-review.googlesource.com/881023
Commit-Queue: Ned Nguyen <nednguyen@google.com>
Reviewed-by: Helen Li <xunjieli@chromium.org>

[modify] https://crrev.com/296e7c3a900bec6123d48cda8436a31b382d0ffe/web_page_replay_go/src/wpr.go
[modify] https://crrev.com/296e7c3a900bec6123d48cda8436a31b382d0ffe/telemetry/telemetry/internal/binary_dependencies.json

The log confirm that wprgo binary stuck at loading the archive file:

ReplayNotStartedError: Web Page Replay failed to start. Log output:
  2018/01/24 04:05:39 Loading cert from /b/s/w/ir/third_party/catapult/web_page_replay_go/wpr_cert.pem
  2018/01/24 04:05:39 Loading key from /b/s/w/ir/third_party/catapult/web_page_replay_go/wpr_key.pem
  2018/01/24 04:05:39 Loading script from /b/s/w/ir/third_party/catapult/web_page_replay_go/deterministic.js
  2018/01/24 04:05:39 Loading archive file from /b/s/w/ir/tools/perf/page_sets/data/system_health_desktop_040.wprgo

I will bump up the timeout for starting wprgo replay
Project Member

Comment 30 by bugdroid1@chromium.org, Jan 24 2018

The following revision refers to this bug:
  https://chromium.googlesource.com/catapult/+/c3373753de40a505241481ad2e74e2746f21551f

commit c3373753de40a505241481ad2e74e2746f21551f
Author: Nghia Nguyen <nednguyen@google.com>
Date: Wed Jan 24 15:22:16 2018

Increase the wait-time for starting wprgo replay

Bug:  chromium:800006 
Change-Id: I25360702d865c33a6eeaf761aff3aa0e10c12a54
Reviewed-on: https://chromium-review.googlesource.com/883762
Reviewed-by: Juan Antonio Navarro Pérez <perezju@chromium.org>
Commit-Queue: Ned Nguyen <nednguyen@google.com>

[modify] https://crrev.com/c3373753de40a505241481ad2e74e2746f21551f/telemetry/telemetry/internal/util/webpagereplay_go_server.py

Status: Fixed (was: Assigned)
This suite no longer fails in the last 50 builds, so I think this is fixed.
🎉

Comment 33 by benhenry@google.com, Jan 16 (6 days ago)

Components: Test>Telemetry

Comment 34 by benhenry@google.com, Jan 16 (6 days ago)

Components: -Speed>Telemetry

Sign in to add a comment