telemetry_perf_unittests failing on chromium.mac/Mac10.11 Tests |
|||||||||||
Issue descriptionFiled by sheriff-o-matic@appspot.gserviceaccount.com on behalf of tmartino@google.com telemetry_perf_unittests failing on chromium.mac/Mac10.11 Tests Builders failed on: - Mac10.11 Tests: https://build.chromium.org/p/chromium.mac/builders/Mac10.11%20Tests
,
Jan 8 2018
This is an OOM problem :-(
Traceback (most recent call last):
RunBenchmark at /b/s/w/ir/third_party/catapult/telemetry/telemetry/internal/story_runner.py:359
max_num_values=benchmark.MAX_NUM_VALUES)
Run at /b/s/w/ir/third_party/catapult/telemetry/telemetry/internal/story_runner.py:224
_RunStoryAndProcessErrorIfNeeded(story, results, state, test)
_RunStoryAndProcessErrorIfNeeded at /b/s/w/ir/third_party/catapult/telemetry/telemetry/internal/story_runner.py:118
state.WillRunStory(story)
traced_function at /b/s/w/ir/third_party/catapult/common/py_trace_event/py_trace_event/trace_event_impl/decorators.py:75
return func(*args, **kwargs)
WillRunStory at /b/s/w/ir/third_party/catapult/telemetry/telemetry/page/shared_page_state.py:231
archive_path, page.make_javascript_deterministic, self._extra_wpr_args)
traced_function at /b/s/w/ir/third_party/catapult/common/py_trace_event/py_trace_event/trace_event_impl/decorators.py:75
return func(*args, **kwargs)
StartReplay at /b/s/w/ir/third_party/catapult/telemetry/telemetry/core/network_controller.py:53
archive_path, make_javascript_deterministic, extra_wpr_args)
StartReplay at /b/s/w/ir/third_party/catapult/telemetry/telemetry/internal/platform/network_controller_backend.py:150
local_ports = self._StartReplayServer()
_StartReplayServer at /b/s/w/ir/third_party/catapult/telemetry/telemetry/internal/platform/network_controller_backend.py:173
return self._wpr_server.StartServer()
StartServer at /b/s/w/ir/third_party/catapult/telemetry/telemetry/internal/util/webpagereplay_go_server.py:236
''.join(self._LogLines()))
ReplayNotStartedError: Web Page Replay failed to start. Log output:
2018/01/08 08:50:49 Loading cert from /b/s/w/ir/third_party/catapult/web_page_replay_go/wpr_cert.pem
2018/01/08 08:50:49 Loading key from /b/s/w/ir/third_party/catapult/web_page_replay_go/wpr_key.pem
2018/01/08 08:50:49 Loading script from /b/s/w/ir/third_party/catapult/web_page_replay_go/deterministic.js
Locals:
is_posix : True
log_fh : <closed file '/b/s/w/it86IArw/tmpP9UXCU', mode 'w' at 0x1121019c0>
Used 15.9 GiB out of 16.0 GiB memory available.
,
Jan 8 2018
Looks like we are leaking WPR processes? https://chromium-swarm.appspot.com/task?id=3aef6fe2bc59fb10&refresh=10&show_raw=1 Locals: is_posix : True log_fh : <closed file '/b/s/w/it86IArw/tmpP9UXCU', mode 'w' at 0x1121019c0> Used 15.9 GiB out of 16.0 GiB memory available. Top 10 memory consumers: - 1.1 GiB (pid=287) Finder - 268.9 MiB (pid=1055) wpr - 238.7 MiB (pid=949) wpr - 155.1 MiB (pid=243) lsd - 80.3 MiB (pid=887) Python - 69.7 MiB (pid=1067) wpr - 66.7 MiB (pid=957) wpr - 66.6 MiB (pid=1099) wpr - 63.6 MiB (pid=1078) wpr - 63.5 MiB (pid=1132) wpr Current process: - 80.3 MiB (pid=887) Python
,
Jan 9 2018
:O Yep, it does look like that. I'll have a look.
,
Jan 11 2018
Stealing this from Juan.. :-P
,
Jan 11 2018
Hmhh, looking at the log again. The first time the test fails, the OS's memory is full but not because of Telemetry. We failed at clean up WPR in this case which created lingering wpr processes later. https://chromium-swarm.appspot.com/task?id=3aff541fddc11110&refresh=10&show_raw=1 2018/01/11 10:44:42 Loading script from /b/s/w/ir/third_party/catapult/web_page_replay_go/deterministic.js Locals: is_posix : True log_fh : <closed file '/b/s/w/itMTubF7/tmpUsN1fL', mode 'w' at 0x109e26930> Used 15.9 GiB out of 16.0 GiB memory available. Top 10 memory consumers: - 1.1 GiB (pid=277) Finder - 205.8 MiB (pid=519) Python - 110.7 MiB (pid=243) lsd - 60.4 MiB (pid=592) wpr - 51.9 MiB (pid=270) Mail - 41.6 MiB (pid=296) com.apple.dock.extra - 41.3 MiB (pid=299) SpotlightNetHelper - 38.1 MiB (pid=480) Python - 37.5 MiB (pid=340) Python - 32.5 MiB (pid=247) identityservicesd Current process: - 205.8 MiB (pid=519) Python Traceback (most recent call last): File "/b/s/w/ir/tools/perf/benchmarks/system_health_smoke_test.py", line 126, in RunTest msg='Failed: %s' % benchmark_class) AssertionError: Failed: <class 'benchmarks.system_health.DesktopMemorySystemHealth'>
,
Jan 12 2018
I had an idea when looking at these memory summaries. Since there is no single largest culprit, maybe we should aggregate by process name? e.g. something like: - 1.1 GiB Finder [pids: 277] - xxx GiB Python [pids: 519, 480, 340, ...] - xxx GiB wpr [pids: 592, ...] - ...
,
Jan 12 2018
The following revision refers to this bug: https://chromium.googlesource.com/catapult/+/d3d295104430f480cf3b44d27e4fe865a997c1ce commit d3d295104430f480cf3b44d27e4fe865a997c1ce Author: Nghia Nguyen <nednguyen@google.com> Date: Fri Jan 12 11:29:53 2018 Ensure webpagereplay_go_server process is killed if there is exception in starting up WPR server Bug: chromium:800006 Change-Id: I9e2ed586df8b16ca01f96dc597401765f31d93f9 Reviewed-on: https://chromium-review.googlesource.com/862365 Reviewed-by: Juan Antonio Navarro Pérez <perezju@chromium.org> Reviewed-by: Ned Nguyen <nednguyen@google.com> Commit-Queue: Ned Nguyen <nednguyen@google.com> [modify] https://crrev.com/d3d295104430f480cf3b44d27e4fe865a997c1ce/telemetry/telemetry/internal/util/webpagereplay_go_server_unittest.py [modify] https://crrev.com/d3d295104430f480cf3b44d27e4fe865a997c1ce/telemetry/telemetry/internal/util/webpagereplay_go_server.py
,
Jan 12 2018
The following revision refers to this bug: https://chromium.googlesource.com/catapult/+/d4489b13e59fc89e51c5c6ac7fd14827e516febe commit d4489b13e59fc89e51c5c6ac7fd14827e516febe Author: Nghia Nguyen <nednguyen@google.com> Date: Fri Jan 12 13:18:23 2018 Change memory_debug to log total memory usage grouped by process names Bug: chromium:800006 Change-Id: Ib70ea438507d37178a79410861e87b2c920e504b Reviewed-on: https://chromium-review.googlesource.com/863906 Commit-Queue: Ned Nguyen <nednguyen@google.com> Reviewed-by: Juan Antonio Navarro Pérez <perezju@chromium.org> [modify] https://crrev.com/d4489b13e59fc89e51c5c6ac7fd14827e516febe/common/py_utils/py_utils/memory_debug.py
,
Jan 16 2018
Happen again: https://chromium-swarm.appspot.com/task?id=3b1476edc58c3e10&refresh=10&show_raw=1 Used 15.9 GiB out of 16.0 GiB memory available. Memory usage of top 10 processes groups - Finder - 1.8 GiB - pids: ['282'] - Python - 198.3 MiB - pids: ['336', '429', '430', '452', '453', '454', '455'] - lsd - 59.0 MiB - pids: ['212', '246'] - Mail - 52.0 MiB - pids: ['275'] - SpotlightNetHelper - 42.2 MiB - pids: ['307'] - com.apple.dock.extra - 41.7 MiB - pids: ['301'] - identityservicesd - 33.8 MiB - pids: ['250'] - python - 31.4 MiB - pids: ['345'] - NotificationCenter - 29.6 MiB - pids: ['337'] - Spotlight - 27.1 MiB - pids: ['300'] Current process: 80.2 MiB (pid=455) Traceback (most recent call last): Not sure how this can add up to 15.9 GiB :-/
,
Jan 16 2018
Yep, I don't see any of our processes using unreasonable amounts of memory. Maybe hand this off to infra so they can check if there is anything strange going on with the machine?
,
Jan 16 2018
+Ben, John, Maruel: can you help me route this to the right people? We suspect that swarming bots are consuming way too much memory & interfere with tests
,
Jan 16 2018
The two bots you linked to: https://chromium-swarm.appspot.com/bot?id=vm987-m4&selected=1&show_all_events=true&show_state=true https://chromium-swarm.appspot.com/bot?id=vm143-m1&selected=1&show_all_events=true&show_state=true Both have "periodic_reboot_secs": 43200, so they reboot. They also reboot after a task failure. Are the values resident memory or virtual memory? I think it is resident, but resident memory normally eventually all get used, the rest normally used by the OS disk cache. Maybe there's a problem with one of the tests running on this fleet of VMs. The variety of tests ran seems fairly low, so it should be possible to run each repeatedly to find the culprit. Maybe the list of task failures may help diagnose: https://chromium-swarm.appspot.com/tasklist?c=name&c=state&c=created_ts&c=duration&c=pending_time&c=pool&c=bot&c=stepname&c=buildername&c=buildnumber&et=1516105260000&f=state%3ACOMPLETED_FAILURE&f=cpu%3Ax86-64&f=gpu%3Anone&f=os%3AMac-10.11&f=pool%3AChrome&l=500&n=true&s=created_ts%3Adesc&st=1515846000000
,
Jan 16 2018
" but resident memory normally eventually all get used, the rest normally used by the OS disk cache." --> interesting. WHat happen when resident memory all get used up & user try to launch a new process? Does the MacOS return the memory back? Also is there a better way to log actual memory usage of all alive processes (discounting memory used by the OS disk cache)?
,
Jan 16 2018
The OS will aggressively free up memory used by the disk cache if the user launches a new process. So depending on how you measure mem usage, it may look like you're oom but in reality the userspace still has a lot of ram to work with. Our monitoring data collects memory usage across the fleet and ignores the page cache. (We record "available" mem, not "free" mem.) Here's the usage of one of the mac vms: http://shortn/_uMXdUdjbke Looks like it never breaks 70%.
,
Jan 16 2018
LogHostMemoryUsage will not do what you want on macOS, as it only records "resident" memory. If you write a program that does this: """ while (true) LeakSomeMemory() """ The resident memory of the program will approach 0 [as all the memory is paged out], and the system memory will eventually run out. To acquire total system memory stats, always use "vm_stat" on macOS. The interesting lines if you're leaking memory will be "pages stored in compressor" and "pages occupied by compressor". To acquire the "memory" usage of a program on macOS, I recommend the C api task_info(TASK_VM_INFO, ...) and pull out phys_footprint. For an example, see: https://cs.chromium.org/chromium/src/base/process/process_metrics_mac.cc?type=cs&q=GetTaskVMInfo&sq=package:chromium&l=305
,
Jan 16 2018
And for those less familiar with mac-isms.
You want a C-program that does something like
void main(int argc, char** argv) {
int pid = atoi(argv[2])
task_t task = TaskForPid(pid);
output;
task_info(TASK_VM_INFO, task, output);
printf(task_info.phys_footprint)
}
and then you'll need to run it with root privileges to be able to use TaskForPid.
,
Jan 16 2018
Thanks for the guidance, Erik! I will try to reproduce this problem with the C-program locally first before proceeding with a bot solution.
,
Jan 16 2018
Looks like I already have a test file that does this, along with some other stuff. Feel free to use it as a starting point.
,
Jan 17 2018
Thanks for the script, Erik! However, I found that it works on some process & not other. Any idea why? Example: sudo ./private_memory_footprint 726 pid: 726 fast [phys_footprint]: 5160 [kb] fast [internal + compressed]: 5160 [kb] slow but accurate: 4044 [kb] slow with dirty COW: 5036 [kb] difference between fast [phys_footprint] and slow with dirty COW: 0.024031 sudo ./private_memory_footprint 21462 <Nothing got printed out>
,
Jan 17 2018
what's the return value of the program? At a guess, it's (2) and task_for_pid is failing. You'll need to disable system integrity protection. I'll send you a link offline.
,
Jan 22 2018
The memory usage is a red herring. Something else is really off about this bug, I tried to reproduce the bug locally by replacing the wprgo binary with a fake binary and still get the stdoutlog. So the fact that we get " ReplayNotStartedError: Web Page Replay failed to start. Log output: (N/A)" log mean Telemetry even fails to create the templogfile (https://cs.chromium.org/chromium/src/third_party/catapult/telemetry/telemetry/internal/util/webpagereplay_go_server.py?rcl=0a20f3ce6c4d6802892cbaa7489cde394eaa6fac&l=224) which happens before it tries to launch the wprgo process (https://cs.chromium.org/chromium/src/third_party/catapult/telemetry/telemetry/internal/util/webpagereplay_go_server.py?rcl=0a20f3ce6c4d6802892cbaa7489cde394eaa6fac&l=226) I will add more logging to see what's going on there
,
Jan 22 2018
,
Jan 22 2018
The following revision refers to this bug: https://chromium.googlesource.com/catapult/+/54db141e38b1cc886b5f24bffc5660aefef44e78 commit 54db141e38b1cc886b5f24bffc5660aefef44e78 Author: Nghia Nguyen <nednguyen@google.com> Date: Mon Jan 22 21:58:28 2018 Make sure wpr stdout/stderr log is captured before the log file is cleaned up Before this change, stout/stderr log of wprgo process is lost upon a wprgo startup crash because we cleaned up the log file before querying the log. This makes sure that we caputres all the logs before clean up the log file in such scenario. Bug: chromium:800006 Change-Id: I64a0348e599c7730cfe8016ab5a90ab5d3860f4b TBR=perezju@chromium.org Reviewed-on: https://chromium-review.googlesource.com/879327 Reviewed-by: Ned Nguyen <nednguyen@google.com> Commit-Queue: Ned Nguyen <nednguyen@google.com> [modify] https://crrev.com/54db141e38b1cc886b5f24bffc5660aefef44e78/telemetry/telemetry/internal/util/webpagereplay_go_server.py
,
Jan 23 2018
Looks like wprgo stuck after started:
File "/b/s/w/ir/third_party/catapult/telemetry/telemetry/internal/util/webpagereplay_go_server.py", line 236, in StartServer
'Web Page Replay failed to start. Log output:\n%s' % log_output)
ReplayNotStartedError: Web Page Replay failed to start. Log output:
2018/01/22 22:44:11 Loading cert from /b/s/w/ir/third_party/catapult/web_page_replay_go/wpr_cert.pem
2018/01/22 22:44:11 Loading key from /b/s/w/ir/third_party/catapult/web_page_replay_go/wpr_key.pem
2018/01/22 22:44:11 Loading script from /b/s/w/ir/third_party/catapult/web_page_replay_go/deterministic.js
We probably need to add more logs to web_page_replay_go to figure out why it stuck
,
Jan 23 2018
I am highly suspect that this flakiness come from the fact that some archive file is big & it takes more than 30s for wprgo to process it. I am adding more log to verify that theory.
,
Jan 23 2018
The following revision refers to this bug: https://chromium.googlesource.com/catapult/+/82173870d1bd9df5954e941aaa8b7781cf1f5b75 commit 82173870d1bd9df5954e941aaa8b7781cf1f5b75 Author: Nghia Nguyen <nednguyen@google.com> Date: Tue Jan 23 14:58:48 2018 Make ReplayServer._CleanUpTempLogFilePath quits gracefully if log file already cleaned When ReplayServer.StartServer() fails, it will try to clean up the temp logfile. This makes subsequent call to ReplayServer.StopServer() fails. See https://chromium-swarm.appspot.com/task?id=3b3a8c33ae9d5410&refresh=10&show_raw=1 for an example of how this happens. This change makes ReplayServer._CleanUpTempLogFilePath() quits gracefully, which also makes ReplayServer.StopServer idempotent (subsequent calls will be no-ops) Bug: chromium:800006 Change-Id: I7d1180843cb03c808ac79bfcafeb6594781c9976 Reviewed-on: https://chromium-review.googlesource.com/881022 Reviewed-by: Juan Antonio Navarro Pérez <perezju@chromium.org> Commit-Queue: Ned Nguyen <nednguyen@google.com> [modify] https://crrev.com/82173870d1bd9df5954e941aaa8b7781cf1f5b75/telemetry/telemetry/internal/util/webpagereplay_go_server.py
,
Jan 23 2018
The following revision refers to this bug: https://chromium.googlesource.com/catapult/+/296e7c3a900bec6123d48cda8436a31b382d0ffe commit 296e7c3a900bec6123d48cda8436a31b382d0ffe Author: Nghia Nguyen <nednguyen@google.com> Date: Tue Jan 23 20:49:49 2018 Add logging when web_page_replay_go try to read archive Bug: chromium:800006 Change-Id: I659fdcd6a4b8044f5c02f34c96b386c664f07dd5 Reviewed-on: https://chromium-review.googlesource.com/881023 Commit-Queue: Ned Nguyen <nednguyen@google.com> Reviewed-by: Helen Li <xunjieli@chromium.org> [modify] https://crrev.com/296e7c3a900bec6123d48cda8436a31b382d0ffe/web_page_replay_go/src/wpr.go [modify] https://crrev.com/296e7c3a900bec6123d48cda8436a31b382d0ffe/telemetry/telemetry/internal/binary_dependencies.json
,
Jan 24 2018
The log confirm that wprgo binary stuck at loading the archive file: ReplayNotStartedError: Web Page Replay failed to start. Log output: 2018/01/24 04:05:39 Loading cert from /b/s/w/ir/third_party/catapult/web_page_replay_go/wpr_cert.pem 2018/01/24 04:05:39 Loading key from /b/s/w/ir/third_party/catapult/web_page_replay_go/wpr_key.pem 2018/01/24 04:05:39 Loading script from /b/s/w/ir/third_party/catapult/web_page_replay_go/deterministic.js 2018/01/24 04:05:39 Loading archive file from /b/s/w/ir/tools/perf/page_sets/data/system_health_desktop_040.wprgo I will bump up the timeout for starting wprgo replay
,
Jan 24 2018
The following revision refers to this bug: https://chromium.googlesource.com/catapult/+/c3373753de40a505241481ad2e74e2746f21551f commit c3373753de40a505241481ad2e74e2746f21551f Author: Nghia Nguyen <nednguyen@google.com> Date: Wed Jan 24 15:22:16 2018 Increase the wait-time for starting wprgo replay Bug: chromium:800006 Change-Id: I25360702d865c33a6eeaf761aff3aa0e10c12a54 Reviewed-on: https://chromium-review.googlesource.com/883762 Reviewed-by: Juan Antonio Navarro Pérez <perezju@chromium.org> Commit-Queue: Ned Nguyen <nednguyen@google.com> [modify] https://crrev.com/c3373753de40a505241481ad2e74e2746f21551f/telemetry/telemetry/internal/util/webpagereplay_go_server.py
,
Jan 26 2018
This suite no longer fails in the last 50 builds, so I think this is fixed.
,
Jan 26 2018
🎉
,
Jan 16
(6 days ago)
,
Jan 16
(6 days ago)
|
|||||||||||
►
Sign in to add a comment |
|||||||||||
Comment 1 by tmartino@chromium.org
, Jan 8 2018Status: Assigned (was: Available)