Issue metadata
Sign in to add a comment
|
Soundcloud system health story flakily crashing on webview bots |
||||||||||||||||||
Issue descriptionIt is the primary cause of system health failures on the webview bots over the past 20 runs. It is more prominent on the nexus 6 than the 5x. https://build.chromium.org/p/chromium.perf/builders/Android%20Nexus5X%20WebView%20Perf?numbuilds=200 https://build.chromium.org/p/chromium.perf/builders/Android%20Nexus6%20WebView%20Perf https://build.chromium.org/p/chromium.perf/builders/Android%20Nexus6%20WebView%20Perf/builds/690 https://logs.chromium.org/v/?s=chrome%2Fbb%2Fchromium.perf%2FAndroid_Nexus6_WebView_Perf%2F690%2F%2B%2Frecipes%2Fsteps%2Fsystem_health.memory_mobile_on_Android%2F0%2Fstdout Traceback (most recent call last): File "/b/swarming/w/ir/third_party/catapult/telemetry/telemetry/internal/story_runner.py", line 105, in _RunStoryAndProcessErrorIfNeeded state.RunStory(results) File "/b/swarming/w/ir/third_party/catapult/common/py_trace_event/py_trace_event/trace_event_impl/decorators.py", line 75, in traced_function return func(*args, **kwargs) File "/b/swarming/w/ir/third_party/catapult/telemetry/telemetry/page/shared_page_state.py", line 332, in RunStory self._current_page.Run(self) File "/b/swarming/w/ir/third_party/catapult/telemetry/telemetry/page/__init__.py", line 114, in Run self.RunPageInteractions(action_runner) File "/b/swarming/w/ir/tools/perf/page_sets/system_health/system_health_story.py", line 108, in RunPageInteractions self._Measure(action_runner) File "/b/swarming/w/ir/tools/perf/page_sets/system_health/system_health_story.py", line 91, in _Measure action_runner.MeasureMemory(deterministic_mode=True) File "/b/swarming/w/ir/third_party/catapult/common/py_trace_event/py_trace_event/trace_event_impl/decorators.py", line 75, in traced_function return func(*args, **kwargs) File "/b/swarming/w/ir/third_party/catapult/telemetry/telemetry/internal/actions/action_runner.py", line 160, in MeasureMemory self.ForceGarbageCollection() File "/b/swarming/w/ir/third_party/catapult/common/py_trace_event/py_trace_event/trace_event_impl/decorators.py", line 75, in traced_function return func(*args, **kwargs) File "/b/swarming/w/ir/third_party/catapult/telemetry/telemetry/internal/actions/action_runner.py", line 777, in ForceGarbageCollection self._tab.browser.ForceJavaHeapGarbageCollection() File "/b/swarming/w/ir/third_party/catapult/telemetry/telemetry/internal/browser/browser.py", line 316, in ForceJavaHeapGarbageCollection return self._browser_backend.ForceJavaHeapGarbageCollection() File "/b/swarming/w/ir/third_party/catapult/common/py_trace_event/py_trace_event/trace_event_impl/decorators.py", line 75, in traced_function return func(*args, **kwargs) File "/b/swarming/w/ir/third_party/catapult/telemetry/telemetry/internal/backends/chrome/android_browser_backend.py", line 226, in ForceJavaHeapGarbageCollection signum=device_signal.SIGUSR1) File "/b/swarming/w/ir/third_party/catapult/devil/devil/android/decorators.py", line 57, in timeout_retry_wrapper retry_if_func=retry_if_func) File "/b/swarming/w/ir/third_party/catapult/devil/devil/utils/timeout_retry.py", line 157, in Run error_log_func=error_log_func) File "/b/swarming/w/ir/third_party/catapult/devil/devil/utils/reraiser_thread.py", line 186, in JoinAll self._JoinAll(watcher, timeout) File "/b/swarming/w/ir/third_party/catapult/devil/devil/utils/reraiser_thread.py", line 158, in _JoinAll thread.ReraiseIfException() File "/b/swarming/w/ir/third_party/catapult/devil/devil/utils/reraiser_thread.py", line 81, in run self._ret = self._func(*self._args, **self._kwargs) File "/b/swarming/w/ir/third_party/catapult/devil/devil/utils/timeout_retry.py", line 150, in <lambda> child_thread = reraiser_thread.ReraiserThread(lambda: func(*args, **kwargs), File "/b/swarming/w/ir/third_party/catapult/devil/devil/android/decorators.py", line 47, in impl return f(*args, **kwargs) File "/b/swarming/w/ir/third_party/catapult/devil/devil/android/device_utils.py", line 1142, in KillAll 'No process "%s"' % process_name, str(self)) CommandFailedError: (device: ZX1G422HK6) No process "org.chromium.webview_shell" (INFO) 2017-11-20 15:11:48,387 devtools_client_backend._IsDevToolsAgentAvailable:80 Devtools client not yet ready: '' (INFO) 2017-11-20 15:11:48,388 exception_formatter.PrintFormattedException:19 Try printing formatted exception: None None None Exception raised when cleaning story run: Traceback (most recent call last): _RunStoryAndProcessErrorIfNeeded at /b/swarming/w/ir/third_party/catapult/telemetry/telemetry/internal/story_runner.py:130 test.DidRunStory(state.platform, results) DidRunStory at /b/swarming/w/ir/third_party/catapult/telemetry/telemetry/web_perf/timeline_based_measurement.py:290 trace_result, _ = platform.tracing_controller.StopTracing() StopTracing at /b/swarming/w/ir/third_party/catapult/telemetry/telemetry/core/tracing_controller.py:53 return self._tracing_controller_backend.StopTracing() StopTracing at /b/swarming/w/ir/third_party/catapult/telemetry/telemetry/internal/platform/tracing_controller_backend.py:109 self._IssueClockSyncMarker() _IssueClockSyncMarker at /b/swarming/w/ir/third_party/catapult/telemetry/telemetry/internal/platform/tracing_controller_backend.py:215 self._RecordIssuerClockSyncMarker) RecordClockSyncMarker at /b/swarming/w/ir/third_party/catapult/telemetry/telemetry/internal/platform/tracing_agent/chrome_tracing_agent.py:178 raise ChromeClockSyncError('Cannot issue clock sync. No devtools clients') ChromeClockSyncError: Cannot issue clock sync. No devtools clients Locals: devtools_clients : [] record_controller_clock_sync_marker_callback : <bound method TracingControllerBackend._RecordIssuerClockSyncMarker of <telemetry.internal.platform.tracing_controller_backend.TracingControllerBackend object at 0x7fc387943cd0>> sync_id : 'f86efe1f-3b5c-4228-9e6e-eb3ee433ae30' [ FAILED ] load:media:soundcloud@{'case': 'load', 'group': 'media'} (16167 ms) Screenshot: https://console.developers.google.com/m/cloudstorage/b/chrome-telemetry-output/o/profiler-file-id_35-2017-11-20_10-15-4683844.png On the nexus 6 webview bot, the first instance of the soundcloud crash was: https://build.chromium.org/p/chromium.perf/builders/Android%20Nexus6%20WebView%20Perf/builds/667
,
Nov 20 2017
Started bisect job https://chromeperf.appspot.com/buildbucket_job_status/8962385949285990528
,
Nov 21 2017
=== Auto-CCing suspected CL author herb@google.com === Hi herb@google.com, the bisect results pointed to your CL, please take a look at the results. === BISECT JOB RESULTS === Test failure found with culprit Suspected Commit Author : Herb Derby Commit : b8b3086de30b3b0f09c6c5ffc176482a538e9337 Date : Thu Nov 16 21:39:19 2017 Subject: Fix Sk8b reading too many bytes Bisect Details Configuration: android_webview_arm64_aosp_perf_bisect Benchmark : system_health.memory_mobile Metric : memory:webview:all_processes:reported_by_chrome:cc:locked_size_avg/load_media/load_media_soundcloud Revision Exit Code N chromium@517139 0 +- N/A 20 good chromium@517283 0 +- N/A 20 good chromium@517319 0 +- N/A 20 good chromium@517324 0 +- N/A 20 good chromium@517326 0 +- N/A 20 good chromium@517327 0 +- N/A 20 good chromium@517327,skia@3bc00fe810 0 +- N/A 20 good chromium@517327,skia@a94970ba4b 0 +- N/A 20 good chromium@517327,skia@45b806eec4 0 +- N/A 20 good chromium@517327,skia@b8b3086de3 1 +- N/A 20 bad <-- chromium@517328 1 +- N/A 20 bad chromium@517337 1 +- N/A 20 bad chromium@517355 1 +- N/A 20 bad chromium@517426 1 +- N/A 20 bad chromium@517713 1 +- N/A 20 bad Please refer to the following doc on diagnosing memory regressions: https://chromium.googlesource.com/chromium/src/+/master/docs/memory-infra/memory_benchmarks.md To Run This Test src/tools/perf/run_benchmark -v --browser=android-webview --output-format=chartjson --upload-results --pageset-repeat=1 --also-run-disabled-tests --story-filter=load.media.soundcloud system_health.memory_mobile More information on addressing performance regressions: http://g.co/ChromePerformanceRegressions Debug information about this bisect: https://chromeperf.appspot.com/buildbucket_job_status/8962385983184100864 For feedback, file a bug with component Speed>Bisection
,
Nov 21 2017
It seems like this might be fixing a different bug so I'm a little hesitant to revert it.
,
Nov 21 2017
Started bisect job https://chromeperf.appspot.com/buildbucket_job_status/8962279774346700800
,
Nov 22 2017
=== Auto-CCing suspected CL author herb@google.com === Hi herb@google.com, the bisect results pointed to your CL, please take a look at the results. === BISECT JOB RESULTS === Test failure found with culprit Suspected Commit Author : Herb Derby Commit : 35c556f33e2c9316055537e5683c28ea1645ff49 Date : Thu Nov 16 22:46:09 2017 Subject: Revert "Revert "Direct evaluation of gaussian"" Bisect Details Configuration: android_webview_nexus6_aosp_perf_bisect Benchmark : system_health.memory_mobile Metric : memory:webview:all_processes:dump_count_avg/load_media/load_media_soundcloud Revision Exit Code N chromium@517130 0 +- N/A 10 good chromium@517275 0 +- N/A 10 good chromium@517312 0 +- N/A 10 good chromium@517321 0 +- N/A 10 good chromium@517326 0 +- N/A 10 good chromium@517327 0 +- N/A 10 good chromium@517327,skia@3bc00fe810 0 +- N/A 10 good chromium@517327,skia@b8b3086de3 0 +- N/A 10 good chromium@517327,skia@1e09e461d2 0 +- N/A 10 good chromium@517327,skia@35c556f33e 1 +- N/A 10 bad <-- chromium@517328 1 +- N/A 10 bad chromium@517330 1 +- N/A 10 bad chromium@517348 1 +- N/A 10 bad chromium@517420 1 +- N/A 10 bad chromium@517710 1 +- N/A 10 bad Please refer to the following doc on diagnosing memory regressions: https://chromium.googlesource.com/chromium/src/+/master/docs/memory-infra/memory_benchmarks.md To Run This Test src/tools/perf/run_benchmark -v --browser=android-webview --output-format=chartjson --upload-results --pageset-repeat=1 --also-run-disabled-tests --story-filter=load.media.soundcloud system_health.memory_mobile More information on addressing performance regressions: http://g.co/ChromePerformanceRegressions Debug information about this bisect: https://chromeperf.appspot.com/buildbucket_job_status/8962279774346700800 For feedback, file a bug with component Speed>Bisection
,
Nov 22 2017
I don't think it is this CL. The code for this CL is not turned on for chrome. It is controlled by the skia configuration flag: SK_USE_LEGACY_INTERP_BLUR.
,
Nov 22 2017
Yeah, the test is failing flakily. I think whats going on is that its highly likely a CL in that roll, or in a CL immediately before the roll, is whats causing the flakiness. You just keep winning the lottery and having random CLs of yours getting flagged as the culprit. I'll fire off another bisect. dtu@ would pinpoint be able to better find the culprit, and if so do you know if there are any plans to move the webview bots to pinpoint soonish?
,
Nov 22 2017
The following revision refers to this bug: https://chromium.googlesource.com/chromium/src.git/+/9f0a9da2e97d5bc7b1a75451994a074f2971c8f3 commit 9f0a9da2e97d5bc7b1a75451994a074f2971c8f3 Author: rnephew <rnephew@chromium.org> Date: Wed Nov 22 20:45:41 2017 [Telemetry] Disable flaky soundcloud benchmark on webview. TBR: nednguyen@google.com Bug: 787001 Change-Id: I9af0a363fd6e4c0e62086e62d1afdaef1e8852ce Reviewed-on: https://chromium-review.googlesource.com/786291 Reviewed-by: rnephew <rnephew@chromium.org> Commit-Queue: rnephew <rnephew@chromium.org> Cr-Commit-Position: refs/heads/master@{#518736} [modify] https://crrev.com/9f0a9da2e97d5bc7b1a75451994a074f2971c8f3/tools/perf/page_sets/system_health/expectations.py
,
Dec 13 2017
,
Jul 19
Going to kick off a perf try job to see if this story magically fixed itself in the last 8 months.
,
Jul 19
📍 Pinpoint job started. https://pinpoint-dot-chromeperf.appspot.com/job/14ca6969a40000
,
Jul 19
📍 Pinpoint job started. https://pinpoint-dot-chromeperf.appspot.com/job/15fc6e71a40000
,
Jul 19
📍 Job complete. See results below. https://pinpoint-dot-chromeperf.appspot.com/job/14ca6969a40000
,
Jul 19
📍 Job complete. See results below. https://pinpoint-dot-chromeperf.appspot.com/job/15fc6e71a40000
,
Jul 19
passing 👍
,
Jul 19
Actually, looks like it failed 5 out of 40 runs
,
Aug 14
There's no information in the logs about that failure, though. I'm going to try re-enabling.
,
Aug 14
The following revision refers to this bug: https://chromium.googlesource.com/chromium/src.git/+/cfdb39526529c855565e2507f558a557e44c6b08 commit cfdb39526529c855565e2507f558a557e44c6b08 Author: Annie Sullivan <sullivan@chromium.org> Date: Tue Aug 14 16:51:50 2018 Re-enable load:media:soundcloud for webview on system_health.common_mobile. It's passing in perf tryjobs. Bug: 787001 Change-Id: Ie9a74b141d6dd610a1bfab3d9d9ea79b40dafc48 TBR: tdresser@chromium.org Reviewed-on: https://chromium-review.googlesource.com/1174632 Reviewed-by: Annie Sullivan <sullivan@chromium.org> Commit-Queue: Annie Sullivan <sullivan@chromium.org> Cr-Commit-Position: refs/heads/master@{#582944} [modify] https://crrev.com/cfdb39526529c855565e2507f558a557e44c6b08/tools/perf/expectations.config
,
Aug 14
,
Aug 14
Setting a NextAction date to swing back around and see how it's doing on the waterfall.
,
Aug 16
The NextAction date has arrived: 2018-08-16
,
Aug 16
Looks like soundcloud is still failing flakily (~20% fail rate) on 6X Webview, the only platform where it's running: https://console.developers.google.com/m/cloudstorage/b/chrome-telemetry-output/o/b9eb9fb6-a14b-11e8-9c69-0242ac110004 Stack trace: (ERROR) 2018-08-16 11:40:02,551 page_test_results.Fail:540 Failure recorded: TypeError: Cannot read property 'rendererHelpers' of undefined at new loadingMetric (/tracing/metrics/system_health/loading_metric.html:414:35) at runMetrics (/b/swarming/w/ir/third_party/catapult/tracing/tracing/metrics/metric_map_function.html:61:16) at metricMapFunction (/b/swarming/w/ir/third_party/catapult/tracing/tracing/metrics/metric_map_function.html:190:24) at Object.mapSingleTrace (/tracing/mre/map_single_trace.html:39:7) at eval (/b/swarming/w/ir/third_party/catapult/tracing/tracing/mre/map_single_trace_cmdline.html:61:18) at Object.runAndConvertErrorsToFailures (/tracing/mre/map_single_trace.html:24:10) at mapSingleTraceWithResult (/b/swarming/w/ir/third_party/catapult/tracing/tracing/mre/map_single_trace_cmdline.html:52:12) at Object.mapSingleTraceMain (/b/swarming/w/ir/third_party/catapult/tracing/tracing/mre/map_single_trace_cmdline.html:77:20) at eval (/b/swarming/w/ir/third_party/catapult/tracing/tracing/mre/map_single_trace_cmdline.html:90:15) at Object.loadHTMLFile (/b/swarming/w/ir/third_party/catapult/third_party/vinn/vinn/html_imports_loader.js:140:5) (ERROR) 2018-08-16 11:40:02,551 page_test_results.Fail:540 Failure recorded: TypeError: Cannot read property 'rendererHelpers' of undefined at new expectedQueueingTimeMetric (/tracing/metrics/system_health/expected_queueing_time_metric.html:83:55) at runMetrics (/b/swarming/w/ir/third_party/catapult/tracing/tracing/metrics/metric_map_function.html:61:16) at metricMapFunction (/b/swarming/w/ir/third_party/catapult/tracing/tracing/metrics/metric_map_function.html:190:24) at Object.mapSingleTrace (/tracing/mre/map_single_trace.html:39:7) at eval (/b/swarming/w/ir/third_party/catapult/tracing/tracing/mre/map_single_trace_cmdline.html:61:18) at Object.runAndConvertErrorsToFailures (/tracing/mre/map_single_trace.html:24:10) at mapSingleTraceWithResult (/b/swarming/w/ir/third_party/catapult/tracing/tracing/mre/map_single_trace_cmdline.html:52:12) at Object.mapSingleTraceMain (/b/swarming/w/ir/third_party/catapult/tracing/tracing/mre/map_single_trace_cmdline.html:77:20) at eval (/b/swarming/w/ir/third_party/catapult/tracing/tracing/mre/map_single_trace_cmdline.html:90:15) Deep, do you have any idea what's going wrong here? Meanwhile, I'm going to revert the reenable.
,
Aug 16
The following revision refers to this bug: https://chromium.googlesource.com/chromium/src.git/+/0adefb35a3a68819f526bf010aa78848e2283d31 commit 0adefb35a3a68819f526bf010aa78848e2283d31 Author: Charlie Andrews <charliea@chromium.org> Date: Thu Aug 16 17:16:11 2018 Revert "Re-enable load:media:soundcloud for webview on system_health.common_mobile." This reverts commit cfdb39526529c855565e2507f558a557e44c6b08. Reason for revert: Story is still failing Original change's description: > Re-enable load:media:soundcloud for webview on system_health.common_mobile. > > It's passing in perf tryjobs. > > Bug: 787001 > Change-Id: Ie9a74b141d6dd610a1bfab3d9d9ea79b40dafc48 > TBR: tdresser@chromium.org > Reviewed-on: https://chromium-review.googlesource.com/1174632 > Reviewed-by: Annie Sullivan <sullivan@chromium.org> > Commit-Queue: Annie Sullivan <sullivan@chromium.org> > Cr-Commit-Position: refs/heads/master@{#582944} NOTRY=true TBR=sullivan@chromium.org Bug: 787001 Change-Id: If734162c4815ec7c1a988343859b80556cce6044 Reviewed-on: https://chromium-review.googlesource.com/1178001 Commit-Queue: Charlie Andrews <charliea@chromium.org> Reviewed-by: Charlie Andrews <charliea@chromium.org> Cr-Commit-Position: refs/heads/master@{#583705} [modify] https://crrev.com/0adefb35a3a68819f526bf010aa78848e2283d31/tools/perf/expectations.config
,
Aug 24
/bump here Deep, do you have any idea when you might be able to take a look at this?
,
Aug 24
Sorry for the delay - I just blocked off some time on my calendar to work on this today afternoon. Will have some update by EOD.
,
Aug 30
Hey Deep, Any update here?
,
Aug 30
When I tried last Friday I couldn't reproduce it using my local phone. I will try again this afternoon, and get back with updates.
,
Sep 5
/bump Any luck? Did you make sure to try to run the story on Webview, not just regular Chrome? (I'm actually not 100% sure how to do that, but it does seem to make a difference given. https://test-results.appspot.com/dashboards/flakiness_dashboard.html#testType=performance_test_suite&tests=system_health.common_mobile%2Fload%3Amedia%3Asoundcloud shows that it's passing 100% of the time on Nexus 5X, but seems to be reliably failing on 5X Webview.)
,
Sep 5
(Even though I'm not 100% sure how to run a test on a webview browser, if you ping me tomorrow in the office I'm happy to look into it with you.)
,
Sep 5
Ah wow completely missed that I need webview. I just briefly searched for it, and looks like building webview is not straightforward. I looked closer at the error message on #23, and it seems to me it's failing on this line, which means chromeHelper is undefined: https://cs.chromium.org/chromium/src/third_party/catapult/tracing/tracing/metrics/system_health/loading_metric.html?l=414&rcl=70783e3968e7bcda29fe37a70fc9a5218ae12b37 This in turn means ChromeModelHelper.supportsModel returned false for the model: https://cs.chromium.org/chromium/src/third_party/catapult/tracing/tracing/model/helpers/chrome_model_helper.html?l=70&rcl=70783e3968e7bcda29fe37a70fc9a5218ae12b37 This in turn means catapult fails to find a single browser process or a renderer process in the trace. This could be because we're not recording the trace properly, or it could be because on webview the processes have different names (which would be very surprising - I thought we had other metrics that we run against webview.) I can't say for sure what is happening unless I can get a trace. Is there anyway you could help get me a trace? Would it be easy to hack the benchmark so a trace shows up on pinpoint and we get a trace from the failing bot? Otherwise I'll try by the end of this week to get webview running and repro this.
,
Sep 5
,
Sep 21
A similar issue cropped up in Issue 886621 . It had the same issue where catapult could not recognize the correct processes. The issue has likely just been fixed by oysteine. Can we check the flakiness of this again?
,
Jan 4
Hey Deep, it looks like this is still happening: https://console.developers.google.com/m/cloudstorage/b/chrome-telemetry-output/o/aefead26-1056-11e9-80c9-0242ac110006 (from today) Any update here?
,
Jan 21
(2 days ago)
Huh the bug is still there - that's unfortunate. Getting a trace here would be super nice, but I'm guessing that's not straightforward. I have a few questions to help me debug this: (1) Where can I see the fails happening? I can't find this on the waterfall here: https://ci.chromium.org/p/chromium/g/chromium.perf/console (2) If I want to reproduce the fails on a bot, can I run a pinpoint job to see this? Is this a good config for the pinpoint job? https://screenshot.googleplex.com/ww6KHpvs0he (3) To get a trace out of pinpoint on the failing run, I'm thinking I can change the benchmark code to upload the trace and add it as a CL on pinpoint. Let me know if there's an easier way. |
|||||||||||||||||||
►
Sign in to add a comment |
|||||||||||||||||||
Comment 1 by 42576172...@developer.gserviceaccount.com
, Nov 20 2017