New issue
Advanced search Search tips

Issue 787001 link

Starred by 1 user

Issue metadata

Status: Assigned
Owner:
Cc:
EstimatedDays: ----
NextAction: 2018-08-16
OS: Android
Pri: 1
Type: Bug



Sign in to add a comment

Soundcloud system health story flakily crashing on webview bots

Project Member Reported by rnep...@chromium.org, Nov 20 2017

Issue description

It is the primary cause of system health failures on the webview bots over the past 20 runs. It is more prominent on the nexus 6 than the 5x.

https://build.chromium.org/p/chromium.perf/builders/Android%20Nexus5X%20WebView%20Perf?numbuilds=200
https://build.chromium.org/p/chromium.perf/builders/Android%20Nexus6%20WebView%20Perf

https://build.chromium.org/p/chromium.perf/builders/Android%20Nexus6%20WebView%20Perf/builds/690

https://logs.chromium.org/v/?s=chrome%2Fbb%2Fchromium.perf%2FAndroid_Nexus6_WebView_Perf%2F690%2F%2B%2Frecipes%2Fsteps%2Fsystem_health.memory_mobile_on_Android%2F0%2Fstdout


Traceback (most recent call last):
  File "/b/swarming/w/ir/third_party/catapult/telemetry/telemetry/internal/story_runner.py", line 105, in _RunStoryAndProcessErrorIfNeeded
    state.RunStory(results)
  File "/b/swarming/w/ir/third_party/catapult/common/py_trace_event/py_trace_event/trace_event_impl/decorators.py", line 75, in traced_function
    return func(*args, **kwargs)
  File "/b/swarming/w/ir/third_party/catapult/telemetry/telemetry/page/shared_page_state.py", line 332, in RunStory
    self._current_page.Run(self)
  File "/b/swarming/w/ir/third_party/catapult/telemetry/telemetry/page/__init__.py", line 114, in Run
    self.RunPageInteractions(action_runner)
  File "/b/swarming/w/ir/tools/perf/page_sets/system_health/system_health_story.py", line 108, in RunPageInteractions
    self._Measure(action_runner)
  File "/b/swarming/w/ir/tools/perf/page_sets/system_health/system_health_story.py", line 91, in _Measure
    action_runner.MeasureMemory(deterministic_mode=True)
  File "/b/swarming/w/ir/third_party/catapult/common/py_trace_event/py_trace_event/trace_event_impl/decorators.py", line 75, in traced_function
    return func(*args, **kwargs)
  File "/b/swarming/w/ir/third_party/catapult/telemetry/telemetry/internal/actions/action_runner.py", line 160, in MeasureMemory
    self.ForceGarbageCollection()
  File "/b/swarming/w/ir/third_party/catapult/common/py_trace_event/py_trace_event/trace_event_impl/decorators.py", line 75, in traced_function
    return func(*args, **kwargs)
  File "/b/swarming/w/ir/third_party/catapult/telemetry/telemetry/internal/actions/action_runner.py", line 777, in ForceGarbageCollection
    self._tab.browser.ForceJavaHeapGarbageCollection()
  File "/b/swarming/w/ir/third_party/catapult/telemetry/telemetry/internal/browser/browser.py", line 316, in ForceJavaHeapGarbageCollection
    return self._browser_backend.ForceJavaHeapGarbageCollection()
  File "/b/swarming/w/ir/third_party/catapult/common/py_trace_event/py_trace_event/trace_event_impl/decorators.py", line 75, in traced_function
    return func(*args, **kwargs)
  File "/b/swarming/w/ir/third_party/catapult/telemetry/telemetry/internal/backends/chrome/android_browser_backend.py", line 226, in ForceJavaHeapGarbageCollection
    signum=device_signal.SIGUSR1)
  File "/b/swarming/w/ir/third_party/catapult/devil/devil/android/decorators.py", line 57, in timeout_retry_wrapper
    retry_if_func=retry_if_func)
  File "/b/swarming/w/ir/third_party/catapult/devil/devil/utils/timeout_retry.py", line 157, in Run
    error_log_func=error_log_func)
  File "/b/swarming/w/ir/third_party/catapult/devil/devil/utils/reraiser_thread.py", line 186, in JoinAll
    self._JoinAll(watcher, timeout)
  File "/b/swarming/w/ir/third_party/catapult/devil/devil/utils/reraiser_thread.py", line 158, in _JoinAll
    thread.ReraiseIfException()
  File "/b/swarming/w/ir/third_party/catapult/devil/devil/utils/reraiser_thread.py", line 81, in run
    self._ret = self._func(*self._args, **self._kwargs)
  File "/b/swarming/w/ir/third_party/catapult/devil/devil/utils/timeout_retry.py", line 150, in <lambda>
    child_thread = reraiser_thread.ReraiserThread(lambda: func(*args, **kwargs),
  File "/b/swarming/w/ir/third_party/catapult/devil/devil/android/decorators.py", line 47, in impl
    return f(*args, **kwargs)
  File "/b/swarming/w/ir/third_party/catapult/devil/devil/android/device_utils.py", line 1142, in KillAll
    'No process "%s"' % process_name, str(self))
CommandFailedError: (device: ZX1G422HK6) No process "org.chromium.webview_shell"
(INFO) 2017-11-20 15:11:48,387 devtools_client_backend._IsDevToolsAgentAvailable:80  Devtools client not yet ready: ''
(INFO) 2017-11-20 15:11:48,388 exception_formatter.PrintFormattedException:19  Try printing formatted exception: None None None
Exception raised when cleaning story run: 
Traceback (most recent call last):
  _RunStoryAndProcessErrorIfNeeded at /b/swarming/w/ir/third_party/catapult/telemetry/telemetry/internal/story_runner.py:130
    test.DidRunStory(state.platform, results)
  DidRunStory at /b/swarming/w/ir/third_party/catapult/telemetry/telemetry/web_perf/timeline_based_measurement.py:290
    trace_result, _ = platform.tracing_controller.StopTracing()
  StopTracing at /b/swarming/w/ir/third_party/catapult/telemetry/telemetry/core/tracing_controller.py:53
    return self._tracing_controller_backend.StopTracing()
  StopTracing at /b/swarming/w/ir/third_party/catapult/telemetry/telemetry/internal/platform/tracing_controller_backend.py:109
    self._IssueClockSyncMarker()
  _IssueClockSyncMarker at /b/swarming/w/ir/third_party/catapult/telemetry/telemetry/internal/platform/tracing_controller_backend.py:215
    self._RecordIssuerClockSyncMarker)
  RecordClockSyncMarker at /b/swarming/w/ir/third_party/catapult/telemetry/telemetry/internal/platform/tracing_agent/chrome_tracing_agent.py:178
    raise ChromeClockSyncError('Cannot issue clock sync. No devtools clients')
ChromeClockSyncError: Cannot issue clock sync. No devtools clients
Locals:
  devtools_clients                             : []
  record_controller_clock_sync_marker_callback : <bound method TracingControllerBackend._RecordIssuerClockSyncMarker of <telemetry.internal.platform.tracing_controller_backend.TracingControllerBackend object at 0x7fc387943cd0>>
  sync_id                                      : 'f86efe1f-3b5c-4228-9e6e-eb3ee433ae30'
[  FAILED  ] load:media:soundcloud@{'case': 'load', 'group': 'media'} (16167 ms)


Screenshot:
https://console.developers.google.com/m/cloudstorage/b/chrome-telemetry-output/o/profiler-file-id_35-2017-11-20_10-15-4683844.png


On the nexus 6 webview bot, the first instance of the soundcloud crash was:
https://build.chromium.org/p/chromium.perf/builders/Android%20Nexus6%20WebView%20Perf/builds/667


 
Project Member

Comment 3 by 42576172...@developer.gserviceaccount.com, Nov 21 2017

Cc: herb@google.com
Owner: herb@google.com
Status: Assigned (was: Untriaged)

=== Auto-CCing suspected CL author herb@google.com ===

Hi herb@google.com, the bisect results pointed to your CL, please take a look at the
results.


=== BISECT JOB RESULTS ===
Test failure found with culprit

Suspected Commit
  Author : Herb Derby
  Commit : b8b3086de30b3b0f09c6c5ffc176482a538e9337
  Date   : Thu Nov 16 21:39:19 2017
  Subject: Fix Sk8b reading too many bytes

Bisect Details
  Configuration: android_webview_arm64_aosp_perf_bisect
  Benchmark    : system_health.memory_mobile
  Metric       : memory:webview:all_processes:reported_by_chrome:cc:locked_size_avg/load_media/load_media_soundcloud

Revision                             Exit Code      N
chromium@517139                      0 +- N/A       20      good
chromium@517283                      0 +- N/A       20      good
chromium@517319                      0 +- N/A       20      good
chromium@517324                      0 +- N/A       20      good
chromium@517326                      0 +- N/A       20      good
chromium@517327                      0 +- N/A       20      good
chromium@517327,skia@3bc00fe810      0 +- N/A       20      good
chromium@517327,skia@a94970ba4b      0 +- N/A       20      good
chromium@517327,skia@45b806eec4      0 +- N/A       20      good
chromium@517327,skia@b8b3086de3      1 +- N/A       20      bad       <--
chromium@517328                      1 +- N/A       20      bad
chromium@517337                      1 +- N/A       20      bad
chromium@517355                      1 +- N/A       20      bad
chromium@517426                      1 +- N/A       20      bad
chromium@517713                      1 +- N/A       20      bad

Please refer to the following doc on diagnosing memory regressions:
  https://chromium.googlesource.com/chromium/src/+/master/docs/memory-infra/memory_benchmarks.md

To Run This Test
  src/tools/perf/run_benchmark -v --browser=android-webview --output-format=chartjson --upload-results --pageset-repeat=1 --also-run-disabled-tests --story-filter=load.media.soundcloud system_health.memory_mobile

More information on addressing performance regressions:
  http://g.co/ChromePerformanceRegressions

Debug information about this bisect:
  https://chromeperf.appspot.com/buildbucket_job_status/8962385983184100864


For feedback, file a bug with component Speed>Bisection
It seems like this might be fixing a different bug so I'm a little hesitant to revert it.
Project Member

Comment 6 by 42576172...@developer.gserviceaccount.com, Nov 22 2017


=== Auto-CCing suspected CL author herb@google.com ===

Hi herb@google.com, the bisect results pointed to your CL, please take a look at the
results.


=== BISECT JOB RESULTS ===
Test failure found with culprit

Suspected Commit
  Author : Herb Derby
  Commit : 35c556f33e2c9316055537e5683c28ea1645ff49
  Date   : Thu Nov 16 22:46:09 2017
  Subject: Revert "Revert "Direct evaluation of gaussian""

Bisect Details
  Configuration: android_webview_nexus6_aosp_perf_bisect
  Benchmark    : system_health.memory_mobile
  Metric       : memory:webview:all_processes:dump_count_avg/load_media/load_media_soundcloud

Revision                             Exit Code      N
chromium@517130                      0 +- N/A       10      good
chromium@517275                      0 +- N/A       10      good
chromium@517312                      0 +- N/A       10      good
chromium@517321                      0 +- N/A       10      good
chromium@517326                      0 +- N/A       10      good
chromium@517327                      0 +- N/A       10      good
chromium@517327,skia@3bc00fe810      0 +- N/A       10      good
chromium@517327,skia@b8b3086de3      0 +- N/A       10      good
chromium@517327,skia@1e09e461d2      0 +- N/A       10      good
chromium@517327,skia@35c556f33e      1 +- N/A       10      bad       <--
chromium@517328                      1 +- N/A       10      bad
chromium@517330                      1 +- N/A       10      bad
chromium@517348                      1 +- N/A       10      bad
chromium@517420                      1 +- N/A       10      bad
chromium@517710                      1 +- N/A       10      bad

Please refer to the following doc on diagnosing memory regressions:
  https://chromium.googlesource.com/chromium/src/+/master/docs/memory-infra/memory_benchmarks.md

To Run This Test
  src/tools/perf/run_benchmark -v --browser=android-webview --output-format=chartjson --upload-results --pageset-repeat=1 --also-run-disabled-tests --story-filter=load.media.soundcloud system_health.memory_mobile

More information on addressing performance regressions:
  http://g.co/ChromePerformanceRegressions

Debug information about this bisect:
  https://chromeperf.appspot.com/buildbucket_job_status/8962279774346700800


For feedback, file a bug with component Speed>Bisection

Comment 7 by herb@google.com, Nov 22 2017

I don't think it is this CL. The code for this CL is not turned on for chrome. It is controlled by the skia configuration flag: SK_USE_LEGACY_INTERP_BLUR.
Cc: dtu@chromium.org
Yeah, the test is failing flakily. I think whats going on is that its highly likely a CL in that roll, or in a CL immediately before the roll, is whats causing the flakiness. You just keep winning the lottery and having random CLs of yours getting flagged as the culprit. I'll fire off another bisect. 

dtu@ would pinpoint be able to better find the culprit, and if so do you know if there are any plans to move the webview bots to pinpoint soonish?
Project Member

Comment 9 by bugdroid1@chromium.org, Nov 22 2017

The following revision refers to this bug:
  https://chromium.googlesource.com/chromium/src.git/+/9f0a9da2e97d5bc7b1a75451994a074f2971c8f3

commit 9f0a9da2e97d5bc7b1a75451994a074f2971c8f3
Author: rnephew <rnephew@chromium.org>
Date: Wed Nov 22 20:45:41 2017

[Telemetry] Disable flaky soundcloud benchmark on webview.

TBR: nednguyen@google.com
Bug: 787001
Change-Id: I9af0a363fd6e4c0e62086e62d1afdaef1e8852ce
Reviewed-on: https://chromium-review.googlesource.com/786291
Reviewed-by: rnephew <rnephew@chromium.org>
Commit-Queue: rnephew <rnephew@chromium.org>
Cr-Commit-Position: refs/heads/master@{#518736}
[modify] https://crrev.com/9f0a9da2e97d5bc7b1a75451994a074f2971c8f3/tools/perf/page_sets/system_health/expectations.py

Comment 10 by herb@google.com, Dec 13 2017

Owner: dtu@chromium.org
Owner: charliea@chromium.org
Going to kick off a perf try job to see if this story magically fixed itself in the last 8 months.
📍 Job complete. See results below.
https://pinpoint-dot-chromeperf.appspot.com/job/14ca6969a40000
📍 Job complete. See results below.
https://pinpoint-dot-chromeperf.appspot.com/job/15fc6e71a40000
passing 👍
Actually, looks like it failed 5 out of 40 runs
There's no information in the logs about that failure, though. I'm going to try re-enabling.
Project Member

Comment 19 by bugdroid1@chromium.org, Aug 14

The following revision refers to this bug:
  https://chromium.googlesource.com/chromium/src.git/+/cfdb39526529c855565e2507f558a557e44c6b08

commit cfdb39526529c855565e2507f558a557e44c6b08
Author: Annie Sullivan <sullivan@chromium.org>
Date: Tue Aug 14 16:51:50 2018

Re-enable load:media:soundcloud for webview on system_health.common_mobile.

It's passing in perf tryjobs.

Bug: 787001
Change-Id: Ie9a74b141d6dd610a1bfab3d9d9ea79b40dafc48
TBR: tdresser@chromium.org
Reviewed-on: https://chromium-review.googlesource.com/1174632
Reviewed-by: Annie Sullivan <sullivan@chromium.org>
Commit-Queue: Annie Sullivan <sullivan@chromium.org>
Cr-Commit-Position: refs/heads/master@{#582944}
[modify] https://crrev.com/cfdb39526529c855565e2507f558a557e44c6b08/tools/perf/expectations.config

NextAction: 2018-08-16
Setting a NextAction date to swing back around and see how it's doing on the waterfall.
The NextAction date has arrived: 2018-08-16
Owner: dproy@chromium.org
Looks like soundcloud is still failing flakily (~20% fail rate) on 6X Webview, the only platform where it's running: https://console.developers.google.com/m/cloudstorage/b/chrome-telemetry-output/o/b9eb9fb6-a14b-11e8-9c69-0242ac110004

Stack trace:

(ERROR) 2018-08-16 11:40:02,551 page_test_results.Fail:540  Failure recorded: TypeError: Cannot read property 'rendererHelpers' of undefined
    at new loadingMetric (/tracing/metrics/system_health/loading_metric.html:414:35)
    at runMetrics (/b/swarming/w/ir/third_party/catapult/tracing/tracing/metrics/metric_map_function.html:61:16)
    at metricMapFunction (/b/swarming/w/ir/third_party/catapult/tracing/tracing/metrics/metric_map_function.html:190:24)
    at Object.mapSingleTrace (/tracing/mre/map_single_trace.html:39:7)
    at eval (/b/swarming/w/ir/third_party/catapult/tracing/tracing/mre/map_single_trace_cmdline.html:61:18)
    at Object.runAndConvertErrorsToFailures (/tracing/mre/map_single_trace.html:24:10)
    at mapSingleTraceWithResult (/b/swarming/w/ir/third_party/catapult/tracing/tracing/mre/map_single_trace_cmdline.html:52:12)
    at Object.mapSingleTraceMain (/b/swarming/w/ir/third_party/catapult/tracing/tracing/mre/map_single_trace_cmdline.html:77:20)
    at eval (/b/swarming/w/ir/third_party/catapult/tracing/tracing/mre/map_single_trace_cmdline.html:90:15)
    at Object.loadHTMLFile (/b/swarming/w/ir/third_party/catapult/third_party/vinn/vinn/html_imports_loader.js:140:5)
(ERROR) 2018-08-16 11:40:02,551 page_test_results.Fail:540  Failure recorded: TypeError: Cannot read property 'rendererHelpers' of undefined
    at new expectedQueueingTimeMetric (/tracing/metrics/system_health/expected_queueing_time_metric.html:83:55)
    at runMetrics (/b/swarming/w/ir/third_party/catapult/tracing/tracing/metrics/metric_map_function.html:61:16)
    at metricMapFunction (/b/swarming/w/ir/third_party/catapult/tracing/tracing/metrics/metric_map_function.html:190:24)
    at Object.mapSingleTrace (/tracing/mre/map_single_trace.html:39:7)
    at eval (/b/swarming/w/ir/third_party/catapult/tracing/tracing/mre/map_single_trace_cmdline.html:61:18)
    at Object.runAndConvertErrorsToFailures (/tracing/mre/map_single_trace.html:24:10)
    at mapSingleTraceWithResult (/b/swarming/w/ir/third_party/catapult/tracing/tracing/mre/map_single_trace_cmdline.html:52:12)
    at Object.mapSingleTraceMain (/b/swarming/w/ir/third_party/catapult/tracing/tracing/mre/map_single_trace_cmdline.html:77:20)
    at eval (/b/swarming/w/ir/third_party/catapult/tracing/tracing/mre/map_single_trace_cmdline.html:90:15)

Deep, do you have any idea what's going wrong here?

Meanwhile, I'm going to revert the reenable.
Project Member

Comment 24 by bugdroid1@chromium.org, Aug 16

The following revision refers to this bug:
  https://chromium.googlesource.com/chromium/src.git/+/0adefb35a3a68819f526bf010aa78848e2283d31

commit 0adefb35a3a68819f526bf010aa78848e2283d31
Author: Charlie Andrews <charliea@chromium.org>
Date: Thu Aug 16 17:16:11 2018

Revert "Re-enable load:media:soundcloud for webview on system_health.common_mobile."

This reverts commit cfdb39526529c855565e2507f558a557e44c6b08.

Reason for revert: Story is still failing

Original change's description:
> Re-enable load:media:soundcloud for webview on system_health.common_mobile.
> 
> It's passing in perf tryjobs.
> 
> Bug: 787001
> Change-Id: Ie9a74b141d6dd610a1bfab3d9d9ea79b40dafc48
> TBR: tdresser@chromium.org
> Reviewed-on: https://chromium-review.googlesource.com/1174632
> Reviewed-by: Annie Sullivan <sullivan@chromium.org>
> Commit-Queue: Annie Sullivan <sullivan@chromium.org>
> Cr-Commit-Position: refs/heads/master@{#582944}

NOTRY=true
TBR=sullivan@chromium.org

Bug: 787001
Change-Id: If734162c4815ec7c1a988343859b80556cce6044
Reviewed-on: https://chromium-review.googlesource.com/1178001
Commit-Queue: Charlie Andrews <charliea@chromium.org>
Reviewed-by: Charlie Andrews <charliea@chromium.org>
Cr-Commit-Position: refs/heads/master@{#583705}
[modify] https://crrev.com/0adefb35a3a68819f526bf010aa78848e2283d31/tools/perf/expectations.config

/bump here

Deep, do you have any idea when you might be able to take a look at this?
Sorry for the delay - I just blocked off some time on my calendar to work
on this today afternoon. Will have some update by EOD.
Hey Deep,

Any update here?
When I tried last Friday I couldn't reproduce it using my local phone. I will try again this afternoon, and get back with updates. 
/bump

Any luck? Did you make sure to try to run the story on Webview, not just regular Chrome? (I'm actually not 100% sure how to do that, but it does seem to make a difference given. https://test-results.appspot.com/dashboards/flakiness_dashboard.html#testType=performance_test_suite&tests=system_health.common_mobile%2Fload%3Amedia%3Asoundcloud shows that it's passing 100% of the time on Nexus 5X, but seems to be reliably failing on 5X Webview.)
(Even though I'm not 100% sure how to run a test on a webview browser, if you ping me tomorrow in the office I'm happy to look into it with you.)
Ah wow completely missed that I need webview. I just briefly searched for it, and looks like building webview is not straightforward.

I looked closer at the error message on #23, and it seems to me it's failing on this line, which means chromeHelper is undefined: https://cs.chromium.org/chromium/src/third_party/catapult/tracing/tracing/metrics/system_health/loading_metric.html?l=414&rcl=70783e3968e7bcda29fe37a70fc9a5218ae12b37

This in turn means ChromeModelHelper.supportsModel returned false for the model: https://cs.chromium.org/chromium/src/third_party/catapult/tracing/tracing/model/helpers/chrome_model_helper.html?l=70&rcl=70783e3968e7bcda29fe37a70fc9a5218ae12b37

This in turn means catapult fails to find a single browser process or a renderer process in the trace. This could be because we're not recording the trace properly, or it could be because on webview the processes have different names (which would be very surprising - I thought we had other metrics that we run against webview.) I can't say for sure what is happening unless I can get a trace.

Is there anyway you could help get me a trace? Would it be easy to hack the benchmark so a trace shows up on pinpoint and we get a trace from the failing bot? Otherwise I'll try by the end of this week to get webview running and repro this.
Cc: -herb@google.com
Cc: charliea@chromium.org
A similar issue cropped up in  Issue 886621 . It had the same issue where catapult could not recognize the correct processes. The issue has likely just been fixed by oysteine. Can we check the flakiness of this again? 
Hey Deep, it looks like this is still happening: https://console.developers.google.com/m/cloudstorage/b/chrome-telemetry-output/o/aefead26-1056-11e9-80c9-0242ac110006 (from today)

Any update here?

Comment 35 by dproy@chromium.org, Jan 21 (2 days ago)

Huh the bug is still there - that's unfortunate. Getting a trace here would be super nice, but I'm guessing that's not straightforward. I have a few questions to help me debug this:

(1) Where can I see the fails happening? I can't find this on the waterfall here: https://ci.chromium.org/p/chromium/g/chromium.perf/console

(2) If I want to reproduce the fails on a bot, can I run a pinpoint job to see this? Is this a good config for the pinpoint job? https://screenshot.googleplex.com/ww6KHpvs0he

(3) To get a trace out of pinpoint on the failing run, I'm thinking I can change the benchmark code to upload the trace and add it as a CL on pinpoint. Let me know if there's an easier way. 

Sign in to add a comment