Issue metadata
Sign in to add a comment
|
browse:news:nytimes in system_health.memory_desktop and v8.browsing_desktop failing on chromium.perf/Win 10 Perf |
||||||||||||||||||||||||
Issue descriptionFiled by sheriff-o-matic@appspot.gserviceaccount.com on behalf of charliea@chromium.org system_health.memory_desktop/browse:news:nytimes in system_health.memory_desktop failing on chromium.perf/Win 10 Perf Builders failed on: - Win 10 Perf: https://ci.chromium.org/buildbot/chromium.perf/Win%2010%20Perf This test turned from reliably passing to reliably failing around April 21. A solid failing run for the bisect is 552607 and a solid passing run is 551610. Going to disable and kick off a bisect.
,
May 11 2018
Log excerpt from the failing test:
Traceback (most recent call last):
File "c:\b\s\w\ir\third_party\catapult\telemetry\telemetry\internal\story_runner.py", line 123, in _RunStoryAndProcessErrorIfNeeded
state.RunStory(results)
File "c:\b\s\w\ir\third_party\catapult\common\py_trace_event\py_trace_event\trace_event_impl\decorators.py", line 75, in traced_function
return func(*args, **kwargs)
File "c:\b\s\w\ir\third_party\catapult\telemetry\telemetry\page\shared_page_state.py", line 313, in RunStory
self._current_page.Run(self)
File "c:\b\s\w\ir\third_party\catapult\telemetry\telemetry\page\__init__.py", line 97, in Run
self.RunPageInteractions(action_runner)
File "c:\b\s\w\ir\tools\perf\page_sets\system_health\system_health_story.py", line 106, in RunPageInteractions
self._DidLoadDocument(action_runner)
File "c:\b\s\w\ir\tools\perf\page_sets\system_health\browsing_stories.py", line 82, in _DidLoadDocument
self._ReadNextArticle(action_runner)
File "c:\b\s\w\ir\tools\perf\page_sets\system_health\browsing_stories.py", line 91, in _ReadNextArticle
action_runner.tab.WaitForDocumentReadyStateToBeComplete()
File "c:\b\s\w\ir\third_party\catapult\common\py_trace_event\py_trace_event\trace_event_impl\decorators.py", line 75, in traced_function
return func(*args, **kwargs)
File "c:\b\s\w\ir\third_party\catapult\telemetry\telemetry\internal\browser\web_contents.py", line 91, in WaitForDocumentReadyStateToBeComplete
'document.readyState == "complete"', timeout=timeout)
File "c:\b\s\w\ir\third_party\catapult\common\py_trace_event\py_trace_event\trace_event_impl\decorators.py", line 75, in traced_function
return func(*args, **kwargs)
File "c:\b\s\w\ir\third_party\catapult\telemetry\telemetry\internal\browser\web_contents.py", line 239, in WaitForJavaScriptCondition
return self._inspector_backend.WaitForJavaScriptCondition(*args, **kwargs)
File "c:\b\s\w\ir\third_party\catapult\common\py_trace_event\py_trace_event\trace_event_impl\decorators.py", line 75, in traced_function
return func(*args, **kwargs)
File "c:\b\s\w\ir\third_party\catapult\telemetry\telemetry\internal\backends\chrome_inspector\inspector_backend.py", line 302, in WaitForJavaScriptCondition
e.message + '\n' + debug_message)
TimeoutException: Timed out while waiting 90s for IsJavaScriptExpressionTrue.
,
May 11 2018
Broadening this to v8.browsing_desktop, which is also failing
,
May 11 2018
The following revision refers to this bug: https://chromium.googlesource.com/chromium/src.git/+/04b303b330e54f12f70efbe04bcfd525d07e3260 commit 04b303b330e54f12f70efbe04bcfd525d07e3260 Author: Charlie Andrews <charliea@chromium.org> Date: Fri May 11 15:20:51 2018 Disable browse:news:nytimes on Windows The test has been very flaky for some time. NOTRY=true TBR=perezju@chromium.org Bug: 842180 Change-Id: I86b81726d9e6c0c5f1cbe7956f102738f6cba823 Reviewed-on: https://chromium-review.googlesource.com/1055594 Commit-Queue: Charlie Andrews <charliea@chromium.org> Reviewed-by: Charlie Andrews <charliea@chromium.org> Cr-Commit-Position: refs/heads/master@{#557886} [modify] https://crrev.com/04b303b330e54f12f70efbe04bcfd525d07e3260/tools/perf/expectations.config
,
May 11 2018
📍 Found a significant difference after 1 commit. https://pinpoint-dot-chromeperf.appspot.com/job/118f4800240000 Make --site-per-process the default on ToT via fieldtrial_testing_config by lukasza@chromium.org https://chromium.googlesource.com/chromium/src/+/fb1ccf02ee8ca79e1404abfd3a3a7d540b7d2dbd Understanding performance regressions: http://g.co/ChromePerformanceRegressions
,
May 11 2018
Looking at the pinpoint job in #c5 I think it measures memory_desktop/memory:chrome:all_processes. If this is indeed the metric this bug is concerned about, then it is expected that site-per-process uses more memory - on average we expect to increase memory usage (e.g. as measured by Memory.Total.PrivateMemoryFootprint) by around 10%.
,
May 12 2018
(Reopening) This was a functional bisect (checking the return code of a particular Telemetry test case), not a performance bisect, and has nothing to do with memory. In order to launch the bisect, you just need to pick a random metric on the story, which you can ignore. The issue is that this patch has made several stories within system_health.memory_desktop and v8.browsing_desktop crash much more often than they were previously. You can see the effect this patch has had on v8.browsing_desktop here (https://test-results.appspot.com/dashboards/flakiness_dashboard.html#testType=v8.browsing_desktop&builder=chromium.perf%3AWin%2010%20Perf). Scroll to the right and you can see that there's a point where several stories transition from passing (green) to failing (red). More recent runs are on the left. The same is true for system_health.memory_desktop, where the patch has had an even more pronounced impact on the nytimes story but seems to have had less impact on the other stories than in v8.browsing_desktop (https://test-results.appspot.com/dashboards/flakiness_dashboard.html#testType=system_health.memory_desktop&builder=chromium.perf%3AWin%2010%20Perf). Adding nednguyen@ and sullivan@: I'm not really sure how to handle this. Obviously, the --site-per-process is a big and important change, but it seems to have dramatically increased the crash rate on several important tests. How do we handle this?
,
May 12 2018
Rereading my comment, I realize that the first part could have been more clear. This bisect identified your patch as the one responsible for a dramatic uptick in how often Chrome crashed for a particular test case (system_health.memory_desktop/browse:news:nytimes). You can safely ignore the "metric" that we bisected on: for these type of crash bisects, the metric should be ignored.
,
May 13 2018
Comment 8: Thanks for the clarification. How do we find out what crashes are occurring in these tests, causing them to fail? I've tried clicking through to a few of the failures, but it's hard to tell why they're actually failing. If there's particular crashes or bugs we can fix to get these green again, we're happy to do that.
,
May 14 2018
The frustrating thing about integration tests in Chrome is that there isn't great support for checking whether the page crashed and getting a crash stack. I looked at this log: https://chromium-swarm.appspot.com/task?id=3d698039da958210&refresh=10&show_raw=1 I see the following stack: Traceback (most recent call last): File "c:\b\s\w\ir\third_party\catapult\telemetry\telemetry\internal\story_runner.py", line 123, in _RunStoryAndProcessErrorIfNeeded state.RunStory(results) File "c:\b\s\w\ir\third_party\catapult\common\py_trace_event\py_trace_event\trace_event_impl\decorators.py", line 75, in traced_function return func(*args, **kwargs) File "c:\b\s\w\ir\third_party\catapult\telemetry\telemetry\page\shared_page_state.py", line 311, in RunStory self._current_page.Run(self) File "c:\b\s\w\ir\third_party\catapult\telemetry\telemetry\page\__init__.py", line 95, in Run self.RunPageInteractions(action_runner) File "c:\b\s\w\ir\tools\perf\page_sets\system_health\system_health_story.py", line 106, in RunPageInteractions self._DidLoadDocument(action_runner) File "c:\b\s\w\ir\tools\perf\page_sets\system_health\browsing_stories.py", line 82, in _DidLoadDocument self._ReadNextArticle(action_runner) File "c:\b\s\w\ir\tools\perf\page_sets\system_health\browsing_stories.py", line 91, in _ReadNextArticle action_runner.tab.WaitForDocumentReadyStateToBeComplete() File "c:\b\s\w\ir\third_party\catapult\common\py_trace_event\py_trace_event\trace_event_impl\decorators.py", line 75, in traced_function return func(*args, **kwargs) File "c:\b\s\w\ir\third_party\catapult\telemetry\telemetry\internal\browser\web_contents.py", line 91, in WaitForDocumentReadyStateToBeComplete 'document.readyState == "complete"', timeout=timeout) File "c:\b\s\w\ir\third_party\catapult\common\py_trace_event\py_trace_event\trace_event_impl\decorators.py", line 75, in traced_function return func(*args, **kwargs) File "c:\b\s\w\ir\third_party\catapult\telemetry\telemetry\internal\browser\web_contents.py", line 239, in WaitForJavaScriptCondition return self._inspector_backend.WaitForJavaScriptCondition(*args, **kwargs) File "c:\b\s\w\ir\third_party\catapult\common\py_trace_event\py_trace_event\trace_event_impl\decorators.py", line 75, in traced_function return func(*args, **kwargs) File "c:\b\s\w\ir\third_party\catapult\telemetry\telemetry\internal\backends\chrome_inspector\inspector_backend.py", line 302, in WaitForJavaScriptCondition e.message + '\n' + debug_message) TimeoutException: Timed out while waiting 90s for IsJavaScriptExpressionTrue. Which means the test started timing out waiting for document.readyState to be 'complete'. It could have timed out because of a crash, or because something in your change makes the document no longer load completely. It managed to upload a screenshot, https://console.developers.google.com/m/cloudstorage/b/chrome-telemetry-output/o/63c8e6b0-5552-11e8-ac38-14b31f28df11 Which looks like maybe something at the top isn't loading properly. I tried checking this by running the test command, `tools/perf/run_benchmark system_health.memory_desktop --story-filter browse.news.nytimes --pageset-repeat 1` on Linux (where the test is passing), and it looks like the carousel at the top lazy-loads there and does not block document.readyState from switching to complete. But maybe the next command, which goes back to nytimes homepage, crashed? I also see in the log: (INFO) 2018-05-11 12:32:39,556 desktop_browser_backend._GetAllCrashpadMinidumps:316 Found crashpad_database_util (INFO) 2018-05-11 12:32:40,694 desktop_browser_backend.GetStackTrace:492 Minidump found: c:\b\s\w\itgdohok\tmpvugs6c\reports\93d42d5a-cc55-4504-814e-1bf06ecce9fb.dmp (INFO) 2018-05-11 12:32:40,694 cloud_storage.Insert:378 Uploading c:\b\s\w\itgdohok\tmpvugs6c\reports\93d42d5a-cc55-4504-814e-1bf06ecce9fb.dmp to gs://chrome-telemetry-output/minidump-2018-05-11_12-32-40-460169.dmp That's available at https://console.developers.google.com/m/cloudstorage/b/chrome-telemetry-output/o/minidump-2018-05-11_12-32-40-460169.dmp Looking through other logs, they all have a different minidump link, so I think that's an artifact from the current test run and not something left over from a different test. Note that the metric shown on the pinpoint page is the failure rate: it goes from 0% before your change to 70% after. Maybe running it a bunch of times on Windows is the easiest way to repro?
,
May 14 2018
Oh, and see also: go/reproduce-telemetry
,
May 16 2018
Good news: I learned (from https://github.com/catapult-project/catapult/blob/master/telemetry/docs/run_benchmarks_locally.md) how to run a benchmark locally. D:\src\chromium\src>python tools/perf/run_benchmark --browser=exact --browser-executable=D:\src\chromium\src\out\rel\chrome.exe system_health.memory_desktop --story-filter=browse:news:nytimes --output-dir=D:\scratch\telemetry --show-stdout --extra-browser-args=--site-per-process Bad news: - The benchmark always passes / I cannot repro the failure on my local machine - I still don't know how to capture log output of the browser and/or renderer process I was hoping to use printf-style debugging to figure out why document.readyState might possibly be not set, but this won't work unless 1) I can repro the problem at a reasonable frequency dnd 2) I can capture the printf output... :-/
,
May 17 2018
lfg@ and wfg@ have been very helpful with analyzing the dump mentioned in #c10.
It seems that the following CHECK has been hit:
content::protocol::SystemInfoHandlerGpuObserver::ObserverWatchdogCallback+4
[C:\b\c\b\win_x64_builder_perf\src\content\browser\devtools\protocol\system_info_handler.cc @ 191]
void ObserverWatchdogCallback() {
DCHECK_CURRENTLY_ON(BrowserThread::UI);
#if defined(OS_CHROMEOS)
...
#else
CHECK(false) << "Gathering system GPU info took more than 5 seconds."; <- LINE 191 HERE
#endif
}
,
May 17 2018
+zmo@ (touched content/browser/devtools/protocol/system_info_handler.cc relatively recently and also present in content/browser/gpu/OWNERS that covers GpuDataManagerImpl which is supposed to be calling content::protocol::SystemInfoHandlerGpuObserver::OnGpuInfoUpdate)
,
May 17 2018
Same CHECK is hit in 1 out of 3 failing/after tasks underneath the pinjob at https://pinpoint-dot-chromeperf.appspot.com/job/118f4800240000: - https://chromium-swarm.appspot.com/task?id=3d698032fff27f10&refresh=10&show_raw=1 (minidump-2018-05-11_12-32-06-994973.dmp) But not in the other 2 tasks, which AFAICT do not refer to a minidump: - https://chromium-swarm.appspot.com/task?id=3d69484f82392010&refresh=10&show_raw=1 - https://chromium-swarm.appspot.com/task?id=3d68941c20709d10&refresh=10&show_raw=1 I am not sure how to find other minidumps from failed benchmark runs (I was hoping to use them to confirm if the CHECK in SystemInfoHandlerGpuObserver::ObserverWatchdogCallback is truly related to the observed, site-per-process-related regression in benchmark completion rates).
,
May 18 2018
,
May 18 2018
More bots time out on GPU info collection |
|||||||||||||||||||||||||
►
Sign in to add a comment |
|||||||||||||||||||||||||
Comment 1 by 42576172...@developer.gserviceaccount.com
, May 11 2018