New issue
Advanced search Search tips
Note: Color blocks (like or ) mean that a user may not be available. Tooltip shows the reason.

Issue 842180 link

Starred by 1 user

Issue metadata

Status: Duplicate
Merged: issue 842782
Owner:
Closed: May 2018
Cc:
Components:
EstimatedDays: ----
NextAction: ----
OS: ----
Pri: 1
Type: ----

Blocking:
issue 838940



Sign in to add a comment

browse:news:nytimes in system_health.memory_desktop and v8.browsing_desktop failing on chromium.perf/Win 10 Perf

Project Member Reported by sheriff-...@appspot.gserviceaccount.com, May 11 2018

Issue description

Filed by sheriff-o-matic@appspot.gserviceaccount.com on behalf of charliea@chromium.org

system_health.memory_desktop/browse:news:nytimes in system_health.memory_desktop failing on chromium.perf/Win 10 Perf

Builders failed on: 
- Win 10 Perf: 
  https://ci.chromium.org/buildbot/chromium.perf/Win%2010%20Perf

This test turned from reliably passing to reliably failing around April 21. A solid failing run for the bisect is 552607 and a solid passing run is 551610. Going to disable and kick off a bisect.
 
Log excerpt from the failing test:

Traceback (most recent call last):
  File "c:\b\s\w\ir\third_party\catapult\telemetry\telemetry\internal\story_runner.py", line 123, in _RunStoryAndProcessErrorIfNeeded
    state.RunStory(results)
  File "c:\b\s\w\ir\third_party\catapult\common\py_trace_event\py_trace_event\trace_event_impl\decorators.py", line 75, in traced_function
    return func(*args, **kwargs)
  File "c:\b\s\w\ir\third_party\catapult\telemetry\telemetry\page\shared_page_state.py", line 313, in RunStory
    self._current_page.Run(self)
  File "c:\b\s\w\ir\third_party\catapult\telemetry\telemetry\page\__init__.py", line 97, in Run
    self.RunPageInteractions(action_runner)
  File "c:\b\s\w\ir\tools\perf\page_sets\system_health\system_health_story.py", line 106, in RunPageInteractions
    self._DidLoadDocument(action_runner)
  File "c:\b\s\w\ir\tools\perf\page_sets\system_health\browsing_stories.py", line 82, in _DidLoadDocument
    self._ReadNextArticle(action_runner)
  File "c:\b\s\w\ir\tools\perf\page_sets\system_health\browsing_stories.py", line 91, in _ReadNextArticle
    action_runner.tab.WaitForDocumentReadyStateToBeComplete()
  File "c:\b\s\w\ir\third_party\catapult\common\py_trace_event\py_trace_event\trace_event_impl\decorators.py", line 75, in traced_function
    return func(*args, **kwargs)
  File "c:\b\s\w\ir\third_party\catapult\telemetry\telemetry\internal\browser\web_contents.py", line 91, in WaitForDocumentReadyStateToBeComplete
    'document.readyState == "complete"', timeout=timeout)
  File "c:\b\s\w\ir\third_party\catapult\common\py_trace_event\py_trace_event\trace_event_impl\decorators.py", line 75, in traced_function
    return func(*args, **kwargs)
  File "c:\b\s\w\ir\third_party\catapult\telemetry\telemetry\internal\browser\web_contents.py", line 239, in WaitForJavaScriptCondition
    return self._inspector_backend.WaitForJavaScriptCondition(*args, **kwargs)
  File "c:\b\s\w\ir\third_party\catapult\common\py_trace_event\py_trace_event\trace_event_impl\decorators.py", line 75, in traced_function
    return func(*args, **kwargs)
  File "c:\b\s\w\ir\third_party\catapult\telemetry\telemetry\internal\backends\chrome_inspector\inspector_backend.py", line 302, in WaitForJavaScriptCondition
    e.message + '\n' + debug_message)
TimeoutException: Timed out while waiting 90s for IsJavaScriptExpressionTrue.
Cc: u...@chromium.org mythria@chromium.org
Summary: browse:news:nytimes in system_health.memory_desktop and v8.browsing_desktop failing on chromium.perf/Win 10 Perf (was: system_health.memory_desktop/browse:news:nytimes in system_health.memory_desktop failing on chromium.perf/Win 10 Perf)
Broadening this to v8.browsing_desktop, which is also failing
Project Member

Comment 4 by bugdroid1@chromium.org, May 11 2018

The following revision refers to this bug:
  https://chromium.googlesource.com/chromium/src.git/+/04b303b330e54f12f70efbe04bcfd525d07e3260

commit 04b303b330e54f12f70efbe04bcfd525d07e3260
Author: Charlie Andrews <charliea@chromium.org>
Date: Fri May 11 15:20:51 2018

Disable browse:news:nytimes on Windows

The test has been very flaky for some time.

NOTRY=true
TBR=perezju@chromium.org

Bug:  842180 
Change-Id: I86b81726d9e6c0c5f1cbe7956f102738f6cba823
Reviewed-on: https://chromium-review.googlesource.com/1055594
Commit-Queue: Charlie Andrews <charliea@chromium.org>
Reviewed-by: Charlie Andrews <charliea@chromium.org>
Cr-Commit-Position: refs/heads/master@{#557886}
[modify] https://crrev.com/04b303b330e54f12f70efbe04bcfd525d07e3260/tools/perf/expectations.config

Project Member

Comment 5 by 42576172...@developer.gserviceaccount.com, May 11 2018

Cc: lukasza@chromium.org
Owner: lukasza@chromium.org
Status: Assigned (was: Available)
📍 Found a significant difference after 1 commit.
https://pinpoint-dot-chromeperf.appspot.com/job/118f4800240000

Make --site-per-process the default on ToT via fieldtrial_testing_config by lukasza@chromium.org
https://chromium.googlesource.com/chromium/src/+/fb1ccf02ee8ca79e1404abfd3a3a7d540b7d2dbd

Understanding performance regressions:
  http://g.co/ChromePerformanceRegressions
Blocking: 838940
Components: Internals>Sandbox>SiteIsolation
Status: WontFix (was: Assigned)
Looking at the pinpoint job in #c5 I think it measures memory_desktop/memory:chrome:all_processes.  If this is indeed the metric this bug is concerned about, then it is expected that site-per-process uses more memory - on average we expect to increase memory usage (e.g. as measured by Memory.Total.PrivateMemoryFootprint) by around 10%.
Cc: nednguyen@chromium.org sullivan@chromium.org
Status: Assigned (was: WontFix)
(Reopening)

This was a functional bisect (checking the return code of a particular Telemetry test case), not a performance bisect, and has nothing to do with memory. In order to launch the bisect, you just need to pick a random metric on the story, which you can ignore.

The issue is that this patch has made several stories within system_health.memory_desktop and v8.browsing_desktop crash much more often than they were previously.

You can see the effect this patch has had on v8.browsing_desktop here (https://test-results.appspot.com/dashboards/flakiness_dashboard.html#testType=v8.browsing_desktop&builder=chromium.perf%3AWin%2010%20Perf). Scroll to the right and you can see that there's a point where several stories transition from passing (green) to failing (red). More recent runs are on the left.

The same is true for system_health.memory_desktop, where the patch has had an even more pronounced impact on the nytimes story but seems to have had less impact on the other stories than in v8.browsing_desktop (https://test-results.appspot.com/dashboards/flakiness_dashboard.html#testType=system_health.memory_desktop&builder=chromium.perf%3AWin%2010%20Perf).

Adding nednguyen@ and sullivan@: I'm not really sure how to handle this. Obviously, the --site-per-process is a big and important change, but it seems to have dramatically increased the crash rate on several important tests. How do we handle this?
Rereading my comment, I realize that the first part could have been more clear.

This bisect identified your patch as the one responsible for a dramatic uptick in how often Chrome crashed for a particular test case (system_health.memory_desktop/browse:news:nytimes). You can safely ignore the "metric" that we bisected on: for these type of crash bisects, the metric should be ignored.

Comment 9 by creis@chromium.org, May 13 2018

Comment 8: Thanks for the clarification.  How do we find out what crashes are occurring in these tests, causing them to fail?  I've tried clicking through to a few of the failures, but it's hard to tell why they're actually failing.  If there's particular crashes or bugs we can fix to get these green again, we're happy to do that.
The frustrating thing about integration tests in Chrome is that there isn't great support for checking whether the page crashed and getting a crash stack. I looked at this log:

https://chromium-swarm.appspot.com/task?id=3d698039da958210&refresh=10&show_raw=1

I see the following stack:
Traceback (most recent call last):
  File "c:\b\s\w\ir\third_party\catapult\telemetry\telemetry\internal\story_runner.py", line 123, in _RunStoryAndProcessErrorIfNeeded
    state.RunStory(results)
  File "c:\b\s\w\ir\third_party\catapult\common\py_trace_event\py_trace_event\trace_event_impl\decorators.py", line 75, in traced_function
    return func(*args, **kwargs)
  File "c:\b\s\w\ir\third_party\catapult\telemetry\telemetry\page\shared_page_state.py", line 311, in RunStory
    self._current_page.Run(self)
  File "c:\b\s\w\ir\third_party\catapult\telemetry\telemetry\page\__init__.py", line 95, in Run
    self.RunPageInteractions(action_runner)
  File "c:\b\s\w\ir\tools\perf\page_sets\system_health\system_health_story.py", line 106, in RunPageInteractions
    self._DidLoadDocument(action_runner)
  File "c:\b\s\w\ir\tools\perf\page_sets\system_health\browsing_stories.py", line 82, in _DidLoadDocument
    self._ReadNextArticle(action_runner)
  File "c:\b\s\w\ir\tools\perf\page_sets\system_health\browsing_stories.py", line 91, in _ReadNextArticle
    action_runner.tab.WaitForDocumentReadyStateToBeComplete()
  File "c:\b\s\w\ir\third_party\catapult\common\py_trace_event\py_trace_event\trace_event_impl\decorators.py", line 75, in traced_function
    return func(*args, **kwargs)
  File "c:\b\s\w\ir\third_party\catapult\telemetry\telemetry\internal\browser\web_contents.py", line 91, in WaitForDocumentReadyStateToBeComplete
    'document.readyState == "complete"', timeout=timeout)
  File "c:\b\s\w\ir\third_party\catapult\common\py_trace_event\py_trace_event\trace_event_impl\decorators.py", line 75, in traced_function
    return func(*args, **kwargs)
  File "c:\b\s\w\ir\third_party\catapult\telemetry\telemetry\internal\browser\web_contents.py", line 239, in WaitForJavaScriptCondition
    return self._inspector_backend.WaitForJavaScriptCondition(*args, **kwargs)
  File "c:\b\s\w\ir\third_party\catapult\common\py_trace_event\py_trace_event\trace_event_impl\decorators.py", line 75, in traced_function
    return func(*args, **kwargs)
  File "c:\b\s\w\ir\third_party\catapult\telemetry\telemetry\internal\backends\chrome_inspector\inspector_backend.py", line 302, in WaitForJavaScriptCondition
    e.message + '\n' + debug_message)
TimeoutException: Timed out while waiting 90s for IsJavaScriptExpressionTrue.

Which means the test started timing out waiting for document.readyState to be 'complete'. It could have timed out because of a crash, or because something in your change makes the document no longer load completely.

It managed to upload a screenshot, 
https://console.developers.google.com/m/cloudstorage/b/chrome-telemetry-output/o/63c8e6b0-5552-11e8-ac38-14b31f28df11

Which looks like maybe something at the top isn't loading properly. I tried checking this by running the test command, `tools/perf/run_benchmark system_health.memory_desktop --story-filter browse.news.nytimes --pageset-repeat 1` on Linux (where the test is passing), and it looks like the carousel at the top lazy-loads there and does not block document.readyState from switching to complete. But maybe the next command, which goes back to nytimes homepage, crashed?

I also see in the log:
(INFO) 2018-05-11 12:32:39,556 desktop_browser_backend._GetAllCrashpadMinidumps:316  Found crashpad_database_util
(INFO) 2018-05-11 12:32:40,694 desktop_browser_backend.GetStackTrace:492  Minidump found: c:\b\s\w\itgdohok\tmpvugs6c\reports\93d42d5a-cc55-4504-814e-1bf06ecce9fb.dmp
(INFO) 2018-05-11 12:32:40,694 cloud_storage.Insert:378  Uploading c:\b\s\w\itgdohok\tmpvugs6c\reports\93d42d5a-cc55-4504-814e-1bf06ecce9fb.dmp to gs://chrome-telemetry-output/minidump-2018-05-11_12-32-40-460169.dmp
That's available at https://console.developers.google.com/m/cloudstorage/b/chrome-telemetry-output/o/minidump-2018-05-11_12-32-40-460169.dmp

Looking through other logs, they all have a different minidump link, so I think that's an artifact from the current test run and not something left over from a different test.

Note that the metric shown on the pinpoint page is the failure rate: it goes from 0% before your change to 70% after. Maybe running it a bunch of times on Windows is the easiest way to repro?



Oh, and see also: go/reproduce-telemetry
Good news: I learned (from https://github.com/catapult-project/catapult/blob/master/telemetry/docs/run_benchmarks_locally.md) how to run a benchmark locally.

D:\src\chromium\src>python tools/perf/run_benchmark --browser=exact --browser-executable=D:\src\chromium\src\out\rel\chrome.exe system_health.memory_desktop --story-filter=browse:news:nytimes --output-dir=D:\scratch\telemetry --show-stdout --extra-browser-args=--site-per-process

Bad news:
- The benchmark always passes / I cannot repro the failure on my local machine
- I still don't know how to capture log output of the browser and/or renderer process

I was hoping to use printf-style debugging to figure out why document.readyState might possibly be not set, but this won't work unless 1) I can repro the problem at a reasonable frequency dnd 2) I can capture the printf output... :-/
lfg@ and wfg@ have been very helpful with analyzing the dump mentioned in #c10.

It seems that the following CHECK has been hit:

    content::protocol::SystemInfoHandlerGpuObserver::ObserverWatchdogCallback+4
    [C:\b\c\b\win_x64_builder_perf\src\content\browser\devtools\protocol\system_info_handler.cc @ 191]

      void ObserverWatchdogCallback() {
        DCHECK_CURRENTLY_ON(BrowserThread::UI);
    #if defined(OS_CHROMEOS)
    ...
    #else
        CHECK(false) << "Gathering system GPU info took more than 5 seconds.";  <- LINE 191 HERE
    #endif
      }
Cc: zmo@chromium.org
+zmo@ (touched content/browser/devtools/protocol/system_info_handler.cc relatively recently and also present in content/browser/gpu/OWNERS that covers GpuDataManagerImpl which is supposed to be calling content::protocol::SystemInfoHandlerGpuObserver::OnGpuInfoUpdate)
Same CHECK is hit in 1 out of 3 failing/after tasks underneath the pinjob at https://pinpoint-dot-chromeperf.appspot.com/job/118f4800240000:

- https://chromium-swarm.appspot.com/task?id=3d698032fff27f10&refresh=10&show_raw=1 (minidump-2018-05-11_12-32-06-994973.dmp)

But not in the other 2 tasks, which AFAICT do not refer to a minidump:

- https://chromium-swarm.appspot.com/task?id=3d69484f82392010&refresh=10&show_raw=1

- https://chromium-swarm.appspot.com/task?id=3d68941c20709d10&refresh=10&show_raw=1


I am not sure how to find other minidumps from failed benchmark runs (I was hoping to use them to confirm if the CHECK in SystemInfoHandlerGpuObserver::ObserverWatchdogCallback is truly related to the observed, site-per-process-related regression in benchmark completion rates).

Comment 16 by zmo@chromium.org, May 18 2018

Mergedinto: 842782
Status: Duplicate (was: Assigned)

Comment 17 by zmo@chromium.org, May 18 2018

More bots time out on GPU info collection

Sign in to add a comment