New issue
Advanced search Search tips

Issue 923437 link

Starred by 1 user

Issue metadata

Status: Assigned
Owner:
Cc:
EstimatedDays: ----
NextAction: ----
OS: ----
Pri: 2
Type: Bug-Regression



Sign in to add a comment

100.2% regression in loading.desktop.network_service at 623494:623545

Project Member Reported by jam@chromium.org, Jan 18 (4 days ago)

Issue description

Network service got slower after  Per-message dispatch tasks

https://chromium.googlesource.com/chromium/src/+/471aa794bbb9e3a98542ee2135ea61ddf4bf87a3
 
Project Member

Comment 1 by 42576172...@developer.gserviceaccount.com, Jan 18 (4 days ago)

All graphs for this bug:
  https://chromeperf.appspot.com/group_report?bug_id=923437

(For debugging:) Original alerts at time of bug-filing:
  https://chromeperf.appspot.com/group_report?sid=bdb7e5ae096453c8079d9517c2e84326f0f96fb7d3bfda2b112e55b6a9381bbc


Bot(s) for this bug's original alert(s):

linux-perf-fyi
Project Member

Comment 2 by 42576172...@developer.gserviceaccount.com, Jan 18 (4 days ago)

Comment 3 by rockot@google.com, Jan 18 (4 days ago)

Cc: altimin@chromium.org
+altimin FYI. This regression is clearly the result of changing Mojo bindings dispatch to schedule one task per message rather than flushing available messages in a single task per endpoint, as we discussed doing like... forever ago.

I know that's the behavior we want since it gives the scheduler more task granularity and it also prevents a busy interface from starving its sequence, but we should be aware that it's going to cause real performance regressions in the wild.

I am disabling the behavior temporarily until M73 branch, but we'll probably have some perf trouble on our hands once I flip it back on in a week.
Project Member

Comment 4 by 42576172...@developer.gserviceaccount.com, Jan 18 (4 days ago)

😿 Pinpoint job stopped with an error.
https://pinpoint-dot-chromeperf.appspot.com/job/15e50a74540000

All of the runs failed. The most common error (12/20 runs) was:
  Run at /b/s/w/ir/thirdTraceback (most recent call last):
Project Member

Comment 5 by bugdroid1@chromium.org, Jan 18 (4 days ago)

The following revision refers to this bug:
  https://chromium.googlesource.com/chromium/src.git/+/492f7c2f6f96f59f8aab79317faeeaa5136de83c

commit 492f7c2f6f96f59f8aab79317faeeaa5136de83c
Author: Ken Rockot <rockot@google.com>
Date: Fri Jan 18 20:09:26 2019

Temporarily revert Mojo back to batch dispatch

This is to avoid any surprises in M73 since we're so close to branch.

Bug: 866708, 923437
Change-Id: Ia063bd130f9124e598651c9a385ac6871cfbe729
Reviewed-on: https://chromium-review.googlesource.com/c/1422663
Reviewed-by: John Abd-El-Malek <jam@chromium.org>
Commit-Queue: Ken Rockot <rockot@google.com>
Cr-Commit-Position: refs/heads/master@{#624251}
[modify] https://crrev.com/492f7c2f6f96f59f8aab79317faeeaa5136de83c/mojo/public/cpp/bindings/connector.h

Comment 6 by jam@chromium.org, Jan 18 (4 days ago)

Given that this is controlled by a boolean now, can it be Finched so that we can see the performance from the field?

If scheduling team wants this functionality because it'll give them more flexibility, that's great, but we should also ensure it doesn't regress performance in the meantime.

Comment 7 by rockot@google.com, Jan 18 (4 days ago)

I really don't think this is something we should Finch.

Also I would not treat this as something the "scheduling team" wants. It's something we should want in Chrome in general for better scheduling granularity. It makes a lot of sense and I consider it a bug that we don't do it already.

We should be willing to tolerate some benchmark regressions to land this. If there are any major and obvious regressions, we can hook up some particularly sensitive bindings (e.g. network ones) to higher-priority task runners.

Comment 8 by rockot@google.com, Jan 18 (4 days ago)

Owner: rockot@google.com

Comment 9 by altimin@chromium.org, Jan 18 (4 days ago)

Looking at the history of the metric, it seems that the distribution is bimodal and sometimes switches back and forth. Most likely that there are two scenarios (e.g. "test end" event being triggered at two different points) and some scheduling changes might cause a change in probability distribution, which may look as a regression. If I'm correct, that's a problem with a benchmark (unfortunately, almost all benchmarks are prone to this).

I think that jam@'s suggestion is a good one — I would expect to see no change (if not an improvement) in top-level metrics. If it is the case, it will be much easier to agree that it's a problem with a benchmark rather than with the patch.

WDYT?

Comment 10 by jam@chromium.org, Jan 18 (4 days ago)

@Ken: I don't think anyone cares about benchmark regressions if they're not representative of the field. But how do we know that this is only going to cause benchmark regressions, and not real world regressions, without Finching it?

Comment 11 by rockot@google.com, Jan 19 (4 days ago)

What about landing it on ToT and Finching the old behavior for comparison data? My concern is leaving it off ToT for too long, because nearly every week, new broken test code is landed which I then have to either fix or disable before trying to land this again.

Also, do we have a reliable set of UMA metrics that we can actually monitor for meaningful performance monitoring in the wild? I remember doing the same thing when switching over to Mojo IPC in general and found all of our metrics to be really noisy and the whole effort kind of felt like a waste of time.

Comment 12 by jam@google.com, Yesterday (24 hours ago)

@Ken: that sounds like a good idea; so would it land with a feature that's disabled before 73 branch, and then next thurs it would turn to enabled by default? That way you'd get data from 73 stable but it would be on trunk sooner.

For stats, you can look at the hearbeat metrics. These and other high-level metrics are listed in https://docs.google.com/document/d/1GQLrftTK2xSseb030-xkWQdB5LJlo2dSm4lpF2Be0Us/edit?pli=1# which we're tracking for network service.

Comment 13 by altimin@chromium.org, Today (18 hours ago)

Speed Launch Metrics Survey contains the high-level metrics that we should care about: https://docs.google.com/document/d/1Ww487ZskJ-xBmJGwPO-XPz_QcJvw-kSNffm0nPhVpj8

Sign in to add a comment