New issue
Advanced search Search tips

Issue 883806 link

Starred by 1 user

Issue metadata

Status: Fixed
Owner:
Closed: Nov 12
Cc:
EstimatedDays: ----
NextAction: ----
OS: ----
Pri: 2
Type: Bug-Regression



Sign in to add a comment

1.1%-16.3% regression in system_health.memory_mobile at 589850:589880

Project Member Reported by npm@chromium.org, Sep 13

Issue description

See the link to graphs below.
 
All graphs for this bug:
  https://chromeperf.appspot.com/group_report?bug_id=883806

(For debugging:) Original alerts at time of bug-filing:
  https://chromeperf.appspot.com/group_report?sid=5be09a3e3ddf2662712a37603e47095183cab33b936389e1717f6042908b8f8a


Bot(s) for this bug's original alert(s):

android-nexus5x-perf

system_health.memory_mobile - Benchmark documentation link:
  https://bit.ly/system-health-benchmarks
Cc: alexilin@chromium.org
Owner: alexilin@chromium.org
Status: Assigned (was: Untriaged)
📍 Found a significant difference after 1 commit.
https://pinpoint-dot-chromeperf.appspot.com/job/12316ac3640000

predictors: Request a proxy lookup in parallel with host resolve request by alexilin@chromium.org
https://chromium.googlesource.com/chromium/src/+/248c267fc1c931ba1863d72bb1ddbd45f8680218
5.142e+07 → 6.064e+07 (+9.22e+06)

Understanding performance regressions:
  http://g.co/ChromePerformanceRegressions

Benchmark documentation link:
  https://bit.ly/system-health-benchmarks
😿 Pinpoint job stopped with an error.
https://pinpoint-dot-chromeperf.appspot.com/job/1017e12ee40000

All of the runs failed. The most common error (20/20 runs) was:
ReadValueError: Could not find values matching: {'story': u'browse:news:cnn', 'tir_label': u'browse_news', 'histogram': u'memory:chrome:all_processes:reported_by_chrome:malloc:effective_size_avg'}
😿 Pinpoint job stopped with an error.
https://pinpoint-dot-chromeperf.appspot.com/job/1411520ae40000

All of the runs failed. The most common error (20/20 runs) was:
ReadValueError: Could not find values matching: {'story': u'browse:news:cnn', 'tir_label': u'browse_news', 'histogram': u'memory:chrome:all_processes:reported_by_chrome:malloc:effective_size_avg'}
😿 Pinpoint job stopped with an error.
https://pinpoint-dot-chromeperf.appspot.com/job/14ce1a72e40000

Buildbucket says the build completed successfully, but Pinpoint can't find the isolate hash.
😿 Pinpoint job stopped with an error.
https://pinpoint-dot-chromeperf.appspot.com/job/17216e7ae40000

Buildbucket says the build completed successfully, but Pinpoint can't find the isolate hash.
Cc: lizeb@chromium.org mmenke@chromium.org ericorth@chromium.org
I was able to reproduce this issue locally. Adding some net/ experts to help interpret the data. The description of the problem is below:

The predictor maintains a queue of preresolve requests and limits the number of active requests to 3. Only if a host was successfully resolved, the predictor proceeds to preconnect. This is done to avoid congestion on dns worker threads, since a preconnect request may also trigger dns resolution.

This works poorly if a proxy is enabled, because dns resolution isn't needed in that case. The CL https://crrev.com/c/1202222 makes the predictor issue a proxy lookup request in parallel to the host resolution. If the predictor determines that a proxy is enabled, it cancels a dns request and preconnects to the proxy server right away. Cancellation is implemented by destructing mojo::Binding<network::mojom::ResolveHostClient>.

This CL caused a big 9 MiB regression in system_health.memory_mobile benchmark (see c#3). The benchmark uses the Web Page Replay tool that effectively means the browser uses proxy to load resources. I noticed that memory usage grows only if a device is *not connected to the network*. This makes me think that dns requests issued by the predictor still occupy memory after being cancelled for some time, and this time is longer when there is no connectivity. 

I wrote a patch https://crrev.com/c/1292880 that switches the predictor to use ResolveHostHandle::Cancel() instead of just closing a pipe for the request cancellation. I confirmed locally that this patch helps, though it doesn't recover 100% of the regression.

ericorth@, mmenke@ do you think this is a plausible explanation?
We watch the |response_client_| pipe for errors, not the ResolveHostHandle.  Since we call it the handle, I think we should probably watch it for errors, too, when the consumer asks for one.  Alternatively, we could rename it.

[Eric]:  WDYT?
The predictor currently doesn't ask for ResolveHostHandle at all. I want to introduce the ResolveHostHandle in https://crrev.com/c/1292880 since it helps mitigate this regression. I'm going to call ResolveHostHandle::Cancel() and wait for ResolveHostClient::OnComplete().

I guess |response_client_| pipe is checked for errors only when the ResolveHostRequest tries to write into/read from it, so the predictor has time to create a lot of new requests before the old ones are deleted.
No, we set up a connection error listener for it, so we should cancel the request as soon as there's an error.  If you're closing the client pipe, I'm not sure why that's any different form calling Cancel().
Sounds like we're already closing the correct pipe for cancellation (ResolveHostClient).  We were deliberate in having closing the ResolveHostHandle pipe not cancel anything because with mojo timing, there is no good way to ensure responses won't still come afterwards on the ResolveHostClient pipe, and we were worried about users thinking of pipe closure as a full hard cancel.  Any ideas on a different name for ResolveHostHandle that would make things clearer?

Regarding ResolveHostHandle::Cancel() vs closing ResolveHostClient, I'll investigate a bit, but the only practical difference should be whether or not we send ResolveHostClient::OnComplete.  The underlying process cancellation should be the same, so I have no idea how it would cause the memory difference you are seeing.
Just doublechecked the code.  The connection error listener we setup on the ResolveHostClient pipe simply calls the implementation method of ResolveHostHandle::Cancel() (which exists even if there isn't a ResolveHostHandle pipe bound to it) with ERR_FAILED.  It then cancels the internal request and the sends the ERR_FAILED response out on the ResolveHostClient pipe (which we don't expect anything to happen with since the client has closed their end of the pipe) just before we close the service's side of the ResolveHostClient pipe.

So within the layer of the service implementation, both cancellation methods are *exactly* the same.  The only room I can think of for a difference is something within the mojo infrastructure.  Maybe sending that response on the closed ResolveHostClient pipe holds some stuff open for a bit while it waits to see if the client rebinds the pipe?
But when does the connection error handler get called? I have a trace showing that there could be 40ms delay between closing a pipe and calling the connection error handler.

I've added some custom trace events and ran this benchmark with and without the patch https://crrev.com/c/1292880.

I logged lifetime of each ResolveHostRequest with is_speculative=true (so it's issued by the predictor). We can see in the "browser_news_cnn.html" trace, that there could be up to 47 requests alive simultaneously. 

After the patch, we always wait for ResolveHostClient::OnComplete() event before starting a new request. The number of simultaneous requests is limited to 3, as we wanted.

Another question is why there is 8 MiB difference in browser malloc memory between these two traces. This difference is stably reproduced.


browser_news_cnn_with_patch.html
3.9 MB View Download
browser_news_cnn.html
3.9 MB View Download
How does that 40ms delay compare to the delay between calling ResolveHostHandle::Cancel() and invocation of the Cancel() implementation? If a significant difference, I'd say that's our main culprit, connection error handling taking much longer than we expect.

And is your measured 8 mb difference normalizing number of simultaneous requests? If one trace is only making 3 simultaneous requests while the other has 47 due to delay between calling cancel and it actually getting cancelled, I don't know how the difference would be as big as 8 mb, but I would expect some difference.
I don't start a new request until I get ResolveHostClient::OnComplete() (triggered by ResolveHostHandle::Cancel()), that's different from what you've described. I think this is the right thing to do. We should receive a confirmation that the previous request was cancelled before starting a new one.

No, the 8 mb difference is not normalized in any way. It's just amount of memory consumed by the browser process. I'm surprised mainly because the memory dump is triggered after the page is loaded (it's 5s after the last predictor action) and I'd assume that all requests are gone at this point.

Also, 8 mb does sound too much (roughly 160 kb per request).
Right.  For your specific usecase, I think not starting a new request until the old one is fully cancelled is the way to go, and whether the two methods of cancellation take different time or not, the best way to wait for cancellation to finish is ResolveHostHandle::Cancel() and wait for ResolveHostClient::OnComplete().

I'm still curious about the timing of the two different ways of cancellation if you've been able to measure it.  If the time from calling explicit Cancel() to the time it takes the service Cancel() implementation to get invoked is significantly faster than 40ms, then that shows explicit messages go through faster than connection error listeners, and we should then encourage calling Cancel() even in usecases unlike yours that don't care about waiting for cancellation to go through.

I agree that 160kb/req sounds high, but I honestly have no hypotheses of what's taking all that memory, especially if that's after all requests are complete and gone.
If we're using the system resolver, each request needs its own thread.  Have no idea how much memory is associated with creating more threads, though.  Also, if that's the case, cancelling requests doesn't actually free up the memory (Since we can't cancel getaddrinfo).
But if it's 5s after page load is complete, I would expect all those getaddrinfo calls to be done along with their threads.

The threads for calling the system resolver are all created by a TaskRunner.  Maybe it's creating new threads for most of these requests and not immediately cleaning up the threads on completion? I'm not very familiar with how Chrome manages its threadpools.  Overall, probably fine as long as it cleans them up eventually.  If something is intelligently managing Chrome threads and thread cleanup, 8mb is insignificant compared everything going on, so I could see things waiting a bit to clean it up.  So yeah, if making more host resolution requests results in a few more threads being created wherever Chrome is managing threads, 8mb sounds very reasonable.  Thread stacks take a lot of memory.
Don't we have a fixed size pool dedicated to perform getaddrinfo? I've thought we can have maximum 6 threads in this pool.

I'll collect a trace with more enabled categories tomorrow to see if the browser really creates more threads.
When we cancel, we don't want for completion.  Also, if a job takes to long, we'll issue another request in parallel, I believe.
I collected traces displaying the number of TaskSchedulerForegroundWorker threads and got *27 threads vs 8* for ToT and after the patch https://crrev.com/c/1292880 correspondingly. It's not clear from the trace whether some threads are cleaned up on completion, but I think this is our culprit.

Regarding the timing of the two different ways of cancellation. I compared two versions of the predictor with only one difference in them: one of them calls ResolveHostHandle::Cancel() before closing the pipe and the other one doesn't. I haven't noticed any difference between these two versions. I think the timing heavily depends on the pipe congestion. The more messages there are waiting in the queue, the higher the latency.
Thanks for investigating the timing.  From the sound of it, we don't need to recommend changes to other usecases.

Matt is correct.  We allow a limited number of concurrent lookups in the HostResolver (not strictly a fixed threadpool), but retries don't take slots and a lookup stops occupying a slot as soon as it's "cancelled".  In the case of resolves done via system call (which are not cancellable calls), "cancellation" doesn't actually do anything except kill the WeakPtr used for it to complete, thus orphaning the thread.

I don't think we have a good general solution to minimizing thread use other than getting more stuff off the system resolver onto the built-in resolver (that we can actually cancel).  That's a long-term solution for another day.

Specific to your usecase, where I believe this is the only code that will issue new resolve requests after cancelling (every other cancellation is the usecase where something is going away and won't be around to care about the response anymore), what we might want is a stronger indication of when the cancelled resolve is completely done/cancelled.  Currently the internal request supports no indication of cancellation completion.  The service code responds ResolveHostClient::OnComplete() once cancellation has been triggered in the internal request.  We could wire up a more elegant cancellation in the internal request and avoid sending OnComplete() until it is fully cancelled (which in the case of system calls basically means actual resolve completion).
Project Member

Comment 28 by bugdroid1@chromium.org, Oct 24

The following revision refers to this bug:
  https://chromium.googlesource.com/chromium/src.git/+/512b3dda649469c65d311056792c219cca3360c3

commit 512b3dda649469c65d311056792c219cca3360c3
Author: Alexandr Ilin <alexilin@chromium.org>
Date: Wed Oct 24 20:00:17 2018

predictors: Always wait for dns job before starting a new one

PreconnectManager is designed with restriction on the maximum number of
simultaneous dns requests. There are instances of violation of this
constraint.

Each dns job has an alternative proxy resolve job that can finish first,
cancel pending dns job and schedule a new one. However, this cancelling
is asynchronous and wouldn't have an effect if a dns job was already
scheduled in the network stack.

This CL makes the PreconnectManager wait for dns job being cancelled
before scheduling a new one.

Bug:  883806 
Change-Id: I8e90ad0dce55b569a4c77b202435f70a6b776231
Reviewed-on: https://chromium-review.googlesource.com/c/1292880
Reviewed-by: Benoit L <lizeb@chromium.org>
Commit-Queue: Alexandr Ilin <alexilin@chromium.org>
Cr-Commit-Position: refs/heads/master@{#602409}
[modify] https://crrev.com/512b3dda649469c65d311056792c219cca3360c3/chrome/browser/predictors/preconnect_manager.cc
[modify] https://crrev.com/512b3dda649469c65d311056792c219cca3360c3/chrome/browser/predictors/preconnect_manager.h
[modify] https://crrev.com/512b3dda649469c65d311056792c219cca3360c3/chrome/browser/predictors/preconnect_manager_unittest.cc
[modify] https://crrev.com/512b3dda649469c65d311056792c219cca3360c3/chrome/browser/predictors/resolve_host_client_impl.cc
[modify] https://crrev.com/512b3dda649469c65d311056792c219cca3360c3/chrome/browser/predictors/resolve_host_client_impl.h

Surprisingly, the CL from c#28 didn't help to recover from the regression, even partially.

The next thing I'm going to try is to not send proxy and dns requests in parallel but sequentially.

Re c#27: I agree it would be nice to avoid sending OnComplete() until the internal request is fully cancelled. But if my code is the only one that has this kind of problem, I don't know if this is worth fixing. I'm going to try to fix this from predictor's side.

😿 Pinpoint job stopped with an error.
https://pinpoint-dot-chromeperf.appspot.com/job/10adfb79e40000

All of the runs failed. The most common error (15/20 runs) was:
CommandTimeoutError: Timed out waiting for 1 of 1 threads.
📍 Job complete. See results below.
https://pinpoint-dot-chromeperf.appspot.com/job/149b5b91e40000
😿 Pinpoint job stopped with an error.
https://pinpoint-dot-chromeperf.appspot.com/job/17526cd9e40000

All of the runs failed. The most common error (20/20 runs) was:
BuildError: Build failed: BUILD_FAILURE
Project Member

Comment 37 by bugdroid1@chromium.org, Nov 7

The following revision refers to this bug:
  https://chromium.googlesource.com/chromium/src.git/+/24a26cf8c72d2b9d0faf43d21bf20ea6e8de5a35

commit 24a26cf8c72d2b9d0faf43d21bf20ea6e8de5a35
Author: Alexandr Ilin <alexilin@chromium.org>
Date: Wed Nov 07 18:31:07 2018

predictors: Issue a dns request only after a proxy lookup is complete

Active cancellation of dns requests may lead to creation of too many background
threads performing getaddrinfo. It happens because we can't cancel getaddrinfo,
and the task runner will allocate a new thread even though a thread with the
cancelled getaddrinfo is still alive.

This reverts commit 512b3dda649469c65d311056792c219cca3360c3 and applies new
patches on top of this revert.

Bug:  883806 
Change-Id: Ie731783b86300a312de56b3e281a640fb23829fb
Reviewed-on: https://chromium-review.googlesource.com/c/1307445
Reviewed-by: Benoit L <lizeb@chromium.org>
Commit-Queue: Alexandr Ilin <alexilin@chromium.org>
Cr-Commit-Position: refs/heads/master@{#606088}
[modify] https://crrev.com/24a26cf8c72d2b9d0faf43d21bf20ea6e8de5a35/chrome/browser/predictors/preconnect_manager.cc
[modify] https://crrev.com/24a26cf8c72d2b9d0faf43d21bf20ea6e8de5a35/chrome/browser/predictors/preconnect_manager.h
[modify] https://crrev.com/24a26cf8c72d2b9d0faf43d21bf20ea6e8de5a35/chrome/browser/predictors/preconnect_manager_unittest.cc
[modify] https://crrev.com/24a26cf8c72d2b9d0faf43d21bf20ea6e8de5a35/chrome/browser/predictors/resolve_host_client_impl.cc
[modify] https://crrev.com/24a26cf8c72d2b9d0faf43d21bf20ea6e8de5a35/chrome/browser/predictors/resolve_host_client_impl.h

alexilin: The CL in c#37 probably should not have passed the CQ. It looks like it causes a predictor test to fail with >90% probability.

For details, see https://bugs.chromium.org/p/chromium/issues/detail?id=900370#c1

Is it safe to revert?
I believe I've fixed this in
https://bugs.chromium.org/p/chromium/issues/detail?id=902963
<https://bugs.chromium.org/p/chromium/issues/detail?id=902963#c1>. Are you
still seeing predictor tests failures?

On Fri, Nov 9, 2018, 16:29 erikchen via monorail <
monorail+v2.3185994615@chromium.org wrote:
Cc: jbudorick@chromium.org
📍 Found significant differences after each of 3 commits.
https://pinpoint-dot-chromeperf.appspot.com/job/103380a7e40000

predictors: Issue a dns request only after a proxy lookup is complete by alexilin@chromium.org
https://chromium.googlesource.com/chromium/src/+/24a26cf8c72d2b9d0faf43d21bf20ea6e8de5a35
memory:chrome:all_processes:reported_by_chrome:malloc:effective_size: 5.485e+07 → 5.271e+07 (-2.131e+06)

Build internal android targets in telemetry_chrome_test. by jbudorick@chromium.org
https://chromium.googlesource.com/chromium/src/+/e0c7bc0e869c70995d8b53d9a22e68294f07c7f2
memory:chrome:all_processes:reported_by_chrome:malloc:effective_size: 5.115e+07 → No values

Revert "Build internal android targets in telemetry_chrome_test." by jbudorick@chromium.org
https://chromium.googlesource.com/chromium/src/+/c7539ef6291836eff22077b38ae55bb62f536b99
memory:chrome:all_processes:reported_by_chrome:malloc:effective_size: No values → 4.862e+07

Understanding performance regressions:
  http://g.co/ChromePerformanceRegressions

Benchmark documentation link:
  https://bit.ly/system-health-benchmarks
Status: Fixed (was: Assigned)
There was a significant drop after the CL in c#37 landed that brought all the graphs back to normal. Bisect from c#41 confirms that my change is involved indeed.

Closing this issue as Fixed.

Sign in to add a comment