Issue metadata
Sign in to add a comment
|
1.1%-16.3% regression in system_health.memory_mobile at 589850:589880 |
||||||||||||||||||||
Issue descriptionSee the link to graphs below.
,
Sep 13
📍 Pinpoint job started. https://pinpoint-dot-chromeperf.appspot.com/job/12316ac3640000
,
Sep 14
📍 Found a significant difference after 1 commit. https://pinpoint-dot-chromeperf.appspot.com/job/12316ac3640000 predictors: Request a proxy lookup in parallel with host resolve request by alexilin@chromium.org https://chromium.googlesource.com/chromium/src/+/248c267fc1c931ba1863d72bb1ddbd45f8680218 5.142e+07 → 6.064e+07 (+9.22e+06) Understanding performance regressions: http://g.co/ChromePerformanceRegressions Benchmark documentation link: https://bit.ly/system-health-benchmarks
,
Oct 15
📍 Pinpoint job started. https://pinpoint-dot-chromeperf.appspot.com/job/1017e12ee40000
,
Oct 15
📍 Pinpoint job started. https://pinpoint-dot-chromeperf.appspot.com/job/1411520ae40000
,
Oct 15
😿 Pinpoint job stopped with an error. https://pinpoint-dot-chromeperf.appspot.com/job/1017e12ee40000 All of the runs failed. The most common error (20/20 runs) was: ReadValueError: Could not find values matching: {'story': u'browse:news:cnn', 'tir_label': u'browse_news', 'histogram': u'memory:chrome:all_processes:reported_by_chrome:malloc:effective_size_avg'}
,
Oct 15
😿 Pinpoint job stopped with an error. https://pinpoint-dot-chromeperf.appspot.com/job/1411520ae40000 All of the runs failed. The most common error (20/20 runs) was: ReadValueError: Could not find values matching: {'story': u'browse:news:cnn', 'tir_label': u'browse_news', 'histogram': u'memory:chrome:all_processes:reported_by_chrome:malloc:effective_size_avg'}
,
Oct 22
📍 Pinpoint job started. https://pinpoint-dot-chromeperf.appspot.com/job/14ce1a72e40000
,
Oct 22
😿 Pinpoint job stopped with an error. https://pinpoint-dot-chromeperf.appspot.com/job/14ce1a72e40000 Buildbucket says the build completed successfully, but Pinpoint can't find the isolate hash.
,
Oct 22
📍 Pinpoint job started. https://pinpoint-dot-chromeperf.appspot.com/job/17216e7ae40000
,
Oct 22
😿 Pinpoint job stopped with an error. https://pinpoint-dot-chromeperf.appspot.com/job/17216e7ae40000 Buildbucket says the build completed successfully, but Pinpoint can't find the isolate hash.
,
Oct 22
I was able to reproduce this issue locally. Adding some net/ experts to help interpret the data. The description of the problem is below: The predictor maintains a queue of preresolve requests and limits the number of active requests to 3. Only if a host was successfully resolved, the predictor proceeds to preconnect. This is done to avoid congestion on dns worker threads, since a preconnect request may also trigger dns resolution. This works poorly if a proxy is enabled, because dns resolution isn't needed in that case. The CL https://crrev.com/c/1202222 makes the predictor issue a proxy lookup request in parallel to the host resolution. If the predictor determines that a proxy is enabled, it cancels a dns request and preconnects to the proxy server right away. Cancellation is implemented by destructing mojo::Binding<network::mojom::ResolveHostClient>. This CL caused a big 9 MiB regression in system_health.memory_mobile benchmark (see c#3). The benchmark uses the Web Page Replay tool that effectively means the browser uses proxy to load resources. I noticed that memory usage grows only if a device is *not connected to the network*. This makes me think that dns requests issued by the predictor still occupy memory after being cancelled for some time, and this time is longer when there is no connectivity. I wrote a patch https://crrev.com/c/1292880 that switches the predictor to use ResolveHostHandle::Cancel() instead of just closing a pipe for the request cancellation. I confirmed locally that this patch helps, though it doesn't recover 100% of the regression. ericorth@, mmenke@ do you think this is a plausible explanation?
,
Oct 22
We watch the |response_client_| pipe for errors, not the ResolveHostHandle. Since we call it the handle, I think we should probably watch it for errors, too, when the consumer asks for one. Alternatively, we could rename it. [Eric]: WDYT?
,
Oct 22
The predictor currently doesn't ask for ResolveHostHandle at all. I want to introduce the ResolveHostHandle in https://crrev.com/c/1292880 since it helps mitigate this regression. I'm going to call ResolveHostHandle::Cancel() and wait for ResolveHostClient::OnComplete(). I guess |response_client_| pipe is checked for errors only when the ResolveHostRequest tries to write into/read from it, so the predictor has time to create a lot of new requests before the old ones are deleted.
,
Oct 22
No, we set up a connection error listener for it, so we should cancel the request as soon as there's an error. If you're closing the client pipe, I'm not sure why that's any different form calling Cancel().
,
Oct 22
Sounds like we're already closing the correct pipe for cancellation (ResolveHostClient). We were deliberate in having closing the ResolveHostHandle pipe not cancel anything because with mojo timing, there is no good way to ensure responses won't still come afterwards on the ResolveHostClient pipe, and we were worried about users thinking of pipe closure as a full hard cancel. Any ideas on a different name for ResolveHostHandle that would make things clearer? Regarding ResolveHostHandle::Cancel() vs closing ResolveHostClient, I'll investigate a bit, but the only practical difference should be whether or not we send ResolveHostClient::OnComplete. The underlying process cancellation should be the same, so I have no idea how it would cause the memory difference you are seeing.
,
Oct 22
Just doublechecked the code. The connection error listener we setup on the ResolveHostClient pipe simply calls the implementation method of ResolveHostHandle::Cancel() (which exists even if there isn't a ResolveHostHandle pipe bound to it) with ERR_FAILED. It then cancels the internal request and the sends the ERR_FAILED response out on the ResolveHostClient pipe (which we don't expect anything to happen with since the client has closed their end of the pipe) just before we close the service's side of the ResolveHostClient pipe. So within the layer of the service implementation, both cancellation methods are *exactly* the same. The only room I can think of for a difference is something within the mojo infrastructure. Maybe sending that response on the closed ResolveHostClient pipe holds some stuff open for a bit while it waits to see if the client rebinds the pipe?
,
Oct 22
But when does the connection error handler get called? I have a trace showing that there could be 40ms delay between closing a pipe and calling the connection error handler. I've added some custom trace events and ran this benchmark with and without the patch https://crrev.com/c/1292880. I logged lifetime of each ResolveHostRequest with is_speculative=true (so it's issued by the predictor). We can see in the "browser_news_cnn.html" trace, that there could be up to 47 requests alive simultaneously. After the patch, we always wait for ResolveHostClient::OnComplete() event before starting a new request. The number of simultaneous requests is limited to 3, as we wanted. Another question is why there is 8 MiB difference in browser malloc memory between these two traces. This difference is stably reproduced.
,
Oct 22
How does that 40ms delay compare to the delay between calling ResolveHostHandle::Cancel() and invocation of the Cancel() implementation? If a significant difference, I'd say that's our main culprit, connection error handling taking much longer than we expect. And is your measured 8 mb difference normalizing number of simultaneous requests? If one trace is only making 3 simultaneous requests while the other has 47 due to delay between calling cancel and it actually getting cancelled, I don't know how the difference would be as big as 8 mb, but I would expect some difference.
,
Oct 22
I don't start a new request until I get ResolveHostClient::OnComplete() (triggered by ResolveHostHandle::Cancel()), that's different from what you've described. I think this is the right thing to do. We should receive a confirmation that the previous request was cancelled before starting a new one. No, the 8 mb difference is not normalized in any way. It's just amount of memory consumed by the browser process. I'm surprised mainly because the memory dump is triggered after the page is loaded (it's 5s after the last predictor action) and I'd assume that all requests are gone at this point. Also, 8 mb does sound too much (roughly 160 kb per request).
,
Oct 22
Right. For your specific usecase, I think not starting a new request until the old one is fully cancelled is the way to go, and whether the two methods of cancellation take different time or not, the best way to wait for cancellation to finish is ResolveHostHandle::Cancel() and wait for ResolveHostClient::OnComplete(). I'm still curious about the timing of the two different ways of cancellation if you've been able to measure it. If the time from calling explicit Cancel() to the time it takes the service Cancel() implementation to get invoked is significantly faster than 40ms, then that shows explicit messages go through faster than connection error listeners, and we should then encourage calling Cancel() even in usecases unlike yours that don't care about waiting for cancellation to go through. I agree that 160kb/req sounds high, but I honestly have no hypotheses of what's taking all that memory, especially if that's after all requests are complete and gone.
,
Oct 22
If we're using the system resolver, each request needs its own thread. Have no idea how much memory is associated with creating more threads, though. Also, if that's the case, cancelling requests doesn't actually free up the memory (Since we can't cancel getaddrinfo).
,
Oct 22
But if it's 5s after page load is complete, I would expect all those getaddrinfo calls to be done along with their threads. The threads for calling the system resolver are all created by a TaskRunner. Maybe it's creating new threads for most of these requests and not immediately cleaning up the threads on completion? I'm not very familiar with how Chrome manages its threadpools. Overall, probably fine as long as it cleans them up eventually. If something is intelligently managing Chrome threads and thread cleanup, 8mb is insignificant compared everything going on, so I could see things waiting a bit to clean it up. So yeah, if making more host resolution requests results in a few more threads being created wherever Chrome is managing threads, 8mb sounds very reasonable. Thread stacks take a lot of memory.
,
Oct 22
Don't we have a fixed size pool dedicated to perform getaddrinfo? I've thought we can have maximum 6 threads in this pool. I'll collect a trace with more enabled categories tomorrow to see if the browser really creates more threads.
,
Oct 22
When we cancel, we don't want for completion. Also, if a job takes to long, we'll issue another request in parallel, I believe.
,
Oct 23
I collected traces displaying the number of TaskSchedulerForegroundWorker threads and got *27 threads vs 8* for ToT and after the patch https://crrev.com/c/1292880 correspondingly. It's not clear from the trace whether some threads are cleaned up on completion, but I think this is our culprit. Regarding the timing of the two different ways of cancellation. I compared two versions of the predictor with only one difference in them: one of them calls ResolveHostHandle::Cancel() before closing the pipe and the other one doesn't. I haven't noticed any difference between these two versions. I think the timing heavily depends on the pipe congestion. The more messages there are waiting in the queue, the higher the latency.
,
Oct 23
Thanks for investigating the timing. From the sound of it, we don't need to recommend changes to other usecases. Matt is correct. We allow a limited number of concurrent lookups in the HostResolver (not strictly a fixed threadpool), but retries don't take slots and a lookup stops occupying a slot as soon as it's "cancelled". In the case of resolves done via system call (which are not cancellable calls), "cancellation" doesn't actually do anything except kill the WeakPtr used for it to complete, thus orphaning the thread. I don't think we have a good general solution to minimizing thread use other than getting more stuff off the system resolver onto the built-in resolver (that we can actually cancel). That's a long-term solution for another day. Specific to your usecase, where I believe this is the only code that will issue new resolve requests after cancelling (every other cancellation is the usecase where something is going away and won't be around to care about the response anymore), what we might want is a stronger indication of when the cancelled resolve is completely done/cancelled. Currently the internal request supports no indication of cancellation completion. The service code responds ResolveHostClient::OnComplete() once cancellation has been triggered in the internal request. We could wire up a more elegant cancellation in the internal request and avoid sending OnComplete() until it is fully cancelled (which in the case of system calls basically means actual resolve completion).
,
Oct 24
The following revision refers to this bug: https://chromium.googlesource.com/chromium/src.git/+/512b3dda649469c65d311056792c219cca3360c3 commit 512b3dda649469c65d311056792c219cca3360c3 Author: Alexandr Ilin <alexilin@chromium.org> Date: Wed Oct 24 20:00:17 2018 predictors: Always wait for dns job before starting a new one PreconnectManager is designed with restriction on the maximum number of simultaneous dns requests. There are instances of violation of this constraint. Each dns job has an alternative proxy resolve job that can finish first, cancel pending dns job and schedule a new one. However, this cancelling is asynchronous and wouldn't have an effect if a dns job was already scheduled in the network stack. This CL makes the PreconnectManager wait for dns job being cancelled before scheduling a new one. Bug: 883806 Change-Id: I8e90ad0dce55b569a4c77b202435f70a6b776231 Reviewed-on: https://chromium-review.googlesource.com/c/1292880 Reviewed-by: Benoit L <lizeb@chromium.org> Commit-Queue: Alexandr Ilin <alexilin@chromium.org> Cr-Commit-Position: refs/heads/master@{#602409} [modify] https://crrev.com/512b3dda649469c65d311056792c219cca3360c3/chrome/browser/predictors/preconnect_manager.cc [modify] https://crrev.com/512b3dda649469c65d311056792c219cca3360c3/chrome/browser/predictors/preconnect_manager.h [modify] https://crrev.com/512b3dda649469c65d311056792c219cca3360c3/chrome/browser/predictors/preconnect_manager_unittest.cc [modify] https://crrev.com/512b3dda649469c65d311056792c219cca3360c3/chrome/browser/predictors/resolve_host_client_impl.cc [modify] https://crrev.com/512b3dda649469c65d311056792c219cca3360c3/chrome/browser/predictors/resolve_host_client_impl.h
,
Oct 30
Surprisingly, the CL from c#28 didn't help to recover from the regression, even partially. The next thing I'm going to try is to not send proxy and dns requests in parallel but sequentially. Re c#27: I agree it would be nice to avoid sending OnComplete() until the internal request is fully cancelled. But if my code is the only one that has this kind of problem, I don't know if this is worth fixing. I'm going to try to fix this from predictor's side.
,
Oct 30
📍 Pinpoint job started. https://pinpoint-dot-chromeperf.appspot.com/job/10adfb79e40000
,
Oct 31
😿 Pinpoint job stopped with an error. https://pinpoint-dot-chromeperf.appspot.com/job/10adfb79e40000 All of the runs failed. The most common error (15/20 runs) was: CommandTimeoutError: Timed out waiting for 1 of 1 threads.
,
Oct 31
📍 Pinpoint job started. https://pinpoint-dot-chromeperf.appspot.com/job/144ff441e40000
,
Oct 31
📍 Pinpoint job started. https://pinpoint-dot-chromeperf.appspot.com/job/149b5b91e40000
,
Nov 1
📍 Job complete. See results below. https://pinpoint-dot-chromeperf.appspot.com/job/149b5b91e40000
,
Nov 5
📍 Pinpoint job started. https://pinpoint-dot-chromeperf.appspot.com/job/17526cd9e40000
,
Nov 5
😿 Pinpoint job stopped with an error. https://pinpoint-dot-chromeperf.appspot.com/job/17526cd9e40000 All of the runs failed. The most common error (20/20 runs) was: BuildError: Build failed: BUILD_FAILURE
,
Nov 7
The following revision refers to this bug: https://chromium.googlesource.com/chromium/src.git/+/24a26cf8c72d2b9d0faf43d21bf20ea6e8de5a35 commit 24a26cf8c72d2b9d0faf43d21bf20ea6e8de5a35 Author: Alexandr Ilin <alexilin@chromium.org> Date: Wed Nov 07 18:31:07 2018 predictors: Issue a dns request only after a proxy lookup is complete Active cancellation of dns requests may lead to creation of too many background threads performing getaddrinfo. It happens because we can't cancel getaddrinfo, and the task runner will allocate a new thread even though a thread with the cancelled getaddrinfo is still alive. This reverts commit 512b3dda649469c65d311056792c219cca3360c3 and applies new patches on top of this revert. Bug: 883806 Change-Id: Ie731783b86300a312de56b3e281a640fb23829fb Reviewed-on: https://chromium-review.googlesource.com/c/1307445 Reviewed-by: Benoit L <lizeb@chromium.org> Commit-Queue: Alexandr Ilin <alexilin@chromium.org> Cr-Commit-Position: refs/heads/master@{#606088} [modify] https://crrev.com/24a26cf8c72d2b9d0faf43d21bf20ea6e8de5a35/chrome/browser/predictors/preconnect_manager.cc [modify] https://crrev.com/24a26cf8c72d2b9d0faf43d21bf20ea6e8de5a35/chrome/browser/predictors/preconnect_manager.h [modify] https://crrev.com/24a26cf8c72d2b9d0faf43d21bf20ea6e8de5a35/chrome/browser/predictors/preconnect_manager_unittest.cc [modify] https://crrev.com/24a26cf8c72d2b9d0faf43d21bf20ea6e8de5a35/chrome/browser/predictors/resolve_host_client_impl.cc [modify] https://crrev.com/24a26cf8c72d2b9d0faf43d21bf20ea6e8de5a35/chrome/browser/predictors/resolve_host_client_impl.h
,
Nov 9
alexilin: The CL in c#37 probably should not have passed the CQ. It looks like it causes a predictor test to fail with >90% probability. For details, see https://bugs.chromium.org/p/chromium/issues/detail?id=900370#c1 Is it safe to revert?
,
Nov 9
I believe I've fixed this in https://bugs.chromium.org/p/chromium/issues/detail?id=902963 <https://bugs.chromium.org/p/chromium/issues/detail?id=902963#c1>. Are you still seeing predictor tests failures? On Fri, Nov 9, 2018, 16:29 erikchen via monorail < monorail+v2.3185994615@chromium.org wrote:
,
Nov 12
📍 Pinpoint job started. https://pinpoint-dot-chromeperf.appspot.com/job/103380a7e40000
,
Nov 12
📍 Found significant differences after each of 3 commits. https://pinpoint-dot-chromeperf.appspot.com/job/103380a7e40000 predictors: Issue a dns request only after a proxy lookup is complete by alexilin@chromium.org https://chromium.googlesource.com/chromium/src/+/24a26cf8c72d2b9d0faf43d21bf20ea6e8de5a35 memory:chrome:all_processes:reported_by_chrome:malloc:effective_size: 5.485e+07 → 5.271e+07 (-2.131e+06) Build internal android targets in telemetry_chrome_test. by jbudorick@chromium.org https://chromium.googlesource.com/chromium/src/+/e0c7bc0e869c70995d8b53d9a22e68294f07c7f2 memory:chrome:all_processes:reported_by_chrome:malloc:effective_size: 5.115e+07 → No values Revert "Build internal android targets in telemetry_chrome_test." by jbudorick@chromium.org https://chromium.googlesource.com/chromium/src/+/c7539ef6291836eff22077b38ae55bb62f536b99 memory:chrome:all_processes:reported_by_chrome:malloc:effective_size: No values → 4.862e+07 Understanding performance regressions: http://g.co/ChromePerformanceRegressions Benchmark documentation link: https://bit.ly/system-health-benchmarks
,
Nov 12
There was a significant drop after the CL in c#37 landed that brought all the graphs back to normal. Bisect from c#41 confirms that my change is involved indeed. Closing this issue as Fixed. |
|||||||||||||||||||||
►
Sign in to add a comment |
|||||||||||||||||||||
Comment 1 by 42576172...@developer.gserviceaccount.com
, Sep 13