New issue
Advanced search Search tips

Issue 907263 link

Starred by 2 users

Issue metadata

Status: Untriaged
Owner: ----
Cc:
Components:
EstimatedDays: ----
NextAction: ----
OS: Windows
Pri: 2
Type: Bug



Sign in to add a comment

Excessive SetCookie activity, leading to lots of CPU time and GB of RAM consumed

Project Member Reported by brucedaw...@chromium.org, Nov 20

Issue description

Excessive memory usage in Chrome's browser process was noticed, together with poor performance. A 157.5 s ETW trace showed that the CPU usage (and presumably also the memory) was associated with hundreds of thousands of calls to content::RenderFrameMessageFilter::SetCookie.

The hundreds of thousands of calls were inferred because there were 242,123 context switches where the browser process' main thread was woken up by one of the browser-process threads (18528) from this stack:

  |- ntdll.dll!RtlUserThreadStart
  |    kernel32.dll!BaseThreadInitThunk
  |    chrome.dll!base::`anonymous namespace'::ThreadFunc
  |    chrome.dll!base::Thread::ThreadMain
  |    chrome.dll!content::BrowserProcessSubThread::IOThreadRun
  |    chrome.dll!base::RunLoop::Run
  |    chrome.dll!base::MessagePumpWin::Run
  |    chrome.dll!base::MessagePumpForIO::DoRunLoop
  |    |- chrome.dll!base::MessageLoop::DoWork
  |    |    chrome.dll!base::MessageLoop::RunTask
  |    |    chrome.dll!base::debug::TaskAnnotator::RunTask
  |    |    |- chrome.dll!mojo::SimpleWatcher::OnHandleReady
  |    |    |    |- chrome.dll!mojo::Connector::ReadAllAvailableMessages
  |    |    |    |    chrome.dll!mojo::Connector::ReadSingleMessage
  |    |    |    |    |- chrome.dll!IPC::`anonymous namespace'::ChannelAssociatedGroupController::Accept
  |    |    |    |    |    |- chrome.dll!mojo::InterfaceEndpointClient::HandleValidatedMessage
  |    |    |    |    |    |    chrome.dll!content::mojom::RenderFrameMessageFilterStub<mojo::RawPtrImplRefTraits<content::mojom::RenderFrameMessageFilter> >::AcceptWithResponder
  |    |    |    |    |    |    chrome.dll!content::mojom::RenderFrameMessageFilterStubDispatch::AcceptWithResponder
  |    |    |    |    |    |    |- chrome.dll!content::RenderFrameMessageFilter::SetCookie
  |    |    |    |    |    |    |    |- chrome.dll!ChromeContentBrowserClient::AllowSetCookie
  |    |    |    |    |    |    |    |    chrome.dll!ChromeContentBrowserClient::OnCookieChange
  |    |    |    |    |    |    |    |    chrome.dll!content::BrowserThread::PostTask
  |    |    |    |    |    |    |    |    chrome.dll!content::`anonymous namespace'::PostTaskHelper
  |    |    |    |    |    |    |    |    chrome.dll!base::internal::MessageLoopTaskRunner::PostDelayedTask
  |    |    |    |    |    |    |    |    chrome.dll!base::internal::MessageLoopTaskRunner::AddToIncomingQueue
  |    |    |    |    |    |    |    |    chrome.dll!base::MessageLoop::Controller::DidQueueTask
  |    |    |    |    |    |    |    |    chrome.dll!base::MessagePumpForUI::ScheduleWork
  |    |    |    |    |    |    |    |    user32.dll!PostMessageW


That is, a mojo message was received and processed and that triggered a call to SetCookie and on to PostMessageW, thus telling the main thread to wake up and do cookie work.

Given that all of the renderer processes combined woke up the browser process fewer than 12,000 times it seems impossible that these cookies could be coming from a renderer process, even one that was hopelessly out of control. Therefore it seems necessary that the same IPC message is being processed multiple times, or ???

TL;DR - It seems like a problem that we are calling SetCookie more than 1,500 times per second.

 
Cc: -costan@google.com pwnall@chromium.org
Cc: jam@chromium.org cduvall@chromium.org
Per the thread, we're seeing this in 70, correct?

No concrete ideas, but looping in a few people that touched code on the stack in the 70.

Do we know if the Network Service was enabled or not?
Correct, this was M70.

See possibly related bug crbug.com/809199 - maybe this bug is a duplicate and therefore fixed, but ??? (just making sure they are explicitly linked)

@jsbell: network service isn't enabled in 70 stable.
@brucedawson: Any information relevant to the identity of thread 18528? 

@jam: since you're in the neighborhood, could you comment on if the calls could be coming from the renderer given the wake up pattern described at the end of original report?

@morlovich: these are all IPCs from the renderer, RenderFrameMessageFilter::SetCookie is the dispatching method for that mojo call.
That was my understanding as well, but brucedawson@'s interpretation of the ETW data in the original report is that the renderer isn't waking up enough to produce that many IPCs, so it must be something within the browser process. I don't know enough about how Mojo works internally (could calls get batched up somehow while waiting for some thread to wake up or for some buffers to free up or something?) or of what exactly ETW is reporting there to evaluate whether that's the case or not, but you probably do.



The bug is reproducing on sorin@'s machine again now - enough to make it practical to run diagnostics. What data should be recorded? From the email thread I have:

Cookie.CommitProblem, Cookie.LoadProblem, and Sqlite.Error.Cookie from
chrome://histograms might provide some leads (or might not...)

I also looked at Cookie.Count, Cookie.DBSizeInKB, Cookie.TimeLoad, Cookie.NumberOfLoadedCookies, and Cookie.BackingStoreUpdateResults and shared results from my non-misbehaving machine to see if any anomalies jump out.

I assume that a chrome://tracing trace would be helpful - any requests for particular flags?

From the misbehaving Chrome instance:
* I don't have any record of Cookie.CommitProblem, Cookie.LoadProblem, and Sqlite.Error.Cookie
* The histograms referred by #8 are below:

Histogram: Cookie.Count recorded 551 samples, mean = 2366.6 (flags = 0x41)
0     ------------------------O                                                 (63 = 11.4%)
1     ------------------------------------------------------------------------O (187 = 33.9%) {11.4%}
2     ... 
3435  -O                                                                        (19 = 3.4%) {45.4%}
4000  ----------------------O                                                   (282 = 51.2%) {48.8%}

Histogram: Cookie.DBSizeInKB recorded 2 samples, mean = 1700.5 (flags = 0x41)
0     ... 
9     ------------------------------------------------------------------------O (1 = 50.0%) {0.0%}
12    ... 
3077  -------------------------------------------O                              (1 = 50.0%) {50.0%}
4053  ... 

Histogram: Cookie.TimeLoad recorded 2 samples, mean = 1520.5 (flags = 0x41)
0     ... 
40    ------------------------------------------------------------------------O (1 = 50.0%) {0.0%}
50    ... 
2964  ------------------------------------------------------------------------O (1 = 50.0%) {50.0%}
3674  ... 


Histogram: Cookie.NumberOfLoadedCookies recorded 2 samples, mean = 1946.5 (flags = 0x41)
0     O                                                                         (0 = 0.0%)
1     ------------------------------------------------------------------------O (1 = 50.0%) {0.0%}
2     ... 
3553  --------------O                                                           (1 = 50.0%) {50.0%}
4222  ... 

Histogram: Cookie.BackingStoreUpdateResults recorded 25384 samples, mean = 0.0 (flags = 0x41)
0  ------------------------------------------------------------------------O (25384 = 100.0%)
1  ... 

My histograms (details in the email) have generally lower numbers but I don't see anything dramatically (orders of magnitude) different.
Yeah, the numbers in comment #9 pretty much look like the backend/store is working properly and writing things out just fine, so it sure looks like it's just not keeping up. 
SQLite.CommitTime.Cookie might also say something, though it got through 25,000+ commits, so it's probably not taking half an hour each time to do so.

(No suggestions on tracing besides whatever gets you threads + tasks --- I don't really use tracing much, so can't get most possible out of it. I think there was something that recorded IPCs edges, even?)

@morlovich: re comment 7, the mojo call is synchronous [1] from the renderer, so it wouldn't be possible that the renderer makes 10 requests, goes to sleep, and then the browser wakes up 10 different times.

One histogram to get the number of times that Blink called SetCookie is Blink.CookieJar.SyncCookiesSetTime

[1] https://cs.chromium.org/chromium/src/content/common/render_frame_message_filter.mojom?rcl=d1f4835cdca00550a19c58bc937111c7eab77675&l=14
@jam - the fundamental mystery, then, is how the browser can call SetCookie 242,000+ times when there is no sign of there being more than 12,000 IPCs going from all renderers combined to the browser.

It looks to me like the busiest renderer process calls chrome_child.dll!IPC::ChannelProxy::Context::OnSendMessage just 1,450 times during the trace.

This is based on context switch data and there are a few ways that could be misleading, but the discrepancy is so huge that I tend to trust it.

If true then it means that the browser process is somehow getting into a loop where it reprocesses the same message. Perhaps a failure triggers a retry? I don't know - I can't really speculate about causes, I just have what seems like evidence that the browser process is stuck in a loop.

I don't know how (reliable) the data above is, so I can't comment on that. I would suggest looking at the uma i quoted above, as that would tell you how many times the renderer called it. If you see a mismatch between that and the callstack in the first comment, let me know.


Bruce suggested that I shared the following histogram:

Histogram: Blink.CookieJar.SyncCookiesTime recorded 98759 samples, mean = 813.9 (flags = 0x41)
0       ... 
76      O                                                                         (2 = 0.0%) {0.0%}
105     --O                                                                       (670 = 0.7%) {0.0%}
146     -----------O                                                              (4645 = 4.7%) {0.7%}
203     ---------------O                                                          (6447 = 6.5%) {5.4%}
282     -------------------------O                                                (10914 = 11.1%) {11.9%}
391     ------------------------------------------------------------------------O (31686 = 32.1%) {23.0%}
542     ---------------------------------------------O                            (19599 = 19.8%) {55.0%}
752     ----------------------------O                                             (12511 = 12.7%) {74.9%}
1043    ---------------O                                                          (6383 = 6.5%) {87.6%}
1447    -------O                                                                  (3195 = 3.2%) {94.0%}
2008    ----O                                                                     (1716 = 1.7%) {97.3%}
2786    -O                                                                        (472 = 0.5%) {99.0%}
3865    O                                                                         (163 = 0.2%) {99.5%}
5362    O                                                                         (79 = 0.1%) {99.6%}
7439    O                                                                         (87 = 0.1%) {99.7%}
10321   O                                                                         (36 = 0.0%) {99.8%}
14319   O                                                                         (20 = 0.0%) {99.8%}
19866   O                                                                         (29 = 0.0%) {99.9%}
27562   O                                                                         (47 = 0.0%) {99.9%}
38240   O                                                                         (11 = 0.0%) {99.9%}
53055   O                                                                         (6 = 0.0%) {100.0%}
73609   O                                                                         (4 = 0.0%) {100.0%}
102126  O                                                                         (7 = 0.0%) {100.0%}
141691  O                                                                         (5 = 0.0%) {100.0%}
196583  O                                                                         (8 = 0.0%) {100.0%}
272741  O                                                                         (6 = 0.0%) {100.0%}
378403  O                                                                         (7 = 0.0%) {100.0%}
525000  O                                                                         (4 = 0.0%) {100.0%}
728390  ... 

And the browser start time at: 11/21/18, 11:26:00AM. The histogram above was captured at 15:00.
So, that's a steady 7.7 cookie syncs per second for ~3.5 hours. That's high, but nowhere near high enough to account for the ~1,500 content::RenderFrameMessageFilter::SetCookie calls per second that the initial ETW trace suggested.

It would be good to get another ETW trace (with chrome://histograms data from before and after the trace) to correlate the different sources of data. And a chrome://tracing trace.

Sorin's browser has been up for ~6 hours and here's the latest histogram data (note that this is a different counter from what was posted in comment #15 - this is the one that jam@ requested):

Histogram: Blink.CookieJar.SyncCookiesSetTime recorded 2057827 samples, mean = 8.8 (flags = 0x41)

2057827 samples in six hours seems like a lot - that's ~95/s.

Sorin shared an ETW trace which seems to show 94 calls to RenderFrameMessageFilter::SetCookie which matches well with the 118 change in the number of samples (he grabbed the histogram before and after - a perfect match would be improbable but that is very close) in the ~30 s time span covered.

So, ~3-4 cookie sets per second during the trace, 95/s over the last six hours, FWIW. I guess the cookie setting rate is highly variable and we caught it during a slow period.

And with that, I'm off to eat turkey.

ok, so I have this happening again in 70.0.3538.102 (Official Build) (64-bit).
The browser was mostly idle overnight but with several tabs open. When I took a peek at it in the morning, the lag was back, and the memory usage in the browser process was ~17GB. After being idle for 4 more hours, the memory usage climbed up to 37GB.

Browser process start time 11/27/18, 7:42 AM
Memory footprint: 37GB
Context switches: 211444678

I have a renderer process with 130mil context switches and the following histogram:

Histogram: Blink.CookieJar.SyncCookiesSetTime recorded 121104396 samples, mean = 6.6 (flags = 0x41)
0      ... 
3      O                                                                         (80565 = 0.1%) {0.0%}
4      ----------O                                                               (12577001 = 10.4%) {0.1%}
6      ------------------------------------------------------------------------O (92919808 = 76.7%) {10.5%}
8      -------O                                                                  (13059252 = 10.8%) {87.2%}
11     O                                                                         (1105284 = 0.9%) {98.0%}
15     O                                                                         (966323 = 0.8%) {98.9%}
21     O                                                                         (215561 = 0.2%) {99.7%}
29     O                                                                         (111182 = 0.1%) {99.9%}
40     O                                                                         (33074 = 0.0%) {99.9%}
55     O                                                                         (14120 = 0.0%) {100.0%}
76     O                                                                         (15046 = 0.0%) {100.0%}
105    O                                                                         (5144 = 0.0%) {100.0%}
146    O                                                                         (438 = 0.0%) {100.0%}
203    O                                                                         (1343 = 0.0%) {100.0%}
282    O                                                                         (206 = 0.0%) {100.0%}
391    O                                                                         (14 = 0.0%) {100.0%}
542    O                                                                         (10 = 0.0%) {100.0%}
752    O                                                                         (1 = 0.0%) {100.0%}
1043   O                                                                         (3 = 0.0%) {100.0%}
1447   O                                                                         (2 = 0.0%) {100.0%}
2008   O                                                                         (2 = 0.0%) {100.0%}
2786   O                                                                         (4 = 0.0%) {100.0%}
3865   O                                                                         (3 = 0.0%) {100.0%}
5362   O                                                                         (2 = 0.0%) {100.0%}
7439   O                                                                         (3 = 0.0%) {100.0%}
10321  O                                                                         (2 = 0.0%) {100.0%}
14319  O                                                                         (2 = 0.0%) {100.0%}
19866  O                                                                         (1 = 0.0%) {100.0%}
27562  ... 

Sign in to add a comment