Excessive SetCookie activity, leading to lots of CPU time and GB of RAM consumed |
||
Issue descriptionExcessive memory usage in Chrome's browser process was noticed, together with poor performance. A 157.5 s ETW trace showed that the CPU usage (and presumably also the memory) was associated with hundreds of thousands of calls to content::RenderFrameMessageFilter::SetCookie. The hundreds of thousands of calls were inferred because there were 242,123 context switches where the browser process' main thread was woken up by one of the browser-process threads (18528) from this stack: |- ntdll.dll!RtlUserThreadStart | kernel32.dll!BaseThreadInitThunk | chrome.dll!base::`anonymous namespace'::ThreadFunc | chrome.dll!base::Thread::ThreadMain | chrome.dll!content::BrowserProcessSubThread::IOThreadRun | chrome.dll!base::RunLoop::Run | chrome.dll!base::MessagePumpWin::Run | chrome.dll!base::MessagePumpForIO::DoRunLoop | |- chrome.dll!base::MessageLoop::DoWork | | chrome.dll!base::MessageLoop::RunTask | | chrome.dll!base::debug::TaskAnnotator::RunTask | | |- chrome.dll!mojo::SimpleWatcher::OnHandleReady | | | |- chrome.dll!mojo::Connector::ReadAllAvailableMessages | | | | chrome.dll!mojo::Connector::ReadSingleMessage | | | | |- chrome.dll!IPC::`anonymous namespace'::ChannelAssociatedGroupController::Accept | | | | | |- chrome.dll!mojo::InterfaceEndpointClient::HandleValidatedMessage | | | | | | chrome.dll!content::mojom::RenderFrameMessageFilterStub<mojo::RawPtrImplRefTraits<content::mojom::RenderFrameMessageFilter> >::AcceptWithResponder | | | | | | chrome.dll!content::mojom::RenderFrameMessageFilterStubDispatch::AcceptWithResponder | | | | | | |- chrome.dll!content::RenderFrameMessageFilter::SetCookie | | | | | | | |- chrome.dll!ChromeContentBrowserClient::AllowSetCookie | | | | | | | | chrome.dll!ChromeContentBrowserClient::OnCookieChange | | | | | | | | chrome.dll!content::BrowserThread::PostTask | | | | | | | | chrome.dll!content::`anonymous namespace'::PostTaskHelper | | | | | | | | chrome.dll!base::internal::MessageLoopTaskRunner::PostDelayedTask | | | | | | | | chrome.dll!base::internal::MessageLoopTaskRunner::AddToIncomingQueue | | | | | | | | chrome.dll!base::MessageLoop::Controller::DidQueueTask | | | | | | | | chrome.dll!base::MessagePumpForUI::ScheduleWork | | | | | | | | user32.dll!PostMessageW That is, a mojo message was received and processed and that triggered a call to SetCookie and on to PostMessageW, thus telling the main thread to wake up and do cookie work. Given that all of the renderer processes combined woke up the browser process fewer than 12,000 times it seems impossible that these cookies could be coming from a renderer process, even one that was hopelessly out of control. Therefore it seems necessary that the same IPC message is being processed multiple times, or ??? TL;DR - It seems like a problem that we are calling SetCookie more than 1,500 times per second.
,
Nov 20
Per the thread, we're seeing this in 70, correct? No concrete ideas, but looping in a few people that touched code on the stack in the 70. Do we know if the Network Service was enabled or not?
,
Nov 20
Correct, this was M70. See possibly related bug crbug.com/809199 - maybe this bug is a duplicate and therefore fixed, but ??? (just making sure they are explicitly linked)
,
Nov 21
@jsbell: network service isn't enabled in 70 stable.
,
Nov 21
@brucedawson: Any information relevant to the identity of thread 18528? @jam: since you're in the neighborhood, could you comment on if the calls could be coming from the renderer given the wake up pattern described at the end of original report?
,
Nov 21
@morlovich: these are all IPCs from the renderer, RenderFrameMessageFilter::SetCookie is the dispatching method for that mojo call.
,
Nov 21
That was my understanding as well, but brucedawson@'s interpretation of the ETW data in the original report is that the renderer isn't waking up enough to produce that many IPCs, so it must be something within the browser process. I don't know enough about how Mojo works internally (could calls get batched up somehow while waiting for some thread to wake up or for some buffers to free up or something?) or of what exactly ETW is reporting there to evaluate whether that's the case or not, but you probably do.
,
Nov 21
The bug is reproducing on sorin@'s machine again now - enough to make it practical to run diagnostics. What data should be recorded? From the email thread I have: Cookie.CommitProblem, Cookie.LoadProblem, and Sqlite.Error.Cookie from chrome://histograms might provide some leads (or might not...) I also looked at Cookie.Count, Cookie.DBSizeInKB, Cookie.TimeLoad, Cookie.NumberOfLoadedCookies, and Cookie.BackingStoreUpdateResults and shared results from my non-misbehaving machine to see if any anomalies jump out. I assume that a chrome://tracing trace would be helpful - any requests for particular flags?
,
Nov 21
From the misbehaving Chrome instance:
* I don't have any record of Cookie.CommitProblem, Cookie.LoadProblem, and Sqlite.Error.Cookie
* The histograms referred by #8 are below:
Histogram: Cookie.Count recorded 551 samples, mean = 2366.6 (flags = 0x41)
0 ------------------------O (63 = 11.4%)
1 ------------------------------------------------------------------------O (187 = 33.9%) {11.4%}
2 ...
3435 -O (19 = 3.4%) {45.4%}
4000 ----------------------O (282 = 51.2%) {48.8%}
Histogram: Cookie.DBSizeInKB recorded 2 samples, mean = 1700.5 (flags = 0x41)
0 ...
9 ------------------------------------------------------------------------O (1 = 50.0%) {0.0%}
12 ...
3077 -------------------------------------------O (1 = 50.0%) {50.0%}
4053 ...
Histogram: Cookie.TimeLoad recorded 2 samples, mean = 1520.5 (flags = 0x41)
0 ...
40 ------------------------------------------------------------------------O (1 = 50.0%) {0.0%}
50 ...
2964 ------------------------------------------------------------------------O (1 = 50.0%) {50.0%}
3674 ...
Histogram: Cookie.NumberOfLoadedCookies recorded 2 samples, mean = 1946.5 (flags = 0x41)
0 O (0 = 0.0%)
1 ------------------------------------------------------------------------O (1 = 50.0%) {0.0%}
2 ...
3553 --------------O (1 = 50.0%) {50.0%}
4222 ...
Histogram: Cookie.BackingStoreUpdateResults recorded 25384 samples, mean = 0.0 (flags = 0x41)
0 ------------------------------------------------------------------------O (25384 = 100.0%)
1 ...
,
Nov 21
My histograms (details in the email) have generally lower numbers but I don't see anything dramatically (orders of magnitude) different.
,
Nov 21
Yeah, the numbers in comment #9 pretty much look like the backend/store is working properly and writing things out just fine, so it sure looks like it's just not keeping up. SQLite.CommitTime.Cookie might also say something, though it got through 25,000+ commits, so it's probably not taking half an hour each time to do so. (No suggestions on tracing besides whatever gets you threads + tasks --- I don't really use tracing much, so can't get most possible out of it. I think there was something that recorded IPCs edges, even?)
,
Nov 21
@morlovich: re comment 7, the mojo call is synchronous [1] from the renderer, so it wouldn't be possible that the renderer makes 10 requests, goes to sleep, and then the browser wakes up 10 different times. One histogram to get the number of times that Blink called SetCookie is Blink.CookieJar.SyncCookiesSetTime [1] https://cs.chromium.org/chromium/src/content/common/render_frame_message_filter.mojom?rcl=d1f4835cdca00550a19c58bc937111c7eab77675&l=14
,
Nov 21
@jam - the fundamental mystery, then, is how the browser can call SetCookie 242,000+ times when there is no sign of there being more than 12,000 IPCs going from all renderers combined to the browser. It looks to me like the busiest renderer process calls chrome_child.dll!IPC::ChannelProxy::Context::OnSendMessage just 1,450 times during the trace. This is based on context switch data and there are a few ways that could be misleading, but the discrepancy is so huge that I tend to trust it. If true then it means that the browser process is somehow getting into a loop where it reprocesses the same message. Perhaps a failure triggers a retry? I don't know - I can't really speculate about causes, I just have what seems like evidence that the browser process is stuck in a loop.
,
Nov 21
I don't know how (reliable) the data above is, so I can't comment on that. I would suggest looking at the uma i quoted above, as that would tell you how many times the renderer called it. If you see a mismatch between that and the callstack in the first comment, let me know.
,
Nov 21
Bruce suggested that I shared the following histogram:
Histogram: Blink.CookieJar.SyncCookiesTime recorded 98759 samples, mean = 813.9 (flags = 0x41)
0 ...
76 O (2 = 0.0%) {0.0%}
105 --O (670 = 0.7%) {0.0%}
146 -----------O (4645 = 4.7%) {0.7%}
203 ---------------O (6447 = 6.5%) {5.4%}
282 -------------------------O (10914 = 11.1%) {11.9%}
391 ------------------------------------------------------------------------O (31686 = 32.1%) {23.0%}
542 ---------------------------------------------O (19599 = 19.8%) {55.0%}
752 ----------------------------O (12511 = 12.7%) {74.9%}
1043 ---------------O (6383 = 6.5%) {87.6%}
1447 -------O (3195 = 3.2%) {94.0%}
2008 ----O (1716 = 1.7%) {97.3%}
2786 -O (472 = 0.5%) {99.0%}
3865 O (163 = 0.2%) {99.5%}
5362 O (79 = 0.1%) {99.6%}
7439 O (87 = 0.1%) {99.7%}
10321 O (36 = 0.0%) {99.8%}
14319 O (20 = 0.0%) {99.8%}
19866 O (29 = 0.0%) {99.9%}
27562 O (47 = 0.0%) {99.9%}
38240 O (11 = 0.0%) {99.9%}
53055 O (6 = 0.0%) {100.0%}
73609 O (4 = 0.0%) {100.0%}
102126 O (7 = 0.0%) {100.0%}
141691 O (5 = 0.0%) {100.0%}
196583 O (8 = 0.0%) {100.0%}
272741 O (6 = 0.0%) {100.0%}
378403 O (7 = 0.0%) {100.0%}
525000 O (4 = 0.0%) {100.0%}
728390 ...
And the browser start time at: 11/21/18, 11:26:00AM. The histogram above was captured at 15:00.
,
Nov 21
So, that's a steady 7.7 cookie syncs per second for ~3.5 hours. That's high, but nowhere near high enough to account for the ~1,500 content::RenderFrameMessageFilter::SetCookie calls per second that the initial ETW trace suggested. It would be good to get another ETW trace (with chrome://histograms data from before and after the trace) to correlate the different sources of data. And a chrome://tracing trace.
,
Nov 22
Sorin's browser has been up for ~6 hours and here's the latest histogram data (note that this is a different counter from what was posted in comment #15 - this is the one that jam@ requested): Histogram: Blink.CookieJar.SyncCookiesSetTime recorded 2057827 samples, mean = 8.8 (flags = 0x41) 2057827 samples in six hours seems like a lot - that's ~95/s. Sorin shared an ETW trace which seems to show 94 calls to RenderFrameMessageFilter::SetCookie which matches well with the 118 change in the number of samples (he grabbed the histogram before and after - a perfect match would be improbable but that is very close) in the ~30 s time span covered. So, ~3-4 cookie sets per second during the trace, 95/s over the last six hours, FWIW. I guess the cookie setting rate is highly variable and we caught it during a slow period. And with that, I'm off to eat turkey.
,
Nov 28
ok, so I have this happening again in 70.0.3538.102 (Official Build) (64-bit).
The browser was mostly idle overnight but with several tabs open. When I took a peek at it in the morning, the lag was back, and the memory usage in the browser process was ~17GB. After being idle for 4 more hours, the memory usage climbed up to 37GB.
Browser process start time 11/27/18, 7:42 AM
Memory footprint: 37GB
Context switches: 211444678
I have a renderer process with 130mil context switches and the following histogram:
Histogram: Blink.CookieJar.SyncCookiesSetTime recorded 121104396 samples, mean = 6.6 (flags = 0x41)
0 ...
3 O (80565 = 0.1%) {0.0%}
4 ----------O (12577001 = 10.4%) {0.1%}
6 ------------------------------------------------------------------------O (92919808 = 76.7%) {10.5%}
8 -------O (13059252 = 10.8%) {87.2%}
11 O (1105284 = 0.9%) {98.0%}
15 O (966323 = 0.8%) {98.9%}
21 O (215561 = 0.2%) {99.7%}
29 O (111182 = 0.1%) {99.9%}
40 O (33074 = 0.0%) {99.9%}
55 O (14120 = 0.0%) {100.0%}
76 O (15046 = 0.0%) {100.0%}
105 O (5144 = 0.0%) {100.0%}
146 O (438 = 0.0%) {100.0%}
203 O (1343 = 0.0%) {100.0%}
282 O (206 = 0.0%) {100.0%}
391 O (14 = 0.0%) {100.0%}
542 O (10 = 0.0%) {100.0%}
752 O (1 = 0.0%) {100.0%}
1043 O (3 = 0.0%) {100.0%}
1447 O (2 = 0.0%) {100.0%}
2008 O (2 = 0.0%) {100.0%}
2786 O (4 = 0.0%) {100.0%}
3865 O (3 = 0.0%) {100.0%}
5362 O (2 = 0.0%) {100.0%}
7439 O (3 = 0.0%) {100.0%}
10321 O (2 = 0.0%) {100.0%}
14319 O (2 = 0.0%) {100.0%}
19866 O (1 = 0.0%) {100.0%}
27562 ...
|
||
►
Sign in to add a comment |
||
Comment 1 by pwnall@chromium.org
, Nov 20