New issue
Advanced search Search tips
Note: Color blocks (like or ) mean that a user may not be available. Tooltip shows the reason.

Issue 716187 link

Starred by 4 users

Issue metadata

Status: Started
Owner:
Cc:
Components:
EstimatedDays: ----
NextAction: ----
OS: Android
Pri: 1
Type: Bug

Blocked on: View detail
issue 653394
issue 684640
issue 918987



Sign in to add a comment

Too many ConnectJob::OnTimeout tasks in delayed queue

Project Member Reported by ssid@chromium.org, Apr 27 2017

Issue description

On loading some websites on telemetry there was 300KB of memory usage form the delayed_work_queue_ having too many tasks on Browser IO thread. 

The list of tasks when the queue hit a 1000 tasks were posted from:

Main:
StartBackupJobTimer :   net/socket/client_socket_pool_base.cc : 5
ResetTimer :   net/socket/client_socket_pool_base.cc : 70
Connect :   net/socket/client_socket_pool_base.cc : 894

Other:
ScheduleBrokenAlternateProtocolMappingsExpiration :   net/http/http_server_properties_impl.cc : 1
RunCompleted :   content/browser/appcache/appcache_storage_impl.cc : 1
RecordAudioThreadStatus :   media/audio/audio_manager.cc : 1
WillStartRequest :   content/browser/loader/power_save_block_resource_throttle.cc : 2
Detach :   content/browser/loader/detachable_resource_handler.cc : 2
ScheduleUpdatePrefsOnNetworkThread :   net/http/http_server_properties_manager.cc : 1
CheckUrl :   components/safe_browsing/base_resource_throttle.cc : 1
PostponeWritingToDisk :   net/disk_cache/simple/simple_index.cc : 1
RecomputeOutstanding :   net/base/network_throttle_manager_impl.cc : 1


Can we not post these many tasks?
 
Components: Internals>Network
Thanks, Sid!

I think 894 Backup Jobs sound excessive.
We should only start one backup job per socket group (scheme/host/port triplet), if the first connection is taking longer than  ClientSocketPool::kMaxConnectRetryIntervalMs (250 msec).
ClientSocketPoolBaseHelper::Group::StartBackupJobTimer() called in ClientSocketPoolBaseHelper::RequestSocketInternal() when connect_backup_jobs_enabled_ && group->IsEmpty().

I doubt there are ~1k distinct origins in the telemetry test, so it's possible that there's something going on. 

Could you point me to the telemetry test?

Comment 2 by ssid@chromium.org, Apr 27 2017

This happens on multiple stories that I ran
You could try:
tools/perf/run_benchmark system_health.memory_mobile --story-filter=browser:news

One of these stories should trigger the issue. I added CHECK(delayed_work_queue_.size < 1000) and got it crashing half the runs.

Comment 3 by ssid@chromium.org, Apr 27 2017

I think there are 940 ConnectJob::OnTimeout tasks, and 5 OnBackupJobTimerFired.

Comment 4 by ssid@chromium.org, Apr 27 2017

Summary: Too many ConnectJob::OnTimeout tasks in delayed queue (was: Too many Group::OnBackupJobTimerFired and ConnectJob::OnTimeout tasks in delayed queue)
Status: Assigned (was: Untriaged)
Ah, that makes more sense. I think we have at least one ConnectJob::OnTimeout per socket connect because we want to terminate the job when timeout happens. 

I will look into it.

Comment 6 by ssid@chromium.org, Apr 27 2017

Attached trace file showing memory usage from delayed_work_queue_.
browse_news_cricbuzz22017-04-13_00-22-25.html
4.2 MB View Download

Comment 7 by ssid@chromium.org, Apr 28 2017

Okay I am able to repro this consistently:
I added CHECK(delayed_work_queue_.size() < 1000) in MessageLoop::AddToDelayedWorkQueue().

I run:
tools/perf/run_benchmark system_health.memory_mobile --browser=exact --browser-executable=out_android_native/Release/apks/Chrome.apk --pageset-repeat=1 --use-live-sites --story-filter=browse:news:cricbuzz

The browser crashes every time.
There's a netlog dump. I don't see anything obvious from a glance. We are getting many ConnectJobs to various origins. When using "--use-live-sites", requests are still going through the socks proxy, which is surprising. 


netlog_file_live.json
12.4 MB View Download
I annotated NetLog with CONNECT_JOB_ON_TIMEOUT when we posted a OnTimeout delay task, and CONNECT_JOB_ON_TIMEOUT_RESET when we reset the timer and posted another OnTimeout task. Everything looks normal.

When base::OneShotTimer goes out of scope (when ConnectJobs are gone), will those tasks remain in the task queue until they are processed?
netlog_file_live_with_extra.json
15.5 MB Download
It could be because our timeouts are on the order of seconds, we have a large backlog. Not sure how the delayed tasks are processed exactly, maybe we could improve the scheduling logic to detect canceled tasks more timely and get rid of them from the queue.
Since tasks aren't actually cancellable, a great many things in the backlog are probably for ConnectJobs that have already completed connecting.

Comment 12 by ssid@chromium.org, Apr 28 2017

Cc: mmenke@chromium.org
So IIUC, we have 1000s of connection requests from the website and we have 1 second delayed connect tasks queued up on the task queue. What would happen if we reduced the delay so that the queue is smaller? Or can we have a list of jobs for which we post a task together?

Comment 13 by ssid@chromium.org, Apr 28 2017

Cc: fdoray@chromium.org tzik@chromium.org
So, we need a notion of invalid tasks in delayed task queue and task queue should support dropping of invalid tasks to be able to fix this issue.
With a timers, we'd have more backup connect jobs.  With one (or just ever) timer, accounting becomes more complicated - we'd either have to go through every single group looking for the ones that have been stalled on a single socket, with an active request, or we'd need to keep track of all such groups, and just have a set of them around.  If we don't update that set on job removal...seems like we'd potentially just be moving where we accumulate cruft.  If we do remove them, it's yet more explicit bookkeeping on the many paths that lead to cancelling a request...Hrm.

Certainly solvable problems, but I'm wondering if we could make the task queue provide some way to better support this behavior (Which may require active consumer participation), but because I wouldn't be surprised if other things have similar behaviors, if just to a lesser extent.
It's frustrating, in that the interface to timers is fine--in theory, it should be possible for the task to be cancelled if the timer is destroyed.  However, the PostTask implementation doesn't have support for that.  You could imagine timers being implemented by coordination on a central object that was notified when a timer was destroyed and had only zero or one tasks outstanding at any given time--that would be pretty efficient in memory without requiring any change in the network code.  Would require a pretty major rewrite of timers, though.


Egads...Typos galore.  Erm, with shorter timers, we'd start more backup connect jobs, particularly on higher latency connections, though I guess that's just for backoff timers, and this issue covers timeout timers as well.  With just one timer (Per pool, or whatever), or timers shared by a group of jobs, accounting gets more complicated.  And then everything I said after the first two sentences actually makes sense.
Blockedon: 653394
Cc: dcheng@chromium.org
Components: Internals>TaskScheduler
scheduler-dev@ thread:
https://groups.google.com/a/chromium.org/d/msg/scheduler-dev/HLuCdnfy4Fs/vBiH4JIQAwAJ

Unless someone has an idea for a short-term fix, let's wait until cancellation is built into base::Timer/MessageLoop/TaskScheduler.

Comment 18 by gab@chromium.org, May 11 2017

Yes, I think base::Timer/MessageLoop/TaskScheduler need to support this. Good news is it's almost there.

@dcheng: I think you had a pending CL to clear cancelled tasks in MessageLoop? I think you were waiting to see if it was worthwhile from a power point of view (avoiding unnecessary wakeups) but looks like it's clearly needed for memory.

I can take care of making sure base::Timer cancels its internal delayed task (uses WeakPtr).

Those 2 changes should fix the issue.

@net folks, just to make sure: the base::Timers leaving stray tasks have been destroyed, not just stopped, right? (the current impl intentionally leaves delayed tasks for stopped timers in place in case the timer is restarted to avoid a pile of tasks after quick start/stop -- though I guess that also wouldn't be a problem anymore once cleanup is supported).

Comment 19 by gab@chromium.org, May 11 2017

Cc: gab@chromium.org
> @net folks, just to make sure: the base::Timers leaving stray tasks have been destroyed, not just stopped, right? 

The base::OneShotTimers (which posted those tasks) have been destroyed. We would like to remove those invalid tasks from |delayed_work_queue_| as well.


net/socket/client_socket_pool_base.cc
int ConnectJob::Connect() {
 ...
 // |timer_| is owned by |this|. |this| is destroyed very quickly, but
 // |ConnectJob::OnTimeout| tasks are accumulating in 
 // MessageLoop::|delayed_work_queue_|
 timer_.Start(FROM_HERE, timeout_duration_, this, &ConnectJob::OnTimeout);
}

Yes, according to ssid@'s measurement, this has a 300KB impact on IO thread's MessageLoop's delayed_worker_queue_. Once the support is in, I think it will have a sizable impact.


Comment 21 by gab@chromium.org, Jun 1 2017

+dcheng just landed https://chromium-review.googlesource.com/c/505274/ to strip cancelled tasks from the front of the queue but opted not to sweep the whole queue [1] each time.

What kind of timeouts are we talking about here? Searching around I found that these are mostly 30 seconds? I don't think it's dramatic to have dead tasks in the queue for 30 seconds -- that's essentially equivalent to a GC...

[1] https://chromium-review.googlesource.com/c/505274/7/base/message_loop/message_loop.h#368
We did talk about sweeping dead tasks more aggressively, but that has real costs.

One possibility is to refactor the callbacks to weakly own the data, so that we can drop the data without waiting for the callbacks to clear from the queue. Not sure how hard that'd be here.

Comment 23 Deleted

Comment 24 by gab@chromium.org, Jun 1 2017

I don't think this is worth it. This seems like a red herring, i.e. it's not real memory usage, it's < 1MB hogged for 30s. This isn't Chrome's memory problem (which the ongoing memory investigations are leading me to believe are in the order of GB over days).
I think the connection timeout is actually more like 4 minutes, rather than 30 seconds?  https://cs.chromium.org/chromium/src/net/socket/transport_client_socket_pool.cc?l=81

Comment 26 by gab@chromium.org, Jun 2 2017

Ah interesting, I'd found:

30s: https://cs.chromium.org/chromium/src/net/socket/socks_client_socket_pool.cc?rcl=6a2e2f088160fab1a354dee3d0b3bb96493d65b4&l=40

30s+: https://cs.chromium.org/chromium/src/net/socket/ssl_client_socket_pool.cc?rcl=6a2e2f088160fab1a354dee3d0b3bb96493d65b4&l=565

but digging through the other 2 callers I do indeed find what you report

4 minutes: https://cs.chromium.org/chromium/src/net/socket/transport_client_socket_pool.cc?rcl=0d1c49bdeaa45f0b1658f734e8a2270a05f661f9&l=81

When do these typically get cancelled? Is there some kind of signal you could provide to MessageLoop to tell it it's time to do a full pass over its delayed queue?
Note that those other timeouts are additional timeouts, on top of the timeout for just establishing a connection.  Each timeout is cancelled on connect success (At which point the timer itself is destroyed, generally).  Note that each connection attempt has its own success every, so if there are 200 timeout timers, there are 200 distinct points at which we destroy each of those timers.

Comment 28 by gab@chromium.org, Jun 2 2017

I see, perhaps a solution is to do coalescing of timers on the same sequence (thread //net's case). The tasks left in the queue aren't the actual ConnectJob::OnTimeout tasks but rather the base::BaseTimerTaskInternal::Run() tasks which have very little associated data (2 raw pointers), there's no reason (other than simplicity and potentially allowing timer coalescing at the MessageLoop level) for base::Timer to have multiple instances of base::BaseTimerTaskInternal::Run() pending in the same queue.
Cc: dskiba@chromium.org
Pfft, 894 tasks - one of our EM stories causes 3342 timer tasks to be added, pushing MessageLoop::delayed_work_queue to 288 KiB: see browser's "message_loop" in go/xscxr

All those tasks were started by ConnectJob::Connect().

Note that creating that many timers also allocates 100KiB in Timer::PostNewScheduledTask().

I hope those thousands of timers don't represent thousands of connections we're attempting? Because there is also file description exhaustion issue 733718.
There's a cap of 256 connections per socket pool (Barring proxies, websockets, incognito, the system URLRequestContext, QUIC, and a couple other cases, there's mostly only one socket pool).
Re#28: gab@, could you elaborate how the coalescing of timers work? Is it a change in //net or //base/timer?

I think this is still worth solving given how much this contributes to browser i/o thread's delayed_work_queue.

Re#30, we do not make that many connections at the same time. But our socket pools are layered. Each layer (e.g TransportConnectJob, SocksConnectJob, SSLConnectJob) can post a ConnectJob::OnTimeout delayed task. Since there's no mechanism to clean up those dead delayed tasks (until the timeout has elapsed), we can have a large |delayed_work_queue|. 
We could look into not running the timer at all layers concurrently, but then we'd still be running just as many, cumulatively, and we still couldn't destroy the posted tasks for timers we no longer cared about, so not sure it would give us much.
As usual, problem can be solved adding one more layer of indirection: define NetTimer, which will coalesce tasks, support cancellation, etc. and use that instead of standard timer.
And any modules that do the same should create their own foo timer class?  Wonder if it would be better to solve this problem in general, rather than just for net/.

Comment 36 by gab@chromium.org, Jul 1 2017

Cc: xunji...@chromium.org
Owner: gab@chromium.org
We can solve it in base::Timer by posting a single delayed task per sequence (thread) for the soonest task and keeping a list of pending timers on that sequence in a SequenceLocalStorageSlot in timer.cc. Since this is in timer.cc and this aware of timers being destroyed, cleanup is easy.

I can own that.
Awesome!  Thanks, gab!

Comment 38 by gab@chromium.org, Jul 19 2017

It turned out to be a bit more tricky than I expected so I put together a design doc : https://docs.google.com/a/chromium.org/document/d/11gcfpL-orj8SWY8bUBa1_ZGe8kPLUcjtGf2_WD7nBKw/edit (go/timer-delay-coalescing).

Sending to chromium-dev now and will begin resolving work items shortly barring any comments.

Comment 39 by gab@chromium.org, Mar 13 2018

Labels: -Pri-2 M-67 Pri-1
Status: Started (was: Assigned)
Assuming same root cause as issue 786597, P1 for me in M67.

Comment 40 by gab@chromium.org, Mar 13 2018

Blockedon: 684640
Blocked on issue 684640 (WIP), as enabling cancellation of Timer tasks requires its usage to be sequence-affine (as a few tests are breaking the rules).
Blockedon: 918987

Sign in to add a comment