New issue
Advanced search Search tips
Note: Color blocks (like or ) mean that a user may not be available. Tooltip shows the reason.

Issue 612358 link

Starred by 4 users

Issue metadata

Status: Fixed
Owner:
Closed: May 2016
Cc:
Components:
EstimatedDays: ----
NextAction: ----
OS: All
Pri: 1
Type: Bug-Regression

Blocking:
issue 612944



Sign in to add a comment

Google Docs (Sheets, etc.) take forever to load in Canary

Project Member Reported by kbr@chromium.org, May 17 2016

Issue description

Version: 52.0.2738.0 (Official Build) canary (64-bit)
OS: OS X 10.11.4

What steps will reproduce the problem?
(1) Open the following document: https://docs.google.com/document/d/1Y7wd-ZZJGn6SqZXQNUGzT0kF2jLm8klMjK35MrZBV1Y/edit (Google only)

What is the expected output?

Expect the document to load.


What do you see instead?

The page hangs for minutes, if not forever, waiting for the CSS and JS of the target to load.

This is a recent regression. I first noticed it last Friday but didn't think anything of it. Only today did I realize it's a bug in the browser. It's still happening in Canary today after updates over the weekend. The bug's probably in the networking stack or loader. Needs a bisect, which could be painful due to the need to log in to each revision. Maybe someone can verify that this happens with a publicly visible document on chromium.org.

 

Comment 1 by falken@chromium.org, May 17 2016

Cc: jkarlin@chromium.org
Components: -Blink>Network -Internals>Network Blink>ServiceWorker Blink>Storage>CacheStorage
Labels: -OS-Mac OS-All
I see this problem too (on Linux and Chrome OS). It looks like it has to do with service worker and cache storage. kbr@ can you check if you have a SW installed: from docs.google.com page, open DevTools -> Resources (or Application) -> Service Workers and see if there's an entry for https://docs.google.com/document/

I *think* this could be related to  Issue 605663 , which should be fixed in Canary. When I've seen this bug there is both an new worker stuck in INSTALLING and the activated worker running. Maybe it's some deadlock when both workers are using same Cache Storage cache?

Once the INSTALLING worker somehow successfully installed, I no longer see this bug.

Comment 2 by falken@chromium.org, May 17 2016

BTW I'm on 52.0.2729.3 dev so the fix for  Issue 605663  is in.

Comment 3 by falken@chromium.org, May 17 2016

Also, when I view Cache Storage in DevTools, it's empty. Right click -> Refresh Caches still results in emptiness.
Cc: -jkarlin@chromium.org
Owner: jkarlin@chromium.org
Status: Started (was: Untriaged)
Can reproduce. It's definitely CacheStorage. Looks like something in CacheStorageCache::Put is failing to run its callback.

Comment 5 by kbr@chromium.org, May 17 2016

Labels: ReleaseBlock-Beta M-52
Very good. Please treat this with urgency. Docs is unusable in Canary, even across browser restarts.

Cc: michaeln@chromium.org dmu...@chromium.org
Components: Blink>Storage
Narrowed it down to this:

BLobURLRequestJob::DidStart calls blob_reader_->CalculateSize() which returns IO_PENDING and BlobURLRequestJob::DidCalculateSize never gets called. Seems related to https://codereview.chromium.org/1234813004 so cc'ing dmurph@ and michaeln@.
Cc: jsb...@chromium.org
+cc jsbell

Comment 8 by dmu...@chromium.org, May 17 2016

This might be related to https://codereview.chromium.org/1967683003/
Looks related, but I still have the same issue when I patch that CL in.
Investigating now.
I'm heading out for the day, will check back again in the morning. It looks like any time I run into IO_PENDING on CalculateSize() from a request created by CacheStorageBLobToDiskCache::StreamBlobToCache it fails to call the callback. Not sure if the thing creating the blob never finishes or what. I'm not terribly familiar with the blob code.
I can't repro on linux. I'll try rebasing to ToT. Based on that it seems like the blob is either
1. Not constructed yet, or
2. Files.

If I can't repro I'll start poking with tests.

Comment 13 by kbr@chromium.org, May 17 2016

Please try on Mac OS X. This happens 100% of the time on my Retina MBP with Canary.

Narrowing this down. Simpler repro steps:

(after opt-in)
* Load the drive settings page here: https://drive.google.com/corp/drive/#settings
* Decheck the 'offline' checkbox
* recheck the 'offline' checkbox
* load any docs or presentation page
* after about 10 seconds, it'll be saving a bunch of stuff to cache storage. In release builds, some of the blobs are still being built. These never 'complete', and thus those jobs don't finish.
Ok, behavior I'm seeing:

1. When I re-enable offline, and let it load a little bit, the serviceworker (or extension, something) starts saving a bunch of data (from blobs) to cache storage.
2. Some of these blobs haven't finished constructing yet, so the CacheStorageBLobToDiskCache's -> BlobURLRequestJob -> BlobReader waits on the blob to finish before continuing.
3. These blobs NEVER finish, which suspends this whole process.
4. To find the origin of these blobs, I try to kill renderers. This can either reveal the renderer that made the blob, or a renderer that is holding the last reference.
5. Killing the docs extension finishes the blobs (investigating now whether they created the blobs or not) as broken.
Confirming that the blob is being constructed from the extension process.
I have a patch that check fails in blob_transport_controller when it gets a message to send blob data for a blob it doesn't contain.
https://codereview.chromium.org/1985383002

It also include's falken's patch for fixing a separate issue. I'm continuing to debug, but basically it seems like we're requesting data from a renderer (the extension) that doesn't have a consolidation stored for that blob.
Got it.

Ok.

Problem happens in BlobTransportController, right around here:
https://code.google.com/p/chromium/codesearch#chromium/src/content/child/blob_storage/blob_transport_controller.cc&q=blobtransportcontroller&l=81

We're getting the response to our IPC messages before that task is run. That task stores the blob consolidation so we can use it in responses. This seems incredibly weird to me. But maybe IPC messages are scheduled at a higher priority on the IO thread?

We're given the messages from our BlobMessageFilter:
https://code.google.com/p/chromium/codesearch#chromium/src/content/child/blob_storage/blob_message_filter.h

Are we not receiving messages on the IO thread? Or are we scheduling IPC messages at a higher priority than regular scheduling?
Cc: gab@chromium.org
Alright, well my current hypothesis is that we have two different task runners for the io thread (or at least more than one).

I'm using this one to schedule my 'storage' task, which stores the blob data for transport:
https://code.google.com/p/chromium/codesearch#chromium/src/content/renderer/renderer_blink_platform_impl.cc&l=268&rcl=1463527138

But it looks like I'm getting my response IPC before that task gets executed, even when they're both on the IO thread.

The IPC message from the renderer, which in this case is being sent from the serviceworker thread, should be eventually using the task runner here just to send that message:
https://code.google.com/p/chromium/codesearch#chromium/src/ipc/ipc_sync_message_filter.cc&q=ipc_sync_message_fil&sq=package:chromium&l=81

Even if that's different than the first one, that's just for sending. My received message from the browser is being dispatched to my blob_message_filter on the IO thread, but apparently somehow being scheduled before my storage task.

If you patch my CL, here is what you would expect in the output for the crash:
... other messages ...
[1:33:0517/205132:ERROR:blob_transport_controller.cc(72)] 2efdfb97-1c91-4120-be77-9c20b750c906 is about to be saved and send messages to browser. Instance: 0x7f29818e08f0, ServiceWorker Thread
[13021:13058:0517/205132:ERROR:blob_async_builder_host.cc(400)] 2efdfb97-1c91-4120-be77-9c20b750c906 sending shared memory request
[1:2:0517/205132:ERROR:blob_transport_controller.cc(229)] 2efdfb97-1c91-4120-be77-9c20b750c906 storing shared memory consolidation on Chrome_ChildIOThread
[1:33:0517/205132:ERROR:blob_transport_controller.cc(72)] d93fa5a7-3d61-4bf1-84cb-e714b5d7fbc6 is about to be saved and send messages to browser. Instance: 0x7f29818e08f0, ServiceWorker Thread
[1:2:0517/205132:ERROR:blob_transport_controller.cc(276)] 2efdfb97-1c91-4120-be77-9c20b750c906 got shared memory request on Chrome_ChildIOThread
[13021:13058:0517/205132:ERROR:blob_async_builder_host.cc(400)] d93fa5a7-3d61-4bf1-84cb-e714b5d7fbc6 sending shared memory request
... other messages ...
[1:2:0517/205132:FATAL:blob_transport_controller.cc(115)] Check failed: false. Could not find d93fa5a7-3d61-4bf1-84cb-e714b5d7fbc6 Instance: 0x7f29818e08f0, Chrome_ChildIOThread


Blob 2efdfb97-1c91-4120-be77-9c20b750c906 was constructed successfully. But blob d93fa5a7-3d61-4bf1-84cb-e714b5d7fbc6 received the 'OnMemoryRequest' before it could store the data, even though the task to store the data was scheduled before IPC messages were sent.

Elliot told me to ask Gabriel for help.
+gab, do you have any insight or ideas?
To abstract a bit, the mystery is:

[Renderer(Service Worker thread)]: PostTask("storage") to IO thread 
[Renderer(Service Worker thread)]: Sends request IPC to browser
[Browser]: Sends response IPC
[Renderer(IO thread)]: Receives response IPC from browser, before storage task runs

One thing I noticed is the request IPC is sent by SyncMessageFilter::Send here:
 if (sender_ && is_channel_send_thread_safe_) { ... }

rather than on the io_task_runner. Perhaps the channel is not really thread safe? But as comment 19 notes, that is just for sending; it's still odd that that response is handled on the IO thread before the storage task is run.

Comment 21 by tzik@chromium.org, May 18 2016

Cc: tzik@chromium.org
IPCs are sequenced and PostTask are also sequenced, but once they are mixed, they seems racy...
task_ordering.png
212 KB View Download
Cc: roc...@chromium.org
Thanks tzik.

I'm unclear if this was always the case. It seems like a recent regression. Since ChannelMojo appears used, could this be related to "MojoChannel"? https://codereview.chromium.org/1950513002
(To explain the screenshot: the arrows emanating from Service Worker thread come from three boxes: the Register IPC, the Build IPC, and the PostTask. We see the IPC response from the browser comes back before the posted task runs on the IO thread.)
By locally reverting the MojoChannel change, it looks like the IPC sends are done from the IO thread, rather than directly from Service Worker thread, and the bug doesn't occur.

My thinking now:
- There's no guarantee that IPC message handling receiver-side is sequenced by the task runner.
- The MojoChannel change somehow allowed the IPC sending to bypass the IO thread via ThreadSafeSender.

If that's accurate, this bug would only happen since the MojoChannel change in M52.
etc.) take forever to load in Canary

are done from the IO thread, rather than directly from Service Worker
thread, and the bug doesn't occur.

sequenced by the task runner.

thread via ThreadSafeSender.

change in M52.

This is accurate. We changed ChannelProxy to send directly on the pipe
instead of hopping to the IO thread first.
Right, it's incorrect to assume that Send() will hop to the IO thread. See https://bugs.chromium.org/p/chromium/issues/detail?id=516464 which fixed an eerily similar race.
MojoChannel returns true for IsSendThreadSafe() (I believe it's the only override that does) and therefore skips the IO jump. 

CL https://codereview.chromium.org/1988983003 moves BlobTransportController::InitiateBlobTransfer to the IO thread, and I no longer hit the case where the render process doesn't know about the blob uuid that the browser is asking for.

The service worker's still aren't loading however. The waitUntil promise is rejecting. They're erroring out with "Error: At least one cache failed to update" but I can't debug the minified js. I'm still also seeing lots of IO_PENDING from CalculateSize that never winds up calling BlobReader::AsyncCalculateSize. 

Comment 27 by gab@chromium.org, May 18 2016

Cc: fdoray@chromium.org
Re. #19-20:

The IO thread runs a MessageLoopForIO which interleaves async IO (IPCs) and tasks from its task runner. So the IPC may indeed come back and be executed before the posted task runs in the task runner's (MessageLoop) queue.

ref. message_pump_win.cc : https://code.google.com/p/chromium/codesearch#chromium/src/base/message_loop/message_pump_win.cc&l=505
Labels: -ReleaseBlock-Beta ReleaseBlock-Dev
M52 dev is branching on Thursday - 05/18, It would be great if we have a fix before today night's canary.
I'm taking over for Josh.

Here's the fix:
https://codereview.chromium.org/1988293002

Tests & better patch (which doesn't eliminate the fast path) will come later.
Going through commit queue now.
Project Member

Comment 31 by bugdroid1@chromium.org, May 18 2016

The following revision refers to this bug:
  https://chromium.googlesource.com/chromium/src.git/+/65a8fb95a8c71a7be8515f5e1c8125b04a477716

commit 65a8fb95a8c71a7be8515f5e1c8125b04a477716
Author: dmurph <dmurph@chromium.org>
Date: Wed May 18 21:13:57 2016

[BlobAsync] Fixed race between IPC messages and IO task queue

The message responses to our blob start message are dispatched on the IO
thread but they're not added to the task queue. So a race can happen
where OnMemoryRequest is called before we store the consolidation in our
map.

I moved the StartBuildingBlob message - which signals the browser to
start requesting memory from the renderer - to our IO task which stores
our blob data. This eliminates the race case with a small sacrifice of
construction speed.

In the future, I'd like to be able to still dispatch both IPC messages
right away to keep the fast path, and correctly handle the race case.

BUG= 612358 

Review-Url: https://codereview.chromium.org/1988293002
Cr-Commit-Position: refs/heads/master@{#394549}

[modify] https://crrev.com/65a8fb95a8c71a7be8515f5e1c8125b04a477716/content/child/blob_storage/blob_transport_controller.cc
[modify] https://crrev.com/65a8fb95a8c71a7be8515f5e1c8125b04a477716/content/child/blob_storage/blob_transport_controller.h
[modify] https://crrev.com/65a8fb95a8c71a7be8515f5e1c8125b04a477716/content/child/blob_storage/blob_transport_controller_unittest.cc

Hey Gabriel,
Was this always the behavior? Or is this just from the Mojo switch? We need to know whether this needs a stable merge.

Daniel
Project Member

Comment 33 by bugdroid1@chromium.org, May 19 2016

The following revision refers to this bug:
  https://chromium.googlesource.com/chromium/src.git/+/723cf1bf74560eb41120c1bf94eaaa38115203d0

commit 723cf1bf74560eb41120c1bf94eaaa38115203d0
Author: skyostil <skyostil@chromium.org>
Date: Thu May 19 11:41:31 2016

Revert of [BlobAsync] Fixed race between IPC messages and IO task queue (patchset #1 id:1 of https://codereview.chromium.org/1988293002/ )

Reason for revert:
The oortonline benchmark started failing to load blobs around the time this landed, so I'll do a speculative revert. Failure log:

https://build.chromium.org/p/chromium.perf/builders/Mac%20Retina%20Perf%20%285%29/builds/3008/steps/oortonline/logs/stdio/text

Original issue's description:
> [BlobAsync] Fixed race between IPC messages and IO task queue
>
> The message responses to our blob start message are dispatched on the IO
> thread but they're not added to the task queue. So a race can happen
> where OnMemoryRequest is called before we store the consolidation in our
> map.
>
> I moved the StartBuildingBlob message - which signals the browser to
> start requesting memory from the renderer - to our IO task which stores
> our blob data. This eliminates the race case with a small sacrifice of
> construction speed.
>
> In the future, I'd like to be able to still dispatch both IPC messages
> right away to keep the fast path, and correctly handle the race case.
>
> BUG= 612358 
>
> Committed: https://crrev.com/65a8fb95a8c71a7be8515f5e1c8125b04a477716
> Cr-Commit-Position: refs/heads/master@{#394549}

TBR=michaeln@chromium.org,dmurph@chromium.org
# Skipping CQ checks because original CL landed less than 1 days ago.
NOPRESUBMIT=true
NOTREECHECKS=true
NOTRY=true
BUG= 612358 

Review-Url: https://codereview.chromium.org/1997623002
Cr-Commit-Position: refs/heads/master@{#394740}

[modify] https://crrev.com/723cf1bf74560eb41120c1bf94eaaa38115203d0/content/child/blob_storage/blob_transport_controller.cc
[modify] https://crrev.com/723cf1bf74560eb41120c1bf94eaaa38115203d0/content/child/blob_storage/blob_transport_controller.h
[modify] https://crrev.com/723cf1bf74560eb41120c1bf94eaaa38115203d0/content/child/blob_storage/blob_transport_controller_unittest.cc

https://codereview.chromium.org/1991913003/ landed which causes mojo channel to return false for IsSendThreadSafe. So this should be fixed for now. 

Comment 35 by gab@chromium.org, May 19 2016

I think questions addressed my way were answered here and/or on chromium-dev thread, but just to make sure:

Turns out I had always assumed IPCs and IO thread tasks could be interleaved but they weren't in practice before Mojo as it's the first implementer which returned true for IPC::Channel::IsSendThreadSafe (if it returns false: an explicit IO thread hop is introduced for the IPC, implicitly sequencing it with other IO thread tasks).
Owner: dmu...@chromium.org
Confirmed that docs works fine on ToT after mojo channel change. Assigning to dmurph to handle the upcoming mojo channel changes.
Cc: -dmu...@chromium.org jkarlin@chromium.org
Is there anything pending here to be taken care?
I'm not sure. What mojo channel changes did you mean Josh?
I reverted the change which bypassed the IO thread hop for Send, so any related bugs should be gone as of r394568.
Status: Fixed (was: Started)
I'll close this as fixed then. Thanks everyone!

Comment 42 by kbr@chromium.org, May 23 2016

Blocking: 612944

Sign in to add a comment