New issue
Advanced search Search tips
Note: Color blocks (like or ) mean that a user may not be available. Tooltip shows the reason.

Issue 814031 link

Starred by 6 users

Issue metadata

Status: Fixed
Owner:
Closed: Feb 2018
Cc:
Components:
EstimatedDays: ----
NextAction: 2018-02-26
OS: Linux , Android , Windows , Chrome , Mac
Pri: 0
Type: Bug

Blocking:
issue 806661
issue 815360



Sign in to add a comment

OffscreenCanvas can deadlock when navigating or refreshing the page

Project Member Reported by esprehn@chromium.org, Feb 21 2018

Issue description

Google Chrome	64.0.3282.167 (Official Build) (64-bit)
Revision	bf44778c9ce98ecff1128b225a165728044bbdeb-refs/branch-heads/3282@{#671}

We're getting reports from a partner that OffscreenCanvas (which they enabled with a flag) causes a deadlock when refreshing the page or navigating. I've attached a stack trace.

I think this was caused by:
https://chromium.googlesource.com/chromium/src/+/d5389f4dd66f040f4e585732c200e6572e840c24

I think what happens is that in the refresh scenario the TaskQueueManager is already gone and the canvas code tries to PostTask inside another PostTask call because it's doing so inside a destructor (~MailboxTextureHolder)... which is a deadlock on the task queue lock introduced in change d5389f4d

More detail:

MailboxTextureHolder has a destructor that tries to PostTask to the scheduler.
https://cs.chromium.org/chromium/src/third_party/WebKit/Source/platform/graphics/MailboxTextureHolder.cpp?rcl=bc8ef3f79ae198d7527e10dedbae19a76772826c&l=142

Accessing the scheduler acquires a lock which is held while calling PostTask. This was added in change d5389f4d.
https://cs.chromium.org/chromium/src/third_party/WebKit/Source/platform/scheduler/base/task_queue.cc?rcl=a0c712b7b66ab84e4cc0be73cc804cf543a29e56&l=76

The sequence of events is:
1. Inside OffscreenCanvasPlaceholder::ReleasePlaceholderFrame it attempts to posts a task with a bound closure, this acquires the lock.
https://cs.chromium.org/chromium/src/third_party/WebKit/Source/platform/scheduler/base/task_queue.cc?rcl=a0c712b7b66ab84e4cc0be73cc804cf543a29e56&l=76

2. The closure takes ownership of the PlaceholderFrame and associated MailboxTextureHolder.
3. Normally the task queue takes ownership of the closure (and transitively the MailboxTextureHolder) and the stack unwinds releasing the lock.
ex. here:
https://cs.chromium.org/chromium/src/third_party/WebKit/Source/platform/scheduler/base/task_queue_impl.cc?rcl=3e667d451a63406a511c46b063eef146840d36b0&l=211

4. BUT if the TaskQueueManager is gone then the PostTask early returns before passing ownership of the closure.
https://cs.chromium.org/chromium/src/third_party/WebKit/Source/platform/scheduler/base/task_queue_impl.cc?rcl=8c0ebdef31450502b1b631ea596838cc48e7d2fa&l=184
https://cs.chromium.org/chromium/src/third_party/WebKit/Source/platform/scheduler/base/task_queue_impl.cc?rcl=3e667d451a63406a511c46b063eef146840d36b0&l=203

5. The closure then has its destructor run *inside* the call to PostTask, *under* the lock.
6. The closure being destroyed attempts to destroy the MailboxTextureHolder, which then attempts to post another task.
7. Posting another task tries to reentrantly acquire the scheduler lock again which is a deadlock.


 
canvas deadlock stacktrace.txt
16.8 KB View Download
Description: Show this description
Labels: Needs-Triage-M64
Components: Blink>Scheduling
Labels: -Pri-2 Pri-1
Status: Available (was: Untriaged)
Thanks for the in-depth analysis. Something strange must be happening here because TaskQueueManager doesn't go away between navigations -- it matches the lifetime of the renderer process. However destroying the task while holding any scheduler locks is clearly wrong.
Labels: -Pri-1 Pri-0
Owner: altimin@chromium.org
Status: Started (was: Available)
Very well spotted, thanks!

It's not TaskQueueManager which goes away, it's TaskQueue itself.
Project Member

Comment 5 by bugdroid1@chromium.org, Feb 21 2018

The following revision refers to this bug:
  https://chromium.googlesource.com/chromium/src.git/+/fe8c817d474d9dbc05ac28f36ca2a15a9556dc0e

commit fe8c817d474d9dbc05ac28f36ca2a15a9556dc0e
Author: Alexander Timin <altimin@chromium.org>
Date: Wed Feb 21 17:43:06 2018

[scheduler] Prevent deadlock when posting a task from task destructor.

When TaskQueueManager was shutdown, posted closure will be destroyed
synchronously. When a closure owns an object, the destructor of this object
may try to post a task. If the task is destroyed while holding scheduler locks,
it may result in deadlocks.

R=alexclarke@chromium.org,skyostil@chromium.org

Bug:  814031 
Change-Id: I7550b767d094fbd99543dd6b7827d58bf1c2bee9
Reviewed-on: https://chromium-review.googlesource.com/928705
Reviewed-by: Sami Kyöstilä <skyostil@chromium.org>
Commit-Queue: Alexander Timin <altimin@chromium.org>
Cr-Commit-Position: refs/heads/master@{#538147}
[modify] https://crrev.com/fe8c817d474d9dbc05ac28f36ca2a15a9556dc0e/third_party/WebKit/Source/platform/scheduler/base/task_queue.cc
[modify] https://crrev.com/fe8c817d474d9dbc05ac28f36ca2a15a9556dc0e/third_party/WebKit/Source/platform/scheduler/base/task_queue_impl.cc
[modify] https://crrev.com/fe8c817d474d9dbc05ac28f36ca2a15a9556dc0e/third_party/WebKit/Source/platform/scheduler/base/task_queue_impl.h
[modify] https://crrev.com/fe8c817d474d9dbc05ac28f36ca2a15a9556dc0e/third_party/WebKit/Source/platform/scheduler/base/task_queue_manager_unittest.cc

Labels: M-64
What are the exact user facing impact of this bug? What OS's is this impacting?

Comment 7 by gov...@chromium.org, Feb 21 2018

Labels: M-65 ReleaseBlock-Stable
Assuming this is applicable to M65 as well. Also adding "ReleaseBlock-Stable" label, pls remove if not needed.

Pls apply applicable OSs label. Thank you.
Labels: OS-Android OS-Chrome OS-Linux OS-Mac OS-Windows
User facing impact is that renderer may freeze.

One known scenario is when page refresh/navigation happens when OffscreenCanvas is used. There are probably more scenarios resulting in renderer freeze.

Comment 9 by gov...@chromium.org, Feb 21 2018

NextAction: 2018-02-22
Pls verify and update the bug with canary result tomorrow. Thank you.
The NextAction date has arrived: 2018-02-22
Labels: Target-65
Discussed with altimin@ and we should target this fix for M65. It's way too late for M64 now, and this issue is probably related to the increase in WAIT_TIMEOUTS and JS hangs in hangouts that we're seeing.  
Labels: Merge-Request-65
I want to merge it to Beta 65. The change should be safe, but there is some strange data coming from the latest Canary, so I'll merge when the situation becomes clearer.
Project Member

Comment 13 by sheriffbot@chromium.org, Feb 22 2018

Labels: -Merge-Request-65 Merge-Review-65 Hotlist-Merge-Review
This bug requires manual review: We are only 11 days from stable.
Please contact the milestone owner if you have questions.
Owners: cmasso@(Android), cmasso@(iOS), bhthompson@(ChromeOS), govind@(Desktop)

For more details visit https://www.chromium.org/issue-tracking/autotriage - Your friendly Sheriffbot
Labels: -M-64
Removing M64
NextAction: 2018-02-26
 altimin@, pls update the bug with canary result on Monday morning. So by then change will be well baked in canary. If canary result looks good, I will approve the merge to M65. Thank you.
Blocking: 806661
Blocking: 813218
Seems like blocking bug 813218 still repros on latest canary per this update - https://bugs.chromium.org/p/chromium/issues/detail?id=813218#c68.

Comment 19 by q...@chromium.org, Feb 23 2018

Cc: q...@chromium.org
Even so, could we get a DI of this into M64? We have a different P0 bug b/73500763 on our hands.
The NextAction date has arrived: 2018-02-26
The last canaries seem fine, therefore requesting a merge to M65.
Blocking: -813218
crbug.com/813218 was assigned speculatively, removing as it turns out that's not the case.
Note: OffscreenCanvas bug warrants a merge to 65 even without 813218 being fixed.
Labels: -Merge-Review-65 Merge-Approved-65
Approving merge to M65 branch 3325 based on comments #21 and #23. Please merge ASAP. Thank you.

Comment 25 by cmasso@google.com, Feb 26 2018

altimin@ please merge this today.
Project Member

Comment 26 by bugdroid1@chromium.org, Feb 26 2018

Labels: -merge-approved-65 merge-merged-3325
The following revision refers to this bug:
  https://chromium.googlesource.com/chromium/src.git/+/86e709882064d66eb4f9746626eedd2d4f891fdb

commit 86e709882064d66eb4f9746626eedd2d4f891fdb
Author: Alexander Timin <altimin@chromium.org>
Date: Mon Feb 26 19:49:19 2018

[scheduler] Prevent deadlock when posting a task from task destructor.

When TaskQueueManager was shutdown, posted closure will be destroyed
synchronously. When a closure owns an object, the destructor of this object
may try to post a task. If the task is destroyed while holding scheduler locks,
it may result in deadlocks.

R=alexclarke@chromium.org, skyostil@chromium.org
TBR=altimin@chromium.org

(cherry picked from commit 021a4bf395642491d7734138f06326b178a11e9a)

Bug:  814031 
Change-Id: I7550b767d094fbd99543dd6b7827d58bf1c2bee9
Reviewed-on: https://chromium-review.googlesource.com/928705
Reviewed-by: Sami Kyöstilä <skyostil@chromium.org>
Commit-Queue: Alexander Timin <altimin@chromium.org>
Cr-Original-Commit-Position: refs/heads/master@{#538147}
Reviewed-on: https://chromium-review.googlesource.com/937466
Reviewed-by: Alexander Timin <altimin@chromium.org>
Cr-Commit-Position: refs/branch-heads/3325@{#595}
Cr-Branched-From: bc084a8b5afa3744a74927344e304c02ae54189f-refs/heads/master@{#530369}
[modify] https://crrev.com/86e709882064d66eb4f9746626eedd2d4f891fdb/third_party/WebKit/Source/platform/scheduler/base/task_queue.cc
[modify] https://crrev.com/86e709882064d66eb4f9746626eedd2d4f891fdb/third_party/WebKit/Source/platform/scheduler/base/task_queue_impl.cc
[modify] https://crrev.com/86e709882064d66eb4f9746626eedd2d4f891fdb/third_party/WebKit/Source/platform/scheduler/base/task_queue_impl.h
[modify] https://crrev.com/86e709882064d66eb4f9746626eedd2d4f891fdb/third_party/WebKit/Source/platform/scheduler/base/task_queue_manager_unittest.cc

Comment 27 by q...@chromium.org, Feb 26 2018

Would this help with issue 815360? I'm suggesting that we pick this into M64 as well.
Status: Fixed (was: Started)
Re #27: This may or may not help with 815360. I think that it does not warrant spinning a new version, but may be worthwhile to piggyback if we spin up a new build due to another reason.

I'd defer the decision to release owners.
qaz@, M65 version 65.0.3325.103 includes this fix. Could you pls test/verify whether it fix issue 815360 or not? 

Comment 31 by q...@chromium.org, Feb 27 2018

We haven't been able to repro the can't join 2nd time issue in the first place, therefore won't be able to verify any fix ourselves. We've asked the customer contact to let the customer try the canary build.

Comment 32 by q...@chromium.org, Feb 27 2018

In b/73500763 we have some positive news from initial tests with the canary build. However, since enterprise Chrome rollouts may (often?) lag behind, could we get this into M64 to address the immediate customer issue?

Comment 33 by q...@chromium.org, Feb 27 2018

Blocking: 815360
Let's target this for M65. 

Comment 35 by wfh@chromium.org, Feb 28 2018

Cc: wfh@chromium.org
other than the anedotal data from bug reports and/or user feedback, what metrics are being used here to determine whether this fix has had any positive effect?

I looked at WAIT_TIMEOUT and compared before/after 66.0.3352.0 (where initial fix landed) and could see no meaningful difference:

https://uma.googleplex.com/p/chrome/timeline_v2?sid=adad9f68d7a787cac520df857da53156

admittedly canaries after 3352 have a large STATUS_BREAKPOINT issue going on, so it's possible this has polluted the data, but I think I would have still seen some change.

Alternatively, WAIT_TIMEOUT was fixed some time before this CL, and so this fix made no difference?

What other data is being looked at? Do you need help with the analysis?

Comment 36 by wfh@chromium.org, Feb 28 2018

One thing to test would be to run the latest Beta and not Canary when doing the tests in b/73500763 - i.e. use builds of M65 beyond the merge (65.0.3325.103 and onwards) rather than just testing canary which might contain other fixes (e.g. whatever caused WAIT_TIMEOUT to vanish on or around the beginning of Feb).

Happy to help with obtaining these builds manually and with installing them, but I think according to the schedule, there should be an M65 beta going out today (Wednesday).
Today's beta #65.0.3325.106 just went out which includes merge listed at #26. Thank you.

Comment 38 by q...@chromium.org, Feb 28 2018

We can use the query in b/73500763#40 as an approximation. I don't think we know the exact signal to help query for the symptom there, though, and we do not see a repro ourselves therefore no way to verify if this is the right fix.

Sign in to add a comment