OffscreenCanvas can deadlock when navigating or refreshing the page |
||||||||||||||||||||||
Issue descriptionGoogle Chrome 64.0.3282.167 (Official Build) (64-bit) Revision bf44778c9ce98ecff1128b225a165728044bbdeb-refs/branch-heads/3282@{#671} We're getting reports from a partner that OffscreenCanvas (which they enabled with a flag) causes a deadlock when refreshing the page or navigating. I've attached a stack trace. I think this was caused by: https://chromium.googlesource.com/chromium/src/+/d5389f4dd66f040f4e585732c200e6572e840c24 I think what happens is that in the refresh scenario the TaskQueueManager is already gone and the canvas code tries to PostTask inside another PostTask call because it's doing so inside a destructor (~MailboxTextureHolder)... which is a deadlock on the task queue lock introduced in change d5389f4d More detail: MailboxTextureHolder has a destructor that tries to PostTask to the scheduler. https://cs.chromium.org/chromium/src/third_party/WebKit/Source/platform/graphics/MailboxTextureHolder.cpp?rcl=bc8ef3f79ae198d7527e10dedbae19a76772826c&l=142 Accessing the scheduler acquires a lock which is held while calling PostTask. This was added in change d5389f4d. https://cs.chromium.org/chromium/src/third_party/WebKit/Source/platform/scheduler/base/task_queue.cc?rcl=a0c712b7b66ab84e4cc0be73cc804cf543a29e56&l=76 The sequence of events is: 1. Inside OffscreenCanvasPlaceholder::ReleasePlaceholderFrame it attempts to posts a task with a bound closure, this acquires the lock. https://cs.chromium.org/chromium/src/third_party/WebKit/Source/platform/scheduler/base/task_queue.cc?rcl=a0c712b7b66ab84e4cc0be73cc804cf543a29e56&l=76 2. The closure takes ownership of the PlaceholderFrame and associated MailboxTextureHolder. 3. Normally the task queue takes ownership of the closure (and transitively the MailboxTextureHolder) and the stack unwinds releasing the lock. ex. here: https://cs.chromium.org/chromium/src/third_party/WebKit/Source/platform/scheduler/base/task_queue_impl.cc?rcl=3e667d451a63406a511c46b063eef146840d36b0&l=211 4. BUT if the TaskQueueManager is gone then the PostTask early returns before passing ownership of the closure. https://cs.chromium.org/chromium/src/third_party/WebKit/Source/platform/scheduler/base/task_queue_impl.cc?rcl=8c0ebdef31450502b1b631ea596838cc48e7d2fa&l=184 https://cs.chromium.org/chromium/src/third_party/WebKit/Source/platform/scheduler/base/task_queue_impl.cc?rcl=3e667d451a63406a511c46b063eef146840d36b0&l=203 5. The closure then has its destructor run *inside* the call to PostTask, *under* the lock. 6. The closure being destroyed attempts to destroy the MailboxTextureHolder, which then attempts to post another task. 7. Posting another task tries to reentrantly acquire the scheduler lock again which is a deadlock.
,
Feb 21 2018
,
Feb 21 2018
Thanks for the in-depth analysis. Something strange must be happening here because TaskQueueManager doesn't go away between navigations -- it matches the lifetime of the renderer process. However destroying the task while holding any scheduler locks is clearly wrong.
,
Feb 21 2018
Very well spotted, thanks! It's not TaskQueueManager which goes away, it's TaskQueue itself.
,
Feb 21 2018
The following revision refers to this bug: https://chromium.googlesource.com/chromium/src.git/+/fe8c817d474d9dbc05ac28f36ca2a15a9556dc0e commit fe8c817d474d9dbc05ac28f36ca2a15a9556dc0e Author: Alexander Timin <altimin@chromium.org> Date: Wed Feb 21 17:43:06 2018 [scheduler] Prevent deadlock when posting a task from task destructor. When TaskQueueManager was shutdown, posted closure will be destroyed synchronously. When a closure owns an object, the destructor of this object may try to post a task. If the task is destroyed while holding scheduler locks, it may result in deadlocks. R=alexclarke@chromium.org,skyostil@chromium.org Bug: 814031 Change-Id: I7550b767d094fbd99543dd6b7827d58bf1c2bee9 Reviewed-on: https://chromium-review.googlesource.com/928705 Reviewed-by: Sami Kyöstilä <skyostil@chromium.org> Commit-Queue: Alexander Timin <altimin@chromium.org> Cr-Commit-Position: refs/heads/master@{#538147} [modify] https://crrev.com/fe8c817d474d9dbc05ac28f36ca2a15a9556dc0e/third_party/WebKit/Source/platform/scheduler/base/task_queue.cc [modify] https://crrev.com/fe8c817d474d9dbc05ac28f36ca2a15a9556dc0e/third_party/WebKit/Source/platform/scheduler/base/task_queue_impl.cc [modify] https://crrev.com/fe8c817d474d9dbc05ac28f36ca2a15a9556dc0e/third_party/WebKit/Source/platform/scheduler/base/task_queue_impl.h [modify] https://crrev.com/fe8c817d474d9dbc05ac28f36ca2a15a9556dc0e/third_party/WebKit/Source/platform/scheduler/base/task_queue_manager_unittest.cc
,
Feb 21 2018
What are the exact user facing impact of this bug? What OS's is this impacting?
,
Feb 21 2018
Assuming this is applicable to M65 as well. Also adding "ReleaseBlock-Stable" label, pls remove if not needed. Pls apply applicable OSs label. Thank you.
,
Feb 21 2018
User facing impact is that renderer may freeze. One known scenario is when page refresh/navigation happens when OffscreenCanvas is used. There are probably more scenarios resulting in renderer freeze.
,
Feb 21 2018
Pls verify and update the bug with canary result tomorrow. Thank you.
,
Feb 22 2018
The NextAction date has arrived: 2018-02-22
,
Feb 22 2018
Discussed with altimin@ and we should target this fix for M65. It's way too late for M64 now, and this issue is probably related to the increase in WAIT_TIMEOUTS and JS hangs in hangouts that we're seeing.
,
Feb 22 2018
I want to merge it to Beta 65. The change should be safe, but there is some strange data coming from the latest Canary, so I'll merge when the situation becomes clearer.
,
Feb 22 2018
This bug requires manual review: We are only 11 days from stable. Please contact the milestone owner if you have questions. Owners: cmasso@(Android), cmasso@(iOS), bhthompson@(ChromeOS), govind@(Desktop) For more details visit https://www.chromium.org/issue-tracking/autotriage - Your friendly Sheriffbot
,
Feb 22 2018
Removing M64
,
Feb 22 2018
altimin@, pls update the bug with canary result on Monday morning. So by then change will be well baked in canary. If canary result looks good, I will approve the merge to M65. Thank you.
,
Feb 23 2018
,
Feb 23 2018
,
Feb 23 2018
Seems like blocking bug 813218 still repros on latest canary per this update - https://bugs.chromium.org/p/chromium/issues/detail?id=813218#c68.
,
Feb 23 2018
Even so, could we get a DI of this into M64? We have a different P0 bug b/73500763 on our hands.
,
Feb 26 2018
The NextAction date has arrived: 2018-02-26
,
Feb 26 2018
The last canaries seem fine, therefore requesting a merge to M65.
,
Feb 26 2018
crbug.com/813218 was assigned speculatively, removing as it turns out that's not the case.
,
Feb 26 2018
Note: OffscreenCanvas bug warrants a merge to 65 even without 813218 being fixed.
,
Feb 26 2018
Approving merge to M65 branch 3325 based on comments #21 and #23. Please merge ASAP. Thank you.
,
Feb 26 2018
altimin@ please merge this today.
,
Feb 26 2018
The following revision refers to this bug: https://chromium.googlesource.com/chromium/src.git/+/86e709882064d66eb4f9746626eedd2d4f891fdb commit 86e709882064d66eb4f9746626eedd2d4f891fdb Author: Alexander Timin <altimin@chromium.org> Date: Mon Feb 26 19:49:19 2018 [scheduler] Prevent deadlock when posting a task from task destructor. When TaskQueueManager was shutdown, posted closure will be destroyed synchronously. When a closure owns an object, the destructor of this object may try to post a task. If the task is destroyed while holding scheduler locks, it may result in deadlocks. R=alexclarke@chromium.org, skyostil@chromium.org TBR=altimin@chromium.org (cherry picked from commit 021a4bf395642491d7734138f06326b178a11e9a) Bug: 814031 Change-Id: I7550b767d094fbd99543dd6b7827d58bf1c2bee9 Reviewed-on: https://chromium-review.googlesource.com/928705 Reviewed-by: Sami Kyöstilä <skyostil@chromium.org> Commit-Queue: Alexander Timin <altimin@chromium.org> Cr-Original-Commit-Position: refs/heads/master@{#538147} Reviewed-on: https://chromium-review.googlesource.com/937466 Reviewed-by: Alexander Timin <altimin@chromium.org> Cr-Commit-Position: refs/branch-heads/3325@{#595} Cr-Branched-From: bc084a8b5afa3744a74927344e304c02ae54189f-refs/heads/master@{#530369} [modify] https://crrev.com/86e709882064d66eb4f9746626eedd2d4f891fdb/third_party/WebKit/Source/platform/scheduler/base/task_queue.cc [modify] https://crrev.com/86e709882064d66eb4f9746626eedd2d4f891fdb/third_party/WebKit/Source/platform/scheduler/base/task_queue_impl.cc [modify] https://crrev.com/86e709882064d66eb4f9746626eedd2d4f891fdb/third_party/WebKit/Source/platform/scheduler/base/task_queue_impl.h [modify] https://crrev.com/86e709882064d66eb4f9746626eedd2d4f891fdb/third_party/WebKit/Source/platform/scheduler/base/task_queue_manager_unittest.cc
,
Feb 26 2018
Would this help with issue 815360? I'm suggesting that we pick this into M64 as well.
,
Feb 27 2018
,
Feb 27 2018
Re #27: This may or may not help with 815360. I think that it does not warrant spinning a new version, but may be worthwhile to piggyback if we spin up a new build due to another reason. I'd defer the decision to release owners.
,
Feb 27 2018
qaz@, M65 version 65.0.3325.103 includes this fix. Could you pls test/verify whether it fix issue 815360 or not?
,
Feb 27 2018
We haven't been able to repro the can't join 2nd time issue in the first place, therefore won't be able to verify any fix ourselves. We've asked the customer contact to let the customer try the canary build.
,
Feb 27 2018
In b/73500763 we have some positive news from initial tests with the canary build. However, since enterprise Chrome rollouts may (often?) lag behind, could we get this into M64 to address the immediate customer issue?
,
Feb 27 2018
,
Feb 28 2018
Let's target this for M65.
,
Feb 28 2018
other than the anedotal data from bug reports and/or user feedback, what metrics are being used here to determine whether this fix has had any positive effect? I looked at WAIT_TIMEOUT and compared before/after 66.0.3352.0 (where initial fix landed) and could see no meaningful difference: https://uma.googleplex.com/p/chrome/timeline_v2?sid=adad9f68d7a787cac520df857da53156 admittedly canaries after 3352 have a large STATUS_BREAKPOINT issue going on, so it's possible this has polluted the data, but I think I would have still seen some change. Alternatively, WAIT_TIMEOUT was fixed some time before this CL, and so this fix made no difference? What other data is being looked at? Do you need help with the analysis?
,
Feb 28 2018
One thing to test would be to run the latest Beta and not Canary when doing the tests in b/73500763 - i.e. use builds of M65 beyond the merge (65.0.3325.103 and onwards) rather than just testing canary which might contain other fixes (e.g. whatever caused WAIT_TIMEOUT to vanish on or around the beginning of Feb). Happy to help with obtaining these builds manually and with installing them, but I think according to the schedule, there should be an M65 beta going out today (Wednesday).
,
Feb 28 2018
Today's beta #65.0.3325.106 just went out which includes merge listed at #26. Thank you.
,
Feb 28 2018
We can use the query in b/73500763#40 as an approximation. I don't think we know the exact signal to help query for the symptom there, though, and we do not see a repro ourselves therefore no way to verify if this is the right fix. |
||||||||||||||||||||||
►
Sign in to add a comment |
||||||||||||||||||||||
Comment 1 by esprehn@chromium.org
, Feb 21 2018