Calling WebMemoryCoordinator::OnPurgeMemory in RenderThreadImpl::ReleaseFreeMemory causes WebGL test failures |
|||||||||||||
Issue descriptionIn https://chromium-review.googlesource.com/1052987 there was an attempt to call WebMemoryCoordinator::OnPurgeMemory from RenderThreadImpl::ReleaseFreeMemory. It turns out that this introduced flaky timeouts in the WebGL conformance tests; see Issue 840988 . It was difficult to track down this CL as the cause because the flakiness was highly intermittent. One of the bots in particular was failing at least one test on almost every run, and that helped identify the culprit. It's unclear why this change caused that sort of flakiness. Does that call cause Oilpan GCs? Could there be bugs in WebGL's tracing code (or elsewhere – perhaps in the implementation of requestAnimationFrame?) that are causing objects to be inadvertently GCd at the wrong time? That might explain these and other flaky timeouts. Studying the code for: https://cs.chromium.org/chromium/src/third_party/blink/renderer/platform/memory_coordinator.cc?q=OnPurgeMemory&sq=package:chromium&l=103&dr=CSs seems unlikely. However, some other clients look suspicious, like this one: https://cs.chromium.org/chromium/src/third_party/blink/renderer/core/script/classic_pending_script.cc?q=OnPurgeMemory&sq=package:chromium&l=381&dr=CSs Does this mean that if ClassicPendingScript::OnPurgeMemory is called, a script tag won't be executed? That would explain the random failures across a range of tests. From Issue 840988 , here were some of the random failures: WebglConformance_deqp_functional_gles3_shadermatrix_add_assign WebglConformance_deqp_functional_gles3_texturespecification_basic_teximage3d_2d_array_02 WebglConformance_conformance_more_conformance_constants WebglConformance_conformance_glsl_functions_glsl_function_clamp_float WebglConformance_deqp_functional_gles3_shaderoperator_angle_and_trigonometry_03 WebglConformance_deqp_functional_gles3_shadermatrix_add_const WebglConformance_conformance_glsl_functions_glsl_function_clamp_float Gyuyoung, could you please investigate why your CL caused these failures? It seems that there's some underlying bug which needs to be tracked down and fixed. Thanks.
,
May 11 2018
The assigned owner "gyuyoung.kim@lge.com" is not able to receive e-mails, please re-triage. For more details visit https://www.chromium.org/issue-tracking/autotriage - Your friendly Sheriffbot
,
May 11 2018
crbug is not happy with the bug assigned to gyuyoung.kim@lge.com. gyuyoung@, do you have a chromium account we should use instead?
,
May 11 2018
Assigning to haraken@ in the meanwhile to track progress.
,
May 12 2018
,
May 12 2018
,
May 12 2018
>crbug is not happy with the bug assigned to gyuyoung.kim@lge.com. gyuyoung@, do you have a chromium account we should use instead? ah, I registered my @chromium.org account. Thanks.
,
May 14 2018
I started to investigate this issue today. Once I verified |blink::MemoryCoordinator::OnPurgeMemory()| was called by my CL during the webgl test. I'm checking it continually.
,
May 16 2018
Once I found that my CL made us call blink::DecommitFreeableMemory() twice. But, I'm not sure if this caused the problem yet. Still investigating it.
if (blink_platform_impl_) {
// Purge Skia font cache, resource cache, and image filter.
SkGraphics::PurgeAllCaches();
blink::DecommitFreeableMemory();
blink::WebMemoryCoordinator::OnPurgeMemory();
}
void MemoryCoordinator::OnPurgeMemory() {
for (auto& client : clients_)
client->OnPurgeMemory();
...
ImageDecodingStore::Instance().Clear();
WTF::Partitions::DecommitFreeableMemory();
,
May 18 2018
,
May 25 2018
Let me share the result of my testing about this issue for last week. Calling |blink::DecommitFreeableMemory()| twice caused the timeout failure more often, but not 100% caused. And, the timeout test failure happened less when I call |blink::DecommitFreeableMemory()| once. So I don't 100% convince that calling the function twice caused the timeout failure yet. But, I feel like that we need to check how is the result of the bots when I land a new CL that calls |blink::DecommitFreeableMemory()| once.
,
May 25 2018
gyuyoung.kim@: thanks for continuing to investigate this. I suspect that one of the classes that purges caches when MemoryCoordinator::OnPurgeMemory is called is doing the wrong thing, and that it is currently buggy, and the cause of intermittent timeouts in these tests. Your CL probably just exacerbated an existing problem. When I glanced through the code earlier: https://cs.chromium.org/chromium/src/third_party/blink/renderer/platform/memory_coordinator.cc?type=cs&q=MemoryCoordinator::OnPurgeMemory&g=0&l=103 one of the clients affected by this is ClassicPendingScript: https://cs.chromium.org/chromium/src/third_party/blink/renderer/core/script/classic_pending_script.cc?type=cs&g=0&l=385 I don't know the semantics of these methods, but to me, having ClassicPendingScript::OnPurgeMemory call ClassicPendingScript::CancelStreaming seems wrong. Doesn't this mean that if we purge memory, then any script nodes which are still in the process of being loaded will never be loaded? This could easily explain random timeouts of these tests, if the script nodes sometimes aren't executed at all.
,
May 25 2018
+vogelheim@ for script streaming, +tasak@ for the long-standing testing issue around memory coordinator. Expectation is that when we cancel the streaming there, still ClassicPendingScript is finished via non-streaming path, but this path is not well tested. I'll try to add unit tests and see whether the expectation holds.
,
May 25 2018
FYI testing issues: Issue 671502 and Issue 682662.
,
May 25 2018
The current implementation of ClassicPendingScript::OnPurgeMemory() is wrong: PendingScriptClient is not notified of finish (i.e. the expectation in Comment #13 doesn't hold). See the unit test CancellingStreamingByPurgeAfterResourceFinish in https://chromium-review.googlesource.com/#/c/chromium/src/+/1073990 (which fails).
,
May 25 2018
As this is a bug that affects functional behavior (not performance-only bug), I feel it would be helpful to merge a fix to beta/stable. I feel disabling the streaming cancel in OnPurgeMemory() (i.e. remove ClassicPendingScript::OnPurgeMemory()) is the simplest fix. tasak@, do you expect performance/memory regressions caused by this? Fixing this so that we can correctly cancel the streaming will be more complicated. IIUC OnPurgeMemory() is the only code path that cancels the streaming from ClassicPendingScript, either before or after resource load finish. Also, this is the only caller of ClassicPendingScript::CancelStreaming() that needs finish notification to PendingScriptClient after CancelStreaming(), because other callers are for disposing PendingScript and thus we no longer need any action after CancelStreaming(). I feel we should cleanup/simplify the semantics of script streaming cancellation, as we have similar but different cancel-like methods like: - ClassicPendingScript::CancelStreaming() - ScriptStreamer::Cancel() - SourceStream::Cancel() - ScriptStreamer::SuppressStreaming() vogelheim@, WDYT?
,
May 25 2018
hiroshige@: this issue has probably been there for a long time so I'm not convinced that we should decide up front to merge a fix to beta, and especially not stable. First we should figure out what the fix is and what sort of risk it carries. Should we file a new bug about this, or should we make fixes in this area under this bug ID?
,
May 25 2018
Filed Issue 846951 for the specific ClassicPendingScript issue.
,
May 25 2018
,
May 30 2018
hiroshige@, FYI, I tested the webGL tests after disabling |CancelSreaming| with my CL yesterday. I ran it 3 times and there was no timeout during the 3 times. It seems to me that CancelStreaming() might cause the timeout intermittently.
void ClassicPendingScript::OnPurgeMemory() {
CheckState();
// CancelStreaming();
}
,
Jun 4 2018
kbr@, It seems to me that the failures of webgl have been disappeared since #1637 build. - https://ci.chromium.org/p/chromium/builders/luci.chromium.ci/Mac%20FYI%20Retina%20Release%20%28NVIDIA%29/1637
,
Jun 4 2018
Yes, a suppression was landed for the failing test.
,
Jun 5 2018
The following revision refers to this bug: https://chromium.googlesource.com/chromium/src.git/+/f3ab72bc34a678e7c8e164869a52b951c2cf051b commit f3ab72bc34a678e7c8e164869a52b951c2cf051b Author: Gyuyoung Kim <gyuyoung.kim@lge.com> Date: Tue Jun 05 01:26:01 2018 Reland: Call blink::WebMemoryCoordinator::OnPurgeMemory in RenderThreadImpl::ReleaseFreeMemory Orignal commit: f4544427fd7e40abba353339146780a2a2b90051 This CL was reverted because it caused the timeout test failure on fyi bots. But, it looks like the issue was solved. So now this CL lands original CL without a duplicated function invocation. Bug: 842019 Test: Covered by RenderThreadImplDiscardableMemoryBrowserTest.ReleaseFreeMemory Change-Id: I52a4bafc1688822a11fa2408ad7422331fa62a46 Reviewed-on: https://chromium-review.googlesource.com/1084369 Reviewed-by: Kentaro Hara <haraken@chromium.org> Reviewed-by: Kenneth Russell <kbr@chromium.org> Commit-Queue: Gyuyoung Kim <gyuyoung.kim@lge.com> Cr-Commit-Position: refs/heads/master@{#564331} [modify] https://crrev.com/f3ab72bc34a678e7c8e164869a52b951c2cf051b/content/renderer/render_thread_impl.cc
,
Jun 5 2018
After re-landing this CL, the bots are still fine. So I'd like to close this bug. If regressions happen again, I'll open this bug again.
,
Jun 12 2018
|
|||||||||||||
►
Sign in to add a comment |
|||||||||||||
Comment 1 by gyuyoung...@lge.com
, May 11 2018