New issue
Advanced search Search tips
Note: Color blocks (like or ) mean that a user may not be available. Tooltip shows the reason.

Issue 871430 link

Starred by 10 users

Issue metadata

Status: Fixed
Owner:
Closed: Aug 13
Cc:
Components:
EstimatedDays: ----
NextAction: ----
OS: Mac
Pri: 1
Type: Bug



Sign in to add a comment

Mac: Memory corruption

Project Member Reported by ccameron@chromium.org, Aug 6

Issue description

There are a handful of bugs that have been filed lately that are OOMs (AFAICT), but are not being tagged as OOMs.

Example bugs (there may be more): issue 854992, issue 866611, issue 862108, issue 862105.

These all have the following stack top:
    __pthread_kill
    abort
    szone_error
    tiny_malloc_from_free_list
    szone_malloc_should_clear
    base::allocator::MallocZoneFunctionsToReplaceDefault (allocator_shim.cc:177)
    malloc_zone_malloc
    malloc

This is contrast with crashes that are correctly tagged as OOMs, which have the stack top (does stack top sound like a dessert to anyone else?):
    logging::LogMessage::~LogMessage()
    base::(anonymous namespace)::OnNoMemory(unsigned long)
    base::allocator::MallocZoneFunctionsToReplaceDefault (allocator_shim.cc:60)
    base::allocator::MallocZoneFunctionsToReplaceDefault (allocator_shim.cc:177)
    malloc_zone_malloc
    malloc

Maybe there's a failure mode that we're missing?
 
Issue 862105 has been merged into this issue.
Issue 866611 has been merged into this issue.
Issue 854992 has been merged into this issue.
Issue 862108 has been merged into this issue.
Issue 812517 has been merged into this issue.
Issue 709895 has been merged into this issue.
Labels: -Pri-3 ReleaseBlock-Stable M-69 Pri-1
Summary: Mac: Memory corruption (was: Mac: Some OOM crashes are not being tagged as OOM)
The first crash I checked: 07ff2d98f47a8b88

from: https://bugs.chromium.org/p/chromium/issues/detail?id=854992

Is either use-after-free or some other form of memory corruption:
"""
abort() called
*** error for object 0x60400062c6c1: Invalid pointer dequeued from free list
"""

This is not OOM. A recent surge in crashes suggests a recent, fairly bad bug.
The crash I listed in c#7 occurs in nano-malloc:

"""
0x00007fff7f453b6e	(libsystem_kernel.dylib + 0x0001cb6e )	__pthread_kill
0x00007fff7f3af1ad	(libsystem_c.dylib + 0x0005d1ad )	abort
0x00007fff7f4bedc0	(libsystem_malloc.dylib + 0x00014dc0 )	nanozone_error
0x00007fff7f4b2d59	(libsystem_malloc.dylib + 0x00008d59 )	_nano_malloc_check_clear
0x00007fff7f4b2b9e	(libsystem_malloc.dylib + 0x00008b9e )	nano_malloc
"""
Yeah, the one in #7 doesn't have szone_error (rather, nanozone_error). Amongst the reports in #8, all of the ones I've sampled so far have:

"""
*** error for object (address): incorrect checksum for freed object - object was probably modified after being freed.
"""

Owner: ellyjo...@chromium.org
Status: Assigned (was: Available)
Over to Elly to find an owner.
WRT nano_error vs szone_error, it looks as though these started appearing ~June 21, first as nanozone_error, then shifting to szone_error with nano being removed ~July 14.
szone-nano.png
127 KB View Download
Cc: kbr@chromium.org piman@chromium.org vmi...@chromium.org
Components: -Internals>Instrumentation>Memory Internals>GPU
This memory corruption seems to be happening almost always during GPU process re-launch after crash (+ some GPU folks).

In beta at the moment (69.0.3497.23)
- 189 browser crash reports with szone_error on the stack [1]. Of these...
  - 172 have content::GpuProcessHost::Get on the stack. Of these...
  - 167 have content::GpuProcessHost::OnProcessCrashed on the stack
  - 17 have neither (and are a totally mixed bag)

Some related (but ultimately irrelevant) context. We have elevated GPU process crash rates due to issue 863817. But, that doesn't account for the increased crash rate here, because even Canary 70.0.3511.0 (which has the GPU process crash fixed) still exhibits increased crash rates of these signatures.
Cc: sdy@chromium.org
Restricting my attention to just content::GpuProcessHost::OnProcessCrashed, crashes start at 69.0.3455.0, and there is one for every Canary version thereafter. The only exceptions are days with multiple canaries, listed below:
  70.0.3501.0 but not 70.0.3501.2
  69.0.3497.0 but not 69.0.3497.3 or 69.0.3497.4
  69.0.3468.0 but not 69.0.3468.2
  69.0.3469.2 but not 69.0.3469.0
  69.0.3461.2 but not 69.0.3461.0

With that in hand, we can be fairly confident that this issue appeared between 69.0.3453.3 and 69.0.3455.0:
https://chromium.googlesource.com/chromium/src/+log/69.0.3453.0..69.0.3455.0?pretty=fuller&n=10000

There are two substantial changes to GpuProcessHost there:
- r565739: Improve GPU mode fallback on GPU crash
- r565687: De-jank window resizing in MacViews (part 2)

I will try disabling CATransactionGPUCoordinator (this is just the GPU-process-side-part) for a Canary build to see if these reports go away.
Project Member

Comment 16 by bugdroid1@chromium.org, Aug 7

The following revision refers to this bug:
  https://chromium.googlesource.com/chromium/src.git/+/f7549422b853e68118fbaf9e4db5f09a770fc5d5

commit f7549422b853e68118fbaf9e4db5f09a770fc5d5
Author: Christopher Cameron <ccameron@chromium.org>
Date: Tue Aug 07 03:17:27 2018

Mac: Disable CATransactionGPUCoordinator to test crash theory

Crashes with content::GpuProcessHost::OnProcessCrashed on the stack
have become prevalent in Chrome 69.

CATransactionGPUCoordinator was introduced in the first Canary to
exhibit these crashes (https://crrev.com/565687 in 69.0.3455.0).

Disable CATransactionGPUCoordinator to test the theory that it is
responsible for the crashes. This will make window resize slightly
jankier (like -- the window's contents will detach from the window
during resize) while disabled. Note that not all window resize logic
is disabled by this -- just the GPU side (so resize will be reasonable
most of the time, but there will be occaisonal glitches now).

Bug:  871430 
Change-Id: Ifec51e511712b9843f15e9b8c43b66d852232b89
Reviewed-on: https://chromium-review.googlesource.com/1164555
Reviewed-by: Kenneth Russell <kbr@chromium.org>
Commit-Queue: ccameron <ccameron@chromium.org>
Cr-Commit-Position: refs/heads/master@{#581117}
[modify] https://crrev.com/f7549422b853e68118fbaf9e4db5f09a770fc5d5/content/browser/gpu/gpu_process_host.cc

M69 Stable promotion is coming VERY soon. Your bug is labelled as Stable ReleaseBlock, pls make sure to land the fix and request a merge into the release branch ASAP. Thank you.
Cc: pnangunoori@chromium.org
Just to update the latest behavior of the magic signature - content::ChildConnection::IOThreadContext::BindInterface, this could be related to this crash:

So far seeing 5 crashes from 5 clients so far on latest beta - 69.0.3497.23 on Mac OS. This crash is ranked as number #9 in 'Browser' beta crashes. 

Crash data:
-----------
70.0.3514.2	2.56%	1 - Dev
69.0.3497.23	20.51%	8 - Beta
So far no crashes are observed on Stable and Canary latest builds.

Link to the list of builds:
-------------------------
https://crash.corp.google.com/browse?q=product_name%3D%27Chrome_Mac%27+AND+expanded_custom_data.ChromeCrashProto.channel%3D%27beta%27+AND+expanded_custom_data.ChromeCrashProto.ptype%3D%27browser%27+AND+expanded_custom_data.ChromeCrashProto.magic_signature_1.name%3D%27content%3A%3AChildConnection%3A%3AIOThreadContext%3A%3ABindInterface%27

Thanks!
Re #18, yes, that signature is the same bug.

I see zero instances of this bug post-70.0.3516.0, so this is definitely the root cause. piman@ suggested that taking a reference in the ctor was the bug, so I'll try removing that and putting this back in.
Labels: Target-69
Owner: ccameron@chromium.org
Project Member

Comment 22 by bugdroid1@chromium.org, Aug 10

The following revision refers to this bug:
  https://chromium.googlesource.com/chromium/src.git/+/62892558b716d1560264e42929e2be01ead07e71

commit 62892558b716d1560264e42929e2be01ead07e71
Author: Christopher Cameron <ccameron@chromium.org>
Date: Fri Aug 10 23:10:49 2018

CATransactionGPUCoordinator: Re-enable and make lifetime explicit

Disalbing the CATransactionGPUCoordinator caused memory corruption bugs
to go away.

Fix two potential bugs:

1.  Don't post tasks (and potentially change the reference count) inside
    CATransactionGPUCoordinator's constructor. Do this in a separate
    explicit Init function.

1a. Move this initialization to the end of GpuProcessHost:Init (instead
    of being at the beginning of the constructor).

2.  Make CATransactionCoordinator explicitly retain PostCommitObserver
    (which includes CATransactionGPUCoordinator). This fixes a bug
    whereby at shutdown, destroying not-yet-executed posted tasks caused
    the CATransactionGPUCoordinator to be destroyed while the
    CATransactionCoordinator still had a raw pointer to it.

Bug:  871430 
Change-Id: Ie144071cce9ce48e0187cdaf1fcf32df7b62ed75
Reviewed-on: https://chromium-review.googlesource.com/1171657
Commit-Queue: ccameron <ccameron@chromium.org>
Reviewed-by: Antoine Labour <piman@chromium.org>
Cr-Commit-Position: refs/heads/master@{#582378}
[modify] https://crrev.com/62892558b716d1560264e42929e2be01ead07e71/content/browser/gpu/ca_transaction_gpu_coordinator.cc
[modify] https://crrev.com/62892558b716d1560264e42929e2be01ead07e71/content/browser/gpu/ca_transaction_gpu_coordinator.h
[modify] https://crrev.com/62892558b716d1560264e42929e2be01ead07e71/content/browser/gpu/gpu_process_host.cc
[modify] https://crrev.com/62892558b716d1560264e42929e2be01ead07e71/ui/accelerated_widget_mac/ca_transaction_observer.h
[modify] https://crrev.com/62892558b716d1560264e42929e2be01ead07e71/ui/accelerated_widget_mac/ca_transaction_observer.mm

Project Member

Comment 24 by sheriffbot@chromium.org, Aug 13

Labels: -Merge-Request-69 Merge-Review-69 Hotlist-Merge-Review
This bug requires manual review: M69 has already been promoted to the beta branch, so this requires manual review
Please contact the milestone owner if you have questions.
Owners: amineer@(Android), kariahda@(iOS), cindyb@(ChromeOS), govind@(Desktop)

For more details visit https://www.chromium.org/issue-tracking/autotriage - Your friendly Sheriffbot
Which CL you're requesting a merge for?
Also CL listed at #22 landed 2 days back but per comment #23 last was in 70.0.3511.0 & current canary version in 70.0.3520.0, so how can we sure this CL indeed fix the crash?
Both #16 and #22 -- we still see /exactly zero/ crashes with this signature after 70.0.3511.0, which strongly suggests that the issue is fixed (we usually get 5+ crash reports per day).

Both changes are very low risk.
Issue 873680 has been merged into this issue.
Labels: -Merge-Review-69 Merge-Approved-69
Approving merge to M69 branch 3497 based on comment #26. Please merge and mark bug as fixed if nothing else is pending. Thank you.
Project Member

Comment 29 by bugdroid1@chromium.org, Aug 13

Labels: -merge-approved-69 merge-merged-3497
The following revision refers to this bug:
  https://chromium.googlesource.com/chromium/src.git/+/bd57bba80b4e8ee9c4e38779851588905d954966

commit bd57bba80b4e8ee9c4e38779851588905d954966
Author: Christopher Cameron <ccameron@chromium.org>
Date: Mon Aug 13 18:03:09 2018

Mac: Disable CATransactionGPUCoordinator to test crash theory

Crashes with content::GpuProcessHost::OnProcessCrashed on the stack
have become prevalent in Chrome 69.

CATransactionGPUCoordinator was introduced in the first Canary to
exhibit these crashes (https://crrev.com/565687 in 69.0.3455.0).

Disable CATransactionGPUCoordinator to test the theory that it is
responsible for the crashes. This will make window resize slightly
jankier (like -- the window's contents will detach from the window
during resize) while disabled. Note that not all window resize logic
is disabled by this -- just the GPU side (so resize will be reasonable
most of the time, but there will be occaisonal glitches now).

TBR=ccameron@chromium.org

(cherry picked from commit f7549422b853e68118fbaf9e4db5f09a770fc5d5)

Bug:  871430 
Change-Id: Ifec51e511712b9843f15e9b8c43b66d852232b89
Reviewed-on: https://chromium-review.googlesource.com/1164555
Reviewed-by: Kenneth Russell <kbr@chromium.org>
Commit-Queue: ccameron <ccameron@chromium.org>
Cr-Original-Commit-Position: refs/heads/master@{#581117}
Reviewed-on: https://chromium-review.googlesource.com/1173058
Reviewed-by: ccameron <ccameron@chromium.org>
Cr-Commit-Position: refs/branch-heads/3497@{#569}
Cr-Branched-From: 271eaf50594eb818c9295dc78d364aea18c82ea8-refs/heads/master@{#576753}
[modify] https://crrev.com/bd57bba80b4e8ee9c4e38779851588905d954966/content/browser/gpu/gpu_process_host.cc

Project Member

Comment 30 by bugdroid1@chromium.org, Aug 13

The following revision refers to this bug:
  https://chromium.googlesource.com/chromium/src.git/+/d9f5d0bc07ef2528def2b420dbbcc9b9f77f1899

commit d9f5d0bc07ef2528def2b420dbbcc9b9f77f1899
Author: Christopher Cameron <ccameron@chromium.org>
Date: Mon Aug 13 20:15:22 2018

CATransactionGPUCoordinator: Re-enable and make lifetime explicit

Disalbing the CATransactionGPUCoordinator caused memory corruption bugs
to go away.

Fix two potential bugs:

1.  Don't post tasks (and potentially change the reference count) inside
    CATransactionGPUCoordinator's constructor. Do this in a separate
    explicit Init function.

1a. Move this initialization to the end of GpuProcessHost:Init (instead
    of being at the beginning of the constructor).

2.  Make CATransactionCoordinator explicitly retain PostCommitObserver
    (which includes CATransactionGPUCoordinator). This fixes a bug
    whereby at shutdown, destroying not-yet-executed posted tasks caused
    the CATransactionGPUCoordinator to be destroyed while the
    CATransactionCoordinator still had a raw pointer to it.

TBR=ccameron@chromium.org

(cherry picked from commit 62892558b716d1560264e42929e2be01ead07e71)

Bug:  871430 
Change-Id: Ie144071cce9ce48e0187cdaf1fcf32df7b62ed75
Reviewed-on: https://chromium-review.googlesource.com/1171657
Commit-Queue: ccameron <ccameron@chromium.org>
Reviewed-by: Antoine Labour <piman@chromium.org>
Cr-Original-Commit-Position: refs/heads/master@{#582378}
Reviewed-on: https://chromium-review.googlesource.com/1173060
Reviewed-by: ccameron <ccameron@chromium.org>
Cr-Commit-Position: refs/branch-heads/3497@{#581}
Cr-Branched-From: 271eaf50594eb818c9295dc78d364aea18c82ea8-refs/heads/master@{#576753}
[modify] https://crrev.com/d9f5d0bc07ef2528def2b420dbbcc9b9f77f1899/content/browser/gpu/ca_transaction_gpu_coordinator.cc
[modify] https://crrev.com/d9f5d0bc07ef2528def2b420dbbcc9b9f77f1899/content/browser/gpu/ca_transaction_gpu_coordinator.h
[modify] https://crrev.com/d9f5d0bc07ef2528def2b420dbbcc9b9f77f1899/content/browser/gpu/gpu_process_host.cc
[modify] https://crrev.com/d9f5d0bc07ef2528def2b420dbbcc9b9f77f1899/ui/accelerated_widget_mac/ca_transaction_observer.h
[modify] https://crrev.com/d9f5d0bc07ef2528def2b420dbbcc9b9f77f1899/ui/accelerated_widget_mac/ca_transaction_observer.mm

Status: Fixed (was: Assigned)
Cc: ccameron@chromium.org
 Issue 866194  has been merged into this issue.

Sign in to add a comment