Mac: Memory corruption |
|||||||||||||
Issue description
There are a handful of bugs that have been filed lately that are OOMs (AFAICT), but are not being tagged as OOMs.
Example bugs (there may be more): issue 854992, issue 866611, issue 862108, issue 862105.
These all have the following stack top:
__pthread_kill
abort
szone_error
tiny_malloc_from_free_list
szone_malloc_should_clear
base::allocator::MallocZoneFunctionsToReplaceDefault (allocator_shim.cc:177)
malloc_zone_malloc
malloc
This is contrast with crashes that are correctly tagged as OOMs, which have the stack top (does stack top sound like a dessert to anyone else?):
logging::LogMessage::~LogMessage()
base::(anonymous namespace)::OnNoMemory(unsigned long)
base::allocator::MallocZoneFunctionsToReplaceDefault (allocator_shim.cc:60)
base::allocator::MallocZoneFunctionsToReplaceDefault (allocator_shim.cc:177)
malloc_zone_malloc
malloc
Maybe there's a failure mode that we're missing?
,
Aug 6
Issue 866611 has been merged into this issue.
,
Aug 6
Issue 854992 has been merged into this issue.
,
Aug 6
Issue 862108 has been merged into this issue.
,
Aug 6
Issue 812517 has been merged into this issue.
,
Aug 6
Issue 709895 has been merged into this issue.
,
Aug 6
The first crash I checked: 07ff2d98f47a8b88 from: https://bugs.chromium.org/p/chromium/issues/detail?id=854992 Is either use-after-free or some other form of memory corruption: """ abort() called *** error for object 0x60400062c6c1: Invalid pointer dequeued from free list """ This is not OOM. A recent surge in crashes suggests a recent, fairly bad bug.
,
Aug 6
(oops, shows what I know...) The surge started ~July 14, after crrev.com/575108 landed. Is this now-not-nano-malloc related? https://crash.corp.google.com/browse?q=product_name%3D%27Chrome_Mac%27+AND+expanded_custom_data.ChromeCrashProto.channel%3D%27canary%27+AND+expanded_custom_data.ChromeCrashProto.ptype%3D%27browser%27+AND+EXISTS+%28SELECT+1+FROM+UNNEST%28CrashedStackTrace.StackFrame%29+WHERE+FunctionName%3D%27szone_error%27%29#+samplereports,-productname:1000,magicsignature:50,-magicsignature2:50,-stablesignature:50,magicsignaturesorted:50
,
Aug 6
The crash I listed in c#7 occurs in nano-malloc: """ 0x00007fff7f453b6e (libsystem_kernel.dylib + 0x0001cb6e ) __pthread_kill 0x00007fff7f3af1ad (libsystem_c.dylib + 0x0005d1ad ) abort 0x00007fff7f4bedc0 (libsystem_malloc.dylib + 0x00014dc0 ) nanozone_error 0x00007fff7f4b2d59 (libsystem_malloc.dylib + 0x00008d59 ) _nano_malloc_check_clear 0x00007fff7f4b2b9e (libsystem_malloc.dylib + 0x00008b9e ) nano_malloc """
,
Aug 6
Yeah, the one in #7 doesn't have szone_error (rather, nanozone_error). Amongst the reports in #8, all of the ones I've sampled so far have: """ *** error for object (address): incorrect checksum for freed object - object was probably modified after being freed. """
,
Aug 6
Over to Elly to find an owner.
,
Aug 6
WRT nano_error vs szone_error, it looks as though these started appearing ~June 21, first as nanozone_error, then shifting to szone_error with nano being removed ~July 14.
,
Aug 6
Query that merges nano and szone: https://crash.corp.google.com/browse?q=product_name%3D%27Chrome_Mac%27+AND+expanded_custom_data.ChromeCrashProto.channel%3D%27canary%27+AND+expanded_custom_data.ChromeCrashProto.ptype%3D%27browser%27+AND+%28EXISTS+%28SELECT+1+FROM+UNNEST%28CrashedStackTrace.StackFrame%29+WHERE+FunctionName%3D%27nanozone_error%27%29+OR+EXISTS+%28SELECT+1+FROM+UNNEST%28CrashedStackTrace.StackFrame%29+WHERE+FunctionName%3D%27szone_error%27%29%29 The first appearance is in 69.0.3446.0
,
Aug 6
This memory corruption seems to be happening almost always during GPU process re-launch after crash (+ some GPU folks). In beta at the moment (69.0.3497.23) - 189 browser crash reports with szone_error on the stack [1]. Of these... - 172 have content::GpuProcessHost::Get on the stack. Of these... - 167 have content::GpuProcessHost::OnProcessCrashed on the stack - 17 have neither (and are a totally mixed bag) Some related (but ultimately irrelevant) context. We have elevated GPU process crash rates due to issue 863817. But, that doesn't account for the increased crash rate here, because even Canary 70.0.3511.0 (which has the GPU process crash fixed) still exhibits increased crash rates of these signatures.
,
Aug 7
Restricting my attention to just content::GpuProcessHost::OnProcessCrashed, crashes start at 69.0.3455.0, and there is one for every Canary version thereafter. The only exceptions are days with multiple canaries, listed below: 70.0.3501.0 but not 70.0.3501.2 69.0.3497.0 but not 69.0.3497.3 or 69.0.3497.4 69.0.3468.0 but not 69.0.3468.2 69.0.3469.2 but not 69.0.3469.0 69.0.3461.2 but not 69.0.3461.0 With that in hand, we can be fairly confident that this issue appeared between 69.0.3453.3 and 69.0.3455.0: https://chromium.googlesource.com/chromium/src/+log/69.0.3453.0..69.0.3455.0?pretty=fuller&n=10000 There are two substantial changes to GpuProcessHost there: - r565739: Improve GPU mode fallback on GPU crash - r565687: De-jank window resizing in MacViews (part 2) I will try disabling CATransactionGPUCoordinator (this is just the GPU-process-side-part) for a Canary build to see if these reports go away.
,
Aug 7
The following revision refers to this bug: https://chromium.googlesource.com/chromium/src.git/+/f7549422b853e68118fbaf9e4db5f09a770fc5d5 commit f7549422b853e68118fbaf9e4db5f09a770fc5d5 Author: Christopher Cameron <ccameron@chromium.org> Date: Tue Aug 07 03:17:27 2018 Mac: Disable CATransactionGPUCoordinator to test crash theory Crashes with content::GpuProcessHost::OnProcessCrashed on the stack have become prevalent in Chrome 69. CATransactionGPUCoordinator was introduced in the first Canary to exhibit these crashes (https://crrev.com/565687 in 69.0.3455.0). Disable CATransactionGPUCoordinator to test the theory that it is responsible for the crashes. This will make window resize slightly jankier (like -- the window's contents will detach from the window during resize) while disabled. Note that not all window resize logic is disabled by this -- just the GPU side (so resize will be reasonable most of the time, but there will be occaisonal glitches now). Bug: 871430 Change-Id: Ifec51e511712b9843f15e9b8c43b66d852232b89 Reviewed-on: https://chromium-review.googlesource.com/1164555 Reviewed-by: Kenneth Russell <kbr@chromium.org> Commit-Queue: ccameron <ccameron@chromium.org> Cr-Commit-Position: refs/heads/master@{#581117} [modify] https://crrev.com/f7549422b853e68118fbaf9e4db5f09a770fc5d5/content/browser/gpu/gpu_process_host.cc
,
Aug 7
M69 Stable promotion is coming VERY soon. Your bug is labelled as Stable ReleaseBlock, pls make sure to land the fix and request a merge into the release branch ASAP. Thank you.
,
Aug 8
Just to update the latest behavior of the magic signature - content::ChildConnection::IOThreadContext::BindInterface, this could be related to this crash: So far seeing 5 crashes from 5 clients so far on latest beta - 69.0.3497.23 on Mac OS. This crash is ranked as number #9 in 'Browser' beta crashes. Crash data: ----------- 70.0.3514.2 2.56% 1 - Dev 69.0.3497.23 20.51% 8 - Beta So far no crashes are observed on Stable and Canary latest builds. Link to the list of builds: ------------------------- https://crash.corp.google.com/browse?q=product_name%3D%27Chrome_Mac%27+AND+expanded_custom_data.ChromeCrashProto.channel%3D%27beta%27+AND+expanded_custom_data.ChromeCrashProto.ptype%3D%27browser%27+AND+expanded_custom_data.ChromeCrashProto.magic_signature_1.name%3D%27content%3A%3AChildConnection%3A%3AIOThreadContext%3A%3ABindInterface%27 Thanks!
,
Aug 10
Re #18, yes, that signature is the same bug. I see zero instances of this bug post-70.0.3516.0, so this is definitely the root cause. piman@ suggested that taking a reference in the ctor was the bug, so I'll try removing that and putting this back in.
,
Aug 10
,
Aug 10
,
Aug 10
The following revision refers to this bug: https://chromium.googlesource.com/chromium/src.git/+/62892558b716d1560264e42929e2be01ead07e71 commit 62892558b716d1560264e42929e2be01ead07e71 Author: Christopher Cameron <ccameron@chromium.org> Date: Fri Aug 10 23:10:49 2018 CATransactionGPUCoordinator: Re-enable and make lifetime explicit Disalbing the CATransactionGPUCoordinator caused memory corruption bugs to go away. Fix two potential bugs: 1. Don't post tasks (and potentially change the reference count) inside CATransactionGPUCoordinator's constructor. Do this in a separate explicit Init function. 1a. Move this initialization to the end of GpuProcessHost:Init (instead of being at the beginning of the constructor). 2. Make CATransactionCoordinator explicitly retain PostCommitObserver (which includes CATransactionGPUCoordinator). This fixes a bug whereby at shutdown, destroying not-yet-executed posted tasks caused the CATransactionGPUCoordinator to be destroyed while the CATransactionCoordinator still had a raw pointer to it. Bug: 871430 Change-Id: Ie144071cce9ce48e0187cdaf1fcf32df7b62ed75 Reviewed-on: https://chromium-review.googlesource.com/1171657 Commit-Queue: ccameron <ccameron@chromium.org> Reviewed-by: Antoine Labour <piman@chromium.org> Cr-Commit-Position: refs/heads/master@{#582378} [modify] https://crrev.com/62892558b716d1560264e42929e2be01ead07e71/content/browser/gpu/ca_transaction_gpu_coordinator.cc [modify] https://crrev.com/62892558b716d1560264e42929e2be01ead07e71/content/browser/gpu/ca_transaction_gpu_coordinator.h [modify] https://crrev.com/62892558b716d1560264e42929e2be01ead07e71/content/browser/gpu/gpu_process_host.cc [modify] https://crrev.com/62892558b716d1560264e42929e2be01ead07e71/ui/accelerated_widget_mac/ca_transaction_observer.h [modify] https://crrev.com/62892558b716d1560264e42929e2be01ead07e71/ui/accelerated_widget_mac/ca_transaction_observer.mm
,
Aug 13
Last browser crash during GPU process launch was in 70.0.3511.0, so I think we can declare this fixed. Requesting M69 merge. https://crash.corp.google.com/browse?q=product_name%3D%27Chrome_Mac%27+AND+expanded_custom_data.ChromeCrashProto.channel%3D%27canary%27+AND+expanded_custom_data.ChromeCrashProto.ptype%3D%27browser%27+AND+expanded_custom_data.ChromeCrashProto.magic_signature_1.name%3D%27content%3A%3ABrowserMessageFilter%3A%3AGetFilter%27+AND+EXISTS+%28SELECT+1+FROM+UNNEST%28CrashedStackTrace.StackFrame%29+WHERE+FunctionName%3D%27content%3A%3AGpuProcessHost%3A%3AGet%28content%3A%3AGpuProcessHost%3A%3AGpuProcessKind%2C+bool%29%27%29#samplereports
,
Aug 13
This bug requires manual review: M69 has already been promoted to the beta branch, so this requires manual review Please contact the milestone owner if you have questions. Owners: amineer@(Android), kariahda@(iOS), cindyb@(ChromeOS), govind@(Desktop) For more details visit https://www.chromium.org/issue-tracking/autotriage - Your friendly Sheriffbot
,
Aug 13
Which CL you're requesting a merge for? Also CL listed at #22 landed 2 days back but per comment #23 last was in 70.0.3511.0 & current canary version in 70.0.3520.0, so how can we sure this CL indeed fix the crash?
,
Aug 13
Both #16 and #22 -- we still see /exactly zero/ crashes with this signature after 70.0.3511.0, which strongly suggests that the issue is fixed (we usually get 5+ crash reports per day). Both changes are very low risk.
,
Aug 13
Issue 873680 has been merged into this issue.
,
Aug 13
Approving merge to M69 branch 3497 based on comment #26. Please merge and mark bug as fixed if nothing else is pending. Thank you.
,
Aug 13
The following revision refers to this bug: https://chromium.googlesource.com/chromium/src.git/+/bd57bba80b4e8ee9c4e38779851588905d954966 commit bd57bba80b4e8ee9c4e38779851588905d954966 Author: Christopher Cameron <ccameron@chromium.org> Date: Mon Aug 13 18:03:09 2018 Mac: Disable CATransactionGPUCoordinator to test crash theory Crashes with content::GpuProcessHost::OnProcessCrashed on the stack have become prevalent in Chrome 69. CATransactionGPUCoordinator was introduced in the first Canary to exhibit these crashes (https://crrev.com/565687 in 69.0.3455.0). Disable CATransactionGPUCoordinator to test the theory that it is responsible for the crashes. This will make window resize slightly jankier (like -- the window's contents will detach from the window during resize) while disabled. Note that not all window resize logic is disabled by this -- just the GPU side (so resize will be reasonable most of the time, but there will be occaisonal glitches now). TBR=ccameron@chromium.org (cherry picked from commit f7549422b853e68118fbaf9e4db5f09a770fc5d5) Bug: 871430 Change-Id: Ifec51e511712b9843f15e9b8c43b66d852232b89 Reviewed-on: https://chromium-review.googlesource.com/1164555 Reviewed-by: Kenneth Russell <kbr@chromium.org> Commit-Queue: ccameron <ccameron@chromium.org> Cr-Original-Commit-Position: refs/heads/master@{#581117} Reviewed-on: https://chromium-review.googlesource.com/1173058 Reviewed-by: ccameron <ccameron@chromium.org> Cr-Commit-Position: refs/branch-heads/3497@{#569} Cr-Branched-From: 271eaf50594eb818c9295dc78d364aea18c82ea8-refs/heads/master@{#576753} [modify] https://crrev.com/bd57bba80b4e8ee9c4e38779851588905d954966/content/browser/gpu/gpu_process_host.cc
,
Aug 13
The following revision refers to this bug: https://chromium.googlesource.com/chromium/src.git/+/d9f5d0bc07ef2528def2b420dbbcc9b9f77f1899 commit d9f5d0bc07ef2528def2b420dbbcc9b9f77f1899 Author: Christopher Cameron <ccameron@chromium.org> Date: Mon Aug 13 20:15:22 2018 CATransactionGPUCoordinator: Re-enable and make lifetime explicit Disalbing the CATransactionGPUCoordinator caused memory corruption bugs to go away. Fix two potential bugs: 1. Don't post tasks (and potentially change the reference count) inside CATransactionGPUCoordinator's constructor. Do this in a separate explicit Init function. 1a. Move this initialization to the end of GpuProcessHost:Init (instead of being at the beginning of the constructor). 2. Make CATransactionCoordinator explicitly retain PostCommitObserver (which includes CATransactionGPUCoordinator). This fixes a bug whereby at shutdown, destroying not-yet-executed posted tasks caused the CATransactionGPUCoordinator to be destroyed while the CATransactionCoordinator still had a raw pointer to it. TBR=ccameron@chromium.org (cherry picked from commit 62892558b716d1560264e42929e2be01ead07e71) Bug: 871430 Change-Id: Ie144071cce9ce48e0187cdaf1fcf32df7b62ed75 Reviewed-on: https://chromium-review.googlesource.com/1171657 Commit-Queue: ccameron <ccameron@chromium.org> Reviewed-by: Antoine Labour <piman@chromium.org> Cr-Original-Commit-Position: refs/heads/master@{#582378} Reviewed-on: https://chromium-review.googlesource.com/1173060 Reviewed-by: ccameron <ccameron@chromium.org> Cr-Commit-Position: refs/branch-heads/3497@{#581} Cr-Branched-From: 271eaf50594eb818c9295dc78d364aea18c82ea8-refs/heads/master@{#576753} [modify] https://crrev.com/d9f5d0bc07ef2528def2b420dbbcc9b9f77f1899/content/browser/gpu/ca_transaction_gpu_coordinator.cc [modify] https://crrev.com/d9f5d0bc07ef2528def2b420dbbcc9b9f77f1899/content/browser/gpu/ca_transaction_gpu_coordinator.h [modify] https://crrev.com/d9f5d0bc07ef2528def2b420dbbcc9b9f77f1899/content/browser/gpu/gpu_process_host.cc [modify] https://crrev.com/d9f5d0bc07ef2528def2b420dbbcc9b9f77f1899/ui/accelerated_widget_mac/ca_transaction_observer.h [modify] https://crrev.com/d9f5d0bc07ef2528def2b420dbbcc9b9f77f1899/ui/accelerated_widget_mac/ca_transaction_observer.mm
,
Aug 13
,
Aug 31
|
|||||||||||||
►
Sign in to add a comment |
|||||||||||||
Comment 1 by ccameron@chromium.org
, Aug 6