New issue
Advanced search Search tips
Note: Color blocks (like or ) mean that a user may not be available. Tooltip shows the reason.

Issue 661194 link

Starred by 5 users

Issue metadata

Status: Archived
Owner:
Closed: Aug 2017
Cc:
Components:
EstimatedDays: ----
NextAction: ----
OS: Windows
Pri: 2
Type: Bug-Regression



Sign in to add a comment

v8.infinite_scroll_tbmv2 failure on chromium.perf at 428764:428823

Project Member Reported by rbyers@chromium.org, Nov 1 2016

Issue description

Revision range first seen: 428810:428823
Link to failing step log:
https://build.chromium.org/p/chromium.perf/builders/Win%2010%20Perf%20%281%29/builds/4873/steps/v8.infinite_scroll_tbmv2/logs/stdio

Failing with a crash in base::trace_event::ProcessMemoryTotals::AsValueInto

I suspect https://chromium.googlesource.com/chromium/src/+/c624f3d71afc0077b16ac8ca9d54296050b5d5d8
"Add BACKGROUND dump mode to various GPU/CC MemoryDumpProviders"

Eric can you please take a look ASAP?

 
Summary: v8.infinite_scroll_tbmv2 failure on chromium.perf at 428764:428823 (was: v8.infinite_scroll_tbmv2 failure on chromium.perf at 428810:428823)
Revert landed: crrev.com/660929
Cc: rbyers@chromium.org
Shoot - looks like I guess wrong.  Still crashing with the same stack after the revert:
https://build.chromium.org/p/chromium.perf/builders/Win%2010%20Perf%20%281%29/builds/4883

Sorry Eric, feel free to reland!  I don't see anything else in the revision range that's related to the callstack.  I'll kick off a bisect.  Since you work on the memory dump infrastructure, should this stay assigned to you?  If not, feel free to reassign to the current perf bot health sheriff for routing.

For the record, here's the crashing callstack:

chrome_child!base::trace_event::ProcessMemoryTotals::AsValueInto+0x835
chrome_child!base::trace_event::MallocDumpProvider::OnMemoryDump+0x60
chrome_child!base::trace_event::MemoryDumpManager::InvokeOnMemoryDump+0x21a
chrome_child!base::trace_event::MemoryDumpManager::SetupNextMemoryDump+0x2ce
chrome_child!base::trace_event::MemoryDumpManager::InvokeOnMemoryDump+0x27f

Cc: picksi@chromium.org
/cc picksi@ current bot health sheriff

===== BISECT JOB RESULTS =====
Status: completed


===== TESTED REVISIONS =====
Revision         Exit Code  Std Dev  N   Good?
chromium@428810  1          N/A      20  good
chromium@428823  1          N/A      20  bad

Bisect job ran on: winx64_10_perf_bisect
Bug ID: 661194

Test Command: src/tools/perf/run_benchmark -v --browser=release_x64 --output-format=chartjson --upload-results --pageset-repeat=1 --also-run-disabled-tests v8.infinite_scroll_tbmv2
Test Metric: memory:chrome:all_processes:reported_by_chrome:v8:allocated_by_malloc:effective_size_avg/memory:chrome:all_processes:reported_by_chrome:v8:allocated_by_malloc:effective_size_avg
Relative Change: 0.00%
Score: 0

Buildbot stdio: http://build.chromium.org/p/tryserver.chromium.perf/builders/winx64_10_perf_bisect/builds/769
Job details: https://chromeperf.appspot.com/buildbucket_job_status/8997081748590615056


Not what you expected? We'll investigate and get back to you!
  https://chromeperf.appspot.com/bad_bisect?try_job_id=5891297460617216

| O O | Visit http://www.chromium.org/developers/speed-infra/perf-bug-faq
|  X  | for more information addressing perf regression bugs. For feedback,
| / \ | file a bug with component Tests>AutoBisect.  Thank you!
Cc: primiano@chromium.org ericrk@chromium.org
 Issue 662973  has been merged into this issue.
This is pretty weird - I tried to bisect locally but found that I could repro the crash quite a ways back.

From stepping through the code, it appears that we ask windows for all heaps for the current process, then iterate through them dumping info. Unfortunately, iterating heaps on Windows is inherently racy, and it looks like the last heap enumerated may be invalid by the time we try to dump its info, leading to a crash.

From the MSDN docs, the function we use, GetProcessHeaps, is racy. You should only call functions on heaps returned by GetProcessHeaps if you created the heap and if you know that it has not been deleted. https://msdn.microsoft.com/en-us/library/windows/desktop/aa366571(v=vs.85).aspx

One other possibility is that the heap where we encounter a crash has not been deleted, but was created as lock-free. If someone creates a heap as lock-free it will crash if you attempt to lock it (which our code does). I need to keep investigating to see if this might be the case.
Cc: siggi@chromium.org chrisha@chromium.org sebmarchand@chromium.org
Adding a bunch of people who are familiar with that Winheap dump code and helped developing that, my Win32 skills are really limited.

For the records, the code ericrk@ is talking about here is base/trace_event/malloc_dump_provider.cc . Any clues?

The only odd thing is that IIRC that code has been there untouched for ~years but I guess we might have introduced some new code that tickles the race.

About the reland, unrelated with this crash, I've seen memory metrics going up and down  with this, so there might be something else there if this is not wai:
see https://chromeperf.appspot.com/group_report?bug_id=661948
 
Yeah, this kind of heap enumeration is racy on Windows. It might be possible to make it robust against crashing by running the enumeration loop inside a __try block.
Note that in the Windows browser, you can expect foreign DLLs to load and unload at more or less any time. Each of these DLLs will generally instantiate their own private heap.
I don't know what you're using this enumeration for, but maybe it's better to have it only account for Chrome's own, known heaps? That way you can eliminate the crashes, as well as noise from OS and third-party things sharing the process?
On the downside, you won't account for OS and third-party heap memory usage.
Cc: manzagop@chromium.org
CC @manzagop, who may have a renderer hang due to corruption caused by this code.
I left my canary open overnight (56.0.2912.0 64bit) and got a "pages unresponsive" dialog in the morning.

We found a thread waiting on a critical section:

0:000> kv 10
 # Child-SP          RetAddr           : Args to Child                                                           : Call Site
00 00000031`0475ad58 00007ffb`9d24017f : 00000178`3e8844d8 00000178`0018ae10 00000178`0018ae88 00000178`0018afa8 : ntdll!NtWaitForAlertByThreadId+0x14
01 00000031`0475ad60 00007ffb`9d24007a : 00000000`00000000 00000000`00000000 00000031`0475ae38 00000178`0018ae10 : ntdll!RtlpWaitOnAddressWithTimeout+0x43
02 00000031`0475ad90 00007ffb`9d23febc : 00000178`3e790298 00000031`0475ae69 00000000`00000000 00000178`0018a880 : ntdll!RtlpWaitOnAddress+0xae
03 00000031`0475ae00 00007ffb`9d23ea0b : 00000178`0018bfa8 00000178`0018c050 00000178`3e790298 00007ffb`5e03b9ed : ntdll!RtlpWaitOnCriticalSection+0x144
04 00000031`0475aed0 00007ffb`9d237def : 00000178`3e790000 00000000`000425d0 00000000`00000000 00000000`0000425d : ntdll!RtlpEnterCriticalSectionContended+0xcb
05 00000031`0475af00 00007ffb`9d235c5f : 00000178`3e790000 00000000`00000002 00000000`000425c8 00000000`000425d0 : ntdll!RtlpAllocateHeap+0x20f
06 00000031`0475b200 00007ffb`5e5a0fcd : 00000034`00000003 00000000`000425c8 00000000`00000000 00000178`47730000 : ntdll!RtlpAllocateHeapInternal+0x64f
07 (Inline Function) --------`-------- : --------`-------- --------`-------- --------`-------- --------`-------- : chrome_child!base::allocator::WinHeapMalloc+0x1c (Inline Function @ 00007ffb`5e5a0fcd) [c:\b\build\slave\win64-pgo\build\src\base\allocator\winheap_stubs_win.cc @ 34]
08 (Inline Function) --------`-------- : --------`-------- --------`-------- --------`-------- --------`-------- : chrome_child!?A0x370bf134::DefaultWinHeapMallocImpl+0x1c (Inline Function @ 00007ffb`5e5a0fcd) [c:\b\build\slave\win64-pgo\build\src\base\allocator\allocator_shim_default_dispatch_to_winheap.cc @ 15]
09 (Inline Function) --------`-------- : --------`-------- --------`-------- --------`-------- --------`-------- : chrome_child!ShimMalloc+0x3c (Inline Function @ 00007ffb`5e5a0fcd) [c:\b\build\slave\win64-pgo\build\src\base\allocator\allocator_shim.cc @ 177]
0a (Inline Function) --------`-------- : --------`-------- --------`-------- --------`-------- --------`-------- : chrome_child!malloc+0x3c (Inline Function @ 00007ffb`5e5a0fcd) [c:\b\build\slave\win64-pgo\build\src\base\allocator\allocator_shim_override_ucrt_symbols_win.h @ 51]
0b 00000031`0475b290 00007ffb`5de12d87 : 00000000`000000a8 00000178`001235d0 00000178`48896940 00000178`001926d0 : chrome_child!v8::internal::AccountingAllocator::AllocateSegment+0x51 [c:\b\build\slave\win64-pgo\build\src\v8\src\zone\accounting-allocator.cc @ 82]
0c 00000031`0475b2c0 00007ffb`5de12c5d : 00000000`000000a8 00000031`0475b450 00000178`001235d0 00000178`48896940 : chrome_child!v8::internal::AccountingAllocator::GetSegment+0xc3 [c:\b\build\slave\win64-pgo\build\src\v8\src\zone\accounting-allocator.cc @ 74]
0d (Inline Function) --------`-------- : --------`-------- --------`-------- --------`-------- --------`-------- : chrome_child!v8::internal::Zone::NewSegment+0xa (Inline Function @ 00007ffb`5de12c5d) [c:\b\build\slave\win64-pgo\build\src\v8\src\zone\zone.cc @ 120]
0e 00000031`0475b2f0 00007ffb`5de1f5a9 : 00000178`48896940 00000178`001923a0 00000178`00121698 00000000`00000000 : chrome_child!v8::internal::Zone::NewExpand+0x51 [c:\b\build\slave\win64-pgo\build\src\v8\src\zone\zone.cc @ 168]
0f (Inline Function) --------`-------- : --------`-------- --------`-------- --------`-------- --------`-------- : chrome_child!v8::internal::Zone::New+0x18a (Inline Function @ 00007ffb`5de1f5a9) [c:\b\build\slave\win64-pgo\build\src\v8\src\zone\zone.cc @ 83]

A bunch of locks: one is corrupt (the before last), the others are held by thread 3344. 
The second lock is the one we're trying to acquire.

0:000> !locks

CritSec ntdll!RtlpProcessHeapsListLock+0 at 00007ffb9d353a40
WaiterWoken        No
LockCount          0
RecursionCount     1
OwningThread       3344
EntryCount         0
ContentionCount    0
*** Locked

CritSec +3e790298 at 000001783e790298
WaiterWoken        No
LockCount          2
RecursionCount     1
OwningThread       3344
EntryCount         0
ContentionCount    1e
*** Locked

CritSec +3e750298 at 000001783e750298
WaiterWoken        No
LockCount          0
RecursionCount     1
OwningThread       3344
EntryCount         0
ContentionCount    0
*** Locked

CritSec +3e610298 at 000001783e610298
WaiterWoken        Yes
LockCount          -478910537
RecursionCount     1634496613
OwningThread       735c5b5e2f286563
EntryCount         0
ContentionCount    0
*** Locked

CritSec +3eec0298 at 000001783eec0298
WaiterWoken        No
LockCount          0
RecursionCount     1
OwningThread       3344
EntryCount         0
ContentionCount    1
*** Locked

No trace of thread 3344:

0:000> ~
.  0  Id: 2c8c.3540 Suspend: 1 Teb: 00000031`048cc000 Unfrozen
   1  Id: 2c8c.4498 Suspend: 1 Teb: 00000031`048ce000 Unfrozen
   2  Id: 2c8c.45e0 Suspend: 1 Teb: 00000031`048d0000 Unfrozen
   3  Id: 2c8c.3e6c Suspend: 1 Teb: 00000031`048d2000 Unfrozen
   4  Id: 2c8c.4174 Suspend: 1 Teb: 00000031`048d4000 Unfrozen
   5  Id: 2c8c.1e94 Suspend: 1 Teb: 00000031`048d6000 Unfrozen
   6  Id: 2c8c.3644 Suspend: 1 Teb: 00000031`048d8000 Unfrozen
   7  Id: 2c8c.329c Suspend: 1 Teb: 00000031`048da000 Unfrozen
   8  Id: 2c8c.1190 Suspend: 1 Teb: 00000031`048dc000 Unfrozen
   9  Id: 2c8c.21ac Suspend: 1 Teb: 00000031`048de000 Unfrozen
  10  Id: 2c8c.3ca4 Suspend: 1 Teb: 00000031`048e0000 Unfrozen
  11  Id: 2c8c.46c4 Suspend: 1 Teb: 00000031`048e6000 Unfrozen
  12  Id: 2c8c.13e0 Suspend: 1 Teb: 00000031`048f8000 Unfrozen
# 13  Id: 2c8c.45cc Suspend: 1 Teb: 00000031`04968000 Unfrozen


Cc: ssid@chromium.org
Hmm unless you had tracing enabled I don't think you could have hit the one in trace_event (yet, doesn't make it the race less scary)
+ssid IIRC the slow reports finch experiment is Linux/Android only?
Note that the missing thread 3344 is *totally weird*, but it's holding what appears to be both the heap list lock, as well as four different heap locks. Note that all the locks held have an address ending in 0x0298, and as heaps (===heap handles) have an allocation granularity of 0x10000, this indicates they're all of a kind.

Where does the background heap enumeration code execute? Is it a posted task, and then to which renderer thread?
Siggi identified the corrupt heap as the CsrPortHeap.

0:013> !heap 1783e610000
Index   Address  Name      Debugging options enabled
  3:   1783e610000 
    Unable to read _HEAP_SEGMENT structure at 7c7c747069726353

0:013> s-[w]q 0 L?FFFFFFFFFFFF 1783e610000 
00000178`000a6818  00000178`3e610000 00000000`00000001
00007ffb`9d353ad0  00000178`3e610000 00000178`3eec0000
00007ffb`9d355a18  00000178`3e610000 00000000`00000118
0:013> ln 00007ffb`9d353ad0  
Browse module
Set bu breakpoint

(00007ffb`9d353ac0)   ntdll!RtlpProcessHeapsListBuffer+0x10   |  (00007ffb`9d353b40)   ntdll!RegistryInfoCritSect
0:013> ln 00007ffb`9d355a18  
Browse module
Set bu breakpoint

(00007ffb`9d355a18)   ntdll!CsrPortHeap   |  (00007ffb`9d355a20)   ntdll!CsrProcessId
Exact matches:

Just in case, I'll mention the context of "unresponsive page": 2 open issues dating back to August. Initially a big spike which subsided mostly but not completely and was never explained.
https://crbug.com/638417
https://crbug.com/632157
Here is another potential culprit: https://cs.chromium.org/chromium/src/content/renderer/render_thread_impl.cc?q=GetProcessHeaps&dr=C&l=1822. Another case of GetProcessHeaps followed by enumerating over them.
This might be related to bug 464430.
Here's the module list:

0:013> lm
start             end                 module name
00000178`3e6c0000 00000178`3e72b000   ws2_32     (deferred)             
00000178`3e950000 00000178`3ea6c000   rpcrt4     (deferred)             
00007ff7`e24b0000 00007ff7`e25c9000   chrome     (deferred)             
00007ffb`5dcf0000 00007ffb`617ba000   chrome_child   (deferred)             
00007ffb`8a660000 00007ffb`8a6f0000   chrome_elf   (deferred)             
00007ffb`8b660000 00007ffb`8b8bf000   DWrite     (deferred)             
00007ffb`8d010000 00007ffb`8d031000   fontsub    (deferred)             
00007ffb`91d40000 00007ffb`91d58000   usp10      (deferred)             
00007ffb`95790000 00007ffb`9579a000   version    (deferred)             
00007ffb`95820000 00007ffb`959ac000   dbghelp    (deferred)             
00007ffb`959d0000 00007ffb`959dc000   secur32    (deferred)             
00007ffb`96320000 00007ffb`9632c000   davhlpr    (deferred)             
00007ffb`96370000 00007ffb`963f4000   winspool   (deferred)             
00007ffb`96470000 00007ffb`966e4000   comctl32   (deferred)             
00007ffb`96970000 00007ffb`969a8000   IPHLPAPI   (deferred)             
00007ffb`97340000 00007ffb`97408000   winhttp    (deferred)             
00007ffb`97eb0000 00007ffb`97ec3000   wtsapi32   (deferred)             
00007ffb`97f50000 00007ffb`97f7c000   WINMMBASE   (deferred)             
00007ffb`97fe0000 00007ffb`98003000   winmm      (deferred)             
00007ffb`985b0000 00007ffb`985e2000   fwbase     (deferred)             
00007ffb`98f90000 00007ffb`98faf000   userenv    (deferred)             
00007ffb`99330000 00007ffb`9933b000   CRYPTBASE   (deferred)             
00007ffb`99510000 00007ffb`9953d000   sspicli    (deferred)             
00007ffb`99760000 00007ffb`99789000   bcrypt     (deferred)             
00007ffb`99830000 00007ffb`99840000   msasn1     (deferred)             
00007ffb`99840000 00007ffb`99854000   profapi    (deferred)             
00007ffb`99860000 00007ffb`9986f000   kernel_appcore   (deferred)             
00007ffb`99870000 00007ffb`998bb000   powrprof   (deferred)             
00007ffb`998c0000 00007ffb`99946000   FirewallAPI   (deferred)             
00007ffb`99950000 00007ffb`99993000   cfgmgr32   (deferred)             
00007ffb`999a0000 00007ffb`99a55000   SHCore     (deferred)             
00007ffb`99a60000 00007ffb`99a77000   netapi32   (deferred)             
00007ffb`99a80000 00007ffb`99c68000   KERNELBASE   (deferred)             
00007ffb`99c70000 00007ffb`99cc5000   wintrust   (deferred)             
00007ffb`99cd0000 00007ffb`99e98000   crypt32    (deferred)             
00007ffb`99f50000 00007ffb`99fba000   bcryptPrimitives   (deferred)             
00007ffb`99fc0000 00007ffb`9a605000   windows_storage   (deferred)             
00007ffb`9a610000 00007ffb`9a64b000   imm32      (deferred)             
00007ffb`9a800000 00007ffb`9a8ad000   kernel32   (pdb symbols)          c:\temp\symbols\microsoft\kernel32.pdb\C9590ED0315E454C8CCA1D94EAEF1D911\kernel32.pdb
00007ffb`9a9d0000 00007ffb`9aa2b000   sechost    (deferred)             
00007ffb`9aac0000 00007ffb`9ab86000   oleaut32   (deferred)             
00007ffb`9abf0000 00007ffb`9abf8000   psapi      (deferred)             
00007ffb`9ac00000 00007ffb`9aca7000   advapi32   (deferred)             
00007ffb`9b240000 00007ffb`9b292000   shlwapi    (deferred)             
00007ffb`9b2b0000 00007ffb`9b406000   user32     (deferred)             
00007ffb`9b480000 00007ffb`9b606000   gdi32      (deferred)             
00007ffb`9b6c0000 00007ffb`9b75d000   msvcrt     (deferred)             
00007ffb`9b760000 00007ffb`9b9dd000   combase    (deferred)             
00007ffb`9b9e0000 00007ffb`9baeb000   comdlg32   (deferred)             
00007ffb`9bb60000 00007ffb`9bca3000   ole32      (deferred)             
00007ffb`9bcb0000 00007ffb`9d20c000   shell32    (deferred)             
00007ffb`9d210000 00007ffb`9d3d1000   ntdll      (pdb symbols)          c:\temp\symbols\microsoft\ntdll.pdb\BDAB16703FB64772AA5FE54FEBDE6F3F1\ntdll.pdb

Project Member

Comment 21 by bugdroid1@chromium.org, Nov 8 2016

The following revision refers to this bug:
  https://chromium.googlesource.com/chromium/src.git/+/316a0c0633b857b49e469037517c09d811dc4703

commit 316a0c0633b857b49e469037517c09d811dc4703
Author: wfh <wfh@chromium.org>
Date: Tue Nov 08 21:55:44 2016

Revert of Enable ALPC lockdown on >= win10 (patchset #2 id:20001 of https://codereview.chromium.org/2246543002/ )

Reason for revert:
seems to be causing crashes with Avast! and issues with SharedMemory/HandleVerifier.

Original issue's description:
> Enable ALPC lockdown on >= win10 to further tighten up the sandbox on Windows.
>
> This is already enabled in sandbox tests, but this CL will enable it for all tests, and for all chrome renderers.
>
> BUG=464430
>
> Committed: https://crrev.com/c9f56c97ccec49da955caf9b7e548f1bdd66144e
> Cr-Commit-Position: refs/heads/master@{#428799}

TBR=liamjm@chromium.org
# Not skipping CQ checks because original CL landed more than 1 days ago.
BUG=464430,661315, 661194 

Review-Url: https://codereview.chromium.org/2488803002
Cr-Commit-Position: refs/heads/master@{#430727}

[modify] https://crrev.com/316a0c0633b857b49e469037517c09d811dc4703/content/common/sandbox_win.cc

Owner: wfh@chromium.org
Assigning to wfh@, as this seems related to his change.

Comment 23 by ssid@chromium.org, Nov 9 2016

reply to comment #14: Slow reports are enabled for memory-infra on Windows as well on Canary and Dev, but Windows very rarely gets memory pressure signals, I have got 500 hits till now. I didn't understand how it's related here.
> I didn't understand how it's related here.
I just wanted to check if this bug in memory-infra could have impacted windows and explained the crash. Seems that there is a possibility it might.
ssid: no action required on your side, I just needed to know that bit.

Issue 662836 has been merged into this issue.
Project Member

Comment 26 by sheriffbot@chromium.org, Nov 10 2016

Labels: FoundIn-M-56 Fracas
Users experienced this crash on the following builds:

Win Dev 56.0.2906.0 -  0.40 CPM, 340 reports, 318 clients (signature [Renderer hang] v8::internal::AccountingAllocator::AllocateSegment)

If this update was incorrect, please add "Fracas-Wrong" label to prevent future updates.

- Go/Fracas

Comment 27 by zh...@chromium.org, Nov 15 2016

Is this fixed after the revert in #21? The bot is green.

Downgrade to pri-2.

Comment 28 by zh...@chromium.org, Nov 15 2016

Labels: -Pri-1 Pri-2
Issue 663408 has been merged into this issue.
See ericrik@ comment in  crbug.com/665516#c9 
Control Flow Guard seems to be the culprit on the memory-infra crashes.

manzagop@ do you have a way to tell whether crrev.com/2412983006 was in in those crashes?
Looks like it was:
- renderer hang was on 56.0.2912.0 so branch position 430205.
- crrev.com/2412983006 is branch position 425226
Cc: penny...@chromium.org
+pennymac
Thanks for the CC.

Took me a few minutes to digest this chain.  

Yes, crbug/665516#c11 is possibly related.  ericrk@ has done some excellent debugging, and discovered that the perf-on-Windows harness is trying to lock third-party (operating system in this case) heaps that should not be locked (HEAP_NO_SERIALIZATION).

FYI: related bug/symptoms seen on one Zen Win perf bot with the landing of CFG: crbug/656040. 

Let's see what happens when Eric lands a change to the heap locking.  It could impact this ticket.
kraynov@ fixed that in https://codereview.chromium.org/2528053002/ and removed the lock of other heaps.

The flakiness and crashes on memory tests show be over. Can somebody verify that?
Components: Speed>Tracing
Components: -Internals>Tracing
Status: Archived (was: Assigned)
This issue was created > 6 months ago. The perf waterfall has changed significantly since then. If this bug is still relevant, please re-open.

Sign in to add a comment