Issue metadata
Sign in to add a comment
|
v8.infinite_scroll_tbmv2 failure on chromium.perf at 428764:428823 |
||||||||||||||||||||||
Issue descriptionRevision range first seen: 428810:428823 Link to failing step log: https://build.chromium.org/p/chromium.perf/builders/Win%2010%20Perf%20%281%29/builds/4873/steps/v8.infinite_scroll_tbmv2/logs/stdio Failing with a crash in base::trace_event::ProcessMemoryTotals::AsValueInto I suspect https://chromium.googlesource.com/chromium/src/+/c624f3d71afc0077b16ac8ca9d54296050b5d5d8 "Add BACKGROUND dump mode to various GPU/CC MemoryDumpProviders" Eric can you please take a look ASAP?
,
Nov 1 2016
Revert landed: crrev.com/660929
,
Nov 2 2016
Shoot - looks like I guess wrong. Still crashing with the same stack after the revert: https://build.chromium.org/p/chromium.perf/builders/Win%2010%20Perf%20%281%29/builds/4883 Sorry Eric, feel free to reland! I don't see anything else in the revision range that's related to the callstack. I'll kick off a bisect. Since you work on the memory dump infrastructure, should this stay assigned to you? If not, feel free to reassign to the current perf bot health sheriff for routing.
,
Nov 2 2016
Started bisect job https://chromeperf.appspot.com/buildbucket_job_status/8997081748590615056
,
Nov 2 2016
For the record, here's the crashing callstack: chrome_child!base::trace_event::ProcessMemoryTotals::AsValueInto+0x835 chrome_child!base::trace_event::MallocDumpProvider::OnMemoryDump+0x60 chrome_child!base::trace_event::MemoryDumpManager::InvokeOnMemoryDump+0x21a chrome_child!base::trace_event::MemoryDumpManager::SetupNextMemoryDump+0x2ce chrome_child!base::trace_event::MemoryDumpManager::InvokeOnMemoryDump+0x27f
,
Nov 2 2016
/cc picksi@ current bot health sheriff
,
Nov 2 2016
===== BISECT JOB RESULTS ===== Status: completed ===== TESTED REVISIONS ===== Revision Exit Code Std Dev N Good? chromium@428810 1 N/A 20 good chromium@428823 1 N/A 20 bad Bisect job ran on: winx64_10_perf_bisect Bug ID: 661194 Test Command: src/tools/perf/run_benchmark -v --browser=release_x64 --output-format=chartjson --upload-results --pageset-repeat=1 --also-run-disabled-tests v8.infinite_scroll_tbmv2 Test Metric: memory:chrome:all_processes:reported_by_chrome:v8:allocated_by_malloc:effective_size_avg/memory:chrome:all_processes:reported_by_chrome:v8:allocated_by_malloc:effective_size_avg Relative Change: 0.00% Score: 0 Buildbot stdio: http://build.chromium.org/p/tryserver.chromium.perf/builders/winx64_10_perf_bisect/builds/769 Job details: https://chromeperf.appspot.com/buildbucket_job_status/8997081748590615056 Not what you expected? We'll investigate and get back to you! https://chromeperf.appspot.com/bad_bisect?try_job_id=5891297460617216 | O O | Visit http://www.chromium.org/developers/speed-infra/perf-bug-faq | X | for more information addressing perf regression bugs. For feedback, | / \ | file a bug with component Tests>AutoBisect. Thank you!
,
Nov 7 2016
,
Nov 7 2016
This is pretty weird - I tried to bisect locally but found that I could repro the crash quite a ways back. From stepping through the code, it appears that we ask windows for all heaps for the current process, then iterate through them dumping info. Unfortunately, iterating heaps on Windows is inherently racy, and it looks like the last heap enumerated may be invalid by the time we try to dump its info, leading to a crash. From the MSDN docs, the function we use, GetProcessHeaps, is racy. You should only call functions on heaps returned by GetProcessHeaps if you created the heap and if you know that it has not been deleted. https://msdn.microsoft.com/en-us/library/windows/desktop/aa366571(v=vs.85).aspx One other possibility is that the heap where we encounter a crash has not been deleted, but was created as lock-free. If someone creates a heap as lock-free it will crash if you attempt to lock it (which our code does). I need to keep investigating to see if this might be the case.
,
Nov 7 2016
Adding a bunch of people who are familiar with that Winheap dump code and helped developing that, my Win32 skills are really limited. For the records, the code ericrk@ is talking about here is base/trace_event/malloc_dump_provider.cc . Any clues? The only odd thing is that IIRC that code has been there untouched for ~years but I guess we might have introduced some new code that tickles the race. About the reland, unrelated with this crash, I've seen memory metrics going up and down with this, so there might be something else there if this is not wai: see https://chromeperf.appspot.com/group_report?bug_id=661948
,
Nov 7 2016
Yeah, this kind of heap enumeration is racy on Windows. It might be possible to make it robust against crashing by running the enumeration loop inside a __try block. Note that in the Windows browser, you can expect foreign DLLs to load and unload at more or less any time. Each of these DLLs will generally instantiate their own private heap. I don't know what you're using this enumeration for, but maybe it's better to have it only account for Chrome's own, known heaps? That way you can eliminate the crashes, as well as noise from OS and third-party things sharing the process? On the downside, you won't account for OS and third-party heap memory usage.
,
Nov 8 2016
CC @manzagop, who may have a renderer hang due to corruption caused by this code.
,
Nov 8 2016
I left my canary open overnight (56.0.2912.0 64bit) and got a "pages unresponsive" dialog in the morning. We found a thread waiting on a critical section: 0:000> kv 10 # Child-SP RetAddr : Args to Child : Call Site 00 00000031`0475ad58 00007ffb`9d24017f : 00000178`3e8844d8 00000178`0018ae10 00000178`0018ae88 00000178`0018afa8 : ntdll!NtWaitForAlertByThreadId+0x14 01 00000031`0475ad60 00007ffb`9d24007a : 00000000`00000000 00000000`00000000 00000031`0475ae38 00000178`0018ae10 : ntdll!RtlpWaitOnAddressWithTimeout+0x43 02 00000031`0475ad90 00007ffb`9d23febc : 00000178`3e790298 00000031`0475ae69 00000000`00000000 00000178`0018a880 : ntdll!RtlpWaitOnAddress+0xae 03 00000031`0475ae00 00007ffb`9d23ea0b : 00000178`0018bfa8 00000178`0018c050 00000178`3e790298 00007ffb`5e03b9ed : ntdll!RtlpWaitOnCriticalSection+0x144 04 00000031`0475aed0 00007ffb`9d237def : 00000178`3e790000 00000000`000425d0 00000000`00000000 00000000`0000425d : ntdll!RtlpEnterCriticalSectionContended+0xcb 05 00000031`0475af00 00007ffb`9d235c5f : 00000178`3e790000 00000000`00000002 00000000`000425c8 00000000`000425d0 : ntdll!RtlpAllocateHeap+0x20f 06 00000031`0475b200 00007ffb`5e5a0fcd : 00000034`00000003 00000000`000425c8 00000000`00000000 00000178`47730000 : ntdll!RtlpAllocateHeapInternal+0x64f 07 (Inline Function) --------`-------- : --------`-------- --------`-------- --------`-------- --------`-------- : chrome_child!base::allocator::WinHeapMalloc+0x1c (Inline Function @ 00007ffb`5e5a0fcd) [c:\b\build\slave\win64-pgo\build\src\base\allocator\winheap_stubs_win.cc @ 34] 08 (Inline Function) --------`-------- : --------`-------- --------`-------- --------`-------- --------`-------- : chrome_child!?A0x370bf134::DefaultWinHeapMallocImpl+0x1c (Inline Function @ 00007ffb`5e5a0fcd) [c:\b\build\slave\win64-pgo\build\src\base\allocator\allocator_shim_default_dispatch_to_winheap.cc @ 15] 09 (Inline Function) --------`-------- : --------`-------- --------`-------- --------`-------- --------`-------- : chrome_child!ShimMalloc+0x3c (Inline Function @ 00007ffb`5e5a0fcd) [c:\b\build\slave\win64-pgo\build\src\base\allocator\allocator_shim.cc @ 177] 0a (Inline Function) --------`-------- : --------`-------- --------`-------- --------`-------- --------`-------- : chrome_child!malloc+0x3c (Inline Function @ 00007ffb`5e5a0fcd) [c:\b\build\slave\win64-pgo\build\src\base\allocator\allocator_shim_override_ucrt_symbols_win.h @ 51] 0b 00000031`0475b290 00007ffb`5de12d87 : 00000000`000000a8 00000178`001235d0 00000178`48896940 00000178`001926d0 : chrome_child!v8::internal::AccountingAllocator::AllocateSegment+0x51 [c:\b\build\slave\win64-pgo\build\src\v8\src\zone\accounting-allocator.cc @ 82] 0c 00000031`0475b2c0 00007ffb`5de12c5d : 00000000`000000a8 00000031`0475b450 00000178`001235d0 00000178`48896940 : chrome_child!v8::internal::AccountingAllocator::GetSegment+0xc3 [c:\b\build\slave\win64-pgo\build\src\v8\src\zone\accounting-allocator.cc @ 74] 0d (Inline Function) --------`-------- : --------`-------- --------`-------- --------`-------- --------`-------- : chrome_child!v8::internal::Zone::NewSegment+0xa (Inline Function @ 00007ffb`5de12c5d) [c:\b\build\slave\win64-pgo\build\src\v8\src\zone\zone.cc @ 120] 0e 00000031`0475b2f0 00007ffb`5de1f5a9 : 00000178`48896940 00000178`001923a0 00000178`00121698 00000000`00000000 : chrome_child!v8::internal::Zone::NewExpand+0x51 [c:\b\build\slave\win64-pgo\build\src\v8\src\zone\zone.cc @ 168] 0f (Inline Function) --------`-------- : --------`-------- --------`-------- --------`-------- --------`-------- : chrome_child!v8::internal::Zone::New+0x18a (Inline Function @ 00007ffb`5de1f5a9) [c:\b\build\slave\win64-pgo\build\src\v8\src\zone\zone.cc @ 83] A bunch of locks: one is corrupt (the before last), the others are held by thread 3344. The second lock is the one we're trying to acquire. 0:000> !locks CritSec ntdll!RtlpProcessHeapsListLock+0 at 00007ffb9d353a40 WaiterWoken No LockCount 0 RecursionCount 1 OwningThread 3344 EntryCount 0 ContentionCount 0 *** Locked CritSec +3e790298 at 000001783e790298 WaiterWoken No LockCount 2 RecursionCount 1 OwningThread 3344 EntryCount 0 ContentionCount 1e *** Locked CritSec +3e750298 at 000001783e750298 WaiterWoken No LockCount 0 RecursionCount 1 OwningThread 3344 EntryCount 0 ContentionCount 0 *** Locked CritSec +3e610298 at 000001783e610298 WaiterWoken Yes LockCount -478910537 RecursionCount 1634496613 OwningThread 735c5b5e2f286563 EntryCount 0 ContentionCount 0 *** Locked CritSec +3eec0298 at 000001783eec0298 WaiterWoken No LockCount 0 RecursionCount 1 OwningThread 3344 EntryCount 0 ContentionCount 1 *** Locked No trace of thread 3344: 0:000> ~ . 0 Id: 2c8c.3540 Suspend: 1 Teb: 00000031`048cc000 Unfrozen 1 Id: 2c8c.4498 Suspend: 1 Teb: 00000031`048ce000 Unfrozen 2 Id: 2c8c.45e0 Suspend: 1 Teb: 00000031`048d0000 Unfrozen 3 Id: 2c8c.3e6c Suspend: 1 Teb: 00000031`048d2000 Unfrozen 4 Id: 2c8c.4174 Suspend: 1 Teb: 00000031`048d4000 Unfrozen 5 Id: 2c8c.1e94 Suspend: 1 Teb: 00000031`048d6000 Unfrozen 6 Id: 2c8c.3644 Suspend: 1 Teb: 00000031`048d8000 Unfrozen 7 Id: 2c8c.329c Suspend: 1 Teb: 00000031`048da000 Unfrozen 8 Id: 2c8c.1190 Suspend: 1 Teb: 00000031`048dc000 Unfrozen 9 Id: 2c8c.21ac Suspend: 1 Teb: 00000031`048de000 Unfrozen 10 Id: 2c8c.3ca4 Suspend: 1 Teb: 00000031`048e0000 Unfrozen 11 Id: 2c8c.46c4 Suspend: 1 Teb: 00000031`048e6000 Unfrozen 12 Id: 2c8c.13e0 Suspend: 1 Teb: 00000031`048f8000 Unfrozen # 13 Id: 2c8c.45cc Suspend: 1 Teb: 00000031`04968000 Unfrozen
,
Nov 8 2016
Hmm unless you had tracing enabled I don't think you could have hit the one in trace_event (yet, doesn't make it the race less scary) +ssid IIRC the slow reports finch experiment is Linux/Android only?
,
Nov 8 2016
Note that the missing thread 3344 is *totally weird*, but it's holding what appears to be both the heap list lock, as well as four different heap locks. Note that all the locks held have an address ending in 0x0298, and as heaps (===heap handles) have an allocation granularity of 0x10000, this indicates they're all of a kind. Where does the background heap enumeration code execute? Is it a posted task, and then to which renderer thread?
,
Nov 8 2016
Siggi identified the corrupt heap as the CsrPortHeap.
0:013> !heap 1783e610000
Index Address Name Debugging options enabled
3: 1783e610000
Unable to read _HEAP_SEGMENT structure at 7c7c747069726353
0:013> s-[w]q 0 L?FFFFFFFFFFFF 1783e610000
00000178`000a6818 00000178`3e610000 00000000`00000001
00007ffb`9d353ad0 00000178`3e610000 00000178`3eec0000
00007ffb`9d355a18 00000178`3e610000 00000000`00000118
0:013> ln 00007ffb`9d353ad0
Browse module
Set bu breakpoint
(00007ffb`9d353ac0) ntdll!RtlpProcessHeapsListBuffer+0x10 | (00007ffb`9d353b40) ntdll!RegistryInfoCritSect
0:013> ln 00007ffb`9d355a18
Browse module
Set bu breakpoint
(00007ffb`9d355a18) ntdll!CsrPortHeap | (00007ffb`9d355a20) ntdll!CsrProcessId
Exact matches:
,
Nov 8 2016
Just in case, I'll mention the context of "unresponsive page": 2 open issues dating back to August. Initially a big spike which subsided mostly but not completely and was never explained. https://crbug.com/638417 https://crbug.com/632157
,
Nov 8 2016
Here is another potential culprit: https://cs.chromium.org/chromium/src/content/renderer/render_thread_impl.cc?q=GetProcessHeaps&dr=C&l=1822. Another case of GetProcessHeaps followed by enumerating over them.
,
Nov 8 2016
This might be related to bug 464430.
,
Nov 8 2016
Here's the module list: 0:013> lm start end module name 00000178`3e6c0000 00000178`3e72b000 ws2_32 (deferred) 00000178`3e950000 00000178`3ea6c000 rpcrt4 (deferred) 00007ff7`e24b0000 00007ff7`e25c9000 chrome (deferred) 00007ffb`5dcf0000 00007ffb`617ba000 chrome_child (deferred) 00007ffb`8a660000 00007ffb`8a6f0000 chrome_elf (deferred) 00007ffb`8b660000 00007ffb`8b8bf000 DWrite (deferred) 00007ffb`8d010000 00007ffb`8d031000 fontsub (deferred) 00007ffb`91d40000 00007ffb`91d58000 usp10 (deferred) 00007ffb`95790000 00007ffb`9579a000 version (deferred) 00007ffb`95820000 00007ffb`959ac000 dbghelp (deferred) 00007ffb`959d0000 00007ffb`959dc000 secur32 (deferred) 00007ffb`96320000 00007ffb`9632c000 davhlpr (deferred) 00007ffb`96370000 00007ffb`963f4000 winspool (deferred) 00007ffb`96470000 00007ffb`966e4000 comctl32 (deferred) 00007ffb`96970000 00007ffb`969a8000 IPHLPAPI (deferred) 00007ffb`97340000 00007ffb`97408000 winhttp (deferred) 00007ffb`97eb0000 00007ffb`97ec3000 wtsapi32 (deferred) 00007ffb`97f50000 00007ffb`97f7c000 WINMMBASE (deferred) 00007ffb`97fe0000 00007ffb`98003000 winmm (deferred) 00007ffb`985b0000 00007ffb`985e2000 fwbase (deferred) 00007ffb`98f90000 00007ffb`98faf000 userenv (deferred) 00007ffb`99330000 00007ffb`9933b000 CRYPTBASE (deferred) 00007ffb`99510000 00007ffb`9953d000 sspicli (deferred) 00007ffb`99760000 00007ffb`99789000 bcrypt (deferred) 00007ffb`99830000 00007ffb`99840000 msasn1 (deferred) 00007ffb`99840000 00007ffb`99854000 profapi (deferred) 00007ffb`99860000 00007ffb`9986f000 kernel_appcore (deferred) 00007ffb`99870000 00007ffb`998bb000 powrprof (deferred) 00007ffb`998c0000 00007ffb`99946000 FirewallAPI (deferred) 00007ffb`99950000 00007ffb`99993000 cfgmgr32 (deferred) 00007ffb`999a0000 00007ffb`99a55000 SHCore (deferred) 00007ffb`99a60000 00007ffb`99a77000 netapi32 (deferred) 00007ffb`99a80000 00007ffb`99c68000 KERNELBASE (deferred) 00007ffb`99c70000 00007ffb`99cc5000 wintrust (deferred) 00007ffb`99cd0000 00007ffb`99e98000 crypt32 (deferred) 00007ffb`99f50000 00007ffb`99fba000 bcryptPrimitives (deferred) 00007ffb`99fc0000 00007ffb`9a605000 windows_storage (deferred) 00007ffb`9a610000 00007ffb`9a64b000 imm32 (deferred) 00007ffb`9a800000 00007ffb`9a8ad000 kernel32 (pdb symbols) c:\temp\symbols\microsoft\kernel32.pdb\C9590ED0315E454C8CCA1D94EAEF1D911\kernel32.pdb 00007ffb`9a9d0000 00007ffb`9aa2b000 sechost (deferred) 00007ffb`9aac0000 00007ffb`9ab86000 oleaut32 (deferred) 00007ffb`9abf0000 00007ffb`9abf8000 psapi (deferred) 00007ffb`9ac00000 00007ffb`9aca7000 advapi32 (deferred) 00007ffb`9b240000 00007ffb`9b292000 shlwapi (deferred) 00007ffb`9b2b0000 00007ffb`9b406000 user32 (deferred) 00007ffb`9b480000 00007ffb`9b606000 gdi32 (deferred) 00007ffb`9b6c0000 00007ffb`9b75d000 msvcrt (deferred) 00007ffb`9b760000 00007ffb`9b9dd000 combase (deferred) 00007ffb`9b9e0000 00007ffb`9baeb000 comdlg32 (deferred) 00007ffb`9bb60000 00007ffb`9bca3000 ole32 (deferred) 00007ffb`9bcb0000 00007ffb`9d20c000 shell32 (deferred) 00007ffb`9d210000 00007ffb`9d3d1000 ntdll (pdb symbols) c:\temp\symbols\microsoft\ntdll.pdb\BDAB16703FB64772AA5FE54FEBDE6F3F1\ntdll.pdb
,
Nov 8 2016
The following revision refers to this bug: https://chromium.googlesource.com/chromium/src.git/+/316a0c0633b857b49e469037517c09d811dc4703 commit 316a0c0633b857b49e469037517c09d811dc4703 Author: wfh <wfh@chromium.org> Date: Tue Nov 08 21:55:44 2016 Revert of Enable ALPC lockdown on >= win10 (patchset #2 id:20001 of https://codereview.chromium.org/2246543002/ ) Reason for revert: seems to be causing crashes with Avast! and issues with SharedMemory/HandleVerifier. Original issue's description: > Enable ALPC lockdown on >= win10 to further tighten up the sandbox on Windows. > > This is already enabled in sandbox tests, but this CL will enable it for all tests, and for all chrome renderers. > > BUG=464430 > > Committed: https://crrev.com/c9f56c97ccec49da955caf9b7e548f1bdd66144e > Cr-Commit-Position: refs/heads/master@{#428799} TBR=liamjm@chromium.org # Not skipping CQ checks because original CL landed more than 1 days ago. BUG=464430,661315, 661194 Review-Url: https://codereview.chromium.org/2488803002 Cr-Commit-Position: refs/heads/master@{#430727} [modify] https://crrev.com/316a0c0633b857b49e469037517c09d811dc4703/content/common/sandbox_win.cc
,
Nov 8 2016
Assigning to wfh@, as this seems related to his change.
,
Nov 9 2016
reply to comment #14: Slow reports are enabled for memory-infra on Windows as well on Canary and Dev, but Windows very rarely gets memory pressure signals, I have got 500 hits till now. I didn't understand how it's related here.
,
Nov 10 2016
> I didn't understand how it's related here. I just wanted to check if this bug in memory-infra could have impacted windows and explained the crash. Seems that there is a possibility it might. ssid: no action required on your side, I just needed to know that bit.
,
Nov 10 2016
Issue 662836 has been merged into this issue.
,
Nov 10 2016
Users experienced this crash on the following builds: Win Dev 56.0.2906.0 - 0.40 CPM, 340 reports, 318 clients (signature [Renderer hang] v8::internal::AccountingAllocator::AllocateSegment) If this update was incorrect, please add "Fracas-Wrong" label to prevent future updates. - Go/Fracas
,
Nov 15 2016
Is this fixed after the revert in #21? The bot is green. Downgrade to pri-2.
,
Nov 15 2016
,
Nov 16 2016
Issue 663408 has been merged into this issue.
,
Nov 22 2016
See ericrik@ comment in crbug.com/665516#c9 Control Flow Guard seems to be the culprit on the memory-infra crashes. manzagop@ do you have a way to tell whether crrev.com/2412983006 was in in those crashes?
,
Nov 22 2016
Looks like it was: - renderer hang was on 56.0.2912.0 so branch position 430205. - crrev.com/2412983006 is branch position 425226
,
Nov 22 2016
+pennymac
,
Nov 22 2016
Thanks for the CC. Took me a few minutes to digest this chain. Yes, crbug/665516#c11 is possibly related. ericrk@ has done some excellent debugging, and discovered that the perf-on-Windows harness is trying to lock third-party (operating system in this case) heaps that should not be locked (HEAP_NO_SERIALIZATION). FYI: related bug/symptoms seen on one Zen Win perf bot with the landing of CFG: crbug/656040. Let's see what happens when Eric lands a change to the heap locking. It could impact this ticket.
,
Nov 30 2016
kraynov@ fixed that in https://codereview.chromium.org/2528053002/ and removed the lock of other heaps. The flakiness and crashes on memory tests show be over. Can somebody verify that?
,
Feb 10 2017
,
Feb 10 2017
,
Aug 4 2017
This issue was created > 6 months ago. The perf waterfall has changed significantly since then. If this bug is still relevant, please re-open. |
|||||||||||||||||||||||
►
Sign in to add a comment |
|||||||||||||||||||||||
Comment 1 by rbyers@chromium.org
, Nov 1 2016