New issue
Advanced search Search tips
Note: Color blocks (like or ) mean that a user may not be available. Tooltip shows the reason.

Issue 762677 link

Starred by 1 user

Issue metadata

Status: Fixed
Owner:
Closed: Sep 2017
Cc:
Components:
EstimatedDays: ----
NextAction: ----
OS: Windows
Pri: 1
Type: Bug-Regression



Sign in to add a comment

Flaky crash in v8::internal::RememberedSetUpdatingItem::CheckAndUpdateOldToNewSlot during webgl_conformance_tests on Win10 Debug (NVIDIA)

Project Member Reported by ynovikov@chromium.org, Sep 6 2017

Issue description

In https://build.chromium.org/p/chromium.gpu.fyi/builders/Win10%20Debug%20%28NVIDIA%29/builds/1846
WebglConformance_conformance_ogles_GL_swizzlers_swizzlers_057_to_064 failed during webgl_conformance_d3d11_passthrough_tests
WebglConformance_conformance_ogles_GL_swizzlers_swizzlers_017_to_024 failed during webgl_conformance_tests

Same crash in both:
	v8::internal::RememberedSetUpdatingItem<v8::internal::MajorNonAtomicMarkingState>::CheckAndUpdateOldToNewSlot<1> [0x189DBC67+39]
	v8::internal::SlotSet::Iterate<<lambda_ad1b48bfd15399d01548b8de55983cc2> > [0x189DE8B9+121]
	v8::internal::RememberedSetUpdatingItem<v8::internal::MajorNonAtomicMarkingState>::UpdateUntypedPointers [0x189FE92A+58]
	v8::internal::RememberedSetUpdatingItem<v8::internal::MajorNonAtomicMarkingState>::Process [0x189F9179+25]
	v8::internal::PointersUpatingTask::RunInParallel [0x189FC51A+26]
	v8::internal::ItemParallelJob::Task::RunInternal [0x189B7E88+8]
	v8::internal::ItemParallelJob::Run [0x189B7C60+256]
	v8::internal::MarkCompactCollector::UpdatePointersAfterEvacuation [0x189FDB1F+1119]
	v8::internal::MarkCompactCollector::Evacuate [0x189F080B+603]
	v8::internal::MarkCompactCollector::CollectGarbage [0x189EDD3B+219]
	v8::internal::Heap::MarkCompact [0x189AFA75+133]
	v8::internal::Heap::PerformGarbageCollection [0x189B1794+644]
	v8::internal::Heap::CollectGarbage [0x189A02EE+478]
	v8::internal::Heap::FinalizeIncrementalMarkingIfComplete [0x189A904E+350]
	v8::internal::IncrementalMarkingJob::Task::RunInternal [0x189CADB5+261]
	v8::internal::CancelableTask::Run [0x1846A6F3+51]
	??$Invoke@PAVTask@v8@@$$V@?$FunctorTraits@P8Task@v8@@AEXXZX@internal@base@@SAXP8Task@v8@@AEXXZ$$QAPAV34@@Z [0x183D8E0B+11]
	base::internal::InvokeHelper<0,void>::MakeItSo<void (__thiscall v8::Task::*const &)(void),v8::Task *> [0x183D8EE4+36]
	base::internal::Invoker<base::internal::BindState<void (__thiscall v8::Task::*)(void),base::internal::OwnedWrapper<v8::Task> >,void __cdecl(void)>::RunImpl<void (__thiscall v8::Task::*const &)(void),std::tuple<base::internal::OwnedWrapper<v8::Task> > cons [0x183D9099+137]
	base::internal::Invoker<base::internal::BindState<void (__thiscall v8::Task::*)(void),base::internal::OwnedWrapper<v8::Task> >,void __cdecl(void)>::Run [0x183DDB04+36]
	base::OnceCallback<void __cdecl(void)>::Run [0x1004BB25+53]
	base::debug::TaskAnnotator::RunTask [0x100B6AA7+519]
	blink::scheduler::TaskQueueManager::ProcessTaskFromWorkQueue [0x1C674008+1400]
	blink::scheduler::TaskQueueManager::DoWork [0x1C672109+1049]
	base::internal::FunctorTraits<void (__thiscall blink::scheduler::TaskQueueManager::*)(bool),void>::Invoke<base::WeakPtr<blink::scheduler::TaskQueueManager> const &,bool const &> [0x1C663FF5+37]
	base::internal::InvokeHelper<1,void>::MakeItSo<void (__thiscall blink::scheduler::TaskQueueManager::*const &)(bool),base::WeakPtr<blink::scheduler::TaskQueueManager> const &,bool const &> [0x1C664256+70]
	base::internal::Invoker<base::internal::BindState<void (__thiscall blink::scheduler::TaskQueueManager::*)(bool),base::WeakPtr<blink::scheduler::TaskQueueManager>,bool>,void __cdecl(void)>::RunImpl<void (__thiscall blink::scheduler::TaskQueueManager::*cons [0x1C6643C0+160]
	base::internal::Invoker<base::internal::BindState<void (__thiscall blink::scheduler::TaskQueueManager::*)(bool),base::WeakPtr<blink::scheduler::TaskQueueManager>,bool>,void __cdecl(void)>::Run [0x1C674A94+36]
	base::OnceCallback<void __cdecl(void)>::Run [0x1004BB25+53]
	base::debug::TaskAnnotator::RunTask [0x100B6AA7+519]
	base::internal::IncomingTaskQueue::RunTask [0x1012E825+37]
	base::MessageLoop::RunTask [0x10138320+512]
	base::MessageLoop::DeferOrRunPendingTask [0x10136972+50]
	base::MessageLoop::DoWork [0x10136FE6+278]
	base::MessagePumpDefault::Run [0x1013D148+40]
	base::MessageLoop::Run [0x1013801F+191]
	base::RunLoop::Run [0x101FC76A+186]
	content::RendererMain [0x136085EA+730]
	content::RunNamedProcessTypeMain [0x13B290F7+135]
	content::ContentMainRunnerImpl::Run [0x13B28FCE+414]
	content::ContentServiceManagerMainDelegate::RunEmbedderProcess [0x13B26B74+36]
	service_manager::Main [0x0C3C9157+823]
	content::ContentMain [0x13B270C9+41]
	ChromeMain [0x04592CB5+277]
	MainDllLoader::Launch [0x00432544+836]
	wWinMain [0x0042D2AB+747]
	invoke_main [0x004F278E+30] (f:\dd\vctools\crt\vcstartup\src\startup\exe_common.inl:118)
	__scrt_common_main_seh [0x004F25F0+336] (f:\dd\vctools\crt\vcstartup\src\startup\exe_common.inl:253)
	__scrt_common_main [0x004F248D+13] (f:\dd\vctools\crt\vcstartup\src\startup\exe_common.inl:296)
	wWinMainCRTStartup [0x004F27A8+8] (f:\dd\vctools\crt\vcstartup\src\startup\exe_wwinmain.cpp:17)
	BaseThreadInitThunk [0x763238F4+36]
	RtlUnicodeStringToInteger [0x77975DE3+595]
	RtlUnicodeStringToInteger [0x77975DAE+542]

Two v8 rolls in blamelist:
https://chromium-review.googlesource.com/651491
https://chromium-review.googlesource.com/652106

Will continue observing to see if all WebglConformance_conformance_ogles_GL_swizzlers_swizzlers need to be disabled.
 
Looks like the problem has started earlier than 1846
https://build.chromium.org/p/chromium.gpu.fyi/builders/Win10%20Debug%20%28NVIDIA%29/builds/1839 - WebglConformance_conformance_ogles_GL_swizzlers_swizzlers_089_to_096

https://build.chromium.org/p/chromium.gpu.fyi/builders/Win10%20Debug%20%28NVIDIA%29/builds/1806 - WebglConformance_conformance_ogles_GL_struct_struct_025_to_032

I don't like the idea of having to mark all WebglConformance_conformance_ogles_GL_* as Flaky.

Comment 2 by kbr@chromium.org, Sep 6 2017

Cc: hablich@chromium.org mlippautz@chromium.org u...@chromium.org verwa...@chromium.org machenb...@chromium.org
V8 folks: who can take this? It's an urgent flaky regression. Thanks.

Cc: jgruber@chromium.org petermarshall@chromium.org
+ mem sheriff, primary and secondary
Cc: -mlippautz@chromium.org
Owner: mlippautz@chromium.org
Status: Assigned (was: Untriaged)
Guessing at a possible candidate:

[heap] Avoid fences during pointer updating
Reviewed-on: https://chromium-review.googlesource.com/539642

Touches the related area, and committed shortly before the first crashes with this signature.
Cc: -u...@chromium.org mlippautz@chromium.org
Owner: u...@chromium.org
Reassigning to ulan@ as per request. PTAL, any ideas?

Comment 7 by u...@chromium.org, Sep 14 2017

Back from vacation. I agree with #4. Looks like heap corruption, hard to say more without local repro.

The crash did not occur in the last ~40 builds after 1846, so there is a chance that the regression was fixed in the meantime.

Comment 8 by kbr@chromium.org, Sep 14 2017

OK. Feel free to close as WontFix if not reproducible. Thanks.

Comment 10 by u...@chromium.org, Sep 15 2017

Thanks, ynovikov@.

Log contains "Found Minidump: True". Does this mean that the minidump is available somewhere? If so, it would be very useful.

Comment 11 by kbr@chromium.org, Sep 15 2017

The log says:

  Minidump found: c:\b\s\w\itinbguh\tmpxs_0to\reports\e6b81fc2-947a-4570-aebb-50dbc27d0cf3.dmp
  Uploading c:\b\s\w\itinbguh\tmpxs_0to\reports\e6b81fc2-947a-4570-aebb-50dbc27d0cf3.dmp to gs://chrome-telemetry-output/minidump-2017-09-13_20-49-11-484977.dmp
 
Please try to find it there. Thanks.

Comment 12 by u...@chromium.org, Sep 18 2017

Thanks a lot, Ken. Based on the minidump I have a theory of what has happened before the crash. 

The minidump shows that the crash happens on page flag check for a dead heap object 0x2ec84101.
Memory region around the slot that contains the bogus pointer:
3fffdee0: 271cc669  <= map?
3fffdee4: 2428412d  <= empty fixed array? (based on the 0x412d offset)
3fffdee8: 2ec84101  <= dead object
3fffdeec: 00024000  <= tagged integer with value 0x12000.
3fffdef0: 271cc669  <= next object map?
3fffdef4: 2428412d  <= empty fixed array?
3fffdef8: 29f84101  
3fffdefc: 00018000  <= tagged integer with value 0x0C000.

So the object containing the bogus pointer is 4 words large and contains a pointer to the empty fixed array and an integer.

I run the webgl test with logging of all objects that match this criteria. This found a JsArray object (my build is 64-bit and the minidump is 32-bit so offsets do not match exactly):
0x00001368eff02f11  <= JsArray map
0x0000129551a02251  <= empty fixed array
0x000017621b182201  <= pointer to the array backing store.
0x0001200000000000  <= length of the JsArray.

0x1368eff02f11: [Map]
 - type: JS_ARRAY_TYPE
 - instance size: 32
 - inobject properties: 0
 - elements kind: PACKED_DOUBLE_ELEMENTS

The backing store contains doubles so its size is 0x12000 * 8 = 589824, which is larger than 512K (the old space page size).
So the backing store is in the large object space.

The theory:
1) JsArray is created and then promoted to the old space.
2) The array grows and a new backing store is allocated in the new space. This records old-to-new slot in the remembered set.
3) The array grows again. This time the backing store is allocated in the large object space.
4) Mark-compact GC runs:
   a) The array backing store is dead (this means that the array itself must be dead)
   b) The large page containing the backing store is unmapped.
   c) During evacuation, iteration of the old-to-new remembered set crashes while trying to check page flags of the unmapped page.

Now the question is how 4.b could happen as we are careful to unmap pages only after evacuation is done.

For some reason dead large object pages are enqueued for unmapping before evacuation (https://cs.chromium.org/chromium/src/v8/src/heap/mark-compact.cc?rcl=ec37390b2ba2b4051f46f153a8cc179ed4656f5d&l=4594).

That should be OK because the unmapper task starts after evacuation (https://cs.chromium.org/chromium/src/v8/src/heap/mark-compact.cc?rcl=ec37390b2ba2b4051f46f153a8cc179ed4656f5d&l=3889).

However, MarkCompactCollector::EnsureSweepingCompleted can also start unmapper task (https://cs.chromium.org/chromium/src/v8/src/heap/mark-compact.cc?rcl=ec37390b2ba2b4051f46f153a8cc179ed4656f5d&l=764).

EnsureSweepingCompleted can be called during slow allocation, which can happen during evacuation.

I think the bug is in enqueuing large object pages for unmapping before evacuation. We should do that after evacuation.
Project Member

Comment 13 by bugdroid1@chromium.org, Sep 18 2017

The following revision refers to this bug:
  https://chromium.googlesource.com/v8/v8.git/+/75877ddb7b3d5b135d5a1ae565e4f1df8b458175

commit 75877ddb7b3d5b135d5a1ae565e4f1df8b458175
Author: Ulan Degenbaev <ulan@chromium.org>
Date: Mon Sep 18 09:39:16 2017

[heap] Do not unmap large pages before evacuation.

See https://bugs.chromium.org/p/chromium/issues/detail?id=762677#c12 for
the description of the bug.

Bug:  chromium:762677 
TBR: mlippautz@chromium.org
Change-Id: If5c4c2c15f2403d336edf34d10679521397db75c
Reviewed-on: https://chromium-review.googlesource.com/670823
Commit-Queue: Ulan Degenbaev <ulan@chromium.org>
Reviewed-by: Ulan Degenbaev <ulan@chromium.org>
Cr-Commit-Position: refs/heads/master@{#48061}
[modify] https://crrev.com/75877ddb7b3d5b135d5a1ae565e4f1df8b458175/src/heap/mark-compact.cc

Comment 14 by kbr@chromium.org, Sep 19 2017

Awesome analysis Ulan!!! Thanks for getting to the bottom of this!


Comment 15 by u...@chromium.org, Sep 25 2017

Status: Fixed (was: Assigned)

Sign in to add a comment