New issue
Advanced search Search tips
Note: Color blocks (like or ) mean that a user may not be available. Tooltip shows the reason.
Starred by 12 users

Issue metadata

Status: Fixed
Owner:
Last visit > 30 days ago
Closed: Mar 2011
Cc:
Components:
EstimatedDays: ----
NextAction: ----
OS: ----
Pri: 1
Type: Bug

Blocking:
issue 72336

Restricted
  • Only users with EditIssue permission may comment.



Sign in to add a comment

WebKit roll caused reliability failures

Project Member Reported by finnur@chromium.org, Jan 4 2011

Issue description

Reliability bot is red at the time of this writing. Going back in time I see that it had three green runs and then started failing at r70323. However, I believe the first three failures are intermittent failures (flakiness) and are not necessarily related to that changelist. 

What is more pressing is that it seems that after the WebKit roll from 74791 to 74816 (r70328) we started seeing more consistent failures, which still happen after many Webkit rolls.

Here is where the failures started (we can probably ignore the first three for now, as those are the intermittent ones): 

1) Reliability bot seemed to be doing fine until r70323 when it started failing (browser crash in Notify during TabContents::UpdateTitleForEntry):
http://build.chromium.org/p/chromium/builders/Win%20Reliability/builds/1292/steps/reliability%3A%20complete%20result%20of%20previous%20build/logs/stdio

2) Next run was green, though, but then failures occurred again (r70325) (this time DelayedLowerToken):
http://build.chromium.org/p/chromium/builders/Win%20Reliability/builds/1294/steps/reliability%3A%20partial%20result%20of%20current%20build/logs/stdio

3) Next run failed (memcpy):
http://build.chromium.org/p/chromium/builders/Win%20Reliability/builds/1295/steps/reliability%3A%20partial%20result%20of%20current%20build/logs/stdio

4) And then we at r70328 get a WebKit roll (from 74791 to 74816), which produced WebCore failures, which seem to be more consistent:
http://build.chromium.org/p/chromium/builders/Win%20Reliability/builds/1296/steps/reliability%3A%20partial%20result%20of%20current%20build/logs/stdio

5) This is the most recent failure log (it looks even scarier than the previous log):
http://build.chromium.org/p/chromium/builders/Win%20Reliability/builds/1342/steps/reliability%3A%20partial%20result%20of%20current%20build/logs/stdio

Webkit has been rolled so many times I'm not sure it is safe to revert this change and I'm not sure who is the best person to investigate, but I'm assigning this to mihaip who performed the WebKit roll to at least get the ball rolling.
 
Status: Started
Only the most recent run has so many scary crashes, and it seems to be getting results as of r70413, which was a V8 roll that was immediately reverted (http://crrev.com/70414).

Runs immediately preceding that are similar to the initial set of crashes (after the WebKit roll) that you mentioned:

http://build.chromium.org/p/chromium/builders/Win%20Reliability/builds/1341/steps/reliability:%20partial%20result%20of%20current%20build/logs/stdio (at r70412)
http://build.chromium.org/p/chromium/builders/Win%20Reliability/builds/1340/steps/reliability:%20partial%20result%20of%20current%20build/logs/stdio (at r70411)
As far as I can tell, the browser crashes are caused by http://trac.webkit.org/changeset/74807/ (verified by rolling it back locally). It ends up creating an infinite recursive loop and blowing the stack (you can see part of the stack at http://chromebot/details?id=buildbot_70419_ext&key=http%3A%2F%2Fwww.deezer.com%2Ffr%2F%23music%2Fqueue).

I'll see about creating a reduction and fixing this upstream.
Nevermind the previous comment.  r74807 is not responsible for that crash, that's already known and was caused by an earlier change ( bug 66741 ).
I tried to reproduce this locally (by running automated_ui_tests with the sequence of steps given), but had no luck.

All of the stack traces seem to end up in tcmalloc trying to allocate:

chrome_2590000!tcmalloc::ThreadCache::Allocate+0x43 [c:\b\build\slave\win\build\src\third_party\tcmalloc\chromium\src\thread_cache.h @ 349]
chrome_2590000!`anonymous namespace'::do_malloc+0xc3 [c:\b\build\slave\win\build\src\third_party\tcmalloc\chromium\src\tcmalloc.cc @ 987]
chrome_2590000!malloc+0x31 [c:\b\build\slave\win\build\src\base\allocator\allocator_shim.cc @ 111]
chrome_2590000!WTF::fastMalloc+0xa [c:\b\build\slave\win\build\src\third_party\webkit\javascriptcore\wtf\fastmalloc.cpp @ 250]
chrome_2590000!WebCore::RenderStyle::setBorderTopWidth+0x2b [c:\b\build\slave\win\build\src\third_party\webkit\webcore\rendering\style\renderstyle.h @ 869]

chrome_2590000!tcmalloc::ThreadCache::Allocate+0x43 [c:\b\build\slave\win\build\src\third_party\tcmalloc\chromium\src\thread_cache.h @ 349]
chrome_2590000!`anonymous namespace'::do_malloc+0xc3 [c:\b\build\slave\win\build\src\third_party\tcmalloc\chromium\src\tcmalloc.cc @ 987]
chrome_2590000!malloc+0x31 [c:\b\build\slave\win\build\src\base\allocator\allocator_shim.cc @ 111]
chrome_2590000!WTF::fastMalloc+0xa [c:\b\build\slave\win\build\src\third_party\webkit\javascriptcore\wtf\fastmalloc.cpp @ 250]
chrome_2590000!WebCore::RenderStyle::create+0x8 [c:\b\build\slave\win\build\src\third_party\webkit\webcore\rendering\style\renderstyle.cpp @ 47]

chrome_2590000!tcmalloc::ThreadCache::Allocate+0x43 [c:\b\build\slave\win\build\src\third_party\tcmalloc\chromium\src\thread_cache.h @ 349]
chrome_2590000!`anonymous namespace'::do_malloc+0xc3 [c:\b\build\slave\win\build\src\third_party\tcmalloc\chromium\src\tcmalloc.cc @ 987]
chrome_2590000!malloc+0x31 [c:\b\build\slave\win\build\src\base\allocator\allocator_shim.cc @ 111]
chrome_2590000!generic_cpp_alloc+0xa [c:\b\build\slave\win\build\src\base\allocator\generic_allocators.cc @ 16]
chrome_2590000!operator new+0xc [c:\b\build\slave\win\build\src\base\allocator\generic_allocators.cc @ 28]

Mike, Dimitri mentioned that you might be able to provide some interpretation (of how tcmalloc could get a bogus free list). Any ideas?

The heap getting sprayed with junk is a possibility, but none of the memory bots went red with that change:

http://build.chromium.org/p/chromium.memory/builders/Linux%20Tests%20%28valgrind%29%281%29/builds/891
http://build.chromium.org/p/chromium.memory/builders/Windows%20Tests%20%28tsan%29/builds/1409
http://build.chromium.org/p/chromium.memory/builders/Linux%20Tests%20%28valgrind%29%282%29/builds/1024
http://build.chromium.org/p/chromium.memory/builders/Linux%20Tests%20%28valgrind%29%283%29/builds/932
http://build.chromium.org/p/chromium.memory/builders/Linux%20Tests%20%28valgrind%29%284%29/builds/938
http://build.chromium.org/p/chromium.memory/builders/Linux%20Tests%20%28tsan%29/builds/1085
http://build.chromium.org/p/chromium.memory/builders/Webkit%20Linux%20%28valgrind%29/builds/3555

Should I still add a suppression to known_crashes.txt while this is investigated? It would have to be pretty broad (basically, anything with tcmalloc::ThreadCache::Allocate at the top of the stack), so I'm worried about masking other issues.
Labels: Memory
I spent some time trying to reproduce these crashes locally as well, but no luck.

The crashes in tcmalloc generally mean there is some heap corruption happening, which is an important bug to fix.

Reproducing the crashes from ui reliability bots is usually pretty hard, since the timings are so much different on the bots than locally.

It is likely that the corruption is happening shortly before this, in one of the frames on the crashed callstack.

Timurrr might have some tips on how to proceed with this.
In case it helps with investigating, now that branch 628 is on the developer channel, these crashes are showing up in the wild too:

http://crash/search?query=product%3A%22Chrome%22+version%3A%2210.0.628.0%22+stack_signature.contains%3Atcmalloc%3A%3AThreadCache%3A%3AAllocate
Tony suggests (and has offered to help with) doing a custom reliability bot run before the roll and at the roll, to make sure that the roll is indeed responsible. Once we have that confirmed we can do more-finegrained WebKit rolls within the 74791 -> 74816 range and see which WebKit revision is responsible.

Tony, the WebKit roll was r70328, so doing a run at that revision and one at r70327 makes the most sense.

Comment 8 by karen@chromium.org, Jan 10 2011

Labels: Mstone-10

Comment 9 by twiz@google.com, Jan 10 2011

The reliability tests are still very red.  Is there a recommended set of prefixes for these crashes that we could add to known_crashes.txt to help clear up the reliability bot?
I asked in comment 4: "Should I still add a suppression to known_crashes.txt while this is investigated? It would have to be pretty broad (basically, anything with tcmalloc::ThreadCache::Allocate at the top of the stack), so I'm worried about masking other issues."

Someone with more reliability bot experience may want to chime in.

Comment 11 by twiz@google.com, Jan 11 2011

I went through some of the more recent crashes from today, and gathered the following set, which seems at least partially representative:

PREFIX : tcmalloc::threadcache::releasetocentralcache___tcmalloc::threadcache::scavenge___tcmalloc::threadcache::deallocate___`anonymous namespace'::do_free_with_callback___operator delete[]___webcore::cssmutablestyledeclaration::`scalar deleting destructor'
PREFIX : tcmalloc::threadcache::allocate___`anonymous namespace'::do_malloc___malloc___wtf::fastmalloc___webcore::renderstyle::setbordertopcolor___webcore::cssstyleselector::applyproperty___webcore::cssstyleselector::applydeclarations<0>___webcore::cssstyleselector::styleforelement
PREFIX : tcmalloc::threadcache::allocate___`anonymous namespace'::do_malloc___malloc___wtf::fastmalloc___webcore::renderstyle::create___webcore::cssstyleselector::styleforelement___webcore::node::styleforrenderer___webcore::node::createrendererifneeded___webcore::element::attach
PREFIX : tcmalloc::threadcache::releasetocentralcache___tcmalloc::threadcache::scavenge___tcmalloc::threadcache::deallocate___`anonymous namespace'::do_free_with_callback___operator delete[]___webcore::cssmutablestyledeclaration::`scalar deleting destructor'___webcore::cssstylerule::~cssstylerule

Project Member

Comment 12 by bugdroid1@chromium.org, Jan 11 2011

The following revision refers to this bug:
    http://src.chromium.org/viewvc/chrome?view=rev&revision=71054

------------------------------------------------------------------------
r71054 | mihaip@chromium.org | Tue Jan 11 10:13:47 PST 2011

Changed paths:
 M http://src.chromium.org/viewvc/chrome/trunk/src/chrome/test/data/reliability/known_crashes.txt?r1=71054&r2=71053&pathrev=71054

Add  bug 68516  to the list of known crashes.

BUG= 68516 
TEST=none
Review URL: http://codereview.chromium.org/6111008
------------------------------------------------------------------------
Project Member

Comment 13 by bugdroid1@chromium.org, Jan 11 2011

The following revision refers to this bug:
    http://src.chromium.org/viewvc/chrome?view=rev&revision=71066

------------------------------------------------------------------------
r71066 | mihaip@chromium.org | Tue Jan 11 11:24:54 PST 2011

Changed paths:
 M http://src.chromium.org/viewvc/chrome/trunk/src/chrome/test/data/reliability/known_crashes.txt?r1=71066&r2=71065&pathrev=71066

Add more  bug 68516  signatures.

TBR=joi@chromium.org
BUG= 68516 
TEST=none
Review URL: http://codereview.chromium.org/6109008
------------------------------------------------------------------------
Project Member

Comment 14 by bugdroid1@chromium.org, Jan 12 2011

The following revision refers to this bug:
    http://src.chromium.org/viewvc/chrome?view=rev&revision=71195

------------------------------------------------------------------------
r71195 | mihaip@chromium.org | Wed Jan 12 10:31:21 PST 2011

Changed paths:
 M http://src.chromium.org/viewvc/chrome/trunk/src/chrome/test/data/reliability/known_crashes.txt?r1=71195&r2=71194&pathrev=71195

Broaden  bug 68516  signature even more.

BUG= 68516 
TEST=none
Review URL: http://codereview.chromium.org/6231005
------------------------------------------------------------------------
It is not DelayLowerToken that is crashing, the offset is too large...

chrome_2590000!DelayedLowerToken+0x111a3
Project Member

Comment 17 by bugdroid1@chromium.org, Jan 20 2011

The following revision refers to this bug:
    http://src.chromium.org/viewvc/chrome?view=rev&revision=71896

------------------------------------------------------------------------
r71896 | isherman@chromium.org | Wed Jan 19 17:35:46 PST 2011

Changed paths:
 M http://src.chromium.org/viewvc/chrome/trunk/src/chrome/test/data/reliability/known_crashes.txt?r1=71896&r2=71895&pathrev=71896

Add some reliability bot exceptions for WebKit crashiness.

BUG= 68516 
TEST=greener tree

Review URL: http://codereview.chromium.org/6322005
------------------------------------------------------------------------
Thanks to tonyg@ and ace@, the regression window has been narrowed to WebKit r74803 to r74816. http://trac.webkit.org/changeset/74807/ looks like the most suspicious change in that range, more reliability bot runs will be done to confirm this.
http://trac.webkit.org/changeset/74807/ is indeed the culprit. Rolling it out may not be an option, since it's before the move of WebCore to Source (and it touches lots of file).
 Issue 70031  has been merged into this issue.
Mihai, can you try one more thing? See if adding NULL check in CSSImageValue fixes the problem?
Status: ExternalDependency
I've filed https://bugs.webkit.org/show_bug.cgi?id=53045 since the fix will have to be on the WebKit side.

Comment 23 by k...@google.com, Jan 27 2011

Labels: -Mstone-10 Mstone-11 MovedFrom-10
Move to M11 from M10, as we've now branched.  If you believe this bug was moved in error, please come talk to me.

Comment 24 by k...@google.com, Jan 27 2011

Labels: -Mstone-11 -MovedFrom-10 Mstone-10 ReleaseBlock-Stable
Here are some related stacks showing up on 10.0.648.45.

0x02597a5f	 [chrome.dll	 - thread_cache.h:349]	tcmalloc::ThreadCache::Allocate(unsigned int,unsigned int)
0x02598a29	 [chrome.dll	 - tcmalloc.cc:985]	`anonymous namespace'::do_malloc(unsigned int)
0x02598ddd	 [chrome.dll	 - allocator_shim.cc:110]	malloc
0x02d0f362	 [chrome.dll	 - fastmalloc.cpp:250]	WTF::fastMalloc(unsigned int)
0x02bd4d78	 [chrome.dll	 - domtimer.cpp:82]	WebCore::DOMTimer::install(WebCore::ScriptExecutionContext *,WTF::PassOwnPtr<WebCore::ScheduledAction>,int,bool)
0x02ba7b11	 [chrome.dll	 - v8domwindowcustom.cpp:132]	WebCore::WindowSetTimeoutImpl(v8::Arguments const &,bool)
0x02ba8a04	 [chrome.dll	 - v8domwindowcustom.cpp:529]	WebCore::V8DOMWindow::setTimeoutCallback(v8::Arguments const &)
0x033087a9	 [chrome.dll	 - builtins.cc:1063]	v8::internal::HandleApiCallHelper<0>
0x03308adf	 [chrome.dll	 + 0x00d88adf]	


- thread_cache.h:349]	tcmalloc::ThreadCache::Allocate(unsigned int,unsigned int)
0x02598a29	 [chrome.dll	 - tcmalloc.cc:985]	`anonymous namespace'::do_malloc(unsigned int)
0x02598ddd	 [chrome.dll	 - allocator_shim.cc:110]	malloc
0x02d0f362	 [chrome.dll	 - fastmalloc.cpp:250]	WTF::fastMalloc(unsigned int)
0x02b83ba8	 [chrome.dll	 - renderstyle.cpp:50]	WebCore::RenderStyle::create()
0x02b15943	 [chrome.dll	 - cssstyleselector.cpp:1280]	WebCore::CSSStyleSelector::styleForElement(WebCore::Element *,WebCore::RenderStyle *,bool,bool,bool)
0x02ad55a4	 [chrome.dll	 - node.cpp:1411]	WebCore::Node::styleForRenderer()
0x02ad54db	 [chrome.dll	 - node.cpp:1388]	WebCore::Node::createRendererIfNeeded()
0x02aeaa9b	 [chrome.dll	 - element.cpp:919]	WebCore::Element::attach()
0x02e07d56	 [chrome.dll	 - htmlconstructionsite.cpp:111]	WebCore::HTMLConstructionSite::attach<WebCore::DocumentType>(WebCore::ContainerNode *,WTF::PassRefPtr<WebCore::DocumentType>)
0x02e0761f	 [chrome.dll	 - htmlconstructionsite.cpp:237]	WebCore::HTMLConstructionSite::attachToCurrent(WTF::PassRefPtr<WebCore::Element>)
0x02e07754	 [chrome.dll	 - htmlconstructionsite.cpp:267]	WebCore::HTMLConstructionSite::insertHTMLElement(WebCore::AtomicHTMLToken &)
0x02dfe655	 [chrome.dll	 - htmltreebuilder.cpp:787]	WebCore::HTMLTreeBuilder::processStartTagForInBody(WebCore::AtomicHTMLToken &)
0x02dfeb2c	 [chrome.dll	 - htmltreebuilder.cpp:1220]	WebCore::HTMLTreeBuilder::processStartTag(WebCore::AtomicHTMLToken &)
0x02dfd666	 [chrome.dll	 - htmltreebuilder.cpp:472]	WebCore::HTMLTreeBuilder::processToken(WebCore::AtomicHTMLToken &)
0x02dfd5b7	 [chrome.dll	 - htmltreebuilder.cpp:453]	WebCore::HTMLTreeBuilder::constructTreeFromAtomicToken(WebCore::AtomicHTMLToken &)
0x02ddc5ef	 [chrome.dll	 - htmldocumentparser.cpp:232]	WebCore::HTMLDocumentParser::pumpTokenizer(WebCore::HTMLDocumentParser::SynchronousMode)
0x02ddc4aa	 [chrome.dll	 - htmldocumentparser.cpp:169]	WebCore::HTMLDocumentParser::pumpTokenizerIfPossible(WebCore::HTMLDocumentParser::SynchronousMode)
0x02ddc88d	 [chrome.dll	 - htmldocumentparser.cpp:320]	WebCore::HTMLDocumentParser::append(WebCore::SegmentedString const &)
0x02c47aae	 [chrome.dll	 - decodeddatadocumentparser.cpp:54]	WebCore::DecodedDataDocumentParser::appendBytes(WebCore::DocumentWriter *,char const *,int,bool)
0x02b33b60	 [chrome.dll	 - documentwriter.cpp:200]	WebCore::DocumentWriter::addData(char const *,int,bool)
0x02b3054e	 [chrome.dll	 - documentloader.cpp:310]	WebCore::DocumentLoader::commitData(char const *,int)
0x02e10202	 [chrome.dll	 - webframeimpl.cpp:1040]	WebKit::WebFrameImpl::commitDocumentData(char const *,unsigned int)
0x02e27eca	 [chrome.dll	 - frameloaderclientimpl.cpp:1066]	WebKit::FrameLoaderClientImpl::committedLoad(WebCore::DocumentLoader *,char const *,int)
0x02b304cf	 [chrome.dll	 - documentloader.cpp:295]	WebCore::DocumentLoader::commitLoad(char const *,int)
0x02c14d6d	 [chrome.dll	 - mainresourceloader.cpp:157]	WebCore::MainResourceLoader::addData(char const *,int,bool)
0x02c13e60	 [chrome.dll	 - resourceloader.cpp:277]	WebCore::ResourceLoader::didReceiveData(char const *,int,__int64,bool)
0x02c15496	 [chrome.dll	 - mainresourceloader.cpp:442]	WebCore::MainResourceLoader::didReceiveData(char const *,int,__int64,bool)
0x02c141a3	 [chrome.dll	 - resourceloader.cpp:428]	WebCore::ResourceLoader::didReceiveData(WebCore::ResourceHandle *,char const *,int,int)
0x02e19ecb	 [chrome.dll	 - resourcehandle.cpp:173]	WebCore::ResourceHandleInternal::didReceiveData(WebKit::WebURLLoader *,char const *,int)
0x02a86323	 [chrome.dll	 - weburlloader_impl.cc:612]	webkit_glue::WebURLLoaderImpl::Context::OnReceivedData(char const *,int)
0x02a0f3f0	 [chrome.dll	 - resource_dispatcher.cc:372]	ResourceDispatcher::OnReceivedData(IPC::Message const &,int,void *,int)
0x02a0f999	 [chrome.dll	 - resource_dispatcher.cc:528]	ResourceDispatcher::DispatchMessageW(IPC::Message const &)
0x02a0f1f2	 [chrome.dll	 - resource_dispatcher.cc:297]	ResourceDispatcher::OnMessageReceived(IPC::Message const &)
0x029ff622	 [chrome.dll	 - child_thread.cc:144]	ChildThread::OnMessageReceived(IPC::Message const &)
0x0282ffa0	 [chrome.dll	 - task.h:331]	RunnableMethod<CancelableRequest<CallbackRunner<Tuple4<int,GURL,bool,std::vector<GURL,std::allocator<GURL> > *> > >,void ( CancelableRequest<CallbackRunner<Tuple4<int,GURL,bool,std::vector<GURL,std::allocator<GURL> > *> > >::*)(Tuple4<int,GURL,bool,std::vector<GURL,std::allocator<GURL> > *> const &),Tuple1<Tuple4<int,GURL,bool,std::vector<GURL,std::allocator<GURL> > *> > >::Run()
0x0264f2ea	 [chrome.dll	 - message_loop.cc:356]	MessageLoop::RunTask(Task *)
0x0264f371	 [chrome.dll	 - message_loop.cc:365]	MessageLoop::DeferOrRunPendingTask(MessageLoop::PendingTask const &)
0x0264f723	 [chrome.dll	 - message_loop.cc:558]	MessageLoop::DoWork()
0x0266691d	 [chrome.dll	 - message_pump_default.cc:50]	base::MessagePumpDefault::Run(base::MessagePump::Delegate *)
0x0264f26b	 [chrome.dll	 - message_loop.cc:331]	MessageLoop::RunInternal()
0x0264f1f0	 [chrome.dll	 - message_loop.cc:304]	MessageLoop::RunHandler()
0x0264f0e4	 [chrome.dll	 - message_loop.cc:234]	MessageLoop::Run()
0x0267db9b	 [chrome.dll	 - renderer_main.cc:300]	RendererMain(MainFunctionParams const &)
0x02583fa3	 [chrome.dll	 - chrome_main.cc:925]	ChromeMain
0x00403ddb	 [chrome.exe	 - client_util.cc:280]	MainDllLoader::Launch(HINSTANCE__ *,sandbox::SandboxInterfaceInfo *)
0x0040419b	 [chrome.exe	 - chrome_exe_main_win.cc:46]	wWinMain
0x00449be6	 [chrome.exe	 - crt0.c:263]	__tmainCRTStartup
0x7c816fd6	 [kernel32.dll	 + 0x00016fd6]	BaseProcessStart

Comment 27 by k...@google.com, Feb 11 2011

 Issue 72483  has been merged into this issue.

Comment 28 by k...@google.com, Feb 11 2011

Labels: -ReleaseBlock-Stable ReleaseBlock-Beta
Unfortunately I haven't been able to make much headway with this. Things that I've tried to reproduce this and/or figure out what is triggering it:
- Running the automated_ui_test suite locally (on Windows and Mac) in both debug and release builds, using the steps that resulted in crashes on the reliability bots
- Running the automated_ui_test suite under threadsanitizer and memcheck
- Adding CRASH checks in addition to ASSERTs (http://trac.webkit.org/changeset/76575 and http://trac.webkit.org/changeset/76576).

Suggested next steps:
- Look at the dumps from the reliability bots (not sure how to go about doing that)
- Go over http://trac.webkit.org/changeset/74807/ with a fine tooth-comb
> Look at the dumps from the reliability bots (not sure how to go about doing that)

Huan (cc-ed) probably can help you with that.

Comment 31 by mal@chromium.org, Feb 11 2011

You can get crashes for recent official release reliability runs from
http://crash-staging/productview?product=Chrome&num=50 [internal only].
And you can get the full memory dumps off of:
  http://chromebot/

(I'll take a look through some of these today see if I can spot anything.)

Comment 33 by k...@google.com, Feb 14 2011

Any luck looking at dumps?  We're holding beta until we understand this one better.
I'm going to start on a rollout patch of http://trac.webkit.org/changeset/74807/ in the WebKit 648 branch.

Comment 35 by jar@chromium.org, Feb 15 2011

@huanr: Is there any nice way to run the Windows Heap checking allocator on our reliability tests?  

General:

If we believe it is a double free, and it can be reproduced on our bots, perhaps we can force the use of a different allocator that can (on Windows) better detect double free (or such corruption).  We could probably temporarily force the default use of a different allocator (current default is tcmalloc), and see what happens on the bots.  Optimally, we can use a smart heap-checking allocator... but even switching to a "different" allocator may reveal some helpful info.

The failure is shown above to be in ThreadCache::Allocate(), where there is an attempt to Pop() a correctly sized block of memory from a local free list.  These singly linked lists are potentially corrupted by a double free.  A double free cause a single (freed) block to potentially be placed either twice on the same list (creating a loop), or placed onto two list at the same time <oops> leaking the tail of one list, and causing the two lists to overlap (share a tail).  The lists each have a length associated with them, and the overlap implies that one of the lists "has the wrong length".  In the most simple scenario, when tcmalloc tries to pop one too many times (on the list with the errantly short count) it will fall off the end and crash.
http://webkit.org/b/54424 has a rollout patch for http://trac.webkit.org/changeset/74807/ that should apply cleanly in the WebKit 648 branch. It's rather large and scary (since there have been changes to that code after r74807 and before we branched 648 at r76408). I'm running layout tests in the branch, any other ideas for how to test this before landing?
Roll out patch landed as http://trac.webkit.org/changeset/78529 (as best as I can tell, it compiles and all the layout tests pass).

Comment 38 by mal@chromium.org, Feb 15 2011

Sweet. This should be in Chrome 10.0.648.76.

I hope that we can get that to the dev channel tomorrow night (15 Feb 2011).
Not sure if this is related: I've just found a use-after-free Valgrind report in TabContents with commits around r70290-70300 in the stacktraces.

Comment 40 by k...@google.com, Feb 15 2011

Labels: -Mstone-10 -ReleaseBlock-Beta Mstone-11
I am going pull the rel-block off this, as this is hopefully fixed on the branch.  We'll still need the right fix for trunk, so moving to M11
@Timur: I don't think that's related, this is a crash in the renderer process, TabContents lives in the browser process.
Regarding comment 35 (using a debugging allocator), it looks like the tcmalloc debug mode is still pending (http://code.google.com/p/google-perftools/issues/detail?id=295). Have we tried building Chromium with any other allocators?

As far as tracking this down, http://trac.webkit.org/changeset/78602 was recently landed on the WebKit side (but hasn't been picked up by a roll), that might catch any use-after-free problems with CachedResourceLoader.
@mihaip Instead if trying fully functional debug allocator we can try something way simpler/stupider: for example when we put a block of size > 4 words into ThreadCache we can zap contents (or first 4 words) with a special value and we also check that contents of the block were not already zapped (if they were --- hooray, we found who does the double free), and malloc clears zapping. This will not catch all cases of double-free and neither it is false-positive safe, but it might catch a lot of simple cases. 

I did a patch doing this kind of poor-man-verification and Søren Gjesse is trying to run it on bots (there were some difficulties today). I will update the issue if we manage to get any valuable information.

Comment 44 by vegorov@google.com, Feb 22 2011

I managed to catch (by running chrome with patched tcmalloc on bots) something that looks like a double free:

We are in CSSSelectorList::deleteSelector (called from ~CSSSelectorList, called from ~CSSStyleRule) and CSSSelector::m_selectorArray point so something that looks like it was already deleted (it's contents are zapped with 0xDEADDEAD which I use as a magic value in the tcmalloc's free). Unfortunately I can't say where exactly it was deleted, because my naive checks do not track this information. I will try to address this tomorrow.

Maybe somebody with WebKit knowledge can deduce something just from looking at CSSSelectorList code.    
Vyacheslav, any luck with fancier double free checks?

I've uploaded https://bugs.webkit.org/show_bug.cgi?id=55596 on the WebKit side to check for CSSSelector double frees.
Unfortunately there was no progress during last 2 weeks. 

Issue seems to be perfectly reproducable on the chromebot so we decided to get a similar VM running locally. This will give us more control over the environment as chromebot is not really suitable for this kind of experiments.

AFAIK Søren managed to get some basic setup up and running, but both he and I become preoccupied by other issues. We are planning to return to our experiments on the next week as we are really interested in resolving this problem.

Did your check catch anything? 

It might be that we are actually double freeing an array of CSSSelector's. I noticed that CSSSelectorList tries to optimize the case when we have only one selector, this optimization might be backfiring but I did not spot any obvious bug there.
My double-free of CSSSelector check was rolled, and it didn't catch anything (reliability runs are still just crashing with tcmalloc failures: http://build.chromium.org/p/chromium/builders/Win%20Reliability/builds/2427/steps/reliability%3A%20partial%20result%20of%20current%20build/logs/stdio).

I'll also look into adding a check for m_selectorArray being freeed twice.

Is it possible to set CHROME_ALLOCATOR=winheap temporary on some (or all?) reliability bots (or in the code?) to see if it catches any issues?
I added a check for m_selectorArray being freed twice with http://trac.webkit.org/changeset/80269, but it hasn't tripped up the reliability bots (which are still seeing regular tcmalloc crashes).
My initial double-free check (http://trac.webkit.org/changeset/80155) was in 11.0.691.0 (WebKit was rolled to 80210 was it was cut) and it looks like it's triggering in the wild:

http://crash/reportdetail?reportid=dd95bfc6635ca836
http://crash/reportdetail?reportid=dd0f453bde97dbb9
http://crash/reportdetail?reportid=4d9d5e5a1dfee4d3

However, the stacks look suspicious, e.g. the "repost_form_warning_view.cc:81]	`anonymous namespace'::ResetDefaultsConfirmBox::DeleteDelegate()" line looks like browser process code.

Comment 53 by kareng@google.com, Mar 9 2011

Labels: -Pri-0 Pri-1

Comment 54 by kareng@google.com, Mar 9 2011

Labels: -Mstone-11 MovedFrom-11 Mstone-12
rolling non releaseblocker mstone 11 bugs to mstone 12. 
Labels: -MovedFrom-11 -Mstone-12 ReleaseBlock-Stable Mstone-11
This definitely needs to be fixed for M11. It's responsible for ~12% of crashes (all the TCMalloc corruption ones on http://chromecrash/browse?q=v%3DChrome-11.0.696.0%2Fptype%3Drenderer). The only reason why this didn't get fixed for M10 was because we were able to do a revert of the change that triggered the bug in the M10 branch.

There is some hope for fixing this. Thanks to Timur and Allen, we now have a reliability bot run using the winheap allocator: http://chromebot/buildsummary?id=ui_buildbot_77274_winheap. That has crashes of the form:

chrome_25a0000!WTF::HashTable<WTF::String,WTF::String,WTF::IdentityExtractor<WTF::String>,WTF::StringHash,WTF::HashTraits<WTF::String>,WTF::HashTraits<WTF::String> >::add<WTF::String,WTF::String,WTF::IdentityHashTranslator<WTF::String,WTF::String,WTF::StringHash> >+0x39 [c:\b\build\slave\win\build\src\third_party\webkit\source\javascriptcore\wtf\hashtable.h @ 644]
chrome_25a0000!WebCore::CachedResourceLoader::revalidateResource+0xbc [c:\b\build\slave\win\build\src\third_party\webkit\source\webcore\loader\cache\cachedresourceloader.cpp @ 355]
chrome_25a0000!WebCore::CachedResourceLoader::requestResource+0x1b5 [c:\b\build\slave\win\build\src\third_party\webkit\source\webcore\loader\cache\cachedresourceloader.cpp @ 318]
chrome_25a0000!WebCore::CachedResourceLoader::requestImage+0xfa [c:\b\build\slave\win\build\src\third_party\webkit\source\webcore\loader\cache\cachedresourceloader.cpp @ 144]
chrome_25a0000!WebCore::ImageLoader::updateFromElement+0x1b0 [c:\b\build\slave\win\build\src\third_party\webkit\source\webcore\loader\imageloader.cpp @ 172]
chrome_25a0000!WebCore::HTMLImageElement::parseMappedAttribute+0x5a [c:\b\build\slave\win\build\src\third_party\webkit\source\webcore\html\htmlimageelement.cpp @ 167]
chrome_25a0000!WebCore::StyledElement::attributeChanged+0xf7 [c:\b\build\slave\win\build\src\third_party\webkit\source\webcore\dom\styledelement.cpp @ 189]
chrome_25a0000!WebCore::NamedNodeMap::addAttribute+0x77 [c:\b\build\slave\win\build\src\third_party\webkit\source\webcore\dom\namednodemap.cpp @ 263]

Looking at the implementation of CachedResourceLoader::revalidateResource (http://trac.webkit.org/browser/trunk/Source/WebCore/loader/cache/CachedResourceLoader.cpp#L335) we crash when calling m_validatedURLs.add(url). We got url earlier, from "const String& url = resource->url();". However, resource could have been deleted by the "memoryCache()->remove(resource);" call, thus the url String reference may not be good anymore. I think making an explicit copy of the url String should fix things.

Comment 56 by k...@google.com, Mar 9 2011

Labels: -ReleaseBlock-Stable ReleaseBlock-Beta
The presumed fix landed with http://trac.webkit.org/changeset/80686. 

Reliability bot runs after that got picked up by a DEPS roll (starting with http://build.chromium.org/p/chromium/builders/Win%20Reliability/builds/2556/steps/reliability%3A%20complete%20result%20of%20previous%20build/logs/stdio, and thus far 6 other runs) no longer show tcmalloc crashes. I'm therefore going to conclude that the problem is fixed, and will merge the WebKit change into the M11 branch.
Merged into the M11 branch with http://trac.webkit.org/changeset/80732
Status: Fixed
Great news. I am curious though how CSSSelectorList-related crashes fit into this picture.
Great news!

Mihai,
Is this code Windows-specific or we were just unlucky with different Linux/Mac scheduling strategies?
I wonder why we haven't caught this with Valgrind or TSan - looks like the test coverage is not good enough.

Allen, Nirnimesh,
do you plan to increase the winheap coverage on the reliability bots?
Can we have something like this for Linux and Mac reliability bots?
eFence, for example?
I will file a separate bug about the CSSSelectorList issue, since that seems unrelated.

The code is not Windows-specific. The triggering conditions were very specific (required getting a 304 during revalidation of resources that are in the memory cache). I believe that's why the regular reliability bot runs didn't trigger this (they visit many different URLs) but the UI action ones did (since they visit a random combination of bing.com, craigslist.org and google.com, thus have a much higher likelihood of using the memory cache between repeated visits)
I've filed http://webkit.org/b/56124 about the CSSSelectorList double-free.
The conditions requiring a 304 explains why we where seeing this a lot on the V8 stressing reliability runs, as there each page is loaded 5 times - not just once.

Comment 65 by ace@chromium.org, Mar 10 2011

from this result it looks like winheap reliability runs is a capability we will want to keep around.

I would love to hear some input about at what frequency people would like to see these runs (continuous / nightly / on demand) since we have a constrained resource pool.

Also, I'm not familiar with eFence or other options for linux/mac.
Project Member

Comment 66 by bugdroid1@chromium.org, Mar 11 2011

The following revision refers to this bug:
    http://src.chromium.org/viewvc/chrome?view=rev&revision=77748

------------------------------------------------------------------------
r77748 | mihaip@chromium.org | Thu Mar 10 17:31:51 PST 2011

Changed paths:
 M http://src.chromium.org/viewvc/chrome/trunk/src/chrome/test/data/reliability/known_crashes.txt?r1=77748&r2=77747&pathrev=77748

Remove crash signatures for  bug 68516  (and  bug 70031 , which was a dupe of it)
since it has been fixed. There are probably other signatures that had the same
cause (especially ones that were in constructors or destructors); I will remove
them once the reliability bot has cycled for a few more hours and we can be more
sure that they have not appeared.

BUG= 68516 
TEST=none
Review URL: http://codereview.chromium.org/6612072
------------------------------------------------------------------------

Comment 67 by ace@chromium.org, Mar 18 2011

Cc: jam@chromium.org brettw%c...@gtempaccount.com
 Issue 75447  has been merged into this issue.
Labels: -Memory bulkmove Stability-Memory
Reliability bot is red at the time of this writing. Going back in time I see that it had three green runs and then started failing at r70323. However, I believe the first three failures are intermittent failures (flakiness) and are not necessarily related to that changelist. 

What is more pressing is that it seems that after the WebKit roll from 74791 to 74816 (r70328) we started seeing more consistent failures, which still happen after many Webkit rolls.

Here is where the failures started (we can probably ignore the first three for now, as those are the intermittent ones): 

1) Reliability bot seemed to be doing fine until r70323 when it started failing (browser crash in Notify during TabContents::UpdateTitleForEntry):
http://build.chromium.org/p/chromium/builders/Win%20Reliability/builds/1292/steps/reliability%3A%20complete%20result%20of%20previous%20build/logs/stdio

2) Next run was green, though, but then failures occurred again (r70325) (this time DelayedLowerToken):
http://build.chromium.org/p/chromium/builders/Win%20Reliability/builds/1294/steps/reliability%3A%20partial%20result%20of%20current%20build/logs/stdio

3) Next run failed (memcpy):
http://build.chromium.org/p/chromium/builders/Win%20Reliability/builds/1295/steps/reliability%3A%20partial%20result%20of%20current%20build/logs/stdio

4) And then we at r70328 get a WebKit roll (from 74791 to 74816), which produced WebCore failures, which seem to be more consistent:
http://build.chromium.org/p/chromium/builders/Win%20Reliability/builds/1296/steps/reliability%3A%20partial%20result%20of%20current%20build/logs/stdio

5) This is the most recent failure log (it looks even scarier than the previous log):
http://build.chromium.org/p/chromium/builders/Win%20Reliability/builds/1342/steps/reliability%3A%20partial%20result%20of%20current%20build/logs/stdio

Webkit has been rolled so many times I'm not sure it is safe to revert this change and I'm not sure who is the best person to investigate, but I'm assigning this to mihaip who performed the WebKit roll to at least get the ball rolling.
Labels: -Crash Stability-Crash
Reliability bot is red at the time of this writing. Going back in time I see that it had three green runs and then started failing at r70323. However, I believe the first three failures are intermittent failures (flakiness) and are not necessarily related to that changelist. 

What is more pressing is that it seems that after the WebKit roll from 74791 to 74816 (r70328) we started seeing more consistent failures, which still happen after many Webkit rolls.

Here is where the failures started (we can probably ignore the first three for now, as those are the intermittent ones): 

1) Reliability bot seemed to be doing fine until r70323 when it started failing (browser crash in Notify during TabContents::UpdateTitleForEntry):
http://build.chromium.org/p/chromium/builders/Win%20Reliability/builds/1292/steps/reliability%3A%20complete%20result%20of%20previous%20build/logs/stdio

2) Next run was green, though, but then failures occurred again (r70325) (this time DelayedLowerToken):
http://build.chromium.org/p/chromium/builders/Win%20Reliability/builds/1294/steps/reliability%3A%20partial%20result%20of%20current%20build/logs/stdio

3) Next run failed (memcpy):
http://build.chromium.org/p/chromium/builders/Win%20Reliability/builds/1295/steps/reliability%3A%20partial%20result%20of%20current%20build/logs/stdio

4) And then we at r70328 get a WebKit roll (from 74791 to 74816), which produced WebCore failures, which seem to be more consistent:
http://build.chromium.org/p/chromium/builders/Win%20Reliability/builds/1296/steps/reliability%3A%20partial%20result%20of%20current%20build/logs/stdio

5) This is the most recent failure log (it looks even scarier than the previous log):
http://build.chromium.org/p/chromium/builders/Win%20Reliability/builds/1342/steps/reliability%3A%20partial%20result%20of%20current%20build/logs/stdio

Webkit has been rolled so many times I'm not sure it is safe to revert this change and I'm not sure who is the best person to investigate, but I'm assigning this to mihaip who performed the WebKit roll to at least get the ball rolling.
Project Member

Comment 70 by bugdroid1@chromium.org, Oct 13 2012

Blocking: -chromium:72336 chromium:72336
Labels: Restrict-AddIssueComment-Commit
This issue has been closed for some time. No one will pay attention to new comments.
If you are seeing this bug or have new data, please click New Issue to start a new bug.
Project Member

Comment 71 by bugdroid1@chromium.org, Mar 10 2013

Labels: -Area-WebKit -Mstone-11 -Stability-Memory Cr-Content Performance-Memory M-11
Project Member

Comment 72 by bugdroid1@chromium.org, Mar 13 2013

Labels: -Restrict-AddIssueComment-Commit Restrict-AddIssueComment-EditIssue
Project Member

Comment 73 by bugdroid1@chromium.org, Apr 1 2013

Labels: -Performance-Memory Stability-Memory
Project Member

Comment 74 by bugdroid1@chromium.org, Apr 6 2013

Labels: -Cr-Content Cr-Blink

Sign in to add a comment