New issue
Advanced search Search tips

Issue 777409 link

Starred by 1 user

Issue metadata

Status: Fixed
Owner:
Closed: Oct 2017
Cc:
Components:
EstimatedDays: ----
NextAction: ----
OS: Windows
Pri: 2
Type: Bug



Sign in to add a comment

ActivityTracker: something in RecordProcessExit causing trybot failures in browser_tests

Project Member Reported by siggi@chromium.org, Oct 23 2017

Issue description

I've been chasing this for a good two weeks now in trying to re-land https://chromium-review.googlesource.com/c/chromium/src/+/713734 a s prerequisite for enabling crumbs more widely.

It looks like this only trips win7_chromium_rel_ng where it causes PrerenderBrowserTest.PrerenderLocationReplaceGWSHistograms to time out. This is a non-debug, non-component build with dchecks_always_on=true, running on a Win7 bot.

I cannot repro the failure locally, neither on my Win10 dev box, nor on a Win7 HyperV virtual machine.

I've tried a remote desktop connection to a bot, where the timeout readily reproduces, even running in a debugger. There's no logging nor exceptions nor anything else that I noticed that'd help figuring out where the test run is derailing. I didn't see signs of a deadlock either, the processes all look idle, and then the test times out.

I believe I've bisected this down to something about ActivityTracker::RecordProcessExit, though the bisection is weird.
See https://chromium-review.googlesource.com/c/chromium/src/+/726344.

If I disable the body of CleanupAfterProcess (see https://chromium-review.googlesource.com/c/chromium/src/+/726344/10/base/debug/activity_tracker.cc), the test doesn't hang.

If, however, I add logging to RecordProcessExit (see https://chromium-review.googlesource.com/c/chromium/src/+/726344/12/base/debug/activity_tracker.cc#1442), this produces no logging on the bot, but the test still hangs.

As much as I'd like to understand what's going on here, the cleanup code is doing work against a feature that nobody uses so far. Namely cleaning up after multiple processes track activity into a shared file mapping.


 

Comment 1 by siggi@chromium.org, Oct 23 2017

Owner: siggi@chromium.org

Comment 2 by siggi@chromium.org, Oct 23 2017

Ok, so I just reproed a crash in CleanupAfterProcess where GetOwningProcessId was passed a nullptr.
Project Member

Comment 3 by bugdroid1@chromium.org, Oct 24 2017

The following revision refers to this bug:
  https://chromium.googlesource.com/chromium/src.git/+/140c68eef1492d894ba12905ef047ce6c0625d55

commit 140c68eef1492d894ba12905ef047ce6c0625d55
Author: Brian White <bcwhite@chromium.org>
Date: Tue Oct 24 12:04:02 2017

Validate memory return.

If the cleanup after a process is done twice (or more) in parallel
then a race may cause the type to change between the iteration and
the conversion to a memory address, resulting in a nullptr crash when
it tries to extract the owning process ID below.

Note that multiple cleanups is itself not desired behavior (i.e. a bug)
but it shouldn't be fatal.

Bug:  777409 
Change-Id: I1759eaa406cbea13b6a1f7f606a09a9426bbe3ac
Reviewed-on: https://chromium-review.googlesource.com/733871
Commit-Queue: Brian White <bcwhite@chromium.org>
Reviewed-by: Sigurður Ásgeirsson <siggi@chromium.org>
Cr-Commit-Position: refs/heads/master@{#511102}
[modify] https://crrev.com/140c68eef1492d894ba12905ef047ce6c0625d55/base/debug/activity_tracker.cc

Comment 4 by siggi@chromium.org, Oct 24 2017

I've managed to get this to flake on my local Win7 VM, by downloading the build from swarming. When run under debugger it doesn't flake.

Running on my local dev box in a shell with /affinity f (e.g. 4 cores), I can also provoke this.

Comment 5 by siggi@chromium.org, Oct 24 2017

Also worth mentioning that this isn't exclusive to CleanupAfterProcess, as I've seen this flake with that function disabled. Most vexing!
Siggi established that the process is stuck in TitleWatcher::WaitAndGetTitle().

This method will run forever until the loop is Quit() by TestTitle() detecting that the title has become what is desired.

A debugger shows that it is called at the top of WaitAndGetTitle() and never again.  It could be called by either TitleWasSet() or DidStopLoading(), both of which are virtual methods of the base WebContentsObserver class.

Presumably, then, neither of the "webcontents" things are ever happening.

By doing a Debug x86 build, this happens every time with tracking enabled but never without it (build from HEAD).

Starting with it enabled (failing), I cumulatively commented-out methods with the following results:
- RecordProcessExit:  still fails
- RecordProcessLaunch:  still fails
- memory flush:  still fails
- use local memory instead of disk file:  still fails
- CreateTrackerForCurrentThread:  PASSES!

So it's some type of thread-specific information.  A lock?  Restore CreateTrackerForCurrentThread and try removing...
- ActivityTracker.ThreadTrackers.Count histogram:  still fails 
- ReturnTrackerMemory:  still fails
- PushActivity:  still fails  (no, not a lock)
- turns out PopActivity is never called at all; something started but never finished
- unless I run it under VS then it is
- oh!  PopActivity is being called but a crash (i.e. NOTREACHED) there causes a timeout.  Strange.
- PopActivity:  PASSES!  (when commented-out; adding NOTREACHED times-out)

Removing the Push/PopActivity causes the test to again work.  It is something to do with one of the two.
- restore PushActivity:  PASSES
- restore PopActivity:  fails

Let's see what the debugger has to say...
- there are a lot of lock push/pop during startup, which is to be expected
- push without pop means that we quickly fill the stack and push becomes just an increment
- hack to always store data:  fails
- breakpoint at overflow detection never trips
- hack to never store data:  PASSES
- breakpoint at same places now trips
- remove hack: fails
- breakpoint now doesn't trip

Disabling the Pop causes the test to pass.  The primary side-effect of this is that the stack grows forever, all slots on the activity stack get filled and pushes have to stop adding data and only increment the depth counter.  VS confirms this effect and the test passes.  However, hacking the code to always store data (in slot 0) regardless of depth goes back to failure.  That says that it's the actual storage of the data that is the problem.

Let's try not storing specific activity types:
- ACT_TASK_RUN: fails
- ACT_LOCK_ACQUIRE: PASSES

It's the lock tracking!  But why?  There's nothing specific to it being a lock... except that it happens A LOT!

Removing parts of Activity::FillFrom...
- everything:  PASSES
- everything but TimeTicks::Now().ToInternalValue():  fails
- only TimeTicks::Now().ToInternalValue():  PASSES

Well now we know!  It's the frequent calls to TimeTicks::Now() causing the issue.  It seems unlikely that this extra time spent querying the clock is so much that its causing test timeouts.  Using the debugger to see what is being locked, most of them are CalledOnValidSequence.  This is a Debug build after all and does explain why it happens all the time on this build when its just flaky on a Release build.

Also seen were:
- TaskTracker::IsShutdownComplete
- SchedulerWorker::Cleanup
- SafeAcquisitionTacker::AssertSafeAcquire
- SchedulerWorkerPoolImpl::SchedulerWorkerDelegateImpl::GetWork

Given the high frequency of lock acquisitions, I'm thinking that this needs to be optimized.  The first method that comes to my mind is to "try" the lock and then only advance to the (tracked) blocking acquire if it fails.  Thoughts?

Of course, if the issue really is a problem with TimeTicks::Now() then such a change would just avoid the problem rather than fix it... if it does fix it -- won't know without trying.

I'll look into it tomorrow.  It's still worth doing either way given the frequency of "lock" calls, especially for SYZYASAN builds which capture the calling stack each time.

Comment 7 by siggi@chromium.org, Oct 25 2017

The test config CL got reverted again, see https://chromium-review.googlesource.com/c/chromium/src/+/736969. The reason listed is the two tests BrowserTest.WindowOpenClose
ExtensionLoadingTest.RuntimeValidWhileDevToolsOpen in this build https://build.chromium.org/p/chromium.win/builders/Win7%20Tests%20%28dbg%29%281%29/builds/64200.
Output is in these two shards:
https://chromium-swarm.appspot.com/task?id=3968a0e1df139310&refresh=10&show_raw=1
https://chromium-swarm.appspot.com/task?id=3968a0e77ab93310&refresh=10&show_raw=1

Comment 8 by siggi@chromium.org, Oct 25 2017

And the ExtensionLoadingTest.RuntimeValidWhileDevToolsOpen timeout repros readily on my local dev box. This looks like a proper perf issue:

with crumbs disabled:
[----------] 1 test from ExtensionLoadingTest (10797 ms total)

with crumbs enabled:
[----------] 1 test from ExtensionLoadingTest (90919 ms total)

and with cl/737858 applied, crumbs enabled:
[----------] 1 test from ExtensionLoadingTest (10161 ms total)


Project Member

Comment 9 by bugdroid1@chromium.org, Oct 25 2017

The following revision refers to this bug:
  https://chromium.googlesource.com/chromium/src.git/+/62ae282c378bd6658c877bf3a9e911e41764ec32

commit 62ae282c378bd6658c877bf3a9e911e41764ec32
Author: Brian White <bcwhite@chromium.org>
Date: Wed Oct 25 21:44:03 2017

Don't track locks unless likely to block.

Chrome has a huge number of locks that are constantly acquired and
released.  Most acquisitions don't block and so having them "tracked"
by the ActivityTracker tends to be expensive with no possible benefit.

If tracking is enabled, simply "try" the lock first and skip the
(tracked) blocking call completely.

This fixes problems where some run-times were 9x longer than when
tracking is not enabled.

Bug:  777409 
Change-Id: Ic132d50db094a7bfe25677fd35157d4473a106cd
Reviewed-on: https://chromium-review.googlesource.com/737858
Reviewed-by: Mark Mentovai <mark@chromium.org>
Reviewed-by: Sigurður Ásgeirsson <siggi@chromium.org>
Commit-Queue: Brian White <bcwhite@chromium.org>
Cr-Commit-Position: refs/heads/master@{#511599}
[modify] https://crrev.com/62ae282c378bd6658c877bf3a9e911e41764ec32/base/debug/activity_tracker.cc
[modify] https://crrev.com/62ae282c378bd6658c877bf3a9e911e41764ec32/base/debug/activity_tracker.h
[modify] https://crrev.com/62ae282c378bd6658c877bf3a9e911e41764ec32/base/debug/activity_tracker_unittest.cc
[modify] https://crrev.com/62ae282c378bd6658c877bf3a9e911e41764ec32/base/synchronization/lock_impl_posix.cc
[modify] https://crrev.com/62ae282c378bd6658c877bf3a9e911e41764ec32/base/synchronization/lock_impl_win.cc
[modify] https://crrev.com/62ae282c378bd6658c877bf3a9e911e41764ec32/base/synchronization/lock_unittest.cc

Project Member

Comment 10 by bugdroid1@chromium.org, Oct 26 2017

The following revision refers to this bug:
  https://chromium.googlesource.com/chromium/src.git/+/4275cc391268d1845fffc86b11c0c98d8e90e46f

commit 4275cc391268d1845fffc86b11c0c98d8e90e46f
Author: Pierre-Antoine Manzagol <manzagop@chromium.org>
Date: Thu Oct 26 13:09:30 2017

Reland "Field trial testing config for StabilityDebugging"

After https://chromium-review.googlesource.com/737858 lands, this should
no longer cause timeouts on Win7 Dbg bots. Turns out the lock
instrumentation was causing ~9X longer runtime for at least some tests,
but with the patch, this is back down to ~1X.

This is a reland of 384db9aa94f082cf13419ba2eb6b82bc7c53ba5f
Original change's description:
> Field trial testing config for StabilityDebugging
> 
> This enables testing for the "record, collect on crash, process
> postmortem without sending postmortem reports" configuration.
> 
> Bug: 620813
> Change-Id: Iefcb7701ec9122c79dcd8d7b083cf1d96aa64c77
> Reviewed-on: https://chromium-review.googlesource.com/692714
> Reviewed-by: Jesse Doherty <jwd@chromium.org>
> Commit-Queue: Pierre-Antoine Manzagol (departed) <manzagop@chromium.org>
> Cr-Commit-Position: refs/heads/master@{#505439}

Bug: 620813
Bug:  777409 
Change-Id: I3b3b2a7d2f1b12646bfed1741f87951b0806683d
Reviewed-on: https://chromium-review.googlesource.com/738330
Commit-Queue: Sigurður Ásgeirsson <siggi@chromium.org>
Reviewed-by: Brian White <bcwhite@chromium.org>
Reviewed-by: Robert Kaplow <rkaplow@chromium.org>
Cr-Commit-Position: refs/heads/master@{#511800}
[modify] https://crrev.com/4275cc391268d1845fffc86b11c0c98d8e90e46f/testing/variations/fieldtrial_testing_config.json

Comment 11 by siggi@chromium.org, Oct 27 2017

Status: Fixed (was: Started)
Looks like this is fixed, as the test config is sticking now with crumbs collection turned on in browser_tests et al.

Sign in to add a comment