Monorail Project: gerrit Issues People Development process History Sign in
New issue
Advanced search Search tips
Starred by 4 users
Status: Submitted
Owner:
Closed: Oct 2016
Cc:
Components:



Sign in to add a comment
Deadlock during Lucene shutdown
Project Member Reported by dborowitz@google.com, Aug 30 2016 Back to list
I believe this is caused by https://gerrit-review.googlesource.com/79042 but I haven't bisected fully.

Basically, from ReindexAfterUpdate in the batch index executor, ChangeDataResults#toList calls future.get(), and that call hangs. This causes the main thread to hang in the WorkQueue#stop loop.

I think the reason is that the interactive executor pool, which is where ChangeDataResults' future executes, was already shut down. But what I don't understand is why it hangs instead of completing with an exception or something. This is just a normal future, not anything fancy like NrtFuture.

Stack trace attached.
 
gerrit-test-deadlock
75.8 KB View Download
Project Member Comment 1 by dborowitz@google.com, Aug 30 2016
I guess I should mention that this happens at least 1/3 of the times when I run "buck test".
I wonder if this is the cause of some of the CI builds failing due to timeout recently.

Project Member Comment 3 by andyb...@chromium.org, Sep 8 2016
Components: Lucene
Status: Accepted
Project Member Comment 4 by ekempin@google.com, Sep 16 2016
 > I think the reason is that the interactive executor pool, which is where 
 > ChangeDataResults' future executes, was already shut down. But what I don't 
 > understand is why it hangs instead of completing with an exception or 
 > something. This is just a normal future, not anything fancy like NrtFuture.

Wouldn't we expect it to complete normally without exception?
WorkQueue uses Executor.shutdown() [1], which "Initiates an orderly shutdown in which previously submitted tasks are executed, but no new tasks will be accepted." Next we await the termination of the previously submitted tasks [2] and in this loop we are hanging because the task never completes. So the question is rather why it doesn't complete normally? In your thread dump it looks like we are still waiting in future.get() although there is no task running anymore and this prevents Executor.awaitTermination to ever return true?

If we wanted to shutdown the executor with letting the previously submitted tasks fail, we would need to use Executor.shutdownNow(), but this is not what we want. Maybe we could still do it as a fallback after waiting long enough for the orderly shutdown to succeed. But that is only a workaround...

[1] https://gerrit.googlesource.com/gerrit/+/refs/heads/master/gerrit-server/src/main/java/com/google/gerrit/server/git/WorkQueue.java#166
[2] https://gerrit.googlesource.com/gerrit/+/refs/heads/master/gerrit-server/src/main/java/com/google/gerrit/server/git/WorkQueue.java#170
Project Member Comment 5 by andyb...@chromium.org, Sep 27 2016
This is also affecting imports. I can’t get past the rest_project acceptance test, which just hangs indefinitely.
Project Member Comment 6 by han...@google.com, Sep 28 2016
looking.
Project Member Comment 7 by han...@google.com, Oct 4 2016
I ran rest_project test through TSAN, and found a bunch of race conditions (mostly in external code), but nothing that looks like a smoking gun.

I'm now trying to see if I can isolate the prbolem to a single test.
Project Member Comment 8 by han...@google.com, Oct 4 2016
ran this with 30 shards, 10. Found hangs in shard 10:

com.google.gerrit.acceptance.rest.project.AccessIT#updateParentAsUser
com.google.gerrit.acceptance.rest.project.AccessIT#removePermissionRulesAndCleanupEmptyEntries
com.google.gerrit.acceptance.rest.project.BanCommitIT#banCommit_Forbidden
com.google.gerrit.acceptance.rest.project.DeleteBranchesIT#deleteBranchesNotFound
com.google.gerrit.acceptance.rest.project.ListBranchesIT#listBranchesOfNonExistingProject_NotFound

shard 16:

com.google.gerrit.acceptance.rest.project.AccessIT#setPermissionsWithDisallowedUser
com.google.gerrit.acceptance.rest.project.CheckMergeabilityIT#checkMergeableCommit
com.google.gerrit.acceptance.rest.project.FileBranchIT#getFileContent
com.google.gerrit.acceptance.rest.project.ListChildProjectsIT#listChildrenOfNonExistingProject_NotFound

Project Member Comment 9 by han...@google.com, Oct 4 2016
for shard 16, running each of these methods separately, 24-fold, doesn't yield any hangs. 

Maybe lingering state of the last test is interfering with the next one? 

Double check this by inspecting stacks for all threads at the end of each test class shutdown.
Project Member Comment 10 by han...@google.com, Oct 5 2016
AfterClass is run twice.

Buck trace:

call 1

java.lang.Exception: Stack trace
        at java.lang.Thread.dumpStack(Thread.java:1329)
        at com.google.gerrit.acceptance.AbstractDaemonTest.stopCommonServer(AbstractDaemonTest.java:322)
        at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
        at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
        at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
        at java.lang.reflect.Method.invoke(Method.java:498)
        at org.junit.runners.model.FrameworkMethod$1.runReflectiveCall(FrameworkMethod.java:47)
        at org.junit.internal.runners.model.ReflectiveCallable.run(ReflectiveCallable.java:12)
        at org.junit.runners.model.FrameworkMethod.invokeExplosively(FrameworkMethod.java:44)
        at org.junit.internal.runners.statements.RunAfters.evaluate(RunAfters.java:33)
        at org.junit.runners.ParentRunner.run(ParentRunner.java:309)
        at org.junit.runners.Suite.runChild(Suite.java:127)
        at org.junit.runners.Suite.runChild(Suite.java:26)
        at org.junit.runners.ParentRunner$3.run(ParentRunner.java:238)
        at org.junit.runners.ParentRunner$1.schedule(ParentRunner.java:63)
        at org.junit.runners.ParentRunner.runChildren(ParentRunner.java:236)
        at org.junit.runners.ParentRunner.access$000(ParentRunner.java:53)
        at org.junit.runners.ParentRunner$2.evaluate(ParentRunner.java:229)
	
        at org.junit.internal.runners.statements.RunAfters.evaluate(RunAfters.java:27)
        at org.junit.runners.ParentRunner.run(ParentRunner.java:309)
        at org.junit.runners.Suite.runChild(Suite.java:127)
        at org.junit.runners.Suite.runChild(Suite.java:26)
        at org.junit.runners.ParentRunner$3.run(ParentRunner.java:238)
        at org.junit.runners.ParentRunner$1.schedule(ParentRunner.java:63)
        at org.junit.runners.ParentRunner.runChildren(ParentRunner.java:236)
        at org.junit.runners.ParentRunner.access$000(ParentRunner.java:53)
        at org.junit.runners.ParentRunner$2.evaluate(ParentRunner.java:229)
        at org.junit.runners.ParentRunner.run(ParentRunner.java:309)
        at org.junit.runner.JUnitCore.run(JUnitCore.java:160)
        at org.junit.runner.JUnitCore.run(JUnitCore.java:138)
        at com.facebook.buck.testrunner.JUnitRunner.run(JUnitRunner.java:135)
        at com.facebook.buck.testrunner.BaseRunner.runAndExit(BaseRunner.java:234)
        at com.facebook.buck.testrunner.JUnitMain.main(JUnitMain.java:47)
        at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
        at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
        at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
        at java.lang.reflect.Method.invoke(Method.java:498)
        at com.facebook.buck.jvm.java.runner.FileClassPathRunner.main(FileClassPathRunner.java:87)


call 2:

 java.lang.Exception: Stack trace
        at java.lang.Thread.dumpStack(Thread.java:1329)
        at com.google.gerrit.acceptance.AbstractDaemonTest.stopCommonServer(AbstractDaemonTest.java:322)
        at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
        at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
        at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
        at java.lang.reflect.Method.invoke(Method.java:498)
        at org.junit.runners.model.FrameworkMethod$1.runReflectiveCall(FrameworkMethod.java:47)
        at org.junit.internal.runners.model.ReflectiveCallable.run(ReflectiveCallable.java:12)
        at org.junit.runners.model.FrameworkMethod.invokeExplosively(FrameworkMethod.java:44)
        at org.junit.internal.runners.statements.RunAfters.evaluate(RunAfters.java:33)
        at org.junit.runners.ParentRunner.run(ParentRunner.java:309)
        at org.junit.runners.Suite.runChild(Suite.java:127)
        at org.junit.runners.Suite.runChild(Suite.java:26)
        at org.junit.runners.ParentRunner$3.run(ParentRunner.java:238)
        at org.junit.runners.ParentRunner$1.schedule(ParentRunner.java:63)
        at org.junit.runners.ParentRunner.runChildren(ParentRunner.java:236)
        at org.junit.runners.ParentRunner.access$000(ParentRunner.java:53)
        at org.junit.runners.ParentRunner$2.evaluate(ParentRunner.java:229)
	
        at org.junit.runners.ParentRunner.run(ParentRunner.java:309)
        at org.junit.runner.JUnitCore.run(JUnitCore.java:160)
        at org.junit.runner.JUnitCore.run(JUnitCore.java:138)
        at com.facebook.buck.testrunner.JUnitRunner.run(JUnitRunner.java:135)
        at com.facebook.buck.testrunner.BaseRunner.runAndExit(BaseRunner.java:234)
        at com.facebook.buck.testrunner.JUnitMain.main(JUnitMain.java:47)
        at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
        at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
        at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
        at java.lang.reflect.Method.invoke(Method.java:498)
        at com.facebook.buck.jvm.java.runner.FileClassPathRunner.main(FileClassPathRunner.java:87)


bazel trace is similar, but uses the Google junit testrunner further down the stack.
Project Member Comment 11 by han...@google.com, Oct 6 2016
I opened a bug for before/afterclass; see https://bugs.chromium.org/p/gerrit/issues/detail?id=4714

there is only one AfterClass, and it is protected against being run twice, so probably not the cause of this problem.
Project Member Comment 12 by han...@google.com, Oct 6 2016
in a normal run of shard 10, i see only parked threads surviving a test class.

the only odd thing is

 sun.misc.Unsafe.park(Native Method)
 java.util.concurrent.locks.LockSupport.parkNanos(LockSupport.java:215)
 java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.awaitNanos(AbstractQueuedSynchronizer.java:2078)
 org.apache.lucene.search.ControlledRealTimeReopenThread.run(ControlledRealTimeReopenThread.java:223)

which seemingly checks something every second. Our use of this thread seems correct, though: we call close(), and then it breaks out of the loop after the wait.
Project Member Comment 13 by han...@google.com, Oct 6 2016
in another hang with the thread-dump enabled, the reindex thread isn't there in the shutdown of the last class; maybe the lucene thread doesn't exit properly? 
Project Member Comment 14 by ziv...@gmail.com, Oct 7 2016
I am not sure if this is related to this issue but today I could reproduce deadlocking
by running offline reindexing. Bisecting the issue brought me to the change:

"Fix data race in MultiProgressMonitor"
https://gerrit-review.googlesource.com/87442


Deadlock info:

Found one Java-level deadlock:
=============================
"Index-Batch-8":
  waiting to lock monitor 0x00007fd5d78ae688 (object 0x000000077563ea90, a com.google.gerrit.server.git.MultiProgressMonitor$Task),
  which is held by "Index-Batch-7"
"Index-Batch-7":
  waiting to lock monitor 0x00007fd5d400a748 (object 0x000000077563eab0, a com.google.gerrit.server.git.MultiProgressMonitor),
  which is held by "main"
"main":
  waiting to lock monitor 0x00007fd5d78ae688 (object 0x000000077563ea90, a com.google.gerrit.server.git.MultiProgressMonitor$Task),
  which is held by "Index-Batch-7"

Java stack information for the threads listed above:
===================================================
"Index-Batch-8":
	at com.google.gerrit.server.index.SiteIndexer$ErrorListener.run(SiteIndexer.java:121)
	- waiting to lock <0x000000077563ea90> (a com.google.gerrit.server.git.MultiProgressMonitor$Task)
	at com.google.common.util.concurrent.MoreExecutors$DirectExecutor.execute(MoreExecutors.java:456)
	at com.google.common.util.concurrent.AbstractFuture.executeListener(AbstractFuture.java:817)
	at com.google.common.util.concurrent.AbstractFuture.complete(AbstractFuture.java:753)
	at com.google.common.util.concurrent.AbstractFuture.set(AbstractFuture.java:613)
	at com.google.common.util.concurrent.TrustedListenableFutureTask$TrustedFutureInterruptibleTask.runInterruptibly(TrustedListenableFutureTask.java:108)
	at com.google.common.util.concurrent.InterruptibleTask.run(InterruptibleTask.java:41)
	at com.google.common.util.concurrent.TrustedListenableFutureTask.run(TrustedListenableFutureTask.java:77)
	at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
	at java.util.concurrent.FutureTask.run(FutureTask.java:266)
	at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$201(ScheduledThreadPoolExecutor.java:180)
	at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:293)
	at com.google.gerrit.server.git.WorkQueue$Task.run(WorkQueue.java:407)
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
	at java.lang.Thread.run(Thread.java:745)
"Index-Batch-7":
	at com.google.gerrit.server.git.MultiProgressMonitor.wakeUp(MultiProgressMonitor.java:286)
	- waiting to lock <0x000000077563eab0> (a com.google.gerrit.server.git.MultiProgressMonitor)
	at com.google.gerrit.server.git.MultiProgressMonitor.access$000(MultiProgressMonitor.java:52)
	at com.google.gerrit.server.git.MultiProgressMonitor$Task.update(MultiProgressMonitor.java:95)
	at com.google.gerrit.server.index.SiteIndexer$ErrorListener.run(SiteIndexer.java:121)
	- locked <0x000000077563ea90> (a com.google.gerrit.server.git.MultiProgressMonitor$Task)
	at com.google.common.util.concurrent.MoreExecutors$DirectExecutor.execute(MoreExecutors.java:456)
	at com.google.common.util.concurrent.AbstractFuture.executeListener(AbstractFuture.java:817)
	at com.google.common.util.concurrent.AbstractFuture.complete(AbstractFuture.java:753)
	at com.google.common.util.concurrent.AbstractFuture.set(AbstractFuture.java:613)
	at com.google.common.util.concurrent.TrustedListenableFutureTask$TrustedFutureInterruptibleTask.runInterruptibly(TrustedListenableFutureTask.java:108)
	at com.google.common.util.concurrent.InterruptibleTask.run(InterruptibleTask.java:41)
	at com.google.common.util.concurrent.TrustedListenableFutureTask.run(TrustedListenableFutureTask.java:77)
	at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
	at java.util.concurrent.FutureTask.run(FutureTask.java:266)
	at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$201(ScheduledThreadPoolExecutor.java:180)
	at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:293)
	at com.google.gerrit.server.git.WorkQueue$Task.run(WorkQueue.java:407)
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
	at java.lang.Thread.run(Thread.java:745)
"main":
	at com.google.gerrit.server.git.MultiProgressMonitor$Task.format(MultiProgressMonitor.java:100)
	- waiting to lock <0x000000077563ea90> (a com.google.gerrit.server.git.MultiProgressMonitor$Task)
	at com.google.gerrit.server.git.MultiProgressMonitor.format(MultiProgressMonitor.java:344)
	at com.google.gerrit.server.git.MultiProgressMonitor.sendUpdate(MultiProgressMonitor.java:335)
	at com.google.gerrit.server.git.MultiProgressMonitor.waitFor(MultiProgressMonitor.java:259)
	- locked <0x000000077563eab0> (a com.google.gerrit.server.git.MultiProgressMonitor)
	at com.google.gerrit.server.git.MultiProgressMonitor.waitFor(MultiProgressMonitor.java:196)
	at com.google.gerrit.server.index.change.AllChangesIndexer.indexAll(AllChangesIndexer.java:186)
	at com.google.gerrit.server.index.change.AllChangesIndexer.indexAll(AllChangesIndexer.java:158)
	at com.google.gerrit.server.index.change.AllChangesIndexer.indexAll(AllChangesIndexer.java:83)
	at com.google.gerrit.pgm.Reindex.reindex(Reindex.java:206)
	at com.google.gerrit.pgm.Reindex.reindex(Reindex.java:124)
	at com.google.gerrit.pgm.Reindex.run(Reindex.java:103)
	at com.google.gerrit.pgm.util.AbstractProgram.main(AbstractProgram.java:64)
	at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
	at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
	at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
	at java.lang.reflect.Method.invoke(Method.java:498)
	at com.google.gerrit.launcher.GerritLauncher.invokeProgram(GerritLauncher.java:201)
	at com.google.gerrit.launcher.GerritLauncher.mainImpl(GerritLauncher.java:104)
	at com.google.gerrit.launcher.GerritLauncher.main(GerritLauncher.java:61)
	at Main.main(Main.java:25)

Found 1 deadlock.

Project Member Comment 15 by ziv...@gmail.com, Oct 7 2016
Created a new issue for deadlocking in offline reindexing:
https://bugs.chromium.org/p/gerrit/issues/detail?id=4726
Project Member Comment 16 by han...@google.com, Oct 10 2016
ControlledRealTimeReopenThread is also implicated in a data race, 

WARNING: ThreadSanitizer: data race (pid=21636)
  Write of size 8 at 0x7f24ad9d3210 by thread T36 (mutexes: write M431922409782456832):
    #0 org.apache.lucene.index.SegmentInfos.changed()V (SegmentInfos.java:944)  
    #1 org.apache.lucene.index.IndexWriter.newSegmentName()Ljava/lang/String; (IndexWriter.java:1652)  
    #2 org.apache.lucene.index.DocumentsWriter.ensureInitialized(Lorg/apache/lucene/index/DocumentsWriterPerThreadPool$ThreadState;)V (DocumentsWriter.java:391)  
    #3 org.apache.lucene.index.DocumentsWriter.updateDocument(Ljava/lang/Iterable;Lorg/apache/lucene/analysis/Analyzer;Lorg/apache/lucene/index/Term;)Z (DocumentsWriter.java:445)  
    #4 org.apache.lucene.index.IndexWriter.updateDocument(Lorg/apache/lucene/index/Term;Ljava/lang/Iterable;)V (IndexWriter.java:1477)  
    #5 com.google.gerrit.lucene.AutoCommitWriter.updateDocument(Lorg/apache/lucene/index/Term;Ljava/lang/Iterable;)V (AutoCommitWriter.java:100)  
    #6 org.apache.lucene.index.TrackingIndexWriter.updateDocument(Lorg/apache/lucene/index/Term;Ljava/lang/Iterable;)J (TrackingIndexWriter.java:55)  
    #7 com.google.gerrit.lucene.AbstractLuceneIndex$4.call()Ljava/lang/Long; (AbstractLuceneIndex.java:250)  
    #8 com.google.gerrit.lucene.AbstractLuceneIndex$4.call()Ljava/lang/Object; (AbstractLuceneIndex.java:247)  
    #9 com.google.common.util.concurrent.TrustedListenableFutureTask$TrustedFutureInterruptibleTask.runInterruptibly()V (TrustedListenableFutureTask.java:108)  
    #10 com.google.common.util.concurrent.InterruptibleTask.run()V (InterruptibleTask.java:41)  
    #11 com.google.common.util.concurrent.TrustedListenableFutureTask.run()V (TrustedListenableFutureTask.java:77)  
    #12 java.util.concurrent.ThreadPoolExecutor.runWorker(Ljava/util/concurrent/ThreadPoolExecutor$Worker;)V (ThreadPoolExecutor.java:1142)  
    #13 java.util.concurrent.ThreadPoolExecutor$Worker.run()V (ThreadPoolExecutor.java:617)  
    #14 java.lang.Thread.run()V (Thread.java:745)  
    #15 (Generated Stub)  

  Previous read of size 8 at 0x7f24ad9d3210 by thread T29 (mutexes: write M1060737507754061632):
    #0 org.apache.lucene.index.IndexWriter.nrtIsCurrent(Lorg/apache/lucene/index/SegmentInfos;)Z (IndexWriter.java:4592)  
    #1 org.apache.lucene.index.StandardDirectoryReader.doOpenFromWriter(Lorg/apache/lucene/index/IndexCommit;)Lorg/apache/lucene/index/DirectoryReader; (StandardDirectoryReader.java:282)  
    #2 org.apache.lucene.index.StandardDirectoryReader.doOpenIfChanged(Lorg/apache/lucene/index/IndexCommit;)Lorg/apache/lucene/index/DirectoryReader; (StandardDirectoryReader.java:261)  
    #3 org.apache.lucene.index.StandardDirectoryReader.doOpenIfChanged()Lorg/apache/lucene/index/DirectoryReader; (StandardDirectoryReader.java:251)  
    #4 org.apache.lucene.index.DirectoryReader.openIfChanged(Lorg/apache/lucene/index/DirectoryReader;)Lorg/apache/lucene/index/DirectoryReader; (DirectoryReader.java:137)  
    #5 com.google.gerrit.lucene.WrappableSearcherManager.refreshIfNeeded(Lorg/apache/lucene/search/IndexSearcher;)Lorg/apache/lucene/search/IndexSearcher; (WrappableSearcherManager.java:148)  
    #6 com.google.gerrit.lucene.WrappableSearcherManager.refreshIfNeeded(Ljava/lang/Object;)Ljava/lang/Object; (WrappableSearcherManager.java:68)  
    #7 org.apache.lucene.search.ReferenceManager.doMaybeRefresh()V (ReferenceManager.java:176)  
    #8 org.apache.lucene.search.ReferenceManager.maybeRefreshBlocking()V (ReferenceManager.java:253)  
    #9 org.apache.lucene.search.ControlledRealTimeReopenThread.run()V (ControlledRealTimeReopenThread.java:245)  
    #10 (Generated Stub)  

Project Member Comment 17 by ziv...@gmail.com, Oct 10 2016
> ControlledRealTimeReopenThread is also implicated in a data race,

A data race is not necessarily causing a deadlock.
How much does it make sense to concentrate on finding data races in this case?
Project Member Comment 18 by han...@google.com, Oct 10 2016
I'm using this bug to keep notes on what I'm finding out. I don't know if the race is connected to the hang or not, but clearly some part of {Gerrit + Lucene} has incorrect parallel code causing a lockup, and a race is clearly incorrect parallel code.

If you have a better line of attack, by all means add your findings to this bug too.
Project Member Comment 19 by han...@google.com, Oct 10 2016
stack trace of the hung thread:

"Index-Batch-2" Id=157 WAITING on com.google.common.util.concurrent.TrustedListenableFutureTask@70700d56
	at sun.misc.Unsafe.park(Native Method)
	-  waiting on com.google.common.util.concurrent.TrustedListenableFutureTask@70700d56
	at java.util.concurrent.locks.LockSupport.park(LockSupport.java:175)
	at com.google.common.util.concurrent.AbstractFuture.get(AbstractFuture.java:445)
	at com.google.common.util.concurrent.AbstractFuture$TrustedFuture.get(AbstractFuture.java:79)
	at com.google.gerrit.lucene.LuceneChangeIndex$ChangeDataResults.toList(LuceneChangeIndex.java:385)
	at com.google.gerrit.server.index.change.IndexedChangeQuery$1.toList(IndexedChangeQuery.java:104)
	at com.google.gerrit.server.query.QueryProcessor.query(QueryProcessor.java:196)
	at com.google.gerrit.server.query.QueryProcessor.query(QueryProcessor.java:135)
	at com.google.gerrit.server.query.QueryProcessor.query(QueryProcessor.java:122)
	at com.google.gerrit.server.query.InternalQuery.query(InternalQuery.java:72)
	at com.google.gerrit.server.query.change.InternalChangeQuery.byBranchNew(InternalChangeQuery.java:154)
	at com.google.gerrit.server.index.change.ReindexAfterUpdate$GetChanges.impl(ReindexAfterUpdate.java:130)
	at com.google.gerrit.server.index.change.ReindexAfterUpdate$GetChanges.impl(ReindexAfterUpdate.java:118)
	at com.google.gerrit.server.index.change.ReindexAfterUpdate$Task.call(ReindexAfterUpdate.java:108)
	at com.google.common.util.concurrent.TrustedListenableFutureTask$TrustedFutureInterruptibleTask.runInterruptibly(TrustedListenableFutureTask.java:108)
	at com.google.common.util.concurrent.InterruptibleTask.run(InterruptibleTask.java:41)
	at com.google.common.util.concurrent.TrustedListenableFutureTask.run(TrustedListenableFutureTask.java:77)
	at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
	at java.util.concurrent.FutureTask.run(FutureTask.java:266)
	at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$201(ScheduledThreadPoolExecutor.java:180)
	at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:295)
	at com.google.gerrit.server.git.WorkQueue$Task.run(WorkQueue.java:407)
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
	at java.lang.Thread.run(Thread.java:745)

	Number of locked synchronizers = 1
	- java.util.concurrent.ThreadPoolExecutor$Worker@5afa0b1a
Project Member Comment 20 by han...@google.com, Oct 10 2016
The code for the hung thread does the following:

ReindexAfterRefUpdate is a background process waiting for ref updates; it discards all but branch and tag updates. (The fact it also kicks off on tag updates is probably an oversight?)

Since a branch update affects mergeability of pending changes, all open changes  (byBranchNew) are fetched to reindex. byBranchNew constructs a query

 (and REPO BRANCH STATUS:NEW)

which is passed to the QueryProcessor and LuceneChangeIndex. 

The single query is extended to a list in a QueryProcessor.query() overload. Since it is now a list we use the generic query code, which is async so it can handle lists sz > 1 more "efficiently".

In this process, the query is rewritten to be in terms of IndexedChangeQuery, which is a DataSource. The rewrite hooks up the Lucene specific code: LuceneChangeIndex implements the ChangeIndex interface.

LuceneChangeIndex runs on the INTERACTIVE executor, which is odd since this originates from the BATCH executor.


Project Member Comment 21 by han...@google.com, Oct 10 2016
In #stop() we shutdown the executors one by one, so could this be some sort of priority inversion? It seems odd that a batch process places tasks on the interactive queue. 
Project Member Comment 22 by han...@google.com, Oct 10 2016
If any task fails to schedule on an executor (eg. due to shutdown having been called), it throws RejectedExecutionException, which is a runtime exception, and therefore nowhere explicitly dealt with. Any exceptions are logged however, as log.error; is there any way to have those be dumped from the test too?

The order of executor creation is random: index-interactive and index-batch are always created first, but not in any fixed order.
Project Member Comment 23 by han...@google.com, Oct 10 2016
In a failing run, Interactive was created before Batch. 
Project Member Comment 24 by han...@google.com, Oct 10 2016
WorkQueue#stop is waiting for the Batch queue to terminate, consistent with the stack trace for the hung ReindexAfterRefUpdate (which runs in batch executor).
Project Member Comment 25 by han...@google.com, Oct 10 2016
No RejectedExecutionException happen for a hanging test.
Project Member Comment 26 by han...@google.com, Oct 10 2016
from earlier comment: "Maybe lingering state of the last test is interfering with the next one?" : probably not, since the executors are created afresh for each test. I suppose I could make it fail for individual tests too if I tried hard enough.
Project Member Comment 27 by ziv...@gmail.com, Oct 11 2016
> If you have a better line of attack, by all means add your findings to this bug too.

No, I don't have anything better to propose right now.

Actually, I haven't been able to reproduce the issue even one single time :-/
Just now I again run all the tests 5 times and nothing ever hung.

Btw, did you see a real JVM deadlock in any of the thread dumps? Normally, jstack
then prints something like "Found one Java-level deadlock" in the thread dump.
Or do we just have handing threads but no JVM deadlocks?
Is the title of this issue a bit misleading?

Project Member Comment 28 by han...@google.com, Oct 11 2016
more reproing, the following is curious:
createQueue Index-Interactive sz 1
createQueue Index-Batch sz 12
createQueue ReceiveCommits sz 1
createQueue SshCommandStart sz 1
createQueue SSH-Stream-Worker sz 13
createQueue SSH-Interactive-Worker sz 1
createQueue SSH-Batch-Worker sz 2
createQueue WorkQueue sz 1
GetChanges.ctor
GetChanges.ctor
GetChanges.impl
GetChanges.impl
ResultSet.doRead
shutdown called for Index-Interactive
shutdown terminated for Index-Interactive
shutdown called for Index-Batch

(hung)
Dumping stack traces for all threads

In this test shard (nr. 10), the last test creates
two events get in at the same time, and then first futures (Getchanges#impl) run in parallel.




Project Member Comment 29 by han...@google.com, Oct 11 2016
further data: one future is correctly scheduled, but shutdown never executes the future.

createQueue WorkQueue sz 1
GetChanges.ctor
GetChanges.ctor
GetChanges.impl
GetChanges.impl
ResultSet.doRead scheduled.
ResultSet.doRead scheduled.
ResultSet.doRead
shutdown called for Index-Interactive
shutdown terminated for Index-Interactive
shutdown called for Index-Batch


Project Member Comment 30 by han...@google.com, Oct 12 2016
Looks like our implementation in WorkQueue of cancel() is broken. The hang occurs if we have a scheduled future, but we cancel it in shutdown. 

For this to happen, we must have a future of state NEW in the executor; this happens only with tests where 2 ref updates happen simultaneously. Since the interactive threadpool has 1 thread, on of the updates gets put in the queue, where we have time to 'succesfully' cancel the pending computation.


createQueue WorkQueue sz 1
GetChanges.ctor
GetChanges.ctor
GetChanges.impl
GetChanges.impl
ResultSet.doRead scheduled.
ResultSet.doRead scheduled.
ResultSet.doRead
Cancel1682360942 on Index-Interactive mayinterrupt: false
java.lang.Exception: Stack trace
        at java.lang.Thread.dumpStack(Thread.java:1329)
        at com.google.gerrit.server.git.WorkQueue$Task.cancel(WorkQueue.java:352)
        at java.util.concurrent.ScheduledThreadPoolExecutor.onShutdown(ScheduledThreadPoolExecutor.java:369)
        at java.util.concurrent.ThreadPoolExecutor.shutdown(ThreadPoolExecutor.java:1390)
        at java.util.concurrent.ScheduledThreadPoolExecutor.shutdown(ScheduledThreadPoolExecutor.java:783)
        at com.google.gerrit.server.git.WorkQueue.stop(WorkQueue.java:167)
        at com.google.gerrit.server.git.WorkQueue.access$000(WorkQueue.java:51)
        at com.google.gerrit.server.git.WorkQueue$Lifecycle.stop(WorkQueue.java:66)
        at com.google.gerrit.lifecycle.LifecycleManager.stop(LifecycleManager.java:103)
        at com.google.gerrit.acceptance.GerritServer.stop(GerritServer.java:298)
        at com.google.gerrit.acceptance.AbstractDaemonTest.stopCommonServer(AbstractDaemonTest.java:316)
        at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
        at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
        at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
        at java.lang.reflect.Method.invoke(Method.java:498)
        at org.junit.runners.model.FrameworkMethod$1.runReflectiveCall(FrameworkMethod.java:47)
        at org.junit.internal.runners.model.ReflectiveCallable.run(ReflectiveCallable.java:12)
        at org.junit.runners.model.FrameworkMethod.invokeExplosively(FrameworkMethod.java:44)
        at org.junit.internal.runners.statements.RunAfters.evaluate(RunAfters.java:33)
        at org.junit.runners.ParentRunner.run(ParentRunner.java:309)
        at org.junit.runners.Suite.runChild(Suite.java:127)
        at org.junit.runners.Suite.runChild(Suite.java:26)
        at org.junit.runners.ParentRunner$3.run(ParentRunner.java:238)
        at org.junit.runners.ParentRunner$1.schedule(ParentRunner.java:63)
        at org.junit.runners.ParentRunner.runChildren(ParentRunner.java:236)
        at org.junit.runners.ParentRunner.access$000(ParentRunner.java:53)
        at org.junit.runners.ParentRunner$2.evaluate(ParentRunner.java:229)
        at org.junit.internal.runners.statements.RunAfters.evaluate(RunAfters.java:27)
        at org.junit.runners.ParentRunner.run(ParentRunner.java:309)
        at org.junit.runners.Suite.runChild(Suite.java:127)
        at org.junit.runners.Suite.runChild(Suite.java:26)
        at org.junit.runners.ParentRunner$3.run(ParentRunner.java:238)
        at org.junit.runners.ParentRunner$1.schedule(ParentRunner.java:63)
        at org.junit.runners.ParentRunner.runChildren(ParentRunner.java:236)
        at org.junit.runners.ParentRunner.access$000(ParentRunner.java:53)
        at org.junit.runners.ParentRunner$2.evaluate(ParentRunner.java:229)
        at org.junit.runners.ParentRunner.run(ParentRunner.java:309)
        at com.google.testing.junit.runner.junit4.CancellableRequestFactory$CancellableRunner.run(CancellableRequestFactory.java:89)
        at org.junit.runner.JUnitCore.run(JUnitCore.java:160)
        at org.junit.runner.JUnitCore.run(JUnitCore.java:138)
        at com.google.testing.junit.runner.junit4.JUnit4Runner.run(JUnit4Runner.java:110)
        at com.google.testing.junit.runner.GoogleTestRunner.runTestsInSuite(GoogleTestRunner.java:197)
        at com.google.testing.junit.runner.GoogleTestRunner.runTestsInSuite(GoogleTestRunner.java:174)
        at com.google.testing.junit.runner.GoogleTestRunner.main(GoogleTestRunner.java:133)
shutdown called for Index-Interactive
shutdown terminated for Index-Interactive
shutdown called for Index-Batch
Dumping stack traces for all threads

Project Member Comment 31 by han...@google.com, Oct 12 2016
sasa: not sure what to call it. We have one piece code that expects to be woken up by another piece of code, but the wake-up never happens.

I'm not super-familiar with JVM deadlock detection, but it's hard to have useful deadlock detection in real programs. There is always some thread somewhere that is waiting on I/O, and even if the I/O never comes, the JVM can't conclude it's a, real deadlock. 
Project Member Comment 32 by han...@google.com, Oct 12 2016
For reproducing, you need to slow down all the JVMs enough so the timewindow is large enough. I have a 6-core (12 HT) machine, and run 50 copies of the offending 5 test methods.
Project Member Comment 33 by han...@google.com, Oct 13 2016
some further reflections

1. shutdown cancels pending tasks in the threadpool. This is bad for correctness, since the tasks are related to re-indexing changes. If we cancel tasks, the index and Git refs will diverge; only the threadpool for accepting incoming requests should be shutdown.

2. the future we create and cancel inside WorkQueue is a ScheduledTPool.FutureTask which comes from the JDK. The future we hang on is a TrustedListenableFutureTask, which comes from Guava. Both have their own parallel code for computation and waking up callers.

This means either: a) we nest both future inside each other, which seems questionable. b) we have two different "Future" views of the same callable, which seems that it should never work.
Project Member Comment 34 by han...@google.com, Oct 13 2016
WorkQueue.java does

  public Executor createQueue(int poolsize, String prefix) {
    final Executor r = new Executor(poolsize, prefix);
    r.setContinueExistingPeriodicTasksAfterShutdownPolicy(false);
    r.setExecuteExistingDelayedTasksAfterShutdownPolicy(false);


the latter is wrong for any work that is generated by the batch threads.
Project Member Comment 35 by han...@google.com, Oct 13 2016
OK, I think I understand what happens.

we have something that computes an interesting result, which is a Callable (returning the query results.) 

that gets wrapped as follows


  Callable<> 
 
gets wrapped into

  TrustedFutureInterruptibleTask (guava)

gets wrapped into 

  TrustedListenableFutureTask (guava) - this is a {Runnable, Future}.

The Lucene module wants a ListeningExecutorService, and its submit() method returns this object, and this is what the Lucene indexing code is waiting on.

Guava delegates the actual work of running things, in this case to gerrit.WorkQueue (which essentially is a jdk.ScheduledThreadPoolExecutor), by calling executor.execute().
 
This calls jdk.ScheduledThreadPoolExecutor#execute, which calls jdk.ScheduledThreadPoolExecutor#schedule.

#schedule wraps TrustedListenableFutureTask as Runnable in a 

  RunnableAdapter 

to make it a Callable again.  The Callable is then wrapped in a 

  ScheduledFutureTask - implements RunnableScheduledFuture 

and this is being passed to gerrit.WorkQueue#decorateTask, where it is wrapped in gerrit.WorkQueue#Task. The ScheduledFutureTask is put into the ScheduledThreadPoolExecutor where from where it is executed and/or canceled.

On the surface, WorkQueue seems sane: it nicely delegates to the inner for delicate operations. 

When we cancel the pending tasks, the task is removed and the outer future would respond correctly as canceled. Unfortunately, the Lucene code holds the inner future, which isn't connected to any execution after cancellation, so it never completes.  The JDK squashes the Guava task type into a Runnable, which has no call surface for cancellation, so the Guava task cannot be canceled. 

 

Project Member Comment 36 by dborowitz@google.com, Oct 13 2016
Status: Submitted
Project Member Comment 37 by jrn@google.com, Oct 13 2016
Owner: han...@google.com
https://gerrit-review.googlesource.com/88519

Thanks for tracking it down and fixing it.
Labels: FixedIn-2.13.2
Sign in to add a comment