New issue
Advanced search Search tips

Issue 9836 link

Starred by 12 users

Issue metadata

Status: Released
Owner:
Closed: Oct 12
Cc:
Components:



Sign in to add a comment

Database Connection Issues After Upgrading To 2.15.5

Project Member Reported by luca.mil...@gmail.com, Oct 10

Issue description

*****************************************************************
*****                                                       *****
***** !!!! THIS BUG TRACKER IS FOR GERRIT CODE REVIEW !!!!  *****
*****                                                       *****
***** DO NOT SUBMIT BUGS FOR CHROME, ANDROID, CYANOGENMOD,  *****
***** INTERNAL ISSUES WITH YOUR COMPANY'S GERRIT SETUP, ETC.*****
*****                                                       *****
*****   THOSE ISSUES BELONG IN DIFFERENT ISSUE TRACKERS     *****
*****                                                       *****
*****************************************************************

Affected Version: 2.15.5, 2.14.15

What steps will reproduce the problem?
1. Install 2.15.5 (or 2.14.15)
2. Execute any Git/SSH command

What is the expected output?
The command is executed and all DB connections are released to the pool

What do you see instead?
One DB connection is leaked.

Please provide any additional information below.
Possibly related to the Git/SSH security context fix (See   Issue 9823  ) 

 
Project Member

Comment 1 by luca.mil...@gmail.com, Oct 10

Problem found:
"HTTP-152" prio=5 WAITING
        sun.misc.Unsafe.park(Native Method)
        java.util.concurrent.locks.LockSupport.park(LockSupport.java:175)
        com.google.common.util.concurrent.AbstractFuture.get(AbstractFuture.java:472)
        com.google.common.util.concurrent.AbstractFuture$TrustedFuture.get(AbstractFuture.java:79)
        com.google.common.util.concurrent.Uninterruptibles.getUninterruptibly(Uninterruptibles.java:142)
        com.google.common.cache.LocalCache$LoadingValueReference.waitForValue(LocalCache.java:3755)
        com.google.common.cache.LocalCache$Segment.waitForLoadingValue(LocalCache.java:2319)
        com.google.common.cache.LocalCache$Segment.get(LocalCache.java:2206)
        com.google.common.cache.LocalCache.get(LocalCache.java:4147)
        com.google.common.cache.LocalCache.getOrLoad(LocalCache.java:4151)
        com.google.common.cache.LocalCache$LocalLoadingCache.get(LocalCache.java:5140)
        com.google.gerrit.server.cache.h2.H2CacheImpl.get(H2CacheImpl.java:127)

There is a bunch of threads that are blocking on the Guava cache calls and the DB leak is a consequence of that leak.

The issue isn't related at all to  Issue 9823 .
Project Member

Comment 2 by luca.mil...@gmail.com, Oct 10

One of those commits introduced the problem:
3e05704 Fix support for setting owner of group to a single user
1a882e7 Set version to 2.15.5
b6ce6e9 Merge branch 'stable-2.14' into stable-2.15
1f56ae4 Set version to 2.14.15
685bb84 Add tests for .gitmodules validation
5fe24a9 Merge branch 'stable-2.14' into stable-2.15
6c02820 Merge "Add support for "max_object_size_limit" effective value" into stable-2.15
b6865e4 Remove occurrence of "is="gr-select"" in gr-create-project-dialog
d114554 Add support for "max_object_size_limit" effective value
527e015 Update JGit to 4.7.5.201810051826-r
9e1a662 ElasticContainer: Use Elasticsearch 6.4.2 for V6_4 tests
0408166 Upgrade elasticsearch-rest-client to 6.4.2
4e873c1 Upgrade jackson-core to 2.9.7

Bisecting now ...
Project Member

Comment 3 by luca.mil...@gmail.com, Oct 10

6c02820 => works like a charm.

Comment 4 Deleted

Project Member

Comment 5 by luca.mil...@gmail.com, Oct 10

5fe24a9 => works like a charm.
Project Member

Comment 6 by luca.mil...@gmail.com, Oct 11

1a882e7 => works fine

Trying now with the last one left (3e05704) and if that works alto, then I'm very confused :-(
Project Member

Comment 7 by luca.mil...@gmail.com, Oct 11

3e05704 => problem reproduced and I've got the blocked threads on the LocalCache.

"HTTP-8854" prio=5 WAITING
	sun.misc.Unsafe.park(Native Method)
	java.util.concurrent.locks.LockSupport.park(LockSupport.java:175)
	com.google.common.util.concurrent.AbstractFuture.get(AbstractFuture.java:472)
	com.google.common.util.concurrent.AbstractFuture$TrustedFuture.get(AbstractFuture.java:79)
	com.google.common.util.concurrent.Uninterruptibles.getUninterruptibly(Uninterruptibles.java:142)
	com.google.common.cache.LocalCache$LoadingValueReference.waitForValue(LocalCache.java:3755)
	com.google.common.cache.LocalCache$Segment.waitForLoadingValue(LocalCache.java:2319)
	com.google.common.cache.LocalCache$Segment.get(LocalCache.java:2206)
	com.google.common.cache.LocalCache.get(LocalCache.java:4147)
	com.google.common.cache.LocalCache.getOrLoad(LocalCache.java:4151)
	com.google.common.cache.LocalCache$LocalLoadingCache.get(LocalCache.java:5140)
Project Member

Comment 8 by luca.mil...@gmail.com, Oct 11

Cc: dpurseho...@collab.net
Project Member

Comment 9 by luca.mil...@gmail.com, Oct 11

@David: what do you think? I'm a bit baffled that 3e05704 introduced the problem :-O
Project Member

Comment 10 by thomasmu...@yahoo.com, Oct 11

3e05704 coulden't cause the problem. It's a frontend change :)
Project Member

Comment 11 by luca.mil...@gmail.com, Oct 11

Yes, I know, that's why I am buffled :-O

But man, I tried the commit before, works, the one after, breaks :-(
Let me try again with the one before the UX change.
Cc: -dpurseho...@collab.net david.pu...@gmail.com
Project Member

Comment 13 by thomasmu...@yahoo.com, Oct 11

Did you git revert istead of git checking out?
Project Member

Comment 14 by sven.sel...@axis.com, Oct 11

When you say: "But man, I tried the commit before, works, the one after, breaks :-("
Do you mean that 1a882e7b0d works and 914a17d662 breaks? Haven't you tried 3e05704f5b exactly?

To me it seems like "f6e7913d14 - Fix access path propagation on git/ssh protocol" is the most likely culprit. It deals with the threads of the git commands and the issue is triggered by git commands and are caused by blocked threads. 
Project Member

Comment 15 by luca.mil...@gmail.com, Oct 11

> To me it seems like "f6e7913d14 - Fix access path propagation on git/ssh protocol" is the most likely culprit. It deals with the threads of the git commands and the issue is triggered by git commands and are caused by blocked threads. 

Yep, that what I thought at the beginning as well. But then I found out that:
a) The DB leak is a mere consequence of the threads stuck. You know that in Gerrit each thread has its own connection that is released when the thread finishes his job.

b) The threads are locked in Guava Cache, not related to a Git/SSH command

That's why I've excluded f6e7913d14 afterward.

Project Member

Comment 16 by sven.sel...@axis.com, Oct 11

I don't quite follow you here.
a) Why would that exclude f6e7913d14?
b) Are you saying that git/SSH command threads doesn't use guava caches?
Project Member

Comment 17 by luca.mil...@gmail.com, Oct 11

> a) Why would that exclude f6e7913d14?
Because has nothing to do with Guava cache

> b) Are you saying that git/SSH command threads doesn't use guava caches?
I am saying that the threads blocked on Guava cache have nothing to do with Git/SSH commands.
Project Member

Comment 18 by luca.mil...@gmail.com, Oct 11

Let me check again, possibly I am seeing two different problems:

Problem 1) Threads locked on Guava cache (not related to f6e7913d14)
Problem 2) DB connections leaks (possibly related to f6e7913d14 ?)
Project Member

Comment 19 by sven.sel...@axis.com, Oct 11

From a naive test (just issuing a series of git commands and counting db connection in one of our staging environment) I find that 914a17d662 (v2.15.5 which contains f6e7913d14) keeps twice as many db-connections compared to 3e05704f5b (Mulhalls UI fix).

f6e7913d14 fiddles with the context of the git-command-threads and you know that it's the context that has the connection.
This in turn might trigger something otherwise dormant in the Guava caches which the git-command-threads most certainly ends up in sooner or later.

If you excluded f6e7913d14 and when doing so everything you have left is the UI fix of 3e05704f5b. Wouldn't you say that between those two f6e7913d14 is the more likely culprit?


Project Member

Comment 20 by luca.mil...@gmail.com, Oct 11

> From a naive test (just issuing a series of git commands and counting db connection in one of our staging environment) I find that 914a17d662 (v2.15.5 which contains f6e7913d14) keeps twice as many db-connections compared to 3e05704f5b (Mulhalls UI fix).

Yes, I did the same locally and it worked: the connections were first allocated and then released. I could have done something differently?

> f6e7913d14 fiddles with the context of the git-command-threads and you know that it's the context that has the connection.

Yes, that's why it was my first target to test.

> This in turn might trigger something otherwise dormant in the Guava caches which the git-command-threads most certainly ends up in sooner or later.

I doubt it, because the Guava cache problem happens even without that commit. I truly believe that the two are different problems.

> If you excluded f6e7913d14 and when doing so everything you have left is the UI fix of 3e05704f5b. Wouldn't you say that between those two f6e7913d14 is the more likely culprit?

Yes, doesn't make sense to me also that a UX fix could trigger a Guava cache problem. No way.


Project Member

Comment 21 by luca.mil...@gmail.com, Oct 11

Testing 914a17d662 on review-2.gerrithub.io right now, let's see how it goes.
(v2.15.5 which contains f6e7913d14)

Will keep production traffic there for 15-20' and see the results.
Project Member

Comment 22 by luca.mil...@gmail.com, Oct 11

Yes, we have two problems: the DB leak with thread leaks + DB leak without the thread leak.

The first is related to Guava Cache, the second to the Git/SSH access path propagation.

I can reproduce both on 2.15.5.

Let me work on the second and issue a fix for it.
Project Member

Comment 23 by luca.mil...@gmail.com, Oct 11

So, the problem is much bigger than I thought: only because we are with ReviewDb then the leak is quite contained.

From what I see in the code, the cleanup task is added to a request-scoped RequestCleanup instance, which works well with HTTP, and not so well with SSH.

The SshScope.Context tries to simulate the same with:
  class Context implements RequestContext {
    private final RequestCleanup cleanup = new RequestCleanup();

However, the cleanup instance is created directly and not injected: then when the DB provider creates a db instance, associates the cleanup to the "request-scoped" cleanup and not to the one of the Ssh context.

That's the reason why the DB connection is leaked.
Project Member

Comment 24 by luca.mil...@gmail.com, Oct 11

Got the culprit: when the DB connection gets opened creates a lambda to the cleanup on the *same thread* that happens to be the same one of the Git/SSH context. But in 914a17d662 the context is passed to another thread (otherwise the access path would be lost) and thus the cleanup is not getting added as it should.

What I need to do then is, instead of passing the entire context, is just passing the access path and set it to the other thread's context. 
Is 2.14.15 also affected by this?
Project Member

Comment 26 by luca.mil...@gmail.com, Oct 11

Actually, the first tests I did were on 2.14.15 and, for some reasons that I need to investigate, the cleanup() was working fine there.

Let me compose and review the fix, if it is needed on 2.14.15 as well, will submit there. 

Nobody reported any problem yet on 2.14.15.
Project Member

Comment 27 by luca.mil...@gmail.com, Oct 12

Status: ChangeUnderReview (was: New)
Two changes have been uploaded for review:
- Gerrit v2.14 (Change-Id: I2b9e1369e53)
- Gerrit v2.15 (Change-Id: I0e724fb029e)
Project Member

Comment 28 by luca.mil...@gmail.com, Oct 12

Description: Show this description
Project Member

Comment 29 by sven.sel...@axis.com, Oct 12

Thank you Luca for fixing this. Awesome work!

Did this fix the threads-hung-in-Guava-cache problem also?
Since the threads were stuck in loading it could suggest that they where waiting on the DB connections kept by the git command contexts.
Project Member

Comment 30 by luca.mil...@gmail.com, Oct 12

> Did this fix the threads-hung-in-Guava-cache problem also?

Unfortunately, not at all :-(
I was 100% mislead by the Guava-Cache issue, which is separate and not related to the Git/SSH use-case.

Will do more testing and open a separate ticket for it. I believe the Guava-Cache deadlocks were raised by Hugo as well on the mailing list and he attempted as well to replace it with Caffeine (alternative to Guava-cache)
See issue 7486 for the cache deadlocks problem.
Project Member

Comment 32 by sven.sel...@axis.com, Oct 12

Darn, it looked so neat in your graph with 8 leaked connections and 8 hung threads.
Labels: FixedIn-2.14.16
Status: Submitted (was: ChangeUnderReview)
Will there be a release for 2.15 this week? This db connection leak is causing weekly downtime for my Gerrit instance.

Thanks for your help! 
Project Member

Comment 35 by luca.mil...@gmail.com, Oct 15

The change has been merged already and is available on the Gerrit-CI:
- 2.14 => https://gerrit-ci.gerritforge.com/job/Gerrit-bazel-stable-2.14/lastSuccessfulBuild/artifact/gerrit/bazel-bin/release.war
- 2.15 => https://gerrit-ci.gerritforge.com/job/Gerrit-bazel-stable-2.15/lastSuccessfulBuild/artifact/gerrit/bazel-bin/release.war

I typically never run with the released version because there is always some extra fix I need on top of it :-)
There will be a release soon, hopefully within this week.  I'm waiting to see if there will be a new JGit release which will also fix the LFS file handles leak [1]

[1] https://bugs.eclipse.org/bugs/show_bug.cgi?id=540049

Labels: FixedIn-2.15.6
Status: Released (was: Submitted)

Sign in to add a comment