New issue
Advanced search Search tips
Starred by 3 users

Issue metadata

Status: Released
Owner: ----
Closed: Jan 22

Sign in to add a comment

Issue 10326: calling /projects/ endpoint on a 2.15 and 2.16 causes an explosion of memory usage

Reported by, Jan 17 Project Member

Issue description

*****                                                       *****
*****                                                       *****
*****                                                       *****
*****                                                       *****

Affected Version: 2.15 and 2.16

What steps will reproduce the problem?
1. curl -XGET 'https://<gerrit-server>/projects/?d&n=1&S=0' on a 2.16 gerrit having a high number of projects.

What is the expected output?

1 project is returned in a timely manner

What do you see instead?

The endpoint hangs for a very long time, often never returning.
A long running thread is executed, which consumes most of the host machine resources, sometimes leading to exhaustion (see attached)

Please provide any additional information below.

Instead of loading only the requested number of projects, the endpoint loads all projects in memory, does sorting _before_ returning the requested project.
Screenshot 2019-01-17 at 17.26.57.png
87.5 KB View Download

Comment 1 by, Jan 17

Project Member
Labels: -Priority-1 Priority-0
I believe this is related to the Project indexing sort bug in v2.16: because the index isn't sorted, then we have to get the list of projects from the in-memory cache ... that for large setups will blow-up the JVM heap.

Why don't we apply the master fix (Lucene index fix), apply to the v2.16 branch and then re-implement the project list using Lucene?

Escalating to P0 because of every time that someone or some tool (e.g. Gerrit Trigger plugin) triggers that API ... it doesn't work and the server risks to die.

Comment 2 by, Jan 17

Project Member
@Dave @David @Edwin WDYT?

Comment 3 by, Jan 17

Project Member
Can you confirm, that master fix: [1] would fix the problem?

I'm in favor of fixing the problem on stable-2.16 then.

My proposal for way forward is here: [2]. In the same change
there was a discussion whether or not it is appropriate to bump
index schema versions for account, group and project indexes on
stable-2.16 branch.

I still prefer to bump the index schema versions on stable-2.16
branch so that in case someone would upgrade to 2.16.4, then it is
guaranteed, that the schema versions of related indexes are upgraded.

In the same time we would bump the schema versions of the affected
indexes on master as well (with or without Lucene version upgrade).

* [1]
* [2]

Comment 4 by, Jan 17

Project Member
+1 to @luca suggestion of applying the lucene fix to the stable branch (if it fixes this issue).

Comment 5 by, Jan 17

Project Member

Comment 6 by, Jan 17

Project Member
I'd suggest working in three steps:

Step-1: The code needs some love for streams instead of just going over the full collections of projects in memory ... which is *VERY VERY* slow.

Step-2: Let's fix the project query in the way you've suggested.

Step-3: Let's rewrite the list in terms of query

I'd like to get Dave and Edwin opinion on the above plan.

P.S. It is really a P0 !!! Thanks to the HA / multi-site setup, our system crashes were not visible, but today it was a continuous fire-fighting.

Comment 7 by, Jan 17

Project Member
I see you have already cherry-picked Lucene fix to master: [1],
however, I think we are facing two different problems:

1. Sorting issue on PG UI: Project list is fast, doesn't consume memory bit is not sorted

2. Performance and memory consumption issue in GWT UI, because the project list is retrieved and sorted in memory

Can you confirm?

For 1. we have a fix, that requires the cherry-pick: [1] and full reindex for affected indexes, but for 2. we don't have a fix yet, but we could do the same what is done for PG UI.


Comment 8 by, Jan 17

Project Member
This ticket is about  issue 2 . My "step-1" is going to mitigate but not resolve the problem. It can be pushed in a few minutes and doesn't require any migration.

The Lucene fix is mandatory for sorting the PG UI, so that we can then use *that* implementation for the project list.

P.S. It isn't a PG vs. GWT UI: the gerrit ls-projects SSH command generates exactly the same explosion. It is the ListProjects API that needs to be fixed.

Does that make sense?

Comment 9 by, Jan 17

Project Member
Could this bug have been introduced with ? (that's the only thing that sticks out to me)

Comment 10 by, Jan 17

Project Member
But PG UI is directly using the project index, isn't it?
Otherwise why the result is not sorted on 2.16? And if it's
already using the index directly, then it shouldn't
consume much memory, right? IOW, GWT UI and other places
should just use the index query directly. In fact they
could do it without porting Lucene sorting fix from master,
but in that case the result wouldn't be sorted.

Comment 12 by, Jan 18

Project Member
Status: ChangeUnderReview (was: New)

Comment 13 by, Jan 18

Project Member
Is this a regression towards 2.15?
I believe this was always slow, which is why we added the project index.

Comment 14 by, Jan 18

Project Member
I believe it was not introduced in 2.16, possibly in 2.15 when I started seeing huge slowdowns.

However, since the introduction of project index, the projects cache is unlikely to be fully loaded all the times. In 2.15 and before, it was constantly hit and then it stayed warm.

What do you think about the fix?

Comment 15 by, Jan 18

Project Member
I'm quite busy these days and likely don't have time to look at this soon.

Comment 16 by, Jan 18

Project Member
No problem, will ask DavidO and Dave to look at that, if they have time.

Comment 17 by, Jan 18

Project Member
Description: Show this description

Comment 18 by, Jan 18

Project Member
Summary: calling /projects/ endpoint on a 2.15 and 2.16 causes an explosion of memory usage (was: calling /projects/ endpoint on a 2.16 causes an explosion of memory usage)

Comment 19 by, Jan 18

Project Member
Moving the fix to 2.15, where we started noticing the problem. It wasn't so bad because the projects cache was always full, but it is essentially the same issue.

Comment 20 by, Jan 21

Project Member
After a bit of research, I found out that this is really a regression specific to only v2.15, caused by the permission backend refactoring at Change-Id: I0ba5491fc.

Basically, instead of keeping the Iterator() through the listing, it is materialized into a collection and thus has the side effect of blowing up the JVM heap.

Until v2.14 that didn't happen and the list of fast and light.

A more specific fix to THIS regression has been pushed to:

Comment 21 by, Jan 22

Labels: FixedIn-2.15.9
Status: Submitted (was: ChangeUnderReview)

Comment 22 by, Jan 24

Project Member
I filed a follow-up  issue 10380 .

Comment 23 by, Jan 27

Project Member
Labels: FixedIn-2.16.4
Status: Released (was: Submitted)

Sign in to add a comment