|Email generation uses too many database queries and is slow.|
|Reported by jeblai...@gmail.com, Sep 19||Back to list|
Affected Version: 2.13.9 Outgoing emails are very slow; Gerrit appears to deliver one email to the MTA approximately every minute. This is much slower than email events are added to the queue, so the task queue grows without bound with email tasks. This is a regression from the previous version we were running, 2.11. It appears that for every email Gerrit sends, the following queries are performed for every account in the system: SELECT T.notify_new_changes,T.notify_all_comments,T.notify_submitted_changes,T.notify_new_patch_sets,T.notify_abandoned_changes,T.account_id,T.project_name,T.filter FROM account_project_watches T WHERE T.account_id=... SELECT T.registered_on,T.full_name,T.preferred_email,T.inactive,T.account_id FROM accounts T WHERE T.account_id=... SELECT T.account_id,T.email_address,T.password,T.external_id FROM account_external_ids T WHERE T.account_id=... SELECT T.account_id,T.group_id FROM account_group_members T WHERE T.account_id=... In our system, that's about 26,000 accounts, meaning that each email requires over 100,000 individual database queries.
What is the size of your account cache? What's the hit rate for this cache? With 2.13.9 project watchers are now retrieved from the account index. This retrieves all accounts that watch the project from the cache. The cache loads the account (including project watches, external IDs and group memberships). Next time the account should be already cached. However if the account cache is to small you might have no cache hits. Those SQL selects are most likely triggered by the account cache on a cache miss. This is why I assume your account cache is too small.
Our account cache was previously the default 1024 entries; after noticing a 64% hit rate, we doubled it to 2048 entries, but that did not substantially change the hit rate (it lowered it to 61%).
This is still very small considering the amount of accounts that you have, e.g. our account cache size is set to 10000. Ideally it would be large enough to hold all accounts. I once calculated the expected memory usage for the 10000 accounts and it was about 17MB. So maybe you can effort an even bigger account cache. An account cache that is too small can also lead to bad performance for other use-cases.
Thanks for the additional information. It seems that due to the behavior change in how watchlists for emails are generated, the function that the account cache serves has changed significantly. Whereas before it could be a small percentage of the number of accounts (usually only a small number of the total accounts are using the site at any one time, so the working set is smaller than the total), it now needs to hold all of the account data all of the time in order to avoid performance regressions. If that's the case, I'd suggest deprecating it as a tunable parameter and simply have Gerrit always cache the entire account set in memory (since that is apparently what is actually required).
I agree with James here. There is a lot place for improvement of default behaviour for account cache configuration. Even if OpenStack project size is an exception and not the rule, we should not spawn 100K SQL queries to send one single mail.
The project watch handling doesn't require loading all accounts, but we only load the accounts that have a project watch on the project that is being changed. Normally this are not all accounts and for average projects the number of watchers is surely below 1000. Also before 2.13 we always loaded the accounts for all users that are notified. This means there is only a small difference between the releases, load all accounts that watch the project (2.13 and later) vs. load all accounts that watch the project with a matching filter (earlier releases). So unless all your accounts have project watches for all projects I don't understand how sending an email would require loading all accounts. Maybe this is not caused by the project watches, but some other code in the email processing. Can you debug this?
Here's a histogram (in buckets of 10 up to 100, and then 100 after that) with the number of projects for given number of watchers. I.e., we have 1161 projects with 0-10 watchers, and only one project with 800-900 watchers. Watcher Projects Count 0 1161 10 161 20 87 30 49 40 24 50 21 60 18 70 13 80 13 90 12 100 108 200 102 300 45 500 1 800 1 I think this is in line with what you expect. And it certainly means there is no single project that all 26k users are watching. The thread dump suggested that Gerrit was going out to the database after failing the account cache. That seems to match your earlier comment (#3) that suggested that since the watches are collected by account, Gerrit needed to load all of the accounts in order to find which have watches, as well as the observed behavior of querying the database for all accounts. I can confirm that setting the account cache size to 32768 entries has resolved the performance issue for us. I'm happy to provide any other information, but I'm not in a position to be able to debug the code at the moment.
Then this seems to be unrelated to the handling of project watches. There are still many other places in the email sending that load accounts via the account cache. We should have a closer look at those and see if accounts are loaded unnecessarily.
I debugged sending emails for new changes / new patch sets on stable-2.14 and for me it didn't access all accounts. Maybe you have a special condition that triggers this bug.
|► Sign in to add a comment|