New issue
Advanced search Search tips

Issue 908636 link

Starred by 1 user

Issue metadata

Status: Fixed
Owner:
Closed: Dec 18
Cc:
Components:
EstimatedDays: ----
NextAction: ----
OS: ----
Pri: 1
Type: Bug



Sign in to add a comment

Gerrit polling frequently fails for tricium-prod

Project Member Reported by qyearsley@google.com, Nov 26

Issue description

Tricium's method for polling Gerrit involves a cron job that polls once per minute and iterates through all changes returned. This has worked fine for small repos, and has even worked fine for Chromium some of the time.

But upon inspecting the logs, it looks like it usually doesn't work fine for Chromium; we get the logged error:

  failed to poll :: {"error":"failed to query for change: urlfetch: truncated body"}

That is, failed to query Gerrit for changes in the last minute.

We also frequently see:

  Exceeded soft memory limit of 128 MB with ... MB after servicing 1 requests total. Consider setting a larger instance class in app.yaml.

Basically, polling once per minute and processing all changes for all projects in one request doesn't scale very well.

To mitigate this, we can first just add a limit to how many changes might be processed in one poll, as suggested by Emma at https://chromium.googlesource.com/infra/infra/+/78bd9ceaf1b70e055cecae447beafb6e7fef17af/go/src/infra/tricium/appengine/gerrit/poll.go#131.

To help with scaling, we could:

1. Make a request for each project or repo, using task queue.
2. Somehow dynamically change the polling rate depending on how active the project is? We probably want to poll chromium/src more than once per minute, but for most projects once per minute is more than sufficient.

Marc-Antoine, do you have any thoughts about what would be a good approach here?
 
Do 1, and increase the instance memory size, which is also increase its performance.
Status: Started (was: Assigned)
Sounds good, will do this this week.

The current behavior seems to be: approximately every other poll request succeeds, and prints:

...
Last poll for "chromium-review.googlesource.com:chromium/src": 2018-10-25 20:37:52 +0000 UTC
Last poll for "chromium-review.googlesource.com:chromiumos/chromite": 2018-12-11 17:35:44 +0000 UTC
Last poll for "chromium-review.googlesource.com:chromiumos/overlays/board-overlays": 2018-12-11 17:25:21 +0000 UTC
Last poll for "chromium-review.googlesource.com:infra/luci/luci-go": 2018-12-11 17:54:50 +0000 UTC
Last poll for "chromium-review.googlesource.com:infra/luci/luci-py": 2018-12-11 17:44:44 +0000 UTC
Last poll for "chromium-review.googlesource.com:playground/gerrit-tricium/demo": 2018-11-29 23:23:35 +0000 UTC
Last poll for "fuchsia-review.googlesource.com:build": 2018-12-11 17:23:59 +0000 UTC
Last poll for "fuchsia-review.googlesource.com:garnet": 2018-12-11 17:52:05 +0000 UTC
Last poll for "fuchsia-review.googlesource.com:peridot": 2018-12-11 17:58:30 +0000 UTC
Last poll for "fuchsia-review.googlesource.com:tools": 2018-12-11 17:03:23 +0000 UTC
Last poll for "fuchsia-review.googlesource.com:topaz": 2018-12-11 17:59:17 +0000 UTC
Last poll for "fuchsia-review.googlesource.com:zircon": 2018-12-11 18:00:16 +0000 UTC
Last poll for "gerrit-review.googlesource.com:gerrit": 2018-12-11 16:57:53 +0000 UTC
Last poll for "pdfium-review.googlesource.com:pdfium": 2018-12-11 17:45:03 +0000 UTC
Last poll for "skia-review.googlesource.com:buildbot": 2018-12-11 18:00:21 +0000 UTC

This indicates that the fuchsia requests are usually succeeding, and the chromium/src request hasn't succeeded since October!

The poll requests that fail, fail with:
failed to poll :: {"error":"failed to query for change: unexpected end of JSON input"}

Increasing the instance size is definitely not the main thing that needs to be done here, so first I'll look into:
 - splitting requests for different projects into different requests, and
 - limiting the nubmer of changes to handle per request.
And then decide what instance class to use.
Project Member

Comment 3 by bugdroid1@chromium.org, Dec 12

The following revision refers to this bug:
  https://chromium.googlesource.com/infra/infra/+/d1fad5666d825e5746c4bc639282ecfecda2cab7

commit d1fad5666d825e5746c4bc639282ecfecda2cab7
Author: Quinten Yearsley <qyearsley@chromium.org>
Date: Wed Dec 12 19:11:08 2018

[tricium] Poll for each project in a separate request

Currently, all projects are polled in parallel in one request,
which means that having one large repo (like chromium/src) may
cause the request to fail due to out of memory etc.

In general, in the future we want to adopt a "multi-tenant" approach
wherein operations for separate LUCI projects are always kept
separate.

So, to bring the Gerrit polling in line with this approach,
to scale up, and to avoid OOM due to one heavy project, this CL
would split polling by project.

Specifically, the poll handler now just puts "poll-project"
requests in a queue, which are then picked up, and each of those
requests polls for each repo in the project.

Bug:  908636 
Change-Id: If85c595bc1910d624897511573835b17ceeb8206
Reviewed-on: https://chromium-review.googlesource.com/c/1372741
Commit-Queue: Quinten Yearsley <qyearsley@chromium.org>
Reviewed-by: Marc-Antoine Ruel <maruel@chromium.org>
Cr-Commit-Position: refs/heads/master@{#19519}
[modify] https://crrev.com/d1fad5666d825e5746c4bc639282ecfecda2cab7/go/src/infra/tricium/appengine/common/config/update_test.go
[modify] https://crrev.com/d1fad5666d825e5746c4bc639282ecfecda2cab7/go/src/infra/tricium/api/admin/v1/pb.discovery.go
[modify] https://crrev.com/d1fad5666d825e5746c4bc639282ecfecda2cab7/go/src/infra/tricium/api/admin/v1/driver.pb.go
[modify] https://crrev.com/d1fad5666d825e5746c4bc639282ecfecda2cab7/go/src/infra/tricium/api/v1/pb.discovery.go
[add] https://crrev.com/d1fad5666d825e5746c4bc639282ecfecda2cab7/go/src/infra/tricium/api/admin/v1/gen.go
[modify] https://crrev.com/d1fad5666d825e5746c4bc639282ecfecda2cab7/go/src/infra/tricium/api/admin/v1/reporter.pb.go
[modify] https://crrev.com/d1fad5666d825e5746c4bc639282ecfecda2cab7/go/src/infra/tricium/appengine/common/server.go
[modify] https://crrev.com/d1fad5666d825e5746c4bc639282ecfecda2cab7/go/src/infra/tricium/api/v1/tricium.proto
[modify] https://crrev.com/d1fad5666d825e5746c4bc639282ecfecda2cab7/go/src/infra/tricium/api/v1/function.pb.go
[modify] https://crrev.com/d1fad5666d825e5746c4bc639282ecfecda2cab7/go/src/infra/tricium/api/admin/v1/reporter.proto
[modify] https://crrev.com/d1fad5666d825e5746c4bc639282ecfecda2cab7/go/src/infra/tricium/appengine/gerrit/poll_test.go
[modify] https://crrev.com/d1fad5666d825e5746c4bc639282ecfecda2cab7/go/src/infra/tricium/api/admin/v1/tracker.proto
[modify] https://crrev.com/d1fad5666d825e5746c4bc639282ecfecda2cab7/go/src/infra/tricium/api/admin/v1/launcher.pb.go
[modify] https://crrev.com/d1fad5666d825e5746c4bc639282ecfecda2cab7/go/src/infra/tricium/appengine/common/triciumtest/testing.go
[modify] https://crrev.com/d1fad5666d825e5746c4bc639282ecfecda2cab7/go/src/infra/tricium/api/admin/v1/launcher.proto
[modify] https://crrev.com/d1fad5666d825e5746c4bc639282ecfecda2cab7/go/src/infra/tricium/appengine/gerrit/handlers.go
[modify] https://crrev.com/d1fad5666d825e5746c4bc639282ecfecda2cab7/go/src/infra/tricium/appengine/gerrit/init.go
[modify] https://crrev.com/d1fad5666d825e5746c4bc639282ecfecda2cab7/go/src/infra/tricium/appengine/common/config/generate.go
[modify] https://crrev.com/d1fad5666d825e5746c4bc639282ecfecda2cab7/go/src/infra/tricium/appengine/gerrit/poll.go
[modify] https://crrev.com/d1fad5666d825e5746c4bc639282ecfecda2cab7/go/src/infra/tricium/api/admin/v1/tracker.pb.go
[modify] https://crrev.com/d1fad5666d825e5746c4bc639282ecfecda2cab7/go/src/infra/tricium/api/v1/tricium.pb.go
[add] https://crrev.com/d1fad5666d825e5746c4bc639282ecfecda2cab7/go/src/infra/tricium/api/admin/v1/gerrit.pb.go
[modify] https://crrev.com/d1fad5666d825e5746c4bc639282ecfecda2cab7/go/src/infra/tricium/api/v1/config.pb.go
[modify] https://crrev.com/d1fad5666d825e5746c4bc639282ecfecda2cab7/go/src/infra/tricium/appengine/common/pubsub.go
[modify] https://crrev.com/d1fad5666d825e5746c4bc639282ecfecda2cab7/go/src/infra/tricium/appengine/common/config/update.go
[modify] https://crrev.com/d1fad5666d825e5746c4bc639282ecfecda2cab7/go/src/infra/tricium/appengine/frontend/queue.yaml
[modify] https://crrev.com/d1fad5666d825e5746c4bc639282ecfecda2cab7/go/src/infra/tricium/api/admin/v1/config.pb.go
[modify] https://crrev.com/d1fad5666d825e5746c4bc639282ecfecda2cab7/go/src/infra/tricium/api/v1/data.pb.go
[modify] https://crrev.com/d1fad5666d825e5746c4bc639282ecfecda2cab7/go/src/infra/tricium/api/admin/v1/driver.proto
[add] https://crrev.com/d1fad5666d825e5746c4bc639282ecfecda2cab7/go/src/infra/tricium/api/admin/v1/gerrit.proto
[modify] https://crrev.com/d1fad5666d825e5746c4bc639282ecfecda2cab7/go/src/infra/tricium/appengine/common/common.go
[modify] https://crrev.com/d1fad5666d825e5746c4bc639282ecfecda2cab7/go/src/infra/tricium/api/admin/v1/workflow.pb.go
[modify] https://crrev.com/d1fad5666d825e5746c4bc639282ecfecda2cab7/go/src/infra/tricium/appengine/driver/handlers_test.go
[modify] https://crrev.com/d1fad5666d825e5746c4bc639282ecfecda2cab7/go/src/infra/tricium/api/v1/platform.pb.go

Project Member

Comment 4 by bugdroid1@chromium.org, Dec 12

The following revision refers to this bug:
  https://chromium.googlesource.com/infra/infra/+/3bdb10bc1bf7852f5fc3686269c0e3c1d0b02e52

commit 3bdb10bc1bf7852f5fc3686269c0e3c1d0b02e52
Author: Quinten Yearsley <qyearsley@chromium.org>
Date: Wed Dec 12 22:13:11 2018

[tricium] Fix URL for poll-project tasks

This is a follow-up to
https://chromium-review.googlesource.com/c/infra/infra/+/1372741
for two problem discovered after deploying to tricium-dev:

 1. The URL had gerrit and internal mixed up;
 2. also, other task queue tasks are POST requests, not GET.

Bug:  908636 
Change-Id: Ic5198e7760c2d4a3cefcaadd86ece3d13b01b8bb
Reviewed-on: https://chromium-review.googlesource.com/c/1373980
Reviewed-by: Marc-Antoine Ruel <maruel@chromium.org>
Commit-Queue: Quinten Yearsley <qyearsley@chromium.org>
Cr-Commit-Position: refs/heads/master@{#19526}
[modify] https://crrev.com/3bdb10bc1bf7852f5fc3686269c0e3c1d0b02e52/go/src/infra/tricium/appengine/gerrit/poll.go
[modify] https://crrev.com/3bdb10bc1bf7852f5fc3686269c0e3c1d0b02e52/go/src/infra/tricium/appengine/gerrit/handlers.go
[modify] https://crrev.com/3bdb10bc1bf7852f5fc3686269c0e3c1d0b02e52/go/src/infra/tricium/appengine/gerrit/init.go

Project Member

Comment 5 by bugdroid1@chromium.org, Dec 17

The following revision refers to this bug:
  https://chromium.googlesource.com/infra/infra/+/f810e326df26b91baf6db1f107379deaba374e4b

commit f810e326df26b91baf6db1f107379deaba374e4b
Author: Quinten Yearsley <qyearsley@chromium.org>
Date: Mon Dec 17 22:55:59 2018

[tricium] Limit polling to one request with a fixed number of changes

Specifically, this CL changes the limit to 60 per minute, as an
initial number that could be adjusted later.

For smaller projects there tends to be only a few changes per minute
at most, so this will always be more than enough.  For Chromium,
most of the time it appears that there is also usually only a few
published changes per minute, so this will also be sufficient for
chromium/src, even considering that we include WIP CLs.
(https://chromium-review.googlesource.com/q/project:chromium%252Fsrc+status:open)

In the rare event of a burst of >60 changes in a minute, it may
be good to limit requests anyway, to avoid OOM (which is what is
happening currently). We also hope to avoid too many requests to
Gerrit anyway, and a limit of one request per minute per project
seems reasonable.

The expected effect of this CL is that the polling of chromium/src
should succeed, and the GerritProject entity will be updated,
so then polling will start working on current CLs.

Bug:  908636 
Change-Id: Ibfc40483473d4606a714fe80472ea65a32ee8fc9
Reviewed-on: https://chromium-review.googlesource.com/c/1378746
Reviewed-by: Marc-Antoine Ruel <maruel@chromium.org>
Commit-Queue: Quinten Yearsley <qyearsley@chromium.org>
Cr-Commit-Position: refs/heads/master@{#19614}
[modify] https://crrev.com/f810e326df26b91baf6db1f107379deaba374e4b/go/src/infra/tricium/appengine/gerrit/gerrit_test.go
[modify] https://crrev.com/f810e326df26b91baf6db1f107379deaba374e4b/go/src/infra/tricium/appengine/gerrit/poll_test.go
[modify] https://crrev.com/f810e326df26b91baf6db1f107379deaba374e4b/go/src/infra/tricium/appengine/gerrit/poll.go
[modify] https://crrev.com/f810e326df26b91baf6db1f107379deaba374e4b/go/src/infra/tricium/appengine/gerrit/gerrit.go

Status: Fixed (was: Started)
Polling now works again for tricium-prod for new CLs; tested with example CL https://chromium-review.googlesource.com/c/chromium/src/+/1381374.

Sign in to add a comment