Gerrit polling frequently fails for tricium-prod |
|||
Issue description
Tricium's method for polling Gerrit involves a cron job that polls once per minute and iterates through all changes returned. This has worked fine for small repos, and has even worked fine for Chromium some of the time.
But upon inspecting the logs, it looks like it usually doesn't work fine for Chromium; we get the logged error:
failed to poll :: {"error":"failed to query for change: urlfetch: truncated body"}
That is, failed to query Gerrit for changes in the last minute.
We also frequently see:
Exceeded soft memory limit of 128 MB with ... MB after servicing 1 requests total. Consider setting a larger instance class in app.yaml.
Basically, polling once per minute and processing all changes for all projects in one request doesn't scale very well.
To mitigate this, we can first just add a limit to how many changes might be processed in one poll, as suggested by Emma at https://chromium.googlesource.com/infra/infra/+/78bd9ceaf1b70e055cecae447beafb6e7fef17af/go/src/infra/tricium/appengine/gerrit/poll.go#131.
To help with scaling, we could:
1. Make a request for each project or repo, using task queue.
2. Somehow dynamically change the polling rate depending on how active the project is? We probably want to poll chromium/src more than once per minute, but for most projects once per minute is more than sufficient.
Marc-Antoine, do you have any thoughts about what would be a good approach here?
,
Dec 11
Sounds good, will do this this week. The current behavior seems to be: approximately every other poll request succeeds, and prints: ... Last poll for "chromium-review.googlesource.com:chromium/src": 2018-10-25 20:37:52 +0000 UTC Last poll for "chromium-review.googlesource.com:chromiumos/chromite": 2018-12-11 17:35:44 +0000 UTC Last poll for "chromium-review.googlesource.com:chromiumos/overlays/board-overlays": 2018-12-11 17:25:21 +0000 UTC Last poll for "chromium-review.googlesource.com:infra/luci/luci-go": 2018-12-11 17:54:50 +0000 UTC Last poll for "chromium-review.googlesource.com:infra/luci/luci-py": 2018-12-11 17:44:44 +0000 UTC Last poll for "chromium-review.googlesource.com:playground/gerrit-tricium/demo": 2018-11-29 23:23:35 +0000 UTC Last poll for "fuchsia-review.googlesource.com:build": 2018-12-11 17:23:59 +0000 UTC Last poll for "fuchsia-review.googlesource.com:garnet": 2018-12-11 17:52:05 +0000 UTC Last poll for "fuchsia-review.googlesource.com:peridot": 2018-12-11 17:58:30 +0000 UTC Last poll for "fuchsia-review.googlesource.com:tools": 2018-12-11 17:03:23 +0000 UTC Last poll for "fuchsia-review.googlesource.com:topaz": 2018-12-11 17:59:17 +0000 UTC Last poll for "fuchsia-review.googlesource.com:zircon": 2018-12-11 18:00:16 +0000 UTC Last poll for "gerrit-review.googlesource.com:gerrit": 2018-12-11 16:57:53 +0000 UTC Last poll for "pdfium-review.googlesource.com:pdfium": 2018-12-11 17:45:03 +0000 UTC Last poll for "skia-review.googlesource.com:buildbot": 2018-12-11 18:00:21 +0000 UTC This indicates that the fuchsia requests are usually succeeding, and the chromium/src request hasn't succeeded since October! The poll requests that fail, fail with: failed to poll :: {"error":"failed to query for change: unexpected end of JSON input"} Increasing the instance size is definitely not the main thing that needs to be done here, so first I'll look into: - splitting requests for different projects into different requests, and - limiting the nubmer of changes to handle per request. And then decide what instance class to use.
,
Dec 12
The following revision refers to this bug: https://chromium.googlesource.com/infra/infra/+/d1fad5666d825e5746c4bc639282ecfecda2cab7 commit d1fad5666d825e5746c4bc639282ecfecda2cab7 Author: Quinten Yearsley <qyearsley@chromium.org> Date: Wed Dec 12 19:11:08 2018 [tricium] Poll for each project in a separate request Currently, all projects are polled in parallel in one request, which means that having one large repo (like chromium/src) may cause the request to fail due to out of memory etc. In general, in the future we want to adopt a "multi-tenant" approach wherein operations for separate LUCI projects are always kept separate. So, to bring the Gerrit polling in line with this approach, to scale up, and to avoid OOM due to one heavy project, this CL would split polling by project. Specifically, the poll handler now just puts "poll-project" requests in a queue, which are then picked up, and each of those requests polls for each repo in the project. Bug: 908636 Change-Id: If85c595bc1910d624897511573835b17ceeb8206 Reviewed-on: https://chromium-review.googlesource.com/c/1372741 Commit-Queue: Quinten Yearsley <qyearsley@chromium.org> Reviewed-by: Marc-Antoine Ruel <maruel@chromium.org> Cr-Commit-Position: refs/heads/master@{#19519} [modify] https://crrev.com/d1fad5666d825e5746c4bc639282ecfecda2cab7/go/src/infra/tricium/appengine/common/config/update_test.go [modify] https://crrev.com/d1fad5666d825e5746c4bc639282ecfecda2cab7/go/src/infra/tricium/api/admin/v1/pb.discovery.go [modify] https://crrev.com/d1fad5666d825e5746c4bc639282ecfecda2cab7/go/src/infra/tricium/api/admin/v1/driver.pb.go [modify] https://crrev.com/d1fad5666d825e5746c4bc639282ecfecda2cab7/go/src/infra/tricium/api/v1/pb.discovery.go [add] https://crrev.com/d1fad5666d825e5746c4bc639282ecfecda2cab7/go/src/infra/tricium/api/admin/v1/gen.go [modify] https://crrev.com/d1fad5666d825e5746c4bc639282ecfecda2cab7/go/src/infra/tricium/api/admin/v1/reporter.pb.go [modify] https://crrev.com/d1fad5666d825e5746c4bc639282ecfecda2cab7/go/src/infra/tricium/appengine/common/server.go [modify] https://crrev.com/d1fad5666d825e5746c4bc639282ecfecda2cab7/go/src/infra/tricium/api/v1/tricium.proto [modify] https://crrev.com/d1fad5666d825e5746c4bc639282ecfecda2cab7/go/src/infra/tricium/api/v1/function.pb.go [modify] https://crrev.com/d1fad5666d825e5746c4bc639282ecfecda2cab7/go/src/infra/tricium/api/admin/v1/reporter.proto [modify] https://crrev.com/d1fad5666d825e5746c4bc639282ecfecda2cab7/go/src/infra/tricium/appengine/gerrit/poll_test.go [modify] https://crrev.com/d1fad5666d825e5746c4bc639282ecfecda2cab7/go/src/infra/tricium/api/admin/v1/tracker.proto [modify] https://crrev.com/d1fad5666d825e5746c4bc639282ecfecda2cab7/go/src/infra/tricium/api/admin/v1/launcher.pb.go [modify] https://crrev.com/d1fad5666d825e5746c4bc639282ecfecda2cab7/go/src/infra/tricium/appengine/common/triciumtest/testing.go [modify] https://crrev.com/d1fad5666d825e5746c4bc639282ecfecda2cab7/go/src/infra/tricium/api/admin/v1/launcher.proto [modify] https://crrev.com/d1fad5666d825e5746c4bc639282ecfecda2cab7/go/src/infra/tricium/appengine/gerrit/handlers.go [modify] https://crrev.com/d1fad5666d825e5746c4bc639282ecfecda2cab7/go/src/infra/tricium/appengine/gerrit/init.go [modify] https://crrev.com/d1fad5666d825e5746c4bc639282ecfecda2cab7/go/src/infra/tricium/appengine/common/config/generate.go [modify] https://crrev.com/d1fad5666d825e5746c4bc639282ecfecda2cab7/go/src/infra/tricium/appengine/gerrit/poll.go [modify] https://crrev.com/d1fad5666d825e5746c4bc639282ecfecda2cab7/go/src/infra/tricium/api/admin/v1/tracker.pb.go [modify] https://crrev.com/d1fad5666d825e5746c4bc639282ecfecda2cab7/go/src/infra/tricium/api/v1/tricium.pb.go [add] https://crrev.com/d1fad5666d825e5746c4bc639282ecfecda2cab7/go/src/infra/tricium/api/admin/v1/gerrit.pb.go [modify] https://crrev.com/d1fad5666d825e5746c4bc639282ecfecda2cab7/go/src/infra/tricium/api/v1/config.pb.go [modify] https://crrev.com/d1fad5666d825e5746c4bc639282ecfecda2cab7/go/src/infra/tricium/appengine/common/pubsub.go [modify] https://crrev.com/d1fad5666d825e5746c4bc639282ecfecda2cab7/go/src/infra/tricium/appengine/common/config/update.go [modify] https://crrev.com/d1fad5666d825e5746c4bc639282ecfecda2cab7/go/src/infra/tricium/appengine/frontend/queue.yaml [modify] https://crrev.com/d1fad5666d825e5746c4bc639282ecfecda2cab7/go/src/infra/tricium/api/admin/v1/config.pb.go [modify] https://crrev.com/d1fad5666d825e5746c4bc639282ecfecda2cab7/go/src/infra/tricium/api/v1/data.pb.go [modify] https://crrev.com/d1fad5666d825e5746c4bc639282ecfecda2cab7/go/src/infra/tricium/api/admin/v1/driver.proto [add] https://crrev.com/d1fad5666d825e5746c4bc639282ecfecda2cab7/go/src/infra/tricium/api/admin/v1/gerrit.proto [modify] https://crrev.com/d1fad5666d825e5746c4bc639282ecfecda2cab7/go/src/infra/tricium/appengine/common/common.go [modify] https://crrev.com/d1fad5666d825e5746c4bc639282ecfecda2cab7/go/src/infra/tricium/api/admin/v1/workflow.pb.go [modify] https://crrev.com/d1fad5666d825e5746c4bc639282ecfecda2cab7/go/src/infra/tricium/appengine/driver/handlers_test.go [modify] https://crrev.com/d1fad5666d825e5746c4bc639282ecfecda2cab7/go/src/infra/tricium/api/v1/platform.pb.go
,
Dec 12
The following revision refers to this bug: https://chromium.googlesource.com/infra/infra/+/3bdb10bc1bf7852f5fc3686269c0e3c1d0b02e52 commit 3bdb10bc1bf7852f5fc3686269c0e3c1d0b02e52 Author: Quinten Yearsley <qyearsley@chromium.org> Date: Wed Dec 12 22:13:11 2018 [tricium] Fix URL for poll-project tasks This is a follow-up to https://chromium-review.googlesource.com/c/infra/infra/+/1372741 for two problem discovered after deploying to tricium-dev: 1. The URL had gerrit and internal mixed up; 2. also, other task queue tasks are POST requests, not GET. Bug: 908636 Change-Id: Ic5198e7760c2d4a3cefcaadd86ece3d13b01b8bb Reviewed-on: https://chromium-review.googlesource.com/c/1373980 Reviewed-by: Marc-Antoine Ruel <maruel@chromium.org> Commit-Queue: Quinten Yearsley <qyearsley@chromium.org> Cr-Commit-Position: refs/heads/master@{#19526} [modify] https://crrev.com/3bdb10bc1bf7852f5fc3686269c0e3c1d0b02e52/go/src/infra/tricium/appengine/gerrit/poll.go [modify] https://crrev.com/3bdb10bc1bf7852f5fc3686269c0e3c1d0b02e52/go/src/infra/tricium/appengine/gerrit/handlers.go [modify] https://crrev.com/3bdb10bc1bf7852f5fc3686269c0e3c1d0b02e52/go/src/infra/tricium/appengine/gerrit/init.go
,
Dec 17
The following revision refers to this bug: https://chromium.googlesource.com/infra/infra/+/f810e326df26b91baf6db1f107379deaba374e4b commit f810e326df26b91baf6db1f107379deaba374e4b Author: Quinten Yearsley <qyearsley@chromium.org> Date: Mon Dec 17 22:55:59 2018 [tricium] Limit polling to one request with a fixed number of changes Specifically, this CL changes the limit to 60 per minute, as an initial number that could be adjusted later. For smaller projects there tends to be only a few changes per minute at most, so this will always be more than enough. For Chromium, most of the time it appears that there is also usually only a few published changes per minute, so this will also be sufficient for chromium/src, even considering that we include WIP CLs. (https://chromium-review.googlesource.com/q/project:chromium%252Fsrc+status:open) In the rare event of a burst of >60 changes in a minute, it may be good to limit requests anyway, to avoid OOM (which is what is happening currently). We also hope to avoid too many requests to Gerrit anyway, and a limit of one request per minute per project seems reasonable. The expected effect of this CL is that the polling of chromium/src should succeed, and the GerritProject entity will be updated, so then polling will start working on current CLs. Bug: 908636 Change-Id: Ibfc40483473d4606a714fe80472ea65a32ee8fc9 Reviewed-on: https://chromium-review.googlesource.com/c/1378746 Reviewed-by: Marc-Antoine Ruel <maruel@chromium.org> Commit-Queue: Quinten Yearsley <qyearsley@chromium.org> Cr-Commit-Position: refs/heads/master@{#19614} [modify] https://crrev.com/f810e326df26b91baf6db1f107379deaba374e4b/go/src/infra/tricium/appengine/gerrit/gerrit_test.go [modify] https://crrev.com/f810e326df26b91baf6db1f107379deaba374e4b/go/src/infra/tricium/appengine/gerrit/poll_test.go [modify] https://crrev.com/f810e326df26b91baf6db1f107379deaba374e4b/go/src/infra/tricium/appengine/gerrit/poll.go [modify] https://crrev.com/f810e326df26b91baf6db1f107379deaba374e4b/go/src/infra/tricium/appengine/gerrit/gerrit.go
,
Dec 18
Polling now works again for tricium-prod for new CLs; tested with example CL https://chromium-review.googlesource.com/c/chromium/src/+/1381374. |
|||
►
Sign in to add a comment |
|||
Comment 1 by mar...@chromium.org
, Nov 27