Swarming client frequently returning 500 response code |
||||||
Issue descriptionA swarming build step is frequently hitting the following error: ``` swarming: googleapi: got HTTP response code 500 with body: 500 Internal Server Error The server has either erred or is incapable of performing the requested operation. ``` The step uses the luci-go swarming client Examples of failure: https://ci.chromium.org/p/fuchsia/builders/luci.fuchsia.ci/garnet-x64-debug/b8933629361170312928 https://ci.chromium.org/p/fuchsia/builders/luci.fuchsia.ci/garnet-x64-debug/b8933633520339708592 https://ci.chromium.org/p/fuchsia/builders/luci.fuchsia.ci/garnet-x64-debug/b8933637154446287456 https://ci.chromium.org/p/fuchsia/builders/luci.fuchsia.ci/garnet-x64-debug/b8933641144116126416 Example of success: https://ci.chromium.org/p/fuchsia/builders/luci.fuchsia.ci/garnet-x64-debug/b8933630877829294432
,
Oct 5
You'll have to make the Go implementation of swarming retry properly.
,
Oct 6
May you please link code pointers to where this logic would need to be added?
,
Oct 9
Hi Robbie, Marc-Antoine mentioned hoping you might chime in on this point. Any thoughts?
,
Oct 10
Looking at https://chromium.googlesource.com/infra/luci/luci-go/+/master/client/cmd/swarming/trigger.go the generated code uses https://google.golang.org/api/gensupport to do the actual call but I don't see any retry mechanism there. I used to use https://go.chromium.org/luci/common/lhttp in the previous code but the hot new thing is to use https://go.chromium.org/luci/common/retry. Someone should chime in as I hadn't followed along on how much work this entails, and package retry has zero example now package documentation (oops?).
,
Oct 10
The following revision refers to this bug: https://chromium.googlesource.com/infra/luci/luci-go.git/+/da86e63132f33771ab628b10d2bc8e874cedeccc commit da86e63132f33771ab628b10d2bc8e874cedeccc Author: Nodir Turakulov <nodir@google.com> Date: Wed Oct 10 16:09:10 2018 [swarming] Add retries on transient RPC errors Bug: 892008 Change-Id: I0e8f521865443be3c93c26fbce3b63477b037538 Reviewed-on: https://chromium-review.googlesource.com/c/1273665 Reviewed-by: Vadim Shtayura <vadimsh@chromium.org> Reviewed-by: Marc-Antoine Ruel <maruel@chromium.org> Commit-Queue: Nodir Turakulov <nodir@chromium.org> [modify] https://crrev.com/da86e63132f33771ab628b10d2bc8e874cedeccc/client/cmd/swarming/common.go
,
Oct 10
please do the deployments as needed
,
Oct 10
Thanks a lot, Nodir and Marc-Antoine!
,
Oct 12
The 500 errors persist unfortunately: https://ci.chromium.org/p/fuchsia/builders/luci.fuchsia.ci/garnet-x64-debug/b8932966211634311760 https://ci.chromium.org/p/fuchsia/builders/luci.fuchsia.ci/garnet-x64-debug/b8932943323653439776 https://ci.chromium.org/p/fuchsia/builders/luci.fuchsia.ci/garnet-x64-debug/b8932921356596406224 https://ci.chromium.org/p/fuchsia/builders/luci.fuchsia.ci/garnet-x64-debug/b8932908687731399984 Those four are just from today alone. Our swarming recipe module is also using the latest swarming go tool CIPD upload. Any other thoughts on what might be going on here?
,
Oct 12
Have you rolled a new CIPD packages? We're not using the Go implementation in Chromium so we don't have anything to roll ourselves. That's eventually going to be done as part of issue 894348. :)
,
Oct 12
We always use the 'latest' of infra/tools/luci/swarming, so we don't have anything to roll on our end. Has the most the version with Nodir's update not yet been uploaded to CIPD? There have been other uploads over the past three days, so I assumed it had.
,
Oct 12
The latest built commit is https://chromium.googlesource.com/infra/infra/+/master/DEPS#32 Which is from Oct 8. The change that add retries if from Oct 10. I'll update DEPS to pick up it.
,
Oct 12
The following revision refers to this bug: https://chromium.googlesource.com/infra/infra/+/e3966fc00b906b7d9abe74bfc3838f0ce0d4ae43 commit e3966fc00b906b7d9abe74bfc3838f0ce0d4ae43 Author: Vadim Shtayura <vadimsh@chromium.org> Date: Fri Oct 12 18:20:53 2018 Roll infra/go/src/go.chromium.org/luci/ 4ecce4047..1bbbd9829 (17 commits) https://chromium.googlesource.com/infra/luci/luci-go/+log/4ecce4047b80..1bbbd982903a $ git log 4ecce4047..1bbbd9829 --date=short --no-merges --format='%ad %ae %s' 2018-10-12 vadimsh [token-server] Fix flake in rpc_inspect_machine_token_test.go. 2018-10-12 vadimsh [token-server] Use bqschemaupdater for updating BigQuery schemas. 2018-10-12 vadimsh [everything] Regenerate everything after protobuf lib change. 2018-10-11 nodir [buildbucket] Check response-level error 2018-10-11 fmatenaar [auth] Use IAM:GenerateAccessToken default lifetime if not set in client 2018-10-10 fmatenaar [auth] Use new IAM generateAccessToken API to obtain OAuth tokens 2018-10-10 nodir [swarming] Add retries on transient RPC errors 2018-10-10 vadimsh [cipd] Add boilerplate for cipd.Admin API. 2018-10-10 vadimsh [mapper] Implement AbortJob. 2018-10-10 vadimsh [mapper] Add Controller.GetJob method that fetches the job. 2018-10-10 vadimsh [mapper] Return info about a job as proto. 2018-10-09 sergiyb Replace LevelsDeep with BuildComponent hiearchy to render nested steps 2018-10-09 vadimsh [mapper] Query number of entities to be processed when launching. 2018-10-09 vadimsh [mapper] Keep track of how many entities have been processed. 2018-10-09 vadimsh [mapper] Move State enum definition to proto. 2018-10-08 vadimsh [mapper] Update jobs status when all shards finish running. 2018-10-08 vadimsh [mapper] Implement shards processing. Created with: roll-dep infra/go/src/go.chromium.org/luci TBR=nodir@chromium.org, maruel@chromium.org BUG=892008 Change-Id: Ia6b480d65cb158e900e7ed028a01efea455eaf22 Reviewed-on: https://chromium-review.googlesource.com/c/1277904 Reviewed-by: Vadim Shtayura <vadimsh@chromium.org> Reviewed-by: Marc-Antoine Ruel <maruel@chromium.org> Commit-Queue: Vadim Shtayura <vadimsh@chromium.org> Cr-Commit-Position: refs/heads/master@{#18288} [modify] https://crrev.com/e3966fc00b906b7d9abe74bfc3838f0ce0d4ae43/DEPS
,
Oct 12
The new version is being built now: https://ci.chromium.org/p/infra-internal/builders/luci.infra-internal.prod/infra-packager-linux-64/2396
,
Oct 12
Thanks, Vadim
,
Oct 15
The following revision refers to this bug: https://chromium.googlesource.com/infra/luci/luci-go.git/+/d731b1f51064de94e8dff9328b459b8120104a2c commit d731b1f51064de94e8dff9328b459b8120104a2c Author: Nodir Turakulov <nodir@google.com> Date: Mon Oct 15 17:35:33 2018 Fix typo in swarming retry logic TBR=maruel@chromium.org Bug: 892008 Change-Id: I5d48e27abd7635e98762b439103be3f8cdd1da20 Reviewed-on: https://chromium-review.googlesource.com/c/1280964 Commit-Queue: Nodir Turakulov <nodir@chromium.org> Commit-Queue: Marc-Antoine Ruel <maruel@chromium.org> Reviewed-by: Nodir Turakulov <nodir@chromium.org> Reviewed-by: Marc-Antoine Ruel <maruel@chromium.org> [modify] https://crrev.com/d731b1f51064de94e8dff9328b459b8120104a2c/client/cmd/swarming/common.go
,
Oct 18
,
Oct 18
|
||||||
►
Sign in to add a comment |
||||||
Comment 1 by hinoka@chromium.org
, Oct 5Labels: -Foundation-Troopers