New issue
Advanced search Search tips

Issue 892008 link

Starred by 1 user

Issue metadata

Status: Available
Owner: ----
Cc:
Components:
EstimatedDays: ----
NextAction: ----
OS: ----
Pri: 1
Type: Bug



Sign in to add a comment

Swarming client frequently returning 500 response code

Project Member Reported by joshuaseaton@google.com, Oct 4

Issue description

A swarming build step is frequently hitting the following error:
```
swarming: googleapi: got HTTP response code 500 with body: 500 Internal Server Error

The server has either erred or is incapable of performing the requested operation.
```

The step uses the luci-go swarming client

Examples of failure:
https://ci.chromium.org/p/fuchsia/builders/luci.fuchsia.ci/garnet-x64-debug/b8933629361170312928
https://ci.chromium.org/p/fuchsia/builders/luci.fuchsia.ci/garnet-x64-debug/b8933633520339708592
https://ci.chromium.org/p/fuchsia/builders/luci.fuchsia.ci/garnet-x64-debug/b8933637154446287456
https://ci.chromium.org/p/fuchsia/builders/luci.fuchsia.ci/garnet-x64-debug/b8933641144116126416

Example of success:
https://ci.chromium.org/p/fuchsia/builders/luci.fuchsia.ci/garnet-x64-debug/b8933630877829294432
 
Components: -Infra>Platform Infra>Platform>Swarming
Labels: -Foundation-Troopers
Status: Available (was: Untriaged)
You'll have to make the Go implementation of swarming retry properly.
May you please link code pointers to where this logic would need to be added?
Cc: iannucci@chromium.org
Labels: -Pri-3 Pri-1
Hi Robbie, Marc-Antoine mentioned hoping you might chime in on this point.
Any thoughts?
Cc: vadimsh@chromium.org no...@chromium.org tandrii@chromium.org
Looking at https://chromium.googlesource.com/infra/luci/luci-go/+/master/client/cmd/swarming/trigger.go

the generated code uses https://google.golang.org/api/gensupport to do the actual call but I don't see any retry mechanism there. I used to use https://go.chromium.org/luci/common/lhttp in the previous code but the hot new thing is to use https://go.chromium.org/luci/common/retry.

Someone should chime in as I hadn't followed along on how much work this entails, and package retry has zero example now package documentation (oops?).
Project Member

Comment 7 by bugdroid1@chromium.org, Oct 10

The following revision refers to this bug:
  https://chromium.googlesource.com/infra/luci/luci-go.git/+/da86e63132f33771ab628b10d2bc8e874cedeccc

commit da86e63132f33771ab628b10d2bc8e874cedeccc
Author: Nodir Turakulov <nodir@google.com>
Date: Wed Oct 10 16:09:10 2018

[swarming] Add retries on transient RPC errors

Bug: 892008
Change-Id: I0e8f521865443be3c93c26fbce3b63477b037538
Reviewed-on: https://chromium-review.googlesource.com/c/1273665
Reviewed-by: Vadim Shtayura <vadimsh@chromium.org>
Reviewed-by: Marc-Antoine Ruel <maruel@chromium.org>
Commit-Queue: Nodir Turakulov <nodir@chromium.org>

[modify] https://crrev.com/da86e63132f33771ab628b10d2bc8e874cedeccc/client/cmd/swarming/common.go

please do the deployments as needed
Thanks a lot, Nodir and Marc-Antoine!

Comment 10 Deleted

Have you rolled a new CIPD packages? We're not using the Go implementation in Chromium so we don't have anything to roll ourselves. That's eventually going to be done as part of issue 894348. :)
We always use the 'latest' of infra/tools/luci/swarming, so we don't have anything to roll on our end.

Has the most the version with Nodir's update not yet been uploaded to CIPD? There have been other uploads over the past three days, so I assumed it had.
The latest built commit is https://chromium.googlesource.com/infra/infra/+/master/DEPS#32

Which is from Oct 8. The change that add retries if from Oct 10. I'll update DEPS to pick up it.
Project Member

Comment 15 by bugdroid1@chromium.org, Oct 12

The following revision refers to this bug:
  https://chromium.googlesource.com/infra/infra/+/e3966fc00b906b7d9abe74bfc3838f0ce0d4ae43

commit e3966fc00b906b7d9abe74bfc3838f0ce0d4ae43
Author: Vadim Shtayura <vadimsh@chromium.org>
Date: Fri Oct 12 18:20:53 2018

Roll infra/go/src/go.chromium.org/luci/ 4ecce4047..1bbbd9829 (17 commits)

https://chromium.googlesource.com/infra/luci/luci-go/+log/4ecce4047b80..1bbbd982903a

$ git log 4ecce4047..1bbbd9829 --date=short --no-merges --format='%ad %ae %s'
2018-10-12 vadimsh [token-server] Fix flake in rpc_inspect_machine_token_test.go.
2018-10-12 vadimsh [token-server] Use bqschemaupdater for updating BigQuery schemas.
2018-10-12 vadimsh [everything] Regenerate everything after protobuf lib change.
2018-10-11 nodir [buildbucket] Check response-level error
2018-10-11 fmatenaar [auth] Use IAM:GenerateAccessToken default lifetime if not set in client
2018-10-10 fmatenaar [auth] Use new IAM generateAccessToken API to obtain OAuth tokens
2018-10-10 nodir [swarming] Add retries on transient RPC errors
2018-10-10 vadimsh [cipd] Add boilerplate for cipd.Admin API.
2018-10-10 vadimsh [mapper] Implement AbortJob.
2018-10-10 vadimsh [mapper] Add Controller.GetJob method that fetches the job.
2018-10-10 vadimsh [mapper] Return info about a job as proto.
2018-10-09 sergiyb Replace LevelsDeep with BuildComponent hiearchy to render nested steps
2018-10-09 vadimsh [mapper] Query number of entities to be processed when launching.
2018-10-09 vadimsh [mapper] Keep track of how many entities have been processed.
2018-10-09 vadimsh [mapper] Move State enum definition to proto.
2018-10-08 vadimsh [mapper] Update jobs status when all shards finish running.
2018-10-08 vadimsh [mapper] Implement shards processing.

Created with:
  roll-dep infra/go/src/go.chromium.org/luci

TBR=nodir@chromium.org, maruel@chromium.org
BUG=892008

Change-Id: Ia6b480d65cb158e900e7ed028a01efea455eaf22
Reviewed-on: https://chromium-review.googlesource.com/c/1277904
Reviewed-by: Vadim Shtayura <vadimsh@chromium.org>
Reviewed-by: Marc-Antoine Ruel <maruel@chromium.org>
Commit-Queue: Vadim Shtayura <vadimsh@chromium.org>
Cr-Commit-Position: refs/heads/master@{#18288}
[modify] https://crrev.com/e3966fc00b906b7d9abe74bfc3838f0ce0d4ae43/DEPS

Thanks, Vadim
Project Member

Comment 18 by bugdroid1@chromium.org, Oct 15

The following revision refers to this bug:
  https://chromium.googlesource.com/infra/luci/luci-go.git/+/d731b1f51064de94e8dff9328b459b8120104a2c

commit d731b1f51064de94e8dff9328b459b8120104a2c
Author: Nodir Turakulov <nodir@google.com>
Date: Mon Oct 15 17:35:33 2018

Fix typo in swarming retry logic

TBR=maruel@chromium.org

Bug: 892008
Change-Id: I5d48e27abd7635e98762b439103be3f8cdd1da20
Reviewed-on: https://chromium-review.googlesource.com/c/1280964
Commit-Queue: Nodir Turakulov <nodir@chromium.org>
Commit-Queue: Marc-Antoine Ruel <maruel@chromium.org>
Reviewed-by: Nodir Turakulov <nodir@chromium.org>
Reviewed-by: Marc-Antoine Ruel <maruel@chromium.org>

[modify] https://crrev.com/d731b1f51064de94e8dff9328b459b8120104a2c/client/cmd/swarming/common.go

Cc: iannu...@google.com
Cc: -iannucci@chromium.org

Sign in to add a comment