Increase the builder's GoB quotas to better accomodate current load
Project Member Reported by diand...@chromium.org, Jan 10
Basically the same thing happened as was described in bug #769088. This paladin: https://luci-milo.appspot.com/buildbot/chromeos/fizz-paladin/3054 ...failed in CommitQueueSync. In the text of the error you see much of: --- 09:52:42: WARNING: git reported transient error (cmd=fetch -f https://chrome-internal-review.googlesource.com/chromeos/overlays/overlay-fizz-private refs/changes/39/540139/2); retrying Traceback (most recent call last): File "/b/c/cbuild/repository/chromite/lib/retry_util.py", line 177, in _Wrapper ret = func(*args, **kwargs) File "/b/c/cbuild/repository/chromite/lib/retry_util.py", line 243, in _run return functor(*args, **kwargs) File "/b/c/cbuild/repository/chromite/lib/cros_build_lib.py", line 654, in RunCommand raise RunCommandError(msg, cmd_result) RunCommandError: return code: 128; command: git fetch -f https://chrome-internal-review.googlesource.com/chromeos/overlays/overlay-fizz-private refs/changes/39/540139/2 fatal: remote error: Short term ls-remote-gerrit rate limit exceeded for <redacted> fatal: The remote end hung up unexpectedly [W git.go:283] Transient error string identified in STDERR: "fatal: The remote end hung up unexpectedly\n" [W git.go:294] Retrying after 3s (rc=128): transient error string encountered --- The previous bug was closed as WontFix since the problem didn't reproduce.
...actually at one other paladin this too. I'll see if I find any more as well... https://luci-milo.appspot.com/buildbot/chromeos/peach_pit-paladin/18175
Looking at the logs on the master paladin, it seems that the master first tried fizz-paladin/3054, observed the failure, and restarted/retried with fizz-paladin/3055. That second run passed. The theory from dgarrett@ is that since adding new GCE builders our some months ago, our usage has increase enough that we're brushing up against our GoB quota limits. In this particular case "brushing up against" became "outright exceeded". Since the quota we hit was short term, retry works in that case. Assuming the theory is correct, the correct response will be to request more quota.
This didn't get done last week. We need to find time to move it forward.
-> don to annotate this bug with the metric that shows the problem, then rediscuss in next meeting
This graph shows our usage over the last year. https://viceroy.corp.google.com/chromeos/gerrit?duration=38707218
I had a thought. We currently have wrappers around git commands that look for a variety of errors. On builders, there is another wrapper that also does retries. Are we using excessive quota in some cases because of an excessive number of retries? <outer retries> * <inner retries> ?
will ask to increase quota.
I see several paladins failing with the same error now.
https://uberchromegw.corp.google.com/i/chromeos/builders/chell-paladin/builds/3813 https://uberchromegw.corp.google.com/i/chromeos/builders/daisy_skate-paladin/builds/12494 https://uberchromegw.corp.google.com/i/chromeos/builders/eve-paladin/builds/2248 https://uberchromegw.corp.google.com/i/chromeos/builders/falco-paladin/builds/17779 https://uberchromegw.corp.google.com/i/chromeos/builders/fizz-paladin/builds/3213 https://uberchromegw.corp.google.com/i/chromeos/builders/kahlee-paladin/builds/1247 https://uberchromegw.corp.google.com/i/chromeos/builders/kip-paladin/builds/4223 Are few of them
Filing at go/gob-quota-ticket
akeshet, any idea how to increase the quote? That seems like the right fix.
Have not had a chance to follow up on above b/ . I don't believe this still needs to be Chase bug. Recommend dropping chase label and downgrading to P2.
Saw a few instance of quota being exceeded this morning: InfrastructureFailure: <class 'chromite.lib.cros_build_lib.RunCommandError'>: return code: 128; command: git fetch -f https://chrome-internal-review.googlesource.com/chromeos/overlays/overlay-nautilus-private refs/changes/98/566698/2 fatal: remote error: Short te RunCommandError: return code: 128; command: git fetch -f https://chrome-internal-review.googlesource.com/chromeos/overlays/overlay-nautilus-private refs/changes/98/566698/2 fatal: remote error: Short term ls-remote-gerrit rate limit exceeded for 3su6n15k.de RunCommandError: return code: 128; command: git fetch -f https://chromium-review.googlesource.com/chromiumos/platform/arc-camera refs/changes/66/906266/1 fatal: remote error: Short term ls-remote-gerrit rate limit exceeded for firstname.lastname@example.org https://luci-milo.appspot.com/buildbot/chromeos/auron_yuna-paladin/2172 https://luci-milo.appspot.com/buildbot/chromeos/daisy_spring-paladin/17987 https://luci-milo.appspot.com/buildbot/chromeos/parrot-paladin/25715
Ok, back to Chase it is.
Still awaiting assistance on https://b.corp.google.com/issues/72697187
Fixed, see https://b.corp.google.com/issues/72697187 for context. However, the steps described there to add reader permissions to new internal repos will be needed when new repos are added. The command to do so is: gob-ctl acl chrome-internal/NEW/REPO/NAME --email@example.com
Sign in to add a comment