New issue
Advanced search Search tips
Note: Color blocks (like or ) mean that a user may not be available. Tooltip shows the reason.

Issue 854652 link

Starred by 3 users

Issue metadata

Status: Fixed
Owner:
Closed: Jul 3
Cc:
Components:
EstimatedDays: ----
NextAction: ----
OS: ----
Pri: 1
Type: Bug



Sign in to add a comment

Out of GOB quota

Project Member Reported by athilenius@chromium.org, Jun 20 2018

Issue description

Swarming bots are failing with with the following error (https://chrome-swarming.appspot.com/task?id=3e36a6caacc65810&refresh=10&show_raw=1):

Daily bandwidth rate limit exceeded for 3su6n15k.default@developer.gserviceaccount.com

Failure rate: http://shortn/_Y0NDt3fX3h

 
Cc: mar...@chromium.org
Components: Infra
Components: -Infra Infra>Client>ChromeOS>CI
Oh... this is a GoB issue. Totally different team, not the trooper.
Status: WontFix (was: Untriaged)
Moved to b/110468393
Owner: vapier@chromium.org
Status: Assigned (was: WontFix)
Actually, going to re-open this as we need to look into why we exceeded quota. Dropping it down to a P1 though. Mike, people are pointing to you as the expert on this (a rare occurrence I know :p). Feel free to re-assign it though.
Labels: -Pri-0 Pri-1
Owner: ----
Status: WontFix (was: Assigned)
Sorry my mistake I copy pasted something incorrectly into crosoncall. resetting the state back, sorry vapier/athilenius
Owner: athilenius@chromium.org
Status: Started (was: WontFix)
Actually, Alec is going to drive a postmortem for this.
In IRC, the GoB team mentioned that we are using more than normal. If that's true, will we run out of quota again shortly?
I went through all CLs to land in chromite since the start of the 19th, and didn't find much that seemed relevant.

The spikes appear to be happening roughly an hour after each release builder run starts. That might sorta kinda line up with when "SyncChrome" starts on each of the build slaves. If that's the case, this CL might be to blame https://crrev.com/c/1103551.  However, I'm pretty sure it was included in a release build before the problems started appearing (fuzzy timezones make it hard to be certain).

Also, I wouldn't  have thought that chrome syncing would be enough data to push us up that high.
Should we speculatively revert that CL to see if it puts things back to the old consumption rates?
Hum.... That does mean that repo and gclient are sharing a git cache directory, which is supposed to be safe, but maybe not regularly done.

Maybe the chrome sync did something that prevented the next repo build from using git cache?

PS: I BELIEVE that fetches into the cache come from GS, not GoB.
Yeah, a revert seems good.

But I'm also confused that the graphs are showing usage go down for each release build. If that theory was right, they should be constant.

PS: The CL fixed a bug in which our builders are creating a random unexpected persistent directory on our builders and just using it.
What graph are you looking at?

GoB team gave us: https://screenshot.googleplex.com/YonoO1QXnQ4

With the panopticon query linked.
Come to think of it, we also have: 

https://viceroy.corp.google.com/chromeos/gerrit
Let's add those graphs to the postmortem: we have the data but didn't alert on it.
Ah... picking a random release builder and looking closely, I believe SyncChrome was the issue. It got WAY slower in all of the relevant builds.

https://cros-goldeneye.corp.google.com/chromeos/legoland/builderHistory?buildConfig=samus-release&buildBranch=master
Cc: hinoka@chromium.org
+Ryan
Chromium uses bot_update, which does the "fetch packfile from GCS, then incremental fetch from GoB" but gclient itself doesn't do that.
Cc: iannucci@chromium.org
gclient can do that, if you set cache_dir.

It looks like your remote_run is doing a git checkout.  We've switched it to cipd a couple months ago, you probably want to do that too.  +iannucci for kitchen/remote_run git -> cipd next steps.
What changed at 11:48 PDT? http://shortn/_tmKeRz8pgw

All of the abnormal bandwidth consumption stopped then.
The revert of my gclient change landed at 10:48, but already in-progress builds wouldn't pick it up.

  https://crrev.com/c/1108391

I believe it was either my CL, or some external problem biting us (ChOps git wrapper + GoB flake interacting badly, or something).

re #19; you can enable CIPD recipes by doing e.g. https://chromium.googlesource.com/chromium/src.git/+/master/infra/config/global/cr-buildbucket.cfg#556 in your cr-buildbucket.cfg file.
Do we need to do anything to prepare the CIPD packages for the recipe?
Actually, I filed https://crbug.com/854830 for the CIPD change.
I don't see the connection with the tryjobs. What makes you think they are the same failure?
Revving git file /b/swarming/w/ir/cache/cbuild/repository/src/third_party/chromiumos-overlay/chromeos/config/make.conf.amd64-host
upload_prebuilts: Unhandled exception:
Traceback (most recent call last):
  File "/b/swarming/w/ir/cache/cbuild/repository/chromite/bin/upload_prebuilts", line 169, in <module>
    DoMain()
  File "/b/swarming/w/ir/cache/cbuild/repository/chromite/bin/upload_prebuilts", line 165, in DoMain
    commandline.ScriptWrapperMain(FindTarget)
  File "/b/swarming/w/ir/cache/cbuild/repository/chromite/lib/commandline.py", line 911, in ScriptWrapperMain
    ret = target(argv[1:])
  File "/b/swarming/w/ir/cache/cbuild/repository/chromite/scripts/upload_prebuilts.py", line 863, in main
    options.sync_binhost_conf)
  File "/b/swarming/w/ir/cache/cbuild/repository/chromite/scripts/upload_prebuilts.py", line 573, in SyncHostPrebuilts
    RevGitFile(git_file, {key: binhost}, dryrun=self._dryrun)
  File "/b/swarming/w/ir/cache/cbuild/repository/chromite/scripts/upload_prebuilts.py", line 165, in RevGitFile
    git.CreatePushBranch(prebuilt_branch, cwd)
  File "/b/swarming/w/ir/cache/cbuild/repository/chromite/lib/git.py", line 1317, in CreatePushBranch
    RunGit(git_repo, cmd)
  File "/b/swarming/w/ir/cache/cbuild/repository/chromite/lib/git.py", line 822, in RunGit
    return cros_build_lib.RunCommand(['git'] + cmd, **kwargs)
  File "/b/swarming/w/ir/cache/cbuild/repository/chromite/lib/cros_build_lib.py", line 669, in RunCommand
    raise RunCommandError(msg, cmd_result)
chromite.lib.cros_build_lib.RunCommandError: return code: 1; command: git remote update cros
remote: Daily bandwidth rate limit exceeded for 3su6n15k.default@developer.gserviceaccount.com
fatal: protocol error: bad pack header
error: Could not fetch cros
Fetching cros

The "daily bandwidth rate limit exceeded" is the only reason I matched this bug. $subject doesn't really target a specific root cause, and so it's tough for me to understand if there's something new going on, or an existing known issue.
Owner: jclinton@chromium.org
Oh... that does match up.

These tryjobs ran Friday, which is the same day we brought a lot of new builders online.

Our total bandwidth usage did spike way up, especially on the 30th.

I guess this answers one question.... 256 cold builders is more than our quota can currently handle.
I believe the same tryjobs now should be good.
Cc: athilenius@chromium.org
Owner: athilenius@chromium.org
Alec, please add the postmortem to this bug and link all of your follow-up bugs here. Then mark this bug as fixed.
Status: Fixed (was: Started)
Postmortem Info

Requiem link: requiem/pm/postmortem101909
Doc link: requiem/doc/postmortem101909

Follow up bugs
Add GoB quota threshold alerts b/110774464
Add Swarming bot failure rate threshold alerts b/110773411
Improve Git cache characteristics for CrOS builders (long term project) b/110773139

Issue 838911 has been merged into this issue.

Sign in to add a comment