Out of GOB quota |
|||||||||||
Issue descriptionSwarming bots are failing with with the following error (https://chrome-swarming.appspot.com/task?id=3e36a6caacc65810&refresh=10&show_raw=1): Daily bandwidth rate limit exceeded for 3su6n15k.default@developer.gserviceaccount.com Failure rate: http://shortn/_Y0NDt3fX3h
,
Jun 20 2018
Oh... this is a GoB issue. Totally different team, not the trooper.
,
Jun 20 2018
,
Jun 20 2018
Actually, going to re-open this as we need to look into why we exceeded quota. Dropping it down to a P1 though. Mike, people are pointing to you as the expert on this (a rare occurrence I know :p). Feel free to re-assign it though.
,
Jun 20 2018
,
Jun 20 2018
Sorry my mistake I copy pasted something incorrectly into crosoncall. resetting the state back, sorry vapier/athilenius
,
Jun 20 2018
Actually, Alec is going to drive a postmortem for this.
,
Jun 20 2018
In IRC, the GoB team mentioned that we are using more than normal. If that's true, will we run out of quota again shortly?
,
Jun 20 2018
I went through all CLs to land in chromite since the start of the 19th, and didn't find much that seemed relevant. The spikes appear to be happening roughly an hour after each release builder run starts. That might sorta kinda line up with when "SyncChrome" starts on each of the build slaves. If that's the case, this CL might be to blame https://crrev.com/c/1103551. However, I'm pretty sure it was included in a release build before the problems started appearing (fuzzy timezones make it hard to be certain). Also, I wouldn't have thought that chrome syncing would be enough data to push us up that high.
,
Jun 20 2018
Should we speculatively revert that CL to see if it puts things back to the old consumption rates?
,
Jun 20 2018
Hum.... That does mean that repo and gclient are sharing a git cache directory, which is supposed to be safe, but maybe not regularly done. Maybe the chrome sync did something that prevented the next repo build from using git cache? PS: I BELIEVE that fetches into the cache come from GS, not GoB.
,
Jun 20 2018
Yeah, a revert seems good. But I'm also confused that the graphs are showing usage go down for each release build. If that theory was right, they should be constant. PS: The CL fixed a bug in which our builders are creating a random unexpected persistent directory on our builders and just using it.
,
Jun 20 2018
What graph are you looking at?
,
Jun 20 2018
GoB team gave us: https://screenshot.googleplex.com/YonoO1QXnQ4 With the panopticon query linked.
,
Jun 20 2018
Come to think of it, we also have: https://viceroy.corp.google.com/chromeos/gerrit
,
Jun 20 2018
Let's add those graphs to the postmortem: we have the data but didn't alert on it.
,
Jun 20 2018
Ah... picking a random release builder and looking closely, I believe SyncChrome was the issue. It got WAY slower in all of the relevant builds. https://cros-goldeneye.corp.google.com/chromeos/legoland/builderHistory?buildConfig=samus-release&buildBranch=master
,
Jun 20 2018
+Ryan Chromium uses bot_update, which does the "fetch packfile from GCS, then incremental fetch from GoB" but gclient itself doesn't do that.
,
Jun 20 2018
gclient can do that, if you set cache_dir. It looks like your remote_run is doing a git checkout. We've switched it to cipd a couple months ago, you probably want to do that too. +iannucci for kitchen/remote_run git -> cipd next steps.
,
Jun 20 2018
What changed at 11:48 PDT? http://shortn/_tmKeRz8pgw All of the abnormal bandwidth consumption stopped then.
,
Jun 20 2018
The revert of my gclient change landed at 10:48, but already in-progress builds wouldn't pick it up. https://crrev.com/c/1108391 I believe it was either my CL, or some external problem biting us (ChOps git wrapper + GoB flake interacting badly, or something).
,
Jun 20 2018
re #19; you can enable CIPD recipes by doing e.g. https://chromium.googlesource.com/chromium/src.git/+/master/infra/config/global/cr-buildbucket.cfg#556 in your cr-buildbucket.cfg file.
,
Jun 20 2018
Do we need to do anything to prepare the CIPD packages for the recipe?
,
Jun 20 2018
Actually, I filed https://crbug.com/854830 for the CIPD change.
,
Jul 2
Is this why my chromiumos-sdk-tryjobs failed here, or should I file a new bug? https://cros-goldeneye.corp.google.com/chromeos/healthmonitoring/buildDetails?buildbucketId=8942330288719869376 https://cros-goldeneye.corp.google.com/chromeos/healthmonitoring/buildDetails?buildbucketId=8942330590227190256
,
Jul 2
I don't see the connection with the tryjobs. What makes you think they are the same failure?
,
Jul 2
Revving git file /b/swarming/w/ir/cache/cbuild/repository/src/third_party/chromiumos-overlay/chromeos/config/make.conf.amd64-host
upload_prebuilts: Unhandled exception:
Traceback (most recent call last):
File "/b/swarming/w/ir/cache/cbuild/repository/chromite/bin/upload_prebuilts", line 169, in <module>
DoMain()
File "/b/swarming/w/ir/cache/cbuild/repository/chromite/bin/upload_prebuilts", line 165, in DoMain
commandline.ScriptWrapperMain(FindTarget)
File "/b/swarming/w/ir/cache/cbuild/repository/chromite/lib/commandline.py", line 911, in ScriptWrapperMain
ret = target(argv[1:])
File "/b/swarming/w/ir/cache/cbuild/repository/chromite/scripts/upload_prebuilts.py", line 863, in main
options.sync_binhost_conf)
File "/b/swarming/w/ir/cache/cbuild/repository/chromite/scripts/upload_prebuilts.py", line 573, in SyncHostPrebuilts
RevGitFile(git_file, {key: binhost}, dryrun=self._dryrun)
File "/b/swarming/w/ir/cache/cbuild/repository/chromite/scripts/upload_prebuilts.py", line 165, in RevGitFile
git.CreatePushBranch(prebuilt_branch, cwd)
File "/b/swarming/w/ir/cache/cbuild/repository/chromite/lib/git.py", line 1317, in CreatePushBranch
RunGit(git_repo, cmd)
File "/b/swarming/w/ir/cache/cbuild/repository/chromite/lib/git.py", line 822, in RunGit
return cros_build_lib.RunCommand(['git'] + cmd, **kwargs)
File "/b/swarming/w/ir/cache/cbuild/repository/chromite/lib/cros_build_lib.py", line 669, in RunCommand
raise RunCommandError(msg, cmd_result)
chromite.lib.cros_build_lib.RunCommandError: return code: 1; command: git remote update cros
remote: Daily bandwidth rate limit exceeded for 3su6n15k.default@developer.gserviceaccount.com
fatal: protocol error: bad pack header
error: Could not fetch cros
Fetching cros
The "daily bandwidth rate limit exceeded" is the only reason I matched this bug. $subject doesn't really target a specific root cause, and so it's tough for me to understand if there's something new going on, or an existing known issue.
,
Jul 2
Oh... that does match up. These tryjobs ran Friday, which is the same day we brought a lot of new builders online. Our total bandwidth usage did spike way up, especially on the 30th. I guess this answers one question.... 256 cold builders is more than our quota can currently handle.
,
Jul 2
I believe the same tryjobs now should be good.
,
Jul 2
Alec, please add the postmortem to this bug and link all of your follow-up bugs here. Then mark this bug as fixed.
,
Jul 3
Postmortem Info Requiem link: requiem/pm/postmortem101909 Doc link: requiem/doc/postmortem101909 Follow up bugs Add GoB quota threshold alerts b/110774464 Add Swarming bot failure rate threshold alerts b/110773411 Improve Git cache characteristics for CrOS builders (long term project) b/110773139
,
Jul 3
Issue 838911 has been merged into this issue. |
|||||||||||
►
Sign in to add a comment |
|||||||||||
Comment 1 by dgarr...@chromium.org
, Jun 20 2018Components: Infra