CQ failures - GoB bandwidth exceeded
Reported by
jrbarnette@chromium.org,
Jun 29 2016
|
|||||||
Issue description
All paladin builders have failed in CommitQueueSync stage.
The error looks like this:
09:36:06: INFO: Updating manifest-versions checkout.
09:36:26: WARNING: git reported transient error (cmd=remote update origin); retrying
Traceback (most recent call last):
File "/b/build/slave/lakitu-paladin-master/build/chromite/lib/retry_util.py", line 88, in GenericRetry
ret = functor(*args, **kwargs)
File "/b/build/slave/lakitu-paladin-master/build/chromite/lib/cros_build_lib.py", line 619, in RunCommand
raise RunCommandError(msg, cmd_result)
RunCommandError: return code: 1; command: git remote update origin
remote: Short term bandwidth rate limit exceeded for chromeos-commit-bot@chromium.org
fatal: protocol error: bad pack header
error: Could not fetch origin
The problem is probably related to attempts to fix a recent canary problem.
,
Jun 29 2016
We have an initial plan of action:
* We'll be asking for a short term increase in the bandwidth
quota. The CQ is fine except for this bandwidth problem,
so once that's done, we'll be able to let the CQ move forward.
* The PFQ and canary are both failing because of bug 624177 .
So, we need to resolve that problem. When that's done, the
PFQ will be able to move forward.
* The canaries will go last. We're investing how best to proceed
without exhausting our bandwidth quota again.
Initial estimate (based on prior experience) is that resolution of the
canaries will require at least 12 hours. We hope to have the CQ and PFQ
up by CoB, Pacific time.
,
Jun 29 2016
For a longer term fix, can we get a GoB bandwidth quota in place that would allow a complete flush of at least one set of builders while keeping everything else running? It seems like a bug to me that our GoB bandwidth quota is unable to keep up with a fleet clobber.
,
Jun 29 2016
,
Jun 29 2016
Update:
We've moved the tree from "closed" to "throttled", and our
quota seems to have recovered sufficiently. So:
* The PFQ is already running, but slated to fail because
of bug 624177 .
* The CQ is running, and apparently is able to sync.
* The 10:00 canary was already stopped; the next one is
due for 18:00. We think that most of them will be able
to sync, but we may need to kill a few selected builders.
,
Jun 29 2016
Update: The CQ is now running normally. The fix for bug 624177 is in, and the next PFQ run should pick it up. We expect that that problem won't affect the next canary run. The canaries are being repaired. The run at 18:00 should be able to complete normally, but success is not guaranteed.
,
Jun 30 2016
The canary repair was only partly successful. Some builders are on track to finish the 18:00 run; most have already failed. There will be more efforts. I'm hopeful that by 02:00 at least 20 more canaries will be clear of trouble. If there are still failures, we'll have time on Thursday morning to get everything working in time for the 10:00 canary run.
,
Jun 30 2016
Update:
Every builder has now gotten through ManifestVersionedSync. However,
many canaries are still failing with bandwidth complaints in SyncChrome.
Additionally, there's one other unexplained failure symptom in the
Uprev phase on some canaries:
02:20:57: INFO: RunCommand: /b/cbuild/internal_master/chromite/bin/cros_mark_as_stable --all '--boards=mccloud' ...
02:21:17: INFO: Committing changes with commit message: Marking 9999 ebuild for chromeos-base/plaso-cronlite as stable.
02:21:40: INFO: Attempting refresh to obtain initial access_token
02:21:40: INFO: Refreshing access_token
02:21:47: ERROR: Package chromeos-kernel-4_4 has a chromeos-version.sh script but it returned no valid version for "/b/cbuild/internal_master/src/third_party/kernel/v4.4"
For me, this is inscrutable. I didn't think canaries should do Uprev
at all, so I don't know why this happens, let alone why it fails.
Action is ongoing.
,
Jun 30 2016
,
Jun 30 2016
Update:
Uprev is normal. Who knew?
After a deeper dive into the "kernel/v4.4" complaint (including testing),
the best available theory is that
A) The problem is a missing or corrupted git repository caused by
the lingering bandwidth failures in the 18:00 canary run, and
B) The builders will repair themselves in the 10:00 run.
Waiting to see how that bears out.
,
Jun 30 2016
None of the builders that failed Uprev previously have repaired
themselves on their own.
Logging in to one of the builders, you get this from the kernel/v4.4
repo:
$ git describe --match "${PATTERN}" --abbrev=0 HEAD
fatal: No names found, cannot describe anything.
So, lingering corruption. Not yet clear how to fix it.
,
Jun 30 2016
Update: The corruption problem should be resolved (knock on wood). The remaining problem is a number of canaries still bumping up against the bandwidth limit during SyncChrome. There's a good chance that this can be resolved in time for the 18:00 canary.
,
Jun 30 2016
Update: All the canaries that previously failed have been synced, and should now be able to sync for the next build without fear of the bandwidth quota. Assuming that it all goes as expected, we're done now.
,
Jul 1 2016
Removing RBD since this is resolved now.
,
Jul 1 2016
Update: Overnight, slightly more than a dozen canaries were still being blocked by the bandwidth quota. I repaired about a half dozen this morning, but I couldn't get to all of them. The canary of 10:00 7/1 still had 10 failures due to the bandwidth quota. I've stepped in to try to repair them. All the others are able to build images, and shouldn't be seeing the problem any more.
,
Jul 1 2016
Update: After attempting repair of the ~10 builders at ~11:30 Pacific, we bumped up against the daily bandwidth quota. That quota is now reset, so we're OK for the moment. A re-attempt to get about half of them synced has succeeded, and we're now down to 6 that have yet to successfully sync. Those six are repairing now. One way or the other they'll be done before the 18:00 canary run. Lord willin' and the crick don't rise, these are the last 6 needing manual attention.
,
Jul 2 2016
The bandwidth problems have cleared up. I'm declaring victory.
,
Jul 6 2016
,
Aug 12 2016
Problem is occurring again (and was occurring throughout the evening): https://uberchromegw.corp.google.com/i/chromeos/builders/pre-cq-launcher/builds/7499
,
Aug 12 2016
This time it looks like it might be happening on some default service account: Fetching project chromiumos/third_party/bluremote: Short term bandwidth rate limit exceeded for 3su6n15k.default@developer.gserviceaccount.com fatal: protocol error: bad pack header remote: Short term bandwidth rate limit exceeded for 3su6n15k.default@developer.gserviceaccount.com
,
Aug 12 2016
Please, please, please, open new bugs for new events/failures. This bug was about a very specific event, with a well-defined beginning and a well-defined point of being fixed. The failure in c#19 is a new, different failure, needing new, different attention.
,
Aug 12 2016
New bug issue 637440 |
|||||||
►
Sign in to add a comment |
|||||||
Comment 1 by jrbarnette@chromium.org
, Jun 29 2016