New issue
Advanced search Search tips
Note: Color blocks (like or ) mean that a user may not be available. Tooltip shows the reason.

Issue 624460 link

Starred by 3 users

Issue metadata

Status: Verified
Owner:
Last visit > 30 days ago
Closed: Aug 2016
Cc:
Components:
EstimatedDays: ----
NextAction: ----
OS: Chrome
Pri: 0
Type: Bug



Sign in to add a comment

CQ failures - GoB bandwidth exceeded

Reported by jrbarnette@chromium.org, Jun 29 2016

Issue description

All paladin builders have failed in CommitQueueSync stage.

The error looks like this:

09:36:06: INFO: Updating manifest-versions checkout.
09:36:26: WARNING: git reported transient error (cmd=remote update origin); retrying
Traceback (most recent call last):
  File "/b/build/slave/lakitu-paladin-master/build/chromite/lib/retry_util.py", line 88, in GenericRetry
    ret = functor(*args, **kwargs)
  File "/b/build/slave/lakitu-paladin-master/build/chromite/lib/cros_build_lib.py", line 619, in RunCommand
    raise RunCommandError(msg, cmd_result)
RunCommandError: return code: 1; command: git remote update origin
remote: Short term bandwidth rate limit exceeded for chromeos-commit-bot@chromium.org
fatal: protocol error: bad pack header
error: Could not fetch origin

The problem is probably related to attempts to fix a recent canary problem.

 
More information:

Recently, the canaries went a lovely shade of red due to  bug 624177 .

The attempted remedy was to force all of the canary builders to clobber.
Unfortunately, "clobber" forces "repo init; repo sync" across all of the
canaries, of which there are several dozen.  Simultaneously syncing an
entire tree across so many builders exceeds our available bandwidth
quota.  Hence this symptom.

We're working to form a plan for a fix; there's no ETA.

We have an initial plan of action:
  * We'll be asking for a short term increase in the bandwidth
    quota.  The CQ is fine except for this bandwidth problem,
    so once that's done, we'll be able to let the CQ move forward.
  * The PFQ and canary are both failing because of  bug 624177 .
    So, we need to resolve that problem.  When that's done, the
    PFQ will be able to move forward.
  * The canaries will go last.  We're investing how best to proceed
    without exhausting our bandwidth quota again.

Initial estimate (based on prior experience) is that resolution of the
canaries will require at least 12 hours.  We hope to have the CQ and PFQ
up by CoB, Pacific time.

For a longer term fix, can we get a GoB bandwidth quota in place that would allow a complete flush of at least one set of builders while keeping everything else running?

It seems like a bug to me that our GoB bandwidth quota is unable to keep up with a fleet clobber. 

Cc: bhthompson@chromium.org
Update:

We've moved the tree from "closed" to "throttled", and our
quota seems to have recovered sufficiently.  So:
  * The PFQ is already running, but slated to fail because
    of  bug 624177 .
  * The CQ is running, and apparently is able to sync.
  * The 10:00 canary was already stopped; the next one is
    due for 18:00.  We think that most of them will be able
    to sync, but we may need to kill a few selected builders.

Update:

The CQ is now running normally.

The fix for  bug 624177  is in, and the next PFQ run should
pick it up.  We expect that that problem won't affect the
next canary run.

The canaries are being repaired.  The run at 18:00 should
be able to complete normally, but success is not guaranteed.

The canary repair was only partly successful.  Some builders are
on track to finish the 18:00 run; most have already failed.

There will be more efforts.  I'm hopeful that by 02:00 at least
20 more canaries will be clear of trouble.  If there are still
failures, we'll have time on Thursday morning to get everything
working in time for the 10:00 canary run.

Update:

Every builder has now gotten through ManifestVersionedSync.  However,
many canaries are still failing with bandwidth complaints in SyncChrome.

Additionally, there's one other unexplained failure symptom in the
Uprev phase on some canaries:
    02:20:57: INFO: RunCommand: /b/cbuild/internal_master/chromite/bin/cros_mark_as_stable --all '--boards=mccloud' ...
    02:21:17: INFO: Committing changes with commit message: Marking 9999 ebuild for chromeos-base/plaso-cronlite as stable.
    02:21:40: INFO: Attempting refresh to obtain initial access_token
    02:21:40: INFO: Refreshing access_token
    02:21:47: ERROR: Package chromeos-kernel-4_4 has a chromeos-version.sh script but it returned no valid version for "/b/cbuild/internal_master/src/third_party/kernel/v4.4"

For me, this is inscrutable.  I didn't think canaries should do Uprev
at all, so I don't know why this happens, let alone why it fails.

Action is ongoing.

Labels: ReleaseBlock-Dev M-53
Update:

Uprev is normal.  Who knew?

After a deeper dive into the "kernel/v4.4" complaint (including testing),
the best available theory is that
 A) The problem is a missing or corrupted git repository caused by
    the lingering bandwidth failures in the 18:00 canary run, and
 B) The builders will repair themselves in the 10:00 run.

Waiting to see how that bears out.

None of the builders that failed Uprev previously have repaired
themselves on their own.

Logging in to one of the builders, you get this from the kernel/v4.4
repo:
    $ git describe --match "${PATTERN}" --abbrev=0 HEAD
    fatal: No names found, cannot describe anything.

So, lingering corruption.  Not yet clear how to fix it.

Update:

The corruption problem should be resolved (knock on wood).

The remaining problem is a number of canaries still bumping up
against the bandwidth limit during SyncChrome.

There's a good chance that this can be resolved in time for the 18:00
canary.

Update:

All the canaries that previously failed have been synced, and
should now be able to sync for the next build without fear of
the bandwidth quota.

Assuming that it all goes as expected, we're done now.

Labels: -ReleaseBlock-Dev
Removing RBD since this is resolved now.
Update:

Overnight, slightly more than a dozen canaries were still being
blocked by the bandwidth quota.  I repaired about a half dozen
this morning, but I couldn't get to all of them.

The canary of 10:00 7/1 still had 10 failures due to the
bandwidth quota.  I've stepped in to try to repair them.
All the others are able to build images, and shouldn't be
seeing the problem any more.

Update:

After attempting repair of the ~10 builders at ~11:30 Pacific,
we bumped up against the daily bandwidth quota.  That quota is
now reset, so we're OK for the moment.

A re-attempt to get about half of them synced has succeeded, and
we're now down to 6 that have yet to successfully sync.  Those six
are repairing now.  One way or the other they'll be done before the
18:00 canary run.

Lord willin' and the crick don't rise, these are the last 6 needing
manual attention.

The bandwidth problems have cleared up.  I'm declaring victory.

Status: Verified (was: Assigned)
Cc: dshi@chromium.org davidri...@chromium.org vprupis@chromium.org
Status: Available (was: Verified)
Problem is occurring again (and was occurring throughout the evening):
https://uberchromegw.corp.google.com/i/chromeos/builders/pre-cq-launcher/builds/7499
This time it looks like it might be happening on some default service account:
Fetching project chromiumos/third_party/bluremote: Short term bandwidth rate limit exceeded for 3su6n15k.default@developer.gserviceaccount.com
fatal: protocol error: bad pack header
remote: Short term bandwidth rate limit exceeded for 3su6n15k.default@developer.gserviceaccount.com

Status: Verified (was: Available)
Please, please, please, open new bugs for new events/failures.
This bug was about a very specific event, with a well-defined
beginning and a well-defined point of being fixed.  The failure
in c#19 is a new, different failure, needing new, different
attention.

New bug  issue 637440 

Sign in to add a comment