New issue
Advanced search Search tips
Note: Color blocks (like or ) mean that a user may not be available. Tooltip shows the reason.

Issue 611240 link

Starred by 1 user

Issue metadata

Status: Verified
Owner:
Last visit > 30 days ago
Closed: May 2016
Cc:
Components:
EstimatedDays: ----
NextAction: ----
OS: Chrome
Pri: 2
Type: Bug



Sign in to add a comment

Do not compound GoB flake by wiping out local checkouts.

Project Member Reported by dgarr...@chromium.org, May 11 2016

Issue description

Today, we had a GoB outage that caused GoB sync requests to be flaky, and sometimes fail with 502 errors.

We believe that this failure caused some builders to decide that the local repo checkouts were corrupt and wiping them out. Having many builders sync from scratch will exceed out GoB quota, compounding the general GoB errors.
 
Cc: davidjames@chromium.org shuqianz@chromium.org nxia@chromium.org

Comment 2 by hinoka@chromium.org, May 11 2016

Generally how this works is that the script that runs git commands need to be able to differentiable between:
1. Definitely network transient failures (Quota, etc)
2. Definitely permanent network failures (403, 404, etc)
3. Definitely permanent local failures (Corrupt checkout, etc)
4. Unknown

And react accordingly:
1. Retry a few times, and fail on the Nth retry
2. Fail immediately
3. Clobber and retry
4. Well, this is where you get to choose one of above.

Currently git doesn't provide a way to differentiate between classes of failures based off return code (or any other way really), other than introspecting the stdout and matching it up to a list of known error messages.  The problem is that this set of error messages can be hard to find out without experiencing the failure, and also changes often when git client versions change, and when GoB gets updated (quota errors are 400's with a special message from GoB iirc).


In Chrome OS's case, we default to just failing, but have built up a library of all the known flaky error messages from git and gerrit. If we encounter a known-flaky error message, then we retry.

To figure out whether the checkout is corrupt, we analyse the checkout and run some basic git commands (e.g. fsck). If those fail,then this indicates the checkout needs to be blown away.

In unknown cases we just fail and don't retry.
So... can GoB errors trigger any of the git commands we test with to fail?
Owner: nxia@chromium.org
On Chrome OS we test for corruption using local commands and don't involve the network, so that helps ensure that we won't compound GoB flake with our git commands.

depot_tools doesn't use our libraries so they don't benefit from the logic, and they blow away the directories unnecessarily and triggering a pile of traffic.

The repo sync / gclient sync retrying logic in chromite could be improved to avoid doing so many retries. It would also be good if a failure in project 1 didn't cause automatic retries in every other project.


Comment 7 by nxia@chromium.org, May 13 2016

So when we retry sync in repository.Sync(), we first do git fsck, if GitRepositoryCorrupted is True, we clean up repository and sync; else, we just retry sync. Is the procedure correct?
I think you're talking about BuildRootGitCleanup. Yes, the logic there is correct.

This code from cbuildbot/repository.py is bad:

      # Do the network half of the sync; retry as necessary to get the content.
      try:
        cros_build_lib.RunCommand(cmd + ['-n'], cwd=self.directory)
      except cros_build_lib.RunCommandError:
        if constants.SYNC_RETRIES > 0:
          # Retry on clean up and repo sync commands,
          # decrement max_retry for this command
          retry_util.RetryCommand(self._CleanUpAndRunCommand,
                                  constants.SYNC_RETRIES - 1,
                                  cmd + ['-n'], cwd=self.directory,
                                  local_manifest=local_manifest)
        else:
          # No need to retry
          raise


  def _CleanUpAndRunCommand(self, *args, **kwargs):
    """Clean up repository and run command"""
    ClearBuildRoot(self.directory, self.preserve_paths)
    local_manifest = kwargs.pop('local_manifest', None)
    # Always re-initialize to the current branch.
    self.Initialize(local_manifest)
    # Fix existing broken mirroring configurations.
    self._EnsureMirroring()
    cros_build_lib.RunCommand(*args, **kwargs)


Rather than running ClearBuildRoot (which deletes the entire repository!), it should run commands.BuildRootGitCleanup, which is more careful and checks whether there is any git corruption before deleting the repository.

Would also be worth adding some more logging for when it deletes the repository. Don't think we have any currently.
Project Member

Comment 9 by bugdroid1@chromium.org, May 14 2016

The following revision refers to this bug:
  https://chromium.googlesource.com/chromiumos/chromite/+/a902fc2296fb074ed93a5f01c2bf6b8af9bd809e

commit a902fc2296fb074ed93a5f01c2bf6b8af9bd809e
Author: Ningning Xia <nxia@chromium.org>
Date: Fri May 13 18:36:08 2016

Use BuildRootGitCleanup in Sync retry

Use BuildRootGitCleanup instead of ClearBuildRoot in repo Sync retry.
BuildRootGitCleanUp calls fsck in order to only delete projects when
git repository is corrupted, which reduces the unnecessary cleanup caused
by ClearBuildRoot.

BUG= chromium:611240 
TEST=run_tests

Change-Id: Ib38cef4631872edc58e193b9529feaedf3c6d181
Reviewed-on: https://chromium-review.googlesource.com/344572
Commit-Ready: Ningning Xia <nxia@chromium.org>
Tested-by: Ningning Xia <nxia@chromium.org>
Reviewed-by: Don Garrett <dgarrett@chromium.org>

[modify] https://crrev.com/a902fc2296fb074ed93a5f01c2bf6b8af9bd809e/cbuildbot/repository.py

Labels: -current-issue
Is this issue resolved now? 
Status: Fixed (was: Untriaged)
We believe so.
This is resolved as far as ChromeOS is concerned. Chrome still has a bug where it compounds GoB flake by wiping out local checkouts.
As long as Chrome does the initial sync from GS not GoB, having it wipe is okay quota-wise.

We had problems with Chrome syncs on GCE builders because the service account didn't have read access to the right GS files. That permission problem has been addressed.
Status: Verified (was: Fixed)
bulk verified

Sign in to add a comment