Do not compound GoB flake by wiping out local checkouts. |
|||||
Issue descriptionToday, we had a GoB outage that caused GoB sync requests to be flaky, and sometimes fail with 502 errors. We believe that this failure caused some builders to decide that the local repo checkouts were corrupt and wiping them out. Having many builders sync from scratch will exceed out GoB quota, compounding the general GoB errors.
,
May 11 2016
Generally how this works is that the script that runs git commands need to be able to differentiable between: 1. Definitely network transient failures (Quota, etc) 2. Definitely permanent network failures (403, 404, etc) 3. Definitely permanent local failures (Corrupt checkout, etc) 4. Unknown And react accordingly: 1. Retry a few times, and fail on the Nth retry 2. Fail immediately 3. Clobber and retry 4. Well, this is where you get to choose one of above. Currently git doesn't provide a way to differentiate between classes of failures based off return code (or any other way really), other than introspecting the stdout and matching it up to a list of known error messages. The problem is that this set of error messages can be hard to find out without experiencing the failure, and also changes often when git client versions change, and when GoB gets updated (quota errors are 400's with a special message from GoB iirc).
,
May 11 2016
In Chrome OS's case, we default to just failing, but have built up a library of all the known flaky error messages from git and gerrit. If we encounter a known-flaky error message, then we retry. To figure out whether the checkout is corrupt, we analyse the checkout and run some basic git commands (e.g. fsck). If those fail,then this indicates the checkout needs to be blown away. In unknown cases we just fail and don't retry.
,
May 12 2016
So... can GoB errors trigger any of the git commands we test with to fail?
,
May 12 2016
,
May 12 2016
On Chrome OS we test for corruption using local commands and don't involve the network, so that helps ensure that we won't compound GoB flake with our git commands. depot_tools doesn't use our libraries so they don't benefit from the logic, and they blow away the directories unnecessarily and triggering a pile of traffic. The repo sync / gclient sync retrying logic in chromite could be improved to avoid doing so many retries. It would also be good if a failure in project 1 didn't cause automatic retries in every other project.
,
May 13 2016
So when we retry sync in repository.Sync(), we first do git fsck, if GitRepositoryCorrupted is True, we clean up repository and sync; else, we just retry sync. Is the procedure correct?
,
May 13 2016
I think you're talking about BuildRootGitCleanup. Yes, the logic there is correct.
This code from cbuildbot/repository.py is bad:
# Do the network half of the sync; retry as necessary to get the content.
try:
cros_build_lib.RunCommand(cmd + ['-n'], cwd=self.directory)
except cros_build_lib.RunCommandError:
if constants.SYNC_RETRIES > 0:
# Retry on clean up and repo sync commands,
# decrement max_retry for this command
retry_util.RetryCommand(self._CleanUpAndRunCommand,
constants.SYNC_RETRIES - 1,
cmd + ['-n'], cwd=self.directory,
local_manifest=local_manifest)
else:
# No need to retry
raise
def _CleanUpAndRunCommand(self, *args, **kwargs):
"""Clean up repository and run command"""
ClearBuildRoot(self.directory, self.preserve_paths)
local_manifest = kwargs.pop('local_manifest', None)
# Always re-initialize to the current branch.
self.Initialize(local_manifest)
# Fix existing broken mirroring configurations.
self._EnsureMirroring()
cros_build_lib.RunCommand(*args, **kwargs)
Rather than running ClearBuildRoot (which deletes the entire repository!), it should run commands.BuildRootGitCleanup, which is more careful and checks whether there is any git corruption before deleting the repository.
Would also be worth adding some more logging for when it deletes the repository. Don't think we have any currently.
,
May 14 2016
The following revision refers to this bug: https://chromium.googlesource.com/chromiumos/chromite/+/a902fc2296fb074ed93a5f01c2bf6b8af9bd809e commit a902fc2296fb074ed93a5f01c2bf6b8af9bd809e Author: Ningning Xia <nxia@chromium.org> Date: Fri May 13 18:36:08 2016 Use BuildRootGitCleanup in Sync retry Use BuildRootGitCleanup instead of ClearBuildRoot in repo Sync retry. BuildRootGitCleanUp calls fsck in order to only delete projects when git repository is corrupted, which reduces the unnecessary cleanup caused by ClearBuildRoot. BUG= chromium:611240 TEST=run_tests Change-Id: Ib38cef4631872edc58e193b9529feaedf3c6d181 Reviewed-on: https://chromium-review.googlesource.com/344572 Commit-Ready: Ningning Xia <nxia@chromium.org> Tested-by: Ningning Xia <nxia@chromium.org> Reviewed-by: Don Garrett <dgarrett@chromium.org> [modify] https://crrev.com/a902fc2296fb074ed93a5f01c2bf6b8af9bd809e/cbuildbot/repository.py
,
May 16 2016
Is this issue resolved now?
,
May 16 2016
We believe so.
,
May 16 2016
This is resolved as far as ChromeOS is concerned. Chrome still has a bug where it compounds GoB flake by wiping out local checkouts.
,
May 16 2016
As long as Chrome does the initial sync from GS not GoB, having it wipe is okay quota-wise. We had problems with Chrome syncs on GCE builders because the service account didn't have read access to the right GS files. That permission problem has been addressed.
,
May 23 2016
bulk verified |
|||||
►
Sign in to add a comment |
|||||
Comment 1 by dgarr...@chromium.org
, May 11 2016