New issue
Advanced search Search tips
Note: Color blocks (like or ) mean that a user may not be available. Tooltip shows the reason.

Issue 671666 link

Starred by 1 user

Issue metadata

Status: Verified
Owner:
Closed: Dec 2016
Cc:
Components:
EstimatedDays: ----
NextAction: ----
OS: Chrome
Pri: 1
Type: Bug



Sign in to add a comment

build156-m2 can't run 'repo sync', can't build

Project Member Reported by afakhry@chromium.org, Dec 6 2016

Issue description

https://uberchromegw.corp.google.com/i/chromeos/builders/x86-alex-chrome-pfq

The build has failed 3 times so far. There are a lot of repo sync errors, such as:

repo has been initialized in /b/cbuild/internal_master
01:05:48: INFO: RunCommand: repo --time sync '--cache-dir=/b/cros_git_cache' -n in /b/cbuild/internal_master
Usage: repo sync [<project>...]

main.py: error: no such option: --cache-dir
01:05:49: WARNING: Command failed with retriable error.
return code: 2; command: repo --time sync '--cache-dir=/b/cros_git_cache' -n
cwd=/b/cbuild/internal_master
01:05:49: ERROR: return code: 2; command: repo --time sync '--cache-dir=/b/cros_git_cache' -n
cwd=/b/cbuild/internal_master


Finally the stack trace of the failure is:

@@@STEP_FAILURE@@@
01:05:49: ERROR: <class 'chromite.cbuildbot.repository.SrcCheckOutException'>: return code: 2; command: repo --time sync '--cache-dir=/b/cros_git_cache' -n
cwd=/b/cbuild/internal_master
Traceback (most recent call last):
  File "/b/build/slave/x86-alex-chrome-pfq-master/build/chromite/lib/failures_lib.py", line 172, in wrapped_functor
    return functor(*args, **kwargs)
  File "/b/build/slave/x86-alex-chrome-pfq-master/build/chromite/cbuildbot/stages/sync_stages.py", line 889, in PerformStage
    self.ManifestCheckout(new_manifest)
  File "/b/build/slave/x86-alex-chrome-pfq-master/build/chromite/cbuildbot/stages/sync_stages.py", line 435, in ManifestCheckout
    self.repo.Sync(next_manifest)
  File "/b/build/slave/x86-alex-chrome-pfq-master/build/chromite/cbuildbot/repository.py", line 517, in Sync
    raise SrcCheckOutException(err_msg)
SrcCheckOutException: return code: 2; command: repo --time sync '--cache-dir=/b/cros_git_cache' -n
cwd=/b/cbuild/internal_master
 
Cc: dgarr...@chromium.org
Owner: jrbarnette@chromium.org
Status: Started (was: Untriaged)
Don't know what's going on except to say "repo sync is failing".

Starting an investigation.
Cc: vapier@chromium.org
This seems to be the error which is stopping us from getting the current version of repo:


10:39:22: INFO: Chrome version was found in the manifest: 57.0.2943.0
BUILDROOT: /b/cbuild/internal_master
TRACKING BRANCH: master
NEXT MANIFEST: /b/cbuild/internal_master/manifest-versions-internal/chrome-LKGM-candidates/buildspecs/57/9055.0.0-rc1.xml
10:39:22: INFO: RunCommand: repo manifest in /b/cbuild/internal_master
10:39:23: INFO: RunCommand: repo selfupdate in /b/cbuild/internal_master
info: A new version of repo is available


object 69034721e607ecbe8dad736c8ce07d91efdc8353
type commit
tag v1.12.37-cr1
tagger Mike Frysinger <vapier@chromium.org> 1476811349 -0400

Chromium-specific release

gpg: Signature made Tue 18 Oct 2016 10:22:29 AM PDT
gpg:                using RSA key DA03FD3916B500A8
gpg: Can't check signature: public key not found


warning: Skipped upgrade to unverified version
This builder has been sitting idle for about 6 months, so has never fetched the version of repo with git-cache support.

It also doesn't seem to have the vapier's key used to validate new versions of repo. That means it hasn't updated depot_tools, but I'm not sure which version of depot tools wasn't updated. It might be the version included in the chromeos checkout, which would be a catch 22 (can't get new key without sync'ing, can't sync without new key).
Spoke w/ dgarrett@ and nxia@.  There were some changes in assignments of
builder hardware yesterday.  It seems x86-alex-chrome-pfq picked up a
system that had been idle for an extended period of time.  The result
was that it needed an update to 'repo'.

dgarrett@ wiped out large chunks of stuff on the builder, and rebooted.
The builder should reconstitute itself, and generally start working on
its own.  In theory.

I saw this problem on a couple of GCE instances and just reimaged them without investigating.

For this builder, we wiped the buildroot and rebooted. We'll see if it comes back up correctly or not.
Wiping the buildroot didn't work. I wonder if the Chrome Infra move from svn to git is also involved. I renamed /b to /b.old and rebooted as a test. I believe everything in there will be recreated on boot and/or on the next puppet run.
(in my limited understanding), i thought depot_tools was provided/managed by c-i-t and we used that ?  i don't think we can (or should or want to) use the depot_tools that's part of the CrOS checkout because we need repo to get that checkout in the first place (i.e. the initial repo init & repo sync).
Cc: friedman@chromium.org
I really don't know which version of depot_tools is used (there seem to be several), I was just hypothesizing about what would explain the error. After wiping the chromeos buildroot, I still see the same error, so that answer was wrong.

Also:

/b was NOT recreated by puppet as expected.

I manually updated the various Chrome Infra checkouts, and rebooted... again, but they generally seemed to be current.
Oh... I did mv /b.old back to /b. No love after the manual updates.

I think we pass this Chrome Operations for repair.
Components: -Infra>Client>ChromeOS Infra
Owner: ----
Status: Untriaged (was: Started)
Summary: build156-m2 can't run 'repo sync', can't build (was: x86-alex-chrome-pfq failing in MasterSlaveLKGMSync)
Passing to the Chrome Troopers.

Full history of the problem is above.  I'm available to answer
questions, although dgarrett@ knows more of the specifics of what's
been done.

The basic summary is that the builder is failing, and needs to be
re-initialized.

Cc: chrome-trooper-bugs@chromium.org
Owner: andyb...@chromium.org
andybons, Could you please take a look at this issue and triage it if necessary?
Labels: Infra-Troopers
Status: Started (was: Untriaged)
Components: -Infra Infra>Labs
Owner: friedman@chromium.org
Status: Assigned (was: Started)
build root re-creation is a manual process.
Ah.
Status: Fixed (was: Assigned)
Please let me know if it's not working right.
Just to ask.... what did you do?
Cc: lhchavez@chromium.org
FWIW, I am seeing the same type of failure across several buildslaves used for m56 branch builders. I was investigating the lack of M56 builds for veyron_minnie since November 30, and saw the following machines failing in a similar manner as the initial report here.

board (all release-R56-9000.B)   machine       step
stout-release                    build288-m2   manifestversionedsync
veyron_minnie                    build293-m2   manifestversionedsync
x86-mario-release                build300-m2   manifestversionedsync
guado_moblab-release             build301-m2   manifestversionedsync

In the ones I saw, the pattern seemed to be:
- run on new build machine (prior build had different build slave)
- first build fails in Cleanup step - example https://uberchromegw.corp.google.com/i/chromeos_release/builders/veyron_minnie-release%20release-R56-9000.B/builds/11
- next builds all fail in manifestversionedsync step, with symptoms like the initial report (complains about --cache-dir, failing with the same stack trace). Example: https://uberchromegw.corp.google.com/i/chromeos_release/builders/veyron_minnie-release%20release-R56-9000.B/builds/12/steps/ManifestVersionedSync/logs/stdio

Not sure if I should open an independent bug, or hope that whatever remediation is happening here can be applied to these machines as well.
Removed: /b /c /home/chrome-bot/{build,.netrc,.gitconfig} /var/lib/puppet/ssl
Cleaned the puppet cert on the puppet master.
Mounted and ran the linux setup script.
Signed the new puppet cert on the puppet master.
Reboot.

This isn't as clean as a fresh install, so if it's still weird, that's the next step.
A number of those builders have the same history.

They were in active use, then left idle for several months, and have just been brought back into active use. Many of them (like builder156) were not connected to buildbot until after a manual reboot.
Is there any way we can follow the same steps, but leave the puppet cert in place?

That would make it something the deputy can do.
I've filed bug 671796 for the four R56 release builders.

I don't know if we are ready to go down that path just yet.  It's actually easier to just reinstall.
Okay. Fair warning that almost all of our unused builders are probably in the same state, so there will probably be more of this over time.
How many of 'em ya got?  We might like to carpe diem the opportunity to migrate you to R620 machines in that case, but I will most likely be on leave so I don't want to throw people under the bus.
I'm not sure exactly, and there is no rush since these are currently
unallocated machines, but I recently went through and rebooted about 20-30
of the builders in these two pools that were disconnected to get them to
reconnect.

https://uberchromegw.corp.google.com/i/chromeos/builders/unallocated-slave-pool
https://uberchromegw.corp.google.com/i/chromeos_release/builders/unallocated-slave-pool

Almost all of them had been up since before August, which means they
haven't done a build (we reboot after every build), and so I'm guessing
they'll have similar issues whenever they happen to be put back into
service.

Comment 27 by dchan@google.com, Mar 4 2017

Labels: VerifyIn-58

Comment 28 by dchan@google.com, Apr 17 2017

Labels: VerifyIn-59

Comment 29 by dchan@google.com, May 30 2017

Labels: VerifyIn-60
Labels: VerifyIn-61
Status: Verified (was: Fixed)
Closing. Please reopen it if its not fixed. Thanks!

Sign in to add a comment