build156-m2 can't run 'repo sync', can't build |
|||||||||||||||
Issue descriptionhttps://uberchromegw.corp.google.com/i/chromeos/builders/x86-alex-chrome-pfq The build has failed 3 times so far. There are a lot of repo sync errors, such as: repo has been initialized in /b/cbuild/internal_master 01:05:48: INFO: RunCommand: repo --time sync '--cache-dir=/b/cros_git_cache' -n in /b/cbuild/internal_master Usage: repo sync [<project>...] main.py: error: no such option: --cache-dir 01:05:49: WARNING: Command failed with retriable error. return code: 2; command: repo --time sync '--cache-dir=/b/cros_git_cache' -n cwd=/b/cbuild/internal_master 01:05:49: ERROR: return code: 2; command: repo --time sync '--cache-dir=/b/cros_git_cache' -n cwd=/b/cbuild/internal_master Finally the stack trace of the failure is: @@@STEP_FAILURE@@@ 01:05:49: ERROR: <class 'chromite.cbuildbot.repository.SrcCheckOutException'>: return code: 2; command: repo --time sync '--cache-dir=/b/cros_git_cache' -n cwd=/b/cbuild/internal_master Traceback (most recent call last): File "/b/build/slave/x86-alex-chrome-pfq-master/build/chromite/lib/failures_lib.py", line 172, in wrapped_functor return functor(*args, **kwargs) File "/b/build/slave/x86-alex-chrome-pfq-master/build/chromite/cbuildbot/stages/sync_stages.py", line 889, in PerformStage self.ManifestCheckout(new_manifest) File "/b/build/slave/x86-alex-chrome-pfq-master/build/chromite/cbuildbot/stages/sync_stages.py", line 435, in ManifestCheckout self.repo.Sync(next_manifest) File "/b/build/slave/x86-alex-chrome-pfq-master/build/chromite/cbuildbot/repository.py", line 517, in Sync raise SrcCheckOutException(err_msg) SrcCheckOutException: return code: 2; command: repo --time sync '--cache-dir=/b/cros_git_cache' -n cwd=/b/cbuild/internal_master
,
Dec 6 2016
This seems to be the error which is stopping us from getting the current version of repo: 10:39:22: INFO: Chrome version was found in the manifest: 57.0.2943.0 BUILDROOT: /b/cbuild/internal_master TRACKING BRANCH: master NEXT MANIFEST: /b/cbuild/internal_master/manifest-versions-internal/chrome-LKGM-candidates/buildspecs/57/9055.0.0-rc1.xml 10:39:22: INFO: RunCommand: repo manifest in /b/cbuild/internal_master 10:39:23: INFO: RunCommand: repo selfupdate in /b/cbuild/internal_master info: A new version of repo is available object 69034721e607ecbe8dad736c8ce07d91efdc8353 type commit tag v1.12.37-cr1 tagger Mike Frysinger <vapier@chromium.org> 1476811349 -0400 Chromium-specific release gpg: Signature made Tue 18 Oct 2016 10:22:29 AM PDT gpg: using RSA key DA03FD3916B500A8 gpg: Can't check signature: public key not found warning: Skipped upgrade to unverified version
,
Dec 6 2016
This builder has been sitting idle for about 6 months, so has never fetched the version of repo with git-cache support. It also doesn't seem to have the vapier's key used to validate new versions of repo. That means it hasn't updated depot_tools, but I'm not sure which version of depot tools wasn't updated. It might be the version included in the chromeos checkout, which would be a catch 22 (can't get new key without sync'ing, can't sync without new key).
,
Dec 6 2016
Spoke w/ dgarrett@ and nxia@. There were some changes in assignments of builder hardware yesterday. It seems x86-alex-chrome-pfq picked up a system that had been idle for an extended period of time. The result was that it needed an update to 'repo'. dgarrett@ wiped out large chunks of stuff on the builder, and rebooted. The builder should reconstitute itself, and generally start working on its own. In theory.
,
Dec 6 2016
I saw this problem on a couple of GCE instances and just reimaged them without investigating. For this builder, we wiped the buildroot and rebooted. We'll see if it comes back up correctly or not.
,
Dec 6 2016
Wiping the buildroot didn't work. I wonder if the Chrome Infra move from svn to git is also involved. I renamed /b to /b.old and rebooted as a test. I believe everything in there will be recreated on boot and/or on the next puppet run.
,
Dec 6 2016
(in my limited understanding), i thought depot_tools was provided/managed by c-i-t and we used that ? i don't think we can (or should or want to) use the depot_tools that's part of the CrOS checkout because we need repo to get that checkout in the first place (i.e. the initial repo init & repo sync).
,
Dec 6 2016
I really don't know which version of depot_tools is used (there seem to be several), I was just hypothesizing about what would explain the error. After wiping the chromeos buildroot, I still see the same error, so that answer was wrong. Also: /b was NOT recreated by puppet as expected. I manually updated the various Chrome Infra checkouts, and rebooted... again, but they generally seemed to be current.
,
Dec 6 2016
Oh... I did mv /b.old back to /b. No love after the manual updates. I think we pass this Chrome Operations for repair.
,
Dec 6 2016
Passing to the Chrome Troopers. Full history of the problem is above. I'm available to answer questions, although dgarrett@ knows more of the specifics of what's been done. The basic summary is that the builder is failing, and needs to be re-initialized.
,
Dec 6 2016
,
Dec 6 2016
andybons, Could you please take a look at this issue and triage it if necessary?
,
Dec 6 2016
,
Dec 6 2016
build root re-creation is a manual process.
,
Dec 6 2016
Ah.
,
Dec 6 2016
Please let me know if it's not working right.
,
Dec 6 2016
Just to ask.... what did you do?
,
Dec 6 2016
FWIW, I am seeing the same type of failure across several buildslaves used for m56 branch builders. I was investigating the lack of M56 builds for veyron_minnie since November 30, and saw the following machines failing in a similar manner as the initial report here. board (all release-R56-9000.B) machine step stout-release build288-m2 manifestversionedsync veyron_minnie build293-m2 manifestversionedsync x86-mario-release build300-m2 manifestversionedsync guado_moblab-release build301-m2 manifestversionedsync In the ones I saw, the pattern seemed to be: - run on new build machine (prior build had different build slave) - first build fails in Cleanup step - example https://uberchromegw.corp.google.com/i/chromeos_release/builders/veyron_minnie-release%20release-R56-9000.B/builds/11 - next builds all fail in manifestversionedsync step, with symptoms like the initial report (complains about --cache-dir, failing with the same stack trace). Example: https://uberchromegw.corp.google.com/i/chromeos_release/builders/veyron_minnie-release%20release-R56-9000.B/builds/12/steps/ManifestVersionedSync/logs/stdio Not sure if I should open an independent bug, or hope that whatever remediation is happening here can be applied to these machines as well.
,
Dec 6 2016
Removed: /b /c /home/chrome-bot/{build,.netrc,.gitconfig} /var/lib/puppet/ssl
Cleaned the puppet cert on the puppet master.
Mounted and ran the linux setup script.
Signed the new puppet cert on the puppet master.
Reboot.
This isn't as clean as a fresh install, so if it's still weird, that's the next step.
,
Dec 6 2016
A number of those builders have the same history. They were in active use, then left idle for several months, and have just been brought back into active use. Many of them (like builder156) were not connected to buildbot until after a manual reboot.
,
Dec 6 2016
Is there any way we can follow the same steps, but leave the puppet cert in place? That would make it something the deputy can do.
,
Dec 6 2016
I've filed bug 671796 for the four R56 release builders.
,
Dec 7 2016
I don't know if we are ready to go down that path just yet. It's actually easier to just reinstall.
,
Dec 7 2016
Okay. Fair warning that almost all of our unused builders are probably in the same state, so there will probably be more of this over time.
,
Dec 7 2016
How many of 'em ya got? We might like to carpe diem the opportunity to migrate you to R620 machines in that case, but I will most likely be on leave so I don't want to throw people under the bus.
,
Dec 7 2016
I'm not sure exactly, and there is no rush since these are currently unallocated machines, but I recently went through and rebooted about 20-30 of the builders in these two pools that were disconnected to get them to reconnect. https://uberchromegw.corp.google.com/i/chromeos/builders/unallocated-slave-pool https://uberchromegw.corp.google.com/i/chromeos_release/builders/unallocated-slave-pool Almost all of them had been up since before August, which means they haven't done a build (we reboot after every build), and so I'm guessing they'll have similar issues whenever they happen to be put back into service.
,
Mar 4 2017
,
Apr 17 2017
,
May 30 2017
,
Aug 1 2017
,
Aug 3 2017
Closing. Please reopen it if its not fixed. Thanks! |
|||||||||||||||
►
Sign in to add a comment |
|||||||||||||||
Comment 1 by jrbarnette@chromium.org
, Dec 6 2016Owner: jrbarnette@chromium.org
Status: Started (was: Untriaged)