master/slave inconsistency between using fresh / old chroot |
|||||
Issue descriptionmaster build: https://luci-milo.appspot.com/buildbot/chromeos/master-paladin/13661 slave: https://luci-milo.appspot.com/buildbot/chromeos/samus-paladin/12878 Master logic chose to use a fresh chroot (CommitQueueSync (Using fresh chroot)) because of failure of prior build. However, slaves appeared to use existing chroot. This appears to have caused failure which was blamed on https://chromium-review.googlesource.com/c/432078/ possibly erroneously. Need to investigate why the chroot reuse logic has this seeming inconsistency.
,
Feb 21 2017
Ok, the mystery to me is something like this: The previous master-paladin was https://luci-milo.appspot.com/buildbot/chromeos/master-paladin/13660. It's manifest version was 9283.0.0-rc1. It failed. Nevertheless, on the subsequent run, I see logging on the master that looks like "05:55:37: INFO: LKGM version was found in the manifest: 9283.0.0-rc1" https://uberchromegw.corp.google.com/i/chromeos/builders/master-paladin/builds/13661/steps/CommitQueueSync/logs/stdio So, trying to understand how/why 9283.0.0-rc1 got marked as the LKGM despite the build that produced it failing. +davidjames@ can you remind me where the LKGM-marking logic triggers?
,
Mar 7 2017
Aviv, are you still investigating this?
,
Mar 7 2017
Aviv, are you still investigating this?
,
Mar 8 2017
I lost track of it, but I wouldn't be shocked if it's still happening, and it can still bite us again if so. Don or David any hints on #2?
,
Mar 8 2017
I'm not sure where the LKGM logic is. I even thought the CQ master used TOT not LKGM as the basis for the test manifests it generates. I would strongly argue that it should. Note that what it sync's for itself can be different than what is used for the basis of manifests for slaves.
,
Mar 8 2017
It does use ToT for the test manifest, but it has some logic such that if ToT == LKGM it can reuse chroots. That's the theory anyway.
,
Mar 8 2017
self._run.attrs.manifest_manager.PromoteCandidate() is called in the master (not slaves) to mark the run as complete and as LKGM if the CommitQueueCompletion stage completes. If a run is failed and is still marked as LKGM, then this is a bug.
# We only promote for the pfq, not chrome pfq.
# TODO(build): Run this logic in debug mode too.
if (not self._run.options.debug and
config_lib.IsPFQType(self._run.config.build_type) and
self._run.config.master and
self._run.manifest_branch == 'master' and
self._run.config.build_type != constants.CHROME_PFQ_TYPE):
self._run.attrs.manifest_manager.PromoteCandidate()
if sync_stages.MasterSlaveLKGMSyncStage.external_manager:
sync_stages.MasterSlaveLKGMSyncStage.external_manager.PromoteCandidate()
The LKGM is fetched by the next master in _AddLKGMToManifest in cbuildbot/lkgm_manager.py
Then in cbuildbot/stages/sync_stages.py we have the following, which compares the version of the chroot with the version of the LKGM:
lkgm_version = self._GetLKGMVersionFromManifest(next_manifest)
chroot_manager = chroot_lib.ChrootManager(self._build_root)
# Make sure the chroot version is valid.
chroot_manager.EnsureChrootAtVersion(lkgm_version)
,
Mar 15 2017
What are the next steps on this issue?
,
Mar 15 2017
We need someone to debug the issue. If this issue isn't occurring regularly you could deprioritize it, but note that this issue could be happening undetected and it does have potential to cause spurious and hard-to-debug commit queue failures when it does happen.
,
Mar 15 2017
Maybe we should make the system more explicit. Have the master pass a flag to slaves saying if the chroot is allowed to be reused.
,
Mar 21 2017
FixIt candidate.
,
May 18 2017
The following revision refers to this bug: https://chromium.googlesource.com/chromiumos/chromite/+/9bc985d12b668e51e1cfea1938ee0befa3a3b0fb commit 9bc985d12b668e51e1cfea1938ee0befa3a3b0fb Author: Aviv Keshet <akeshet@chromium.org> Date: Thu May 18 02:06:28 2017 chroot_lib: record a metric about whether we reused existing chroot BUG= chromium:692845 TEST=None Change-Id: I05e382c1d12e66f6ed139e4cbd40f67492d9d7ea Reviewed-on: https://chromium-review.googlesource.com/503592 Commit-Ready: Aviv Keshet <akeshet@chromium.org> Tested-by: Aviv Keshet <akeshet@chromium.org> Reviewed-by: Don Garrett <dgarrett@chromium.org> [modify] https://crrev.com/9bc985d12b668e51e1cfea1938ee0befa3a3b0fb/lib/constants.py [modify] https://crrev.com/9bc985d12b668e51e1cfea1938ee0befa3a3b0fb/cbuildbot/chroot_lib.py [modify] https://crrev.com/9bc985d12b668e51e1cfea1938ee0befa3a3b0fb/cbuildbot/chroot_lib_unittest.py [modify] https://crrev.com/9bc985d12b668e51e1cfea1938ee0befa3a3b0fb/cbuildbot/stages/sync_stages.py
,
Mar 14 2018
This bug is very old, is Untriaged, and has no owner. If it is still relevant, reopen as Untriaged or open a new bug |
|||||
►
Sign in to add a comment |
|||||
Comment 1 by tfiga@chromium.org
, Feb 16 2017