New issue
Advanced search Search tips
Note: Color blocks (like or ) mean that a user may not be available. Tooltip shows the reason.

Issue 692845 link

Starred by 1 user

Issue metadata

Status: Archived
Owner: ----
Closed: Mar 2018
Cc:
Components:
EstimatedDays: ----
NextAction: ----
OS: Chrome
Pri: 2
Type: Bug



Sign in to add a comment

master/slave inconsistency between using fresh / old chroot

Project Member Reported by akes...@chromium.org, Feb 16 2017

Issue description

master build: https://luci-milo.appspot.com/buildbot/chromeos/master-paladin/13661

slave: https://luci-milo.appspot.com/buildbot/chromeos/samus-paladin/12878

Master logic chose to use a fresh chroot (CommitQueueSync (Using fresh chroot)) because of failure of prior build. However, slaves appeared to use existing chroot.

This appears to have caused failure which was blamed on https://chromium-review.googlesource.com/c/432078/ possibly erroneously.

Need to investigate why the chroot reuse logic has this seeming inconsistency.
 

Comment 1 by tfiga@chromium.org, Feb 16 2017

My observation is that one CQ run picked all the CLs from the series to whiich CL:432078 belongs and actually completed the build stage, but failed on hwtest due to a lab issue.

Then some of the later CLs lost their CQ+1, but some earlier ones still kept it, leading the CQ to retry with only part of the series, but for some reason the chroot reused from previous run.
Cc: davidjames@chromium.org
Ok, the mystery to me is something like this:

The previous master-paladin was 
https://luci-milo.appspot.com/buildbot/chromeos/master-paladin/13660. It's manifest version was 9283.0.0-rc1. It failed.

Nevertheless, on the subsequent run, I see logging on the master that looks like "05:55:37: INFO: LKGM version was found in the manifest: 9283.0.0-rc1" https://uberchromegw.corp.google.com/i/chromeos/builders/master-paladin/builds/13661/steps/CommitQueueSync/logs/stdio

So, trying to understand how/why 9283.0.0-rc1 got marked as the LKGM despite the build that produced it failing. +davidjames@ can you remind me where the LKGM-marking logic triggers?
Aviv, are you still investigating this? 
Aviv, are you still investigating this? 
Cc: nxia@chromium.org
I lost track of it, but I wouldn't be shocked if it's still happening, and it can still bite us again if so.

Don or David any hints on #2?
I'm not sure where the LKGM logic is. I even thought the CQ master used TOT not LKGM as the basis for the test manifests it generates. I would strongly argue that it should.

Note that what it sync's for itself can be different than what is used for the basis of manifests for slaves.
It does use ToT for the test manifest, but it has some logic such that if ToT == LKGM it can reuse chroots. That's the theory anyway.
self._run.attrs.manifest_manager.PromoteCandidate() is called in the master (not slaves) to mark the run as complete and as LKGM if the CommitQueueCompletion stage completes. If a run is failed and is still marked as LKGM, then this is a bug.


    # We only promote for the pfq, not chrome pfq.
    # TODO(build): Run this logic in debug mode too.
    if (not self._run.options.debug and
        config_lib.IsPFQType(self._run.config.build_type) and
        self._run.config.master and
        self._run.manifest_branch == 'master' and
        self._run.config.build_type != constants.CHROME_PFQ_TYPE):
      self._run.attrs.manifest_manager.PromoteCandidate()
      if sync_stages.MasterSlaveLKGMSyncStage.external_manager:
        sync_stages.MasterSlaveLKGMSyncStage.external_manager.PromoteCandidate()


The LKGM is fetched by the next master in _AddLKGMToManifest in cbuildbot/lkgm_manager.py

Then in cbuildbot/stages/sync_stages.py we have the following, which compares the version of the chroot with the version of the LKGM:

    lkgm_version = self._GetLKGMVersionFromManifest(next_manifest)
    chroot_manager = chroot_lib.ChrootManager(self._build_root)
    # Make sure the chroot version is valid.
    chroot_manager.EnsureChrootAtVersion(lkgm_version)


Comment 9 by autumn@chromium.org, Mar 15 2017

What are the next steps on this issue?
We need someone to debug the issue. If this issue isn't occurring regularly you could deprioritize it, but note that this issue could be happening undetected and it does have potential to cause spurious and hard-to-debug commit queue failures when it does happen.
Maybe we should make the system more explicit. Have the master pass a flag to slaves saying if the chroot is allowed to be reused.
Labels: -current-issue Hotlist-Fixit
FixIt candidate. 
Status: Archived (was: Untriaged)
This bug is very old, is Untriaged, and has no owner.  If it is still relevant, reopen as Untriaged or open a new bug

Sign in to add a comment