New issue
Advanced search Search tips
Note: Color blocks (like or ) mean that a user may not be available. Tooltip shows the reason.

Issue 867032 link

Starred by 3 users

Issue metadata

Status: Duplicate
Merged: issue 791061
Owner:
Closed: Dec 12
Cc:
Components:
EstimatedDays: ----
NextAction: ----
OS: Chrome
Pri: 2
Type: Bug



Sign in to add a comment

CQ run "Failed to promote manifest" repeatedly

Project Member Reported by la...@chromium.org, Jul 24

Issue description

It's *possible* that we just lost a race 20 times in a row...

https://logs.chromium.org/v/?s=chromeos%2Fbb%2Fchromeos%2Fmaster-paladin%2F19204%2F%2B%2Frecipes%2Fsteps%2FCommitQueueCompletion%2F0%2Fstdout

09:16:35: INFO: Retrying to promote manifest:  Retry 21/20
09:16:35: ERROR: Traceback (most recent call last):
  File "/b/c/cbuild/repository/chromite/cbuildbot/stages/generic_stages.py", line 682, in Run
    self.PerformStage()
  File "/b/c/cbuild/repository/chromite/cbuildbot/stages/completion_stages.py", line 713, in PerformStage
    super(CommitQueueCompletionStage, self).PerformStage()
  File "/b/c/cbuild/repository/chromite/cbuildbot/stages/completion_stages.py", line 297, in PerformStage
    self.HandleSuccess()
  File "/b/c/cbuild/repository/chromite/cbuildbot/stages/completion_stages.py", line 207, in HandleSuccess
    self._run.attrs.manifest_manager.PromoteCandidate()
  File "/b/c/cbuild/repository/chromite/cbuildbot/lkgm_manager.py", line 501, in PromoteCandidate
    raise PromoteCandidateException(last_error)
PromoteCandidateException: Failed to promote manifest. error: return code: 1; command: git push origin temp_auto_checkin_branch:refs/heads/master
 
We've auto-submitted 869 CLs to manifest-versions in the last 24 hours, many of them are time clustered.

It doesn't seem super unlikely that we would have really bad luck once in a while.

I suspect the best fix is to stop generating most of the auto-submits. I doubt we still need to submit anything at all, other than the pinned manifests generated by master builders for consumption by slaves.
There is an open bug with the GOB team to investigate these types of errors:  http://b/111686132

Can you update with your most recent findings to give them some additional information?  

-- Mike
#2: This isn't obviously-incorrect behavior from GoB.
Cc: shu...@chromium.org mikenichols@chromium.org vapier@chromium.org la...@chromium.org
 Issue 867462  has been merged into this issue.
Cc: sjg@chromium.org
Labels: -Pri-3 Pri-2
Status: Available (was: Untriaged)
We need to fix this using the strategy that dgarrett@ outlined in Comment #1: there's no reason that we should be submitting anything from the child builders.
Cc: evgreen@chromium.org
 Issue 865038  has been merged into this issue.
Labels: -Pri-2 OS-Chrome Pri-1
Owner: athilenius@chromium.org
Status: Assigned (was: Available)
This is also affecting canaries. Raising priority and assigning to oncall. Alec, you're want to look at Chromite to figure out how to stop uploading empty manifest-versions from the child builders.
I'll take a look and see what can be groked, might take me a bit though.
I suggest looking for code that uses the "manifest_version" library. We probably generate all of the submits through there.

I've also been meaning to look at cleaning this up, but haven't gotten to it yet.
Where is manifest_versions.manifest checked in? I see https://cs.corp.google.com/chromeos_public/chromite/cbuildbot/manifest_version.py which seems to take care of all the manifest stuff, but I can't fine the actual file it's generating in codesearch anywhere?
Hum... that's a thought.

We used to have to submit CLs via the gerrit web API. Eventually permissions were updated, and we  started doing it directly through git as a performance improvement.

Are we still submitting manifest-versions CLs via the gerrit API? That could be related to the flake.
Also, those two repos aren't in the manifest, and aren't in code search. Maybe they should be.
I know we still need a temporary solution, but this sounds like a thing that needs to be rolled into the CI API, even if the final datasource lives in GOB, no?
It needs rethinking badly, that's for certain.

There are two questions to answer as part of that:

A) How do we distribute manifests from scheduler to builder.
B) How do we track the exact manifest used for a given release number? "123.0.0"?

And of course, we need to preserve all existing history.
Labels: -Pri-1 Pri-2
Mergedinto: 791061
Status: Duplicate (was: Assigned)

Sign in to add a comment