New issue
Advanced search Search tips
Note: Color blocks (like or ) mean that a user may not be available. Tooltip shows the reason.

Issue 791061 link

Starred by 16 users

Issue metadata

Status: Fixed
Owner:
Closed: Jan 4
Cc:
Components:
EstimatedDays: ----
NextAction: ----
OS: Chrome
Pri: 1
Type: Bug

Blocked on: View detail
issue 903372
issue 904271

Blocking:
issue 917421



Sign in to add a comment

stop writing to manifest-version from canary slaves | No failed slave, CommitQueueCompletion Failed | PromoteCandidateException: Failed to promote manifest.

Project Member Reported by xixuan@chromium.org, Dec 1 2017

Issue description

https://uberchromegw.corp.google.com/i/chromeos/builders/master-paladin/builds/17082

Error msg:

06:13:08: INFO: Failed to promote manifest. error: return code: 1; command: git push origin temp_auto_checkin_branch:refs/heads/master
To https://chrome-internal.googlesource.com/chromeos/manifest-versions
 ! [rejected]                temp_auto_checkin_branch -> master (fetch first)
error: failed to push some refs to 'https://chrome-internal.googlesource.com/chromeos/manifest-versions'
hint: Updates were rejected because the remote contains work that you do
hint: not have locally. This is usually caused by another repository pushing
hint: to the same ref. You may want to first integrate the remote changes
hint: (e.g., 'git pull ...') before pushing again.
hint: See the 'Note about fast-forwards' in 'git push --help' for details.
cmd=['git', 'push', 'origin', 'temp_auto_checkin_branch:refs/heads/master'], cwd=/b/c/cbuild/repository/manifest-versions-internal
 
Cc: xixuan@chromium.org
 Issue 790043  has been merged into this issue.

Comment 2 by nxia@chromium.org, Dec 4 2017

Status: WontFix (was: Untriaged)
"Updates were rejected because the remote contains work that you do
hint: not have locally. This is usually caused by another repository pushing
hint: to the same ref."

It's because there were other builds updating the repository 'https://chrome-internal.googlesource.com/chromeos/manifest-versions' at the same time, CQ gave up after 20 times of retry. 
Status: Untriaged (was: WontFix)
How to avoid this? or Will this block any CLs to be submitted?

Comment 4 by nxia@chromium.org, Dec 5 2017

Owner: ----
CQ has already retried from 06:13:08 to 06:23:31 (20 times) to avoid this and still failed, this does happen to CQ sometimes but rarely. 

yes. this will block CQ from submitting CLs, but it won't reject CLs.
Labels: Chase-Pending
Move to chase-pending for discussion to see whether we have more ideas then.
Cc: nxia@chromium.org akes...@chromium.org
Labels: -Pri-1 -Chase-Pending Hotlist-Fixit Pri-2
Summary: stop writing to manifest-version from canary slaves | No failed slave, CommitQueueCompletion Failed | PromoteCandidateException: Failed to promote manifest. (was: No failed slave, CommitQueueCompletion Failed)
Possible race between cq master and (cq/canary) slaves (that might also be writing their status here?)

Possible solutions:
 - stop writing to manifest-version from slaves
 - more retries? :(

Prefer the former.
Cc: jclinton@chromium.org
+jclinton: 
An example bug caused because of our dependence on manifest(-internal) repos.
We've been slowly whiting away at this dependence, but given resources, it may be worth a concerted push.

Another instance: https://uberchromegw.corp.google.com/i/chromeos/builders/master-paladin/builds/17940
logs: https://uberchromegw.corp.google.com/i/chromeos/builders/master-paladin/builds/17940
Why would a CQ canary slave be pushing any manifest updates in the first place? That's the part that I don't understand.
Cc: pprabhu@chromium.org
Components: -Infra>Client>ChromeOS Infra>Client>ChromeOS>CI
Can someone take a swing at answering Comment 8?

Comment 10 by nxia@chromium.org, Jun 8 2018

Cc: -nxia@chromium.org
Labels: -Pri-2 Pri-1 Type-Feature
Owner: jclinton@chromium.org
Status: Available (was: Untriaged)
Okay, so this caused another round of breakages this week. So, let's get this fixed.
Status: Assigned (was: Available)
Cc: apronin@chromium.org yamaguchi@chromium.org
 Issue 891542  has been merged into this issue.
And again (per dupe).
Can  issue 892292  also be related then?
Definitely unrelated.
I was assuming these status files were unused, and going to stop writing them as a quick side project.

What I found is that pass/fail information written out IS used to find the past passing build, so that we can correctly display blame lists.

This can be revamped to be buildbucket or CIDB based, but is a bigger change than I was expecting to tackle.
Cc: -apronin@chromium.org
Blockedon: 904271 903372
If we implement the design described in the blocking bugs, then we can use the gitiles commit metadata in Buildbucket to derive the CL blamelist delta from build N-1 to N. That would address the last remaining need for child commits to display blamelists and solve this bug.

Do we actually care if it's a SHA1? Or would a buildspec work just as well?

The buildspec would be somewhat more friendly to users. Roughly, something like one of these:
  1.2.3
  CQ/buildspec/1.2.3-rc2


I'm not sure but if we look at an example from Chrome Browser today <https://ci.chromium.org/p/chrome/g/internal.bling.main/console?limit=3>, we can see that the gitiles reference is actually used to turn the left-hand row header into a clickable link to the parent repo commit from which the build was launched. I'm not sure if we would get that feature if we used something else.


 Issue 905726  has been merged into this issue.
Cc: zwisler@chromium.org
Another reproduction on nami-release:

https://cros-goldeneye.corp.google.com/chromeos/healthmonitoring/buildDetails?buildbucketId=8929707842041731488

see the output in the CanaryCompletion stage.
Cc: kbleicher@chromium.org vapier@chromium.org gu...@chromium.org djmm@chromium.org bhthompson@chromium.org josa...@chromium.org
 Issue 913470  has been merged into this issue.
Cc: yueherngl@chromium.org caveh@chromium.org moragues@chromium.org
 Issue 913759  has been merged into this issue.
Looks like this happened again for the M71 master builder again (assuming it's this issue per #27 from yesterday).

Logs:
https://logs.chromium.org/logs/chromeos/buildbucket/cr-buildbucket.appspot.com/8927465760759591888/+/steps/CanaryCompletion/0/stdout

PLEASE DO NOT HALT OR START a new M71 build without consulting kbleicher@; thanks!

May also have impacted the M71 PFQ build yesterday (again):

https://logs.chromium.org/logs/chromeos/buildbucket/cr-buildbucket.appspot.com/8927471284596733616/+/steps/MasterSlaveLKGMSync/0/stdout

If the same issue for this and #29 can we escalate priority to P0 since we're failing on time-critical builds?  Thanks
Cc: matth...@chromium.org
Cc: la...@chromium.org
 Issue 913094  has been merged into this issue.
Cc: yawano@google.com
we've been seeing issues with android pfq.
https://bugs.chromium.org/p/chromium/issues/detail?id=913769

Issue 913769 has been merged into this issue.
Cc: bpastene@chromium.org achuith@chromium.org
 Issue 914446  has been merged into this issue.
This is the stage that is writing to manifest versions. Nearly all completion stages inherit from it.

http://cs/chromeos_public/chromite/cbuildbot/stages/completion_stages.py?l=59&rcl=572100bbdaa15bb70d6d7ae4d5d3259fb9481d36
I'm going to try to look at this this week. This is newly urgent because of another bug we fixed: we were previously not getting uprevs of binary packages because the CQ is rarely green. So, we added a new set of builders called post-submit that do uprevs from ToT landed CL's. Those new builders are adding further contention on this file.
Labels: -Type-Feature Type-Bug
Cc: dgarr...@chromium.org shu...@chromium.org sjg@chromium.org evgreen@chromium.org mikenichols@chromium.org
 Issue 867032  has been merged into this issue.
Looking at this now.
spitballing, but there's no value in slaves committing their manifest to git right ?  they can upload it like any other artifact alongside the files they upload.

however, for long term stuff, it makes sense for the master builders to commit theirs to git still ?
Right, I'm going to intentionally break blamelist creation for child builders because the blamelist will still be available on the parent builder.
As long as whatever changes are done here don't break the ability to have individual board --production tryjobs generate real OS images with official versions and manifest versions updates. 

Comment 42 makes me a bit nervous that if we did something like `cros tryjob --remote --production --branch=stabilize-11101.B caroline-release` we might get a partially formed build...
Does a production tryjob like that result in a new build number being generated?
It does today, for example the command above generated results seen in https://cros-goldeneye.corp.google.com/chromeos/console/listBuild?boards=&milestone=&chromeOsVersion=11101&chromeVersion=&startTimeFrom=&startTimeTo=&token=AAVs2SZYWZln6k4yW7Sq2Klv19j5%3A1544744030888#/

This is a key tool for the release owners to generate builds that are used to fill in missing builds and generate single patch builds for specific boards to get the releases out as quickly as possible.

While one could argue we could just build all of the boards against the stabilization branch, we lost that capability with the move to swarming (waterfall allowed us to point at a particular branch for the master), and even if we could I am afraid it would use up too many resources (both risk of running out of GCE bots available, and overloading in the HWTest lab). If we had an easy way to start the master against a unique branch, and we were comfortable with the resource loading, we could drop support for making individual board official builds. 
Just a sanity check ; 
"git push" retries are happening 20 times without doing a git pull (I think somewhere in the code there's `git remote update` but I don't see them in the logs) isn't that the reason it's failing?

Or are we that contentious that 20 retries between push / pull will fail?

... I only see about 2 CLs / minute.
https://chrome-internal.googlesource.com/chromeos/manifest-versions


Also if we don't need to push directly to git and use an indirect push.
https://gerrit-review.googlesource.com/Documentation/user-upload.html#auto_merge

`push to refs/for/refs/heads/master%submit` and it would work?


Looks like the latest CQ failure in CommitQueueSync is related to this issue.

https://cros-goldeneye.corp.google.com/chromeos/healthmonitoring/buildDetails?buildbucketId=8927183519708937648

23:39:41: INFO: Translating result <class 'chromite.cbuildbot.manifest_version.GenerateBuildSpecException'>: Failed to generate LKGM Candidate. error: return code: 1; command: git push origin temp_auto_checkin_branch:refs/heads/master
To https://chrome-internal.googlesource.com/chromeos/manifest-versions
 ! [rejected]                temp_auto_checkin_branch -> master (fetch first)
error: failed to push some refs to 'https://chrome-internal.googlesource.com/chromeos/manifest-versions'
hint: Updates were rejected because the remote contains work that you do
hint: not have locally. This is usually caused by another repository pushing
hint: to the same ref. You may want to first integrate the remote changes
hint: (e.g., 'git pull ...') before pushing again.
hint: See the 'Note about fast-forwards' in 'git push --help' for details.
The problematic commits are at the end of the build where slaves record pass/fail, which conceptually shouldn't be in git at all (anymore). They aren't actually committing the manifests they built.

When slaves are run with --production, they act like a master and create a new builspec for themselves (and conceptually for other builders). I believe that to be an accidental feature, and not really supported. I do believe the CI team should create a supported workflow for that use case, and there is an existing bug for that (somewhere).
Re: #47: I checked and we are re-syncing on every push attempt: https://cs.corp.google.com/chromeos_public/chromite/cbuildbot/manifest_version.py?q=%22Updating+manifest-versions+checkout%22&sq=package:%5Echromeos&g=0&l=138 . Every time that 'Updating manifest-versions checkout' shows up in the logs, we are doing 'git remote update ...'.

The %submit API does appear to be a quick, elegant solution to this problem so I'm about finished with that implementation. As it's the end of the day on a Friday and there's really no way to test this, I'll stay late, chump the change as everyone goes home, and look for impacts as they arise.
Project Member

Comment 51 by bugdroid1@chromium.org, Dec 15

The following revision refers to this bug:
  https://chromium.googlesource.com/chromiumos/chromite/+/35c2661c2fba66b519542503b556037ac9c00a08

commit 35c2661c2fba66b519542503b556037ac9c00a08
Author: Jason D. Clinton <jclinton@chromium.org>
Date: Sat Dec 15 00:14:26 2018

cbuildbot/manifest_version.py: Use Gerrit autosubmit feature

When we are pushing manifests, let Gerrit manage automerging rather
than trying to do it client-side.

BUG= chromium:791061 
TEST=run_tests

Change-Id: Ibb0748a822d3e7b547c72f62b1a56ddee8bc0c54
Reviewed-on: https://chromium-review.googlesource.com/c/1378909
Tested-by: Jason Clinton <jclinton@chromium.org>
Trybot-Ready: Jason Clinton <jclinton@chromium.org>
Reviewed-by: Mike Nichols <mikenichols@chromium.org>

[modify] https://crrev.com/35c2661c2fba66b519542503b556037ac9c00a08/cbuildbot/manifest_version.py

Much to my surprise, that seems to have worked on almost the first try. Automerge feature requires that the repo "Submit type" *not* be set to "Fast forward only". I didn't realize that until I got quite a few error messages on the postsubmit builder. Having changed the repo to "Rebase if necessary", it seems that things are now merging without error.

However, we're not quite out of the woods yet. This change affected the vast majority of commits and especially those from child builders. But, we still have manifest-versions submits happening a few other places such as in the parent builders when establishing the LKGM manifest and also a similar codepath for PFQ's. These commits will be fighting against the automerged ones so there's still a chance that they will fail even if the chance is reduced.

Things should be much better for the moment, though, so I'll leave that follow-up until Monday.
 Issue 915432  has been merged into this issue.
Labels: -Pri-1 Pri-2
That resulted in our first successful post-submit run: https://cros-goldeneye.corp.google.com/chromeos/healthmonitoring/buildDetails?buildbucketId=8927109628284744048

The Portage packages have been updated to days-newer versions, as a result.

It also appears that we're closing to passing the CQ. And I'm seeing successful Release builder child completions.


Blocking: 917421
Cc: allenwebb@chromium.org rspangler@chromium.org x...@chromium.org
 Issue 917421  has been merged into this issue.
From merged bug, LKGM isn't using the fancy automerge feature yet so it still has to contend with other merges to succeed.
the flakes have gone down significantly though even by converting some, so thanks for that
Labels: -Pri-2 Pri-1
Right, so time to apply the optimisation to the other code paths.
Project Member

Comment 61 by bugdroid1@chromium.org, Jan 3

The following revision refers to this bug:
  https://chromium.googlesource.com/chromiumos/chromite/+/c48496fe54a9ff1850b2b718697d29d4a1264ff4

commit c48496fe54a9ff1850b2b718697d29d4a1264ff4
Author: Jason D. Clinton <jclinton@chromium.org>
Date: Thu Jan 03 22:26:08 2019

cbuildbot/manifest_version.py: Use Gerrit autosubmit feature throughout

This will use the autosubmit feature for all manifest_version pushes.

BUG= chromium:791061 
TEST=run_tests

Change-Id: I1f3560056e2353ef2074cdcbef08a4a0216cf0f6
Reviewed-on: https://chromium-review.googlesource.com/c/1394573
Tested-by: Jason Clinton <jclinton@chromium.org>
Reviewed-by: David Burger <dburger@chromium.org>

[modify] https://crrev.com/c48496fe54a9ff1850b2b718697d29d4a1264ff4/cbuildbot/manifest_version.py

Status: Fixed (was: Assigned)
Like before, there's no way to test that CL in CQ and I'm pretty confident based on previous success prior to this refactor so I chumped it. Hopefully, this fixes this issue once and for all. Once. And. For. All.

Will monitor builds tonight to ensure.


Status: Started (was: Fixed)
Nope. Rolling back.
Project Member

Comment 64 by bugdroid1@chromium.org, Jan 4

The following revision refers to this bug:
  https://chromium.googlesource.com/chromiumos/chromite/+/8743424b5d6d73e0c62cc87a21f9ac17f151495e

commit 8743424b5d6d73e0c62cc87a21f9ac17f151495e
Author: Jason Clinton <jclinton@chromium.org>
Date: Fri Jan 04 00:53:31 2019

Revert "cbuildbot/manifest_version.py: Use Gerrit autosubmit feature throughout"

This reverts commit c48496fe54a9ff1850b2b718697d29d4a1264ff4.

Reason for revert: Breaks pushes on some code paths with "missing Change-Id in message footer"

Original change's description:
> cbuildbot/manifest_version.py: Use Gerrit autosubmit feature throughout
> 
> This will use the autosubmit feature for all manifest_version pushes.
> 
> BUG= chromium:791061 
> TEST=run_tests
> 
> Change-Id: I1f3560056e2353ef2074cdcbef08a4a0216cf0f6
> Reviewed-on: https://chromium-review.googlesource.com/c/1394573
> Tested-by: Jason Clinton <jclinton@chromium.org>
> Reviewed-by: David Burger <dburger@chromium.org>

Bug:  chromium:791061 
Change-Id: Icc06a2c6a0db2502bc6267ea5b5a59a5b1958cab
Reviewed-on: https://chromium-review.googlesource.com/c/1395597
Reviewed-by: Jason Clinton <jclinton@chromium.org>
Tested-by: Jason Clinton <jclinton@chromium.org>

[modify] https://crrev.com/8743424b5d6d73e0c62cc87a21f9ac17f151495e/cbuildbot/manifest_version.py

Cc: ecgh@chromium.org cjmcdonald@chromium.org
The CL broke the CQ and the Android, Chrome, and Release Sync stages. A period of one hour was affected. Rolling back returned everything to status-quo.

It appears that the change landed on Dec 15 may have actually broken the Chrome PFQ, too, in a different way, but the Chrome PFQ has been broken for a chromeos-fonts issue for so long that we didn't notice. One such example slipped in between the fonts issue here: https://bugs.chromium.org/p/chromium/issues/detail?id=913317#c34 . So, as of this moment, even with the rollback above, I believe that Chrome PFQ is still broken in this way.

That said, I'm not entirely sure why it broke everything. The error message was:

    remote: ERROR: commit 67ba119: missing Change-Id in message footer        

The CL did not modify the commit message in any way and we have never used Change-Id for this repo. An example failed commit message was:

  Automatic: Start master-paladin master 11519.0.0-rc3
  CrOS-Build-Id: 3308631

The failure example from the Chrome PFQ was:

  Automatic checkin: status=pass build_version 11483.0.0-rc1 for amd64-generic-full 

But an example CL that submits with the change that landed on the 15th is:

  Automatic checkin: status=pass build_version 11316.58.0 for falco-release

Clearly the commit message is actually not to blame.

Also, Change-Id requirement is turned off for manifest-versions repo via permission inheritance.

I think we're dealing with another Gerrit bug. Continuing to investigate.

Cc: djkurtz@chromium.org
 Issue 918201  has been merged into this issue.
It turns out that all of the -full builders have been failing this way since Dec 15: https://cros-goldeneye.corp.google.com/chromeos/legoland/builderSummary?builderGroups=full&limit=3&buildBranch=master

I believe that the root cause was that "Require Change-Id in commit message" was set to "true" for chromiumos/manifest-versions while it was set to "false" for chromeos/manifest-versions.

I've explicitly set the "Require Change-Id in commit message" to false for external manifest-versions and for internal manifest-versions instead of relying on inheritance.

I will watch the -full builders tonight to confirm that fixes them and then, if so, reland the change above.

Yes, that fixed the full builders. Will reland the change tomorrow when I am around to monitor it. Then, we can call this fixed.
Project Member

Comment 69 by bugdroid1@chromium.org, Jan 4

The following revision refers to this bug:
  https://chromium.googlesource.com/chromiumos/chromite/+/6551d805f7ca4b096754c9da754721074d690336

commit 6551d805f7ca4b096754c9da754721074d690336
Author: Jason D. Clinton <jclinton@chromium.org>
Date: Fri Jan 04 14:54:26 2019

Reland "cbuildbot/manifest_version.py: Use Gerrit autosubmit feature throughout"

This is a reland of c48496fe54a9ff1850b2b718697d29d4a1264ff4

Original change's description:
> cbuildbot/manifest_version.py: Use Gerrit autosubmit feature throughout
> 
> This will use the autosubmit feature for all manifest_version pushes.
> 
> BUG= chromium:791061 
> TEST=run_tests
> 
> Change-Id: I1f3560056e2353ef2074cdcbef08a4a0216cf0f6
> Reviewed-on: https://chromium-review.googlesource.com/c/1394573
> Tested-by: Jason Clinton <jclinton@chromium.org>
> Reviewed-by: David Burger <dburger@chromium.org>

Bug:  chromium:791061 
Change-Id: I0abfc0c7a9c28d45aa4dee6c9abaec79c603d385
Reviewed-on: https://chromium-review.googlesource.com/c/1396257
Reviewed-by: Jason Clinton <jclinton@chromium.org>
Tested-by: Jason Clinton <jclinton@chromium.org>

[modify] https://crrev.com/6551d805f7ca4b096754c9da754721074d690336/cbuildbot/manifest_version.py

Status: Fixed (was: Started)
Right. That should do it. Will monitor.
Everything looks hunky-dory.
Cc: kblairh@chromium.org jclinton@google.com martinroth@chromium.org ayatane@chromium.org teravest@chromium.org adurbin@chromium.org npoojary@chromium.org vineeths@chromium.org
 Issue 921347  has been merged into this issue.

Comment 73 by jclinton@chromium.org, Jan 18 (4 days ago)

Cc: nednguyen@chromium.org ahass...@chromium.org
 Issue 923530  has been merged into this issue.

Sign in to add a comment