stop writing to manifest-version from canary slaves | No failed slave, CommitQueueCompletion Failed | PromoteCandidateException: Failed to promote manifest. |
|||||||||||||||||||||||||||||||
Issue descriptionhttps://uberchromegw.corp.google.com/i/chromeos/builders/master-paladin/builds/17082 Error msg: 06:13:08: INFO: Failed to promote manifest. error: return code: 1; command: git push origin temp_auto_checkin_branch:refs/heads/master To https://chrome-internal.googlesource.com/chromeos/manifest-versions ! [rejected] temp_auto_checkin_branch -> master (fetch first) error: failed to push some refs to 'https://chrome-internal.googlesource.com/chromeos/manifest-versions' hint: Updates were rejected because the remote contains work that you do hint: not have locally. This is usually caused by another repository pushing hint: to the same ref. You may want to first integrate the remote changes hint: (e.g., 'git pull ...') before pushing again. hint: See the 'Note about fast-forwards' in 'git push --help' for details. cmd=['git', 'push', 'origin', 'temp_auto_checkin_branch:refs/heads/master'], cwd=/b/c/cbuild/repository/manifest-versions-internal ⛆ |
|
|
,
Dec 4 2017
"Updates were rejected because the remote contains work that you do hint: not have locally. This is usually caused by another repository pushing hint: to the same ref." It's because there were other builds updating the repository 'https://chrome-internal.googlesource.com/chromeos/manifest-versions' at the same time, CQ gave up after 20 times of retry.
,
Dec 5 2017
How to avoid this? or Will this block any CLs to be submitted?
,
Dec 5 2017
CQ has already retried from 06:13:08 to 06:23:31 (20 times) to avoid this and still failed, this does happen to CQ sometimes but rarely. yes. this will block CQ from submitting CLs, but it won't reject CLs.
,
Dec 5 2017
Move to chase-pending for discussion to see whether we have more ideas then.
,
Dec 11 2017
Possible race between cq master and (cq/canary) slaves (that might also be writing their status here?) Possible solutions: - stop writing to manifest-version from slaves - more retries? :( Prefer the former.
,
Mar 7 2018
+jclinton: An example bug caused because of our dependence on manifest(-internal) repos. We've been slowly whiting away at this dependence, but given resources, it may be worth a concerted push. Another instance: https://uberchromegw.corp.google.com/i/chromeos/builders/master-paladin/builds/17940 logs: https://uberchromegw.corp.google.com/i/chromeos/builders/master-paladin/builds/17940
,
Mar 8 2018
Why would a CQ canary slave be pushing any manifest updates in the first place? That's the part that I don't understand.
,
Apr 20 2018
Can someone take a swing at answering Comment 8?
,
Jun 8 2018
,
Jul 27
Okay, so this caused another round of breakages this week. So, let's get this fixed.
,
Aug 2
,
Sep 24
This seems to happen again. https://uberchromegw.corp.google.com/i/chromeos/builders/master-paladin/builds/19670
,
Oct 4
,
Oct 4
And again (per dupe).
,
Oct 4
Can issue 892292 also be related then?
,
Oct 4
Definitely unrelated.
,
Oct 10
I was assuming these status files were unused, and going to stop writing them as a quick side project. What I found is that pass/fail information written out IS used to find the past passing build, so that we can correctly display blame lists. This can be revamped to be buildbucket or CIDB based, but is a bigger change than I was expecting to tackle.
,
Oct 12
,
Nov 7
another instance on grunt release build: https://logs.chromium.org/logs/chromeos/buildbucket/cr-buildbucket.appspot.com/8930584118511634480/+/steps/CanaryCompletion/0/stdout
,
Nov 12
,
Nov 12
Do we actually care if it's a SHA1? Or would a buildspec work just as well? The buildspec would be somewhat more friendly to users. Roughly, something like one of these: 1.2.3 CQ/buildspec/1.2.3-rc2
,
Nov 12
I'm not sure but if we look at an example from Chrome Browser today <https://ci.chromium.org/p/chrome/g/internal.bling.main/console?limit=3>, we can see that the gitiles reference is actually used to turn the left-hand row header into a clickable link to the parent repo commit from which the build was launched. I'm not sure if we would get that feature if we used something else.
,
Nov 15
Issue 905726 has been merged into this issue.
,
Nov 15
,
Nov 16
Another reproduction on nami-release: https://cros-goldeneye.corp.google.com/chromeos/healthmonitoring/buildDetails?buildbucketId=8929707842041731488 see the output in the CanaryCompletion stage.
,
Dec 10
Issue 913470 has been merged into this issue.
,
Dec 11
Issue 913759 has been merged into this issue.
,
Dec 11
Looks like this happened again for the M71 master builder again (assuming it's this issue per #27 from yesterday). Logs: https://logs.chromium.org/logs/chromeos/buildbucket/cr-buildbucket.appspot.com/8927465760759591888/+/steps/CanaryCompletion/0/stdout PLEASE DO NOT HALT OR START a new M71 build without consulting kbleicher@; thanks!
,
Dec 11
May also have impacted the M71 PFQ build yesterday (again): https://logs.chromium.org/logs/chromeos/buildbucket/cr-buildbucket.appspot.com/8927471284596733616/+/steps/MasterSlaveLKGMSync/0/stdout If the same issue for this and #29 can we escalate priority to P0 since we're failing on time-critical builds? Thanks
,
Dec 11
,
Dec 11
,
Dec 11
,
Dec 12
we've been seeing issues with android pfq. https://bugs.chromium.org/p/chromium/issues/detail?id=913769
,
Dec 12
Issue 913769 has been merged into this issue.
,
Dec 12
,
Dec 12
This is the stage that is writing to manifest versions. Nearly all completion stages inherit from it. http://cs/chromeos_public/chromite/cbuildbot/stages/completion_stages.py?l=59&rcl=572100bbdaa15bb70d6d7ae4d5d3259fb9481d36
,
Dec 12
I'm going to try to look at this this week. This is newly urgent because of another bug we fixed: we were previously not getting uprevs of binary packages because the CQ is rarely green. So, we added a new set of builders called post-submit that do uprevs from ToT landed CL's. Those new builders are adding further contention on this file.
,
Dec 12
,
Dec 12
Issue 867032 has been merged into this issue.
,
Dec 13
Looking at this now.
,
Dec 13
spitballing, but there's no value in slaves committing their manifest to git right ? they can upload it like any other artifact alongside the files they upload. however, for long term stuff, it makes sense for the master builders to commit theirs to git still ?
,
Dec 13
Right, I'm going to intentionally break blamelist creation for child builders because the blamelist will still be available on the parent builder.
,
Dec 13
As long as whatever changes are done here don't break the ability to have individual board --production tryjobs generate real OS images with official versions and manifest versions updates. Comment 42 makes me a bit nervous that if we did something like `cros tryjob --remote --production --branch=stabilize-11101.B caroline-release` we might get a partially formed build...
,
Dec 13
Does a production tryjob like that result in a new build number being generated?
,
Dec 13
It does today, for example the command above generated results seen in https://cros-goldeneye.corp.google.com/chromeos/console/listBuild?boards=&milestone=&chromeOsVersion=11101&chromeVersion=&startTimeFrom=&startTimeTo=&token=AAVs2SZYWZln6k4yW7Sq2Klv19j5%3A1544744030888#/ This is a key tool for the release owners to generate builds that are used to fill in missing builds and generate single patch builds for specific boards to get the releases out as quickly as possible. While one could argue we could just build all of the boards against the stabilization branch, we lost that capability with the move to swarming (waterfall allowed us to point at a particular branch for the master), and even if we could I am afraid it would use up too many resources (both risk of running out of GCE bots available, and overloading in the HWTest lab). If we had an easy way to start the master against a unique branch, and we were comfortable with the resource loading, we could drop support for making individual board official builds.
,
Dec 14
Just a sanity check ; "git push" retries are happening 20 times without doing a git pull (I think somewhere in the code there's `git remote update` but I don't see them in the logs) isn't that the reason it's failing? Or are we that contentious that 20 retries between push / pull will fail? ... I only see about 2 CLs / minute. https://chrome-internal.googlesource.com/chromeos/manifest-versions Also if we don't need to push directly to git and use an indirect push. https://gerrit-review.googlesource.com/Documentation/user-upload.html#auto_merge `push to refs/for/refs/heads/master%submit` and it would work?
,
Dec 14
Looks like the latest CQ failure in CommitQueueSync is related to this issue. https://cros-goldeneye.corp.google.com/chromeos/healthmonitoring/buildDetails?buildbucketId=8927183519708937648 23:39:41: INFO: Translating result <class 'chromite.cbuildbot.manifest_version.GenerateBuildSpecException'>: Failed to generate LKGM Candidate. error: return code: 1; command: git push origin temp_auto_checkin_branch:refs/heads/master To https://chrome-internal.googlesource.com/chromeos/manifest-versions ! [rejected] temp_auto_checkin_branch -> master (fetch first) error: failed to push some refs to 'https://chrome-internal.googlesource.com/chromeos/manifest-versions' hint: Updates were rejected because the remote contains work that you do hint: not have locally. This is usually caused by another repository pushing hint: to the same ref. You may want to first integrate the remote changes hint: (e.g., 'git pull ...') before pushing again. hint: See the 'Note about fast-forwards' in 'git push --help' for details.
,
Dec 14
The problematic commits are at the end of the build where slaves record pass/fail, which conceptually shouldn't be in git at all (anymore). They aren't actually committing the manifests they built. When slaves are run with --production, they act like a master and create a new builspec for themselves (and conceptually for other builders). I believe that to be an accidental feature, and not really supported. I do believe the CI team should create a supported workflow for that use case, and there is an existing bug for that (somewhere).
,
Dec 14
Re: #47: I checked and we are re-syncing on every push attempt: https://cs.corp.google.com/chromeos_public/chromite/cbuildbot/manifest_version.py?q=%22Updating+manifest-versions+checkout%22&sq=package:%5Echromeos&g=0&l=138 . Every time that 'Updating manifest-versions checkout' shows up in the logs, we are doing 'git remote update ...'. The %submit API does appear to be a quick, elegant solution to this problem so I'm about finished with that implementation. As it's the end of the day on a Friday and there's really no way to test this, I'll stay late, chump the change as everyone goes home, and look for impacts as they arise.
,
Dec 15
The following revision refers to this bug: https://chromium.googlesource.com/chromiumos/chromite/+/35c2661c2fba66b519542503b556037ac9c00a08 commit 35c2661c2fba66b519542503b556037ac9c00a08 Author: Jason D. Clinton <jclinton@chromium.org> Date: Sat Dec 15 00:14:26 2018 cbuildbot/manifest_version.py: Use Gerrit autosubmit feature When we are pushing manifests, let Gerrit manage automerging rather than trying to do it client-side. BUG= chromium:791061 TEST=run_tests Change-Id: Ibb0748a822d3e7b547c72f62b1a56ddee8bc0c54 Reviewed-on: https://chromium-review.googlesource.com/c/1378909 Tested-by: Jason Clinton <jclinton@chromium.org> Trybot-Ready: Jason Clinton <jclinton@chromium.org> Reviewed-by: Mike Nichols <mikenichols@chromium.org> [modify] https://crrev.com/35c2661c2fba66b519542503b556037ac9c00a08/cbuildbot/manifest_version.py
,
Dec 15
Much to my surprise, that seems to have worked on almost the first try. Automerge feature requires that the repo "Submit type" *not* be set to "Fast forward only". I didn't realize that until I got quite a few error messages on the postsubmit builder. Having changed the repo to "Rebase if necessary", it seems that things are now merging without error. However, we're not quite out of the woods yet. This change affected the vast majority of commits and especially those from child builders. But, we still have manifest-versions submits happening a few other places such as in the parent builders when establishing the LKGM manifest and also a similar codepath for PFQ's. These commits will be fighting against the automerged ones so there's still a chance that they will fail even if the chance is reduced. Things should be much better for the moment, though, so I'll leave that follow-up until Monday.
,
Dec 15
Issue 915432 has been merged into this issue.
,
Dec 15
That resulted in our first successful post-submit run: https://cros-goldeneye.corp.google.com/chromeos/healthmonitoring/buildDetails?buildbucketId=8927109628284744048 The Portage packages have been updated to days-newer versions, as a result. It also appears that we're closing to passing the CQ. And I'm seeing successful Release builder child completions.
,
Dec 21
,
Dec 28
Issue 917421 has been merged into this issue.
,
Dec 28
From merged bug, LKGM isn't using the fancy automerge feature yet so it still has to contend with other merges to succeed.
,
Dec 28
the flakes have gone down significantly though even by converting some, so thanks for that
,
Jan 3
In case it is helpful to track here: looks like this caused 2 CQ failures (Failed to generate LKGM Candidate) and a M-72 release build failure yesterday/today. https://luci-logdog.appspot.com/logs/chromeos/buildbucket/cr-buildbucket.appspot.com/8925375357436629296/+/steps/CommitQueueSync/0/stdout https://luci-logdog.appspot.com/logs/chromeos/buildbucket/cr-buildbucket.appspot.com/8925329198877985920/+/steps/CommitQueueSync/0/stdout https://luci-logdog.appspot.com/logs/chromeos/buildbucket/cr-buildbucket.appspot.com/8925375959038817984/+/steps/ManifestVersionedSync/0/stdout
,
Jan 3
Right, so time to apply the optimisation to the other code paths.
,
Jan 3
The following revision refers to this bug: https://chromium.googlesource.com/chromiumos/chromite/+/c48496fe54a9ff1850b2b718697d29d4a1264ff4 commit c48496fe54a9ff1850b2b718697d29d4a1264ff4 Author: Jason D. Clinton <jclinton@chromium.org> Date: Thu Jan 03 22:26:08 2019 cbuildbot/manifest_version.py: Use Gerrit autosubmit feature throughout This will use the autosubmit feature for all manifest_version pushes. BUG= chromium:791061 TEST=run_tests Change-Id: I1f3560056e2353ef2074cdcbef08a4a0216cf0f6 Reviewed-on: https://chromium-review.googlesource.com/c/1394573 Tested-by: Jason Clinton <jclinton@chromium.org> Reviewed-by: David Burger <dburger@chromium.org> [modify] https://crrev.com/c48496fe54a9ff1850b2b718697d29d4a1264ff4/cbuildbot/manifest_version.py
,
Jan 3
Like before, there's no way to test that CL in CQ and I'm pretty confident based on previous success prior to this refactor so I chumped it. Hopefully, this fixes this issue once and for all. Once. And. For. All. Will monitor builds tonight to ensure.
,
Jan 4
Nope. Rolling back.
,
Jan 4
The following revision refers to this bug: https://chromium.googlesource.com/chromiumos/chromite/+/8743424b5d6d73e0c62cc87a21f9ac17f151495e commit 8743424b5d6d73e0c62cc87a21f9ac17f151495e Author: Jason Clinton <jclinton@chromium.org> Date: Fri Jan 04 00:53:31 2019 Revert "cbuildbot/manifest_version.py: Use Gerrit autosubmit feature throughout" This reverts commit c48496fe54a9ff1850b2b718697d29d4a1264ff4. Reason for revert: Breaks pushes on some code paths with "missing Change-Id in message footer" Original change's description: > cbuildbot/manifest_version.py: Use Gerrit autosubmit feature throughout > > This will use the autosubmit feature for all manifest_version pushes. > > BUG= chromium:791061 > TEST=run_tests > > Change-Id: I1f3560056e2353ef2074cdcbef08a4a0216cf0f6 > Reviewed-on: https://chromium-review.googlesource.com/c/1394573 > Tested-by: Jason Clinton <jclinton@chromium.org> > Reviewed-by: David Burger <dburger@chromium.org> Bug: chromium:791061 Change-Id: Icc06a2c6a0db2502bc6267ea5b5a59a5b1958cab Reviewed-on: https://chromium-review.googlesource.com/c/1395597 Reviewed-by: Jason Clinton <jclinton@chromium.org> Tested-by: Jason Clinton <jclinton@chromium.org> [modify] https://crrev.com/8743424b5d6d73e0c62cc87a21f9ac17f151495e/cbuildbot/manifest_version.py
,
Jan 4
The CL broke the CQ and the Android, Chrome, and Release Sync stages. A period of one hour was affected. Rolling back returned everything to status-quo. It appears that the change landed on Dec 15 may have actually broken the Chrome PFQ, too, in a different way, but the Chrome PFQ has been broken for a chromeos-fonts issue for so long that we didn't notice. One such example slipped in between the fonts issue here: https://bugs.chromium.org/p/chromium/issues/detail?id=913317#c34 . So, as of this moment, even with the rollback above, I believe that Chrome PFQ is still broken in this way. That said, I'm not entirely sure why it broke everything. The error message was: remote: ERROR: commit 67ba119: missing Change-Id in message footer The CL did not modify the commit message in any way and we have never used Change-Id for this repo. An example failed commit message was: Automatic: Start master-paladin master 11519.0.0-rc3 CrOS-Build-Id: 3308631 The failure example from the Chrome PFQ was: Automatic checkin: status=pass build_version 11483.0.0-rc1 for amd64-generic-full But an example CL that submits with the change that landed on the 15th is: Automatic checkin: status=pass build_version 11316.58.0 for falco-release Clearly the commit message is actually not to blame. Also, Change-Id requirement is turned off for manifest-versions repo via permission inheritance. I think we're dealing with another Gerrit bug. Continuing to investigate.
,
Jan 4
,
Jan 4
It turns out that all of the -full builders have been failing this way since Dec 15: https://cros-goldeneye.corp.google.com/chromeos/legoland/builderSummary?builderGroups=full&limit=3&buildBranch=master I believe that the root cause was that "Require Change-Id in commit message" was set to "true" for chromiumos/manifest-versions while it was set to "false" for chromeos/manifest-versions. I've explicitly set the "Require Change-Id in commit message" to false for external manifest-versions and for internal manifest-versions instead of relying on inheritance. I will watch the -full builders tonight to confirm that fixes them and then, if so, reland the change above.
,
Jan 4
Yes, that fixed the full builders. Will reland the change tomorrow when I am around to monitor it. Then, we can call this fixed.
,
Jan 4
The following revision refers to this bug: https://chromium.googlesource.com/chromiumos/chromite/+/6551d805f7ca4b096754c9da754721074d690336 commit 6551d805f7ca4b096754c9da754721074d690336 Author: Jason D. Clinton <jclinton@chromium.org> Date: Fri Jan 04 14:54:26 2019 Reland "cbuildbot/manifest_version.py: Use Gerrit autosubmit feature throughout" This is a reland of c48496fe54a9ff1850b2b718697d29d4a1264ff4 Original change's description: > cbuildbot/manifest_version.py: Use Gerrit autosubmit feature throughout > > This will use the autosubmit feature for all manifest_version pushes. > > BUG= chromium:791061 > TEST=run_tests > > Change-Id: I1f3560056e2353ef2074cdcbef08a4a0216cf0f6 > Reviewed-on: https://chromium-review.googlesource.com/c/1394573 > Tested-by: Jason Clinton <jclinton@chromium.org> > Reviewed-by: David Burger <dburger@chromium.org> Bug: chromium:791061 Change-Id: I0abfc0c7a9c28d45aa4dee6c9abaec79c603d385 Reviewed-on: https://chromium-review.googlesource.com/c/1396257 Reviewed-by: Jason Clinton <jclinton@chromium.org> Tested-by: Jason Clinton <jclinton@chromium.org> [modify] https://crrev.com/6551d805f7ca4b096754c9da754721074d690336/cbuildbot/manifest_version.py
,
Jan 4
Right. That should do it. Will monitor.
,
Jan 4
Everything looks hunky-dory.
,
Jan 14
Issue 921347 has been merged into this issue.
,
Jan 18
(4 days ago)
|
||||||||||||||||||||||||||||
►
Sign in to add a comment |
|||||||||||||||||||||||||||||||
Comment 1 by slavamn@chromium.org
, Dec 2 2017