Uprev failing due to kernel version lookup failing |
|||||||||||||||||||
Issue descriptionM72 had build failures on the latest RC Boards not included: bob veyron-mighty clapper snappy celes bob: https://luci-logdog.appspot.com/logs/chromeos/buildbucket/cr-buildbucket.appspot.com/8926642390956290832/+/steps/Uprev/0/stdout veyron-mighty: https://luci-logdog.appspot.com/logs/chromeos/buildbucket/cr-buildbucket.appspot.com/8926642312582650800/+/steps/Uprev/0/stdout clapper: https://luci-logdog.appspot.com/logs/chromeos/buildbucket/cr-buildbucket.appspot.com/8926642385249505568/+/steps/Uprev/0/stdout snappy: https://luci-logdog.appspot.com/logs/chromeos/buildbucket/cr-buildbucket.appspot.com/8926642324744951600/+/steps/Uprev/0/stdout celes: https://luci-logdog.appspot.com/logs/chromeos/buildbucket/cr-buildbucket.appspot.com/8926642387478585872/+/steps/Uprev/0/stdout Error: DieSystemExit: 1 [1;31m22:49:36: ERROR: return code: 1; command: /b/swarming/wzvhV6l/ir/cache/cbuild/repository/chromite/bin/cros_mark_as_stable commit --all '--boards=bob' '--drop_file=/b/swarming/wzvhV6l/ir/cache/cbuild/repository/src/scripts/cbuildbot_package.list' --buildroot /b/swarming/wzvhV6l/ir/cache/cbuild/repository --overlay-type both cmd=['/b/swarming/wzvhV6l/ir/cache/cbuild/repository/chromite/bin/cros_mark_as_stable', 'commit', '--all', u'--boards=bob', '--drop_file=/b/swarming/wzvhV6l/ir/cache/cbuild/repository/src/scripts/cbuildbot_package.list', '--buildroot', '/b/swarming/wzvhV6l/ir/cache/cbuild/repository', '--overlay-type', u'both'], cwd=/b/swarming/wzvhV6l/ir/cache/cbuild/repository[0m [1;31m22:49:36: ERROR: /b/swarming/wzvhV6l/ir/cache/cbuild/repository/chromite/bin/cros_mark_as_stable failed (code=1)[0m 22:49:36: INFO: Translating result /b/swarming/wzvhV6l/ir/cache/cbuild/repository/chromite/bin/cros_mark_as_stable failed (code=1) to fail. 22:49:36: INFO: Running cidb query on pid 20718, repr(query) starts with <sqlalchemy.sql.expression.Update object at 0x7fbe328c2350> 22:49:36: INFO: Running cidb query on pid 20718, repr(query) starts with <sqlalchemy.sql.expression.Insert object at 0x7fbe328c2490>
,
Dec 21
,
Dec 21
this error is flaking on ToT too. [1;31m22:48:38: ERROR: Package chromeos-kernel-experimental has a chromeos-version.sh script but it returned no valid version for "/b/swarming/wzvhV6l/ir/cache/cbuild/repository/src/third_party/kernel/experimental"[0m if kernel/experimental isn't actually being used (doesn't seem to have been touched in 3 months), lets punt it.
,
Dec 21
Who'd take the AI for the punt? Fairly critical that builds not be flaky.... thanks
,
Dec 21
Most definitely not me. I'd rather understand _why_ kernel/experimental suddenly started to generate this error instead of just dropping it. Sure, we can always recreate it when needed, but whatever happened here can happen again, even more so if we don't know what is going on in the first place. In other words, dropping kernel/experimental will potentially just paint over some other problem, and I'd rather know what that is (and why our builders play with kernel/experimental in the first place instead of leaving it alone).
,
Dec 22
i agree it shouldn't be failing. but independently, we shouldn't be wasting resources on it. it's not clear to me why we even need this when we have "next" ...
,
Dec 22
'next' and 'experimental' were distincly different, one being used for Intel development and one to test the ongoing kernel rebase. At the time we needed both.
,
Dec 29
The following revision refers to this bug: https://chromium.googlesource.com/chromiumos/overlays/chromiumos-overlay/+/fe7496a6260122df864c0f8045e7821a866ff2e6 commit fe7496a6260122df864c0f8045e7821a866ff2e6 Author: Mike Frysinger <vapier@chromium.org> Date: Sat Dec 29 07:32:45 2018 chromeos-kernel-experimental: blacklist ebuild This isn't actively used. Blacklist it so we can drop it from the manifest. BUG=chromium:917099 TEST=None Change-Id: I9b11c8b1816840e73d59294d16c9437f7c5e97d5 Reviewed-on: https://chromium-review.googlesource.com/c/1390215 Reviewed-by: Guenter Roeck <groeck@chromium.org> Reviewed-by: Mike Frysinger <vapier@chromium.org> Tested-by: Mike Frysinger <vapier@chromium.org> [modify] https://crrev.com/fe7496a6260122df864c0f8045e7821a866ff2e6/sys-kernel/chromeos-kernel-experimental/chromeos-kernel-experimental-9999.ebuild [modify] https://crrev.com/fe7496a6260122df864c0f8045e7821a866ff2e6/sys-kernel/chromeos-kernel-experimental/chromeos-kernel-experimental-4.18_rc2-r21.ebuild
,
Jan 7
Now fixed, right?
,
Jan 15
,
Jan 15
If this was happening in 72, do we need to merge this CL to 72?
,
Jan 15
> 'next' and 'experimental' were distincly different, one being used for Intel development and one to test the ongoing kernel rebase. At the time we needed both. if the ebuild isn't going to be in a builder, then imo it doesn't belong in manifest. kernel repos are not cheap. > If this was happening in 72, do we need to merge this CL to 72? we can, but we've started seeing the failure move on to other kernel repos (with less frequency it seems). so whatever the problem is, it's still there. but maybe the lower freq is good enough for existing release branches.
,
Jan 15
,
Jan 15
This bug requires manual review: We are only 13 days from stable. Please contact the milestone owner if you have questions. Owners: govind@(Android), kariahda@(iOS), djmm@(ChromeOS), abdulsyed@(Desktop) For more details visit https://www.chromium.org/issue-tracking/autotriage - Your friendly Sheriffbot
,
Jan 15
,
Jan 16
FYI this is still breaking the CQ. (Although I am not sure it's exactly the same.) https://cros-goldeneye.corp.google.com/chromeos/healthmonitoring/buildDetails?buildbucketId=8924218365360476160 https://luci-logdog.appspot.com/logs/chromeos/buildbucket/cr-buildbucket.appspot.com/8924218365360476160/+/steps/Uprev/0/stdout [1;31m16:55:48: ERROR: Package lakitu-kernel-4_14 has a chromeos-version.sh script but it returned no valid version for "/b/swarming/w/ir/cache/cbuild/repository/src/third_party/kernel/v4.14"[0m 16:55:48: INFO: Determining whether to create new ebuild /b/swarming/w/ir/cache/cbuild/repository/src/overlays/overlay-lakitu/sys-kernel/dump-capture-kernel/dump-capture-kernel-0.0.1-r77.ebuild 16:55:48: INFO: Creating new stable ebuild /b/swarming/w/ir/cache/cbuild/repository/src/overlays/overlay-lakitu/sys-kernel/dump-capture-kernel/dump-capture-kernel-0.0.1-r77.ebuild 16:55:48: INFO: New ebuild commit id: "ef6df1cf8b33cf10779fe1f3102dca86f24a6e2c" [1;31m16:55:48: ERROR: Package lakitu-kernel-4_4 has a chromeos-version.sh script but it returned no valid version for "/b/swarming/w/ir/cache/cbuild/repository/src/third_party/kernel/v4.4"[0m 16:55:50: INFO: Determining whether to create new ebuild /b/swarmi
,
Jan 16
FWIW, I don't see kernel-experimental in the logs from #16. I don't think the problem is really related to kernel-experimental.
,
Jan 16
(6 days ago)
The following revision refers to this bug: https://chromium.googlesource.com/chromiumos/manifest/+/ba2ff1b4510747c49d1c87e77a4bb27187db75a5 commit ba2ff1b4510747c49d1c87e77a4bb27187db75a5 Author: Mike Frysinger <vapier@chromium.org> Date: Wed Jan 16 09:47:01 2019 drop unused kernel/experimental This isn't actively used and is wasting space. Drop it. BUG=chromium:917099 TEST=None Change-Id: Ia1ff9941ed7c16831c37fb62706f8c30a344654c Reviewed-on: https://chromium-review.googlesource.com/1412217 Commit-Ready: Mike Frysinger <vapier@chromium.org> Tested-by: Mike Frysinger <vapier@chromium.org> Reviewed-by: Guenter Roeck <groeck@chromium.org> Reviewed-by: Bernie Thompson <bhthompson@chromium.org> [modify] https://crrev.com/ba2ff1b4510747c49d1c87e77a4bb27187db75a5/full.xml
,
Jan 16
(6 days ago)
The following revision refers to this bug: https://chrome-internal.googlesource.com/chromeos/manifest-internal/+/7262d3b15a3b9cfa8264b105fa197a71e7cb50ec commit 7262d3b15a3b9cfa8264b105fa197a71e7cb50ec Author: Mike Frysinger <vapier@chromium.org> Date: Wed Jan 16 09:46:54 2019
,
Jan 16
(6 days ago)
The following revision refers to this bug: https://chromium.googlesource.com/chromiumos/overlays/chromiumos-overlay/+/dd856918e43cf460f0973a495d7039785a0feed4 commit dd856918e43cf460f0973a495d7039785a0feed4 Author: Mike Frysinger <vapier@chromium.org> Date: Wed Jan 16 16:59:48 2019 chromeos-kernel-experimental: blacklist ebuild This isn't actively used. Blacklist it so we can drop it from the manifest. BUG=chromium:917099 TEST=None Change-Id: I9b11c8b1816840e73d59294d16c9437f7c5e97d5 Reviewed-on: https://chromium-review.googlesource.com/c/1390216 Reviewed-by: Bernie Thompson <bhthompson@chromium.org> Commit-Queue: Bernie Thompson <bhthompson@chromium.org> Tested-by: Bernie Thompson <bhthompson@chromium.org> [modify] https://crrev.com/dd856918e43cf460f0973a495d7039785a0feed4/sys-kernel/chromeos-kernel-experimental/chromeos-kernel-experimental-9999.ebuild [modify] https://crrev.com/dd856918e43cf460f0973a495d7039785a0feed4/sys-kernel/chromeos-kernel-experimental/chromeos-kernel-experimental-4.18_rc2-r21.ebuild
,
Jan 16
(6 days ago)
We merged the blacklist CL to 72, but we need to get to the bottom of this, having random build flakes on release branches is not something we can allow to continue on for long, this will become a P0 quickly as 72 nears stable if the blacklisting CL does not resolve it.
,
Jan 16
(6 days ago)
at this point, i suspect it might be related to git sync what with the other flakes/errors we've seen there. but we might need to add more debugging to the uprev code to display when there's a failure first.
,
Jan 16
(6 days ago)
This and other flakes that cause images to go missing on scheduled release days has a significant ripple effect that causes severe scheduling adjustments affecting multiple teams. 72 will be going to stable in just a couple of weeks. We really can't afford much more of this. Can we make this a P0.5?
,
Jan 16
(6 days ago)
Re #16---that breakage didn't actually break the CQ because lakitu is experimental (unrelated to kernel-experimental). But I suppose it could have happened anywhere.
,
Jan 16
(6 days ago)
Here's another one, from the latest run of buddy-release. https://cros-goldeneye.corp.google.com/chromeos/healthmonitoring/buildDetails?buildbucketId=8924180839835796032 [1;31m02:55:52: ERROR: Package chromeos-kernel-3_10 has a chromeos-version.sh script but it returned no valid version for "/b/swarming/w/ir/cache/cbuild/repository/src/third_party/kernel/v3.10"[0m
,
Jan 16
(6 days ago)
Also setzer-release. https://cros-goldeneye.corp.google.com/chromeos/healthmonitoring/buildDetails?buildbucketId=8924180694734934672 +jclinton FYI
,
Jan 16
(6 days ago)
dgarrett@: are you actively working on this? If not, we need to find someone to drive this.
,
Jan 16
(6 days ago)
Also wizpig-release. Interestingly, these are all for kernel 3.10. Sorry for the spam, I am probably done now. https://cros-goldeneye.corp.google.com/chromeos/healthmonitoring/buildDetails?buildbucketId=8924180670429218160
,
Jan 16
(6 days ago)
Re #27: No, I never did, was just under the impression it was a solved issue. I thought I thought it was the same thing as the kernel-experimental issue.
,
Jan 16
(6 days ago)
This isn't a CI issue and it looks solved from the log
,
Jan 16
(6 days ago)
> This isn't a CI issue and it looks solved from the log Sorry---what do you mean by "this"? And by "solved"? Should I open a different bug for the failures in #25, #26, and #28? Thanks.
,
Jan 16
(6 days ago)
Jason, any chance this is the git corruption bug we've been seeing?
,
Jan 16
(6 days ago)
A chance? Sure. It could also be cosmic rays. We don't know because the logging isn't there. Build team, please add logging or maybe attempt to log in to a bot and repro manually. In the meantime, if it is related to the git corruption, we don't have a root cause on that one but mikenichols@ is working on a mitigation on issue 919166. We should assume that this issue is not related to git corruption and be working toward root-causing it. Uprev is owned by Build and my (incomplete) comment in #30 was meant to clarify why Don shouldn't be focusing on this bug report at this stage.
,
Jan 16
(6 days ago)
Lamont: can you take a look at this?
,
Jan 16
(6 days ago)
Never say never. Also peach_pit-release, also kernel 3.10. https://cros-goldeneye.corp.google.com/chromeos/healthmonitoring/buildDetails?buildbucketId=8924150541125996304 At this point it's a safe bet that there is a correlation.
,
Jan 16
(6 days ago)
let me implement my ideas in comment #22 and see if that helps drive further debugging. we'd want that in general anyways as sometimes people writing custom chromeos-version.sh need a little help.
,
Jan 16
(6 days ago)
i'm about to grab dinner. here's what i was thinking if you want to run with it more.
--- a/lib/portage_util.py
+++ b/lib/portage_util.py
@@ -885,16 +885,15 @@ class EBuild(object):
# The chromeos-version script will output a usable raw version number,
# or nothing in case of error or no available version
- try:
- output = self._RunCommand([vers_script] + srcdirs).strip()
- except cros_build_lib.RunCommandError as e:
- cros_build_lib.Die('Package %s chromeos-version.sh failed: %s' %
- (self.pkgname, e))
+ result = self._RunCommand(['bash', '-x', vers_script] + srcdirs,
+ error_code_ok=True)
- if not output:
- cros_build_lib.Die('Package %s has a chromeos-version.sh script but '
- 'it returned no valid version for "%s"' %
- (self.pkgname, ' '.join(srcdirs)))
+ output = result.output
+ if result.returncode or not output:
+ cros_build_lib.Die(
+ 'Package %s has a chromeos-version.sh script but failed:\n'
+ 'return code = %s\nstdout = %s\nstderr = %s\ndir listing = %s\n',
+ self.pkgname, result.returncode, result.output, result.error, ...)
# Sanity check: disallow versions that will be larger than the 9999 ebuild
# used by cros-workon.
prob want to include the srcdirs, the listing of the srcdirs, and the .git/ subdirs too. that should help us debug a bit more. although if it's git corruption, we might have to also run some `git` commands in each subdir to see what's wrong.
,
Jan 17
(5 days ago)
Failed again in celes-release. https://cros-goldeneye.corp.google.com/chromeos/healthmonitoring/buildDetails?buildbucketId=8924089028221336256 Should I even bother reporting failures?
,
Jan 17
(5 days ago)
At this point it's understood and being worked. Until there is a change in that, I wouldn't bother updating.
,
Jan 17
(5 days ago)
https://chromium-review.googlesource.com/c/chromiumos/chromite/+/1417954 should give us more information about what is going on. It doesn't solve the problem, but may let us make progress.
,
Jan 17
(5 days ago)
,
Yesterday
(37 hours ago)
This issue has been approved for a merge. Please merge the fix to any appropriate branches as soon as possible! If all merges have been completed, please remove any remaining Merge-Approved labels from this issue. Thanks for your time! To disable nags, add the Disable-Nags label. For more details visit https://www.chromium.org/issue-tracking/autotriage - Your friendly Sheriffbot
,
Today
(6 hours ago)
The change has landed in both master and R72-11316.B -- I am looking for the next example of a failure. Feel free to point one out if you see it. |
|||||||||||||||||||
►
Sign in to add a comment |
|||||||||||||||||||
Comment 1 by xixuan@chromium.org
, Dec 20Owner: dgarrett@google.com