New issue
Advanced search Search tips
Note: Color blocks (like or ) mean that a user may not be available. Tooltip shows the reason.

Issue 917099 link

Starred by 4 users

Issue metadata

Status: Started
Owner:
Cc:
Components:
EstimatedDays: ----
NextAction: ----
OS: Chrome
Pri: 1
Type: Bug



Sign in to add a comment

Uprev failing due to kernel version lookup failing

Project Member Reported by dgagnon@google.com, Dec 20

Issue description

M72 had build failures on the latest RC

Boards not included:
bob
veyron-mighty
clapper
snappy
celes


bob:
https://luci-logdog.appspot.com/logs/chromeos/buildbucket/cr-buildbucket.appspot.com/8926642390956290832/+/steps/Uprev/0/stdout

veyron-mighty:
https://luci-logdog.appspot.com/logs/chromeos/buildbucket/cr-buildbucket.appspot.com/8926642312582650800/+/steps/Uprev/0/stdout

clapper:
https://luci-logdog.appspot.com/logs/chromeos/buildbucket/cr-buildbucket.appspot.com/8926642385249505568/+/steps/Uprev/0/stdout


snappy:
https://luci-logdog.appspot.com/logs/chromeos/buildbucket/cr-buildbucket.appspot.com/8926642324744951600/+/steps/Uprev/0/stdout


celes:
https://luci-logdog.appspot.com/logs/chromeos/buildbucket/cr-buildbucket.appspot.com/8926642387478585872/+/steps/Uprev/0/stdout


Error:

DieSystemExit: 1

22:49:36: ERROR: 
return code: 1; command: /b/swarming/wzvhV6l/ir/cache/cbuild/repository/chromite/bin/cros_mark_as_stable commit --all '--boards=bob' '--drop_file=/b/swarming/wzvhV6l/ir/cache/cbuild/repository/src/scripts/cbuildbot_package.list' --buildroot /b/swarming/wzvhV6l/ir/cache/cbuild/repository --overlay-type both
cmd=['/b/swarming/wzvhV6l/ir/cache/cbuild/repository/chromite/bin/cros_mark_as_stable', 'commit', '--all', u'--boards=bob', '--drop_file=/b/swarming/wzvhV6l/ir/cache/cbuild/repository/src/scripts/cbuildbot_package.list', '--buildroot', '/b/swarming/wzvhV6l/ir/cache/cbuild/repository', '--overlay-type', u'both'], cwd=/b/swarming/wzvhV6l/ir/cache/cbuild/repository

22:49:36: ERROR: /b/swarming/wzvhV6l/ir/cache/cbuild/repository/chromite/bin/cros_mark_as_stable failed (code=1)
22:49:36: INFO: Translating result /b/swarming/wzvhV6l/ir/cache/cbuild/repository/chromite/bin/cros_mark_as_stable failed (code=1) to fail.
22:49:36: INFO: Running cidb query on pid 20718, repr(query) starts with <sqlalchemy.sql.expression.Update object at 0x7fbe328c2350>
22:49:36: INFO: Running cidb query on pid 20718, repr(query) starts with <sqlalchemy.sql.expression.Insert object at 0x7fbe328c2490>
 
Components: -Infra>Client>ChromeOS>Test Infra>Client>ChromeOS>Build
Owner: dgarrett@google.com
Re-assign to CI bobby as it looks like a build issue.
Cc: vapier@chromium.org dgarr...@chromium.org
Cc: groeck@chromium.org
Summary: Uprev failing due to chromeos-kernel-experimental version lookup failing (was: M72 latest RC board build failed - BuildScriptFailure)
this error is flaking on ToT too.
22:48:38: ERROR: Package chromeos-kernel-experimental has a chromeos-version.sh script but it returned no valid version for "/b/swarming/wzvhV6l/ir/cache/cbuild/repository/src/third_party/kernel/experimental"

if kernel/experimental isn't actually being used (doesn't seem to have been touched in 3 months), lets punt it.
Labels: M-72
Who'd take the AI for the punt?  Fairly critical that builds not be flaky....   thanks
Most definitely not me. I'd rather understand _why_  kernel/experimental suddenly started to generate this error instead of just dropping it. Sure, we can always recreate it when needed, but whatever happened here can happen again, even more so if we don't know what is going on in the first place.
In other words, dropping kernel/experimental will potentially just paint over some other problem, and I'd rather know what that is (and why our builders play with kernel/experimental in the first place instead of leaving it alone).
i agree it shouldn't be failing.  but independently, we shouldn't be wasting resources on it.  it's not clear to me why we even need this when we have "next" ...
'next' and 'experimental' were distincly different, one being used for Intel development and one to test the ongoing kernel rebase. At the time we needed both.

Project Member

Comment 8 by bugdroid1@chromium.org, Dec 29

The following revision refers to this bug:
  https://chromium.googlesource.com/chromiumos/overlays/chromiumos-overlay/+/fe7496a6260122df864c0f8045e7821a866ff2e6

commit fe7496a6260122df864c0f8045e7821a866ff2e6
Author: Mike Frysinger <vapier@chromium.org>
Date: Sat Dec 29 07:32:45 2018

chromeos-kernel-experimental: blacklist ebuild

This isn't actively used.  Blacklist it so we can drop it from the manifest.

BUG=chromium:917099
TEST=None

Change-Id: I9b11c8b1816840e73d59294d16c9437f7c5e97d5
Reviewed-on: https://chromium-review.googlesource.com/c/1390215
Reviewed-by: Guenter Roeck <groeck@chromium.org>
Reviewed-by: Mike Frysinger <vapier@chromium.org>
Tested-by: Mike Frysinger <vapier@chromium.org>

[modify] https://crrev.com/fe7496a6260122df864c0f8045e7821a866ff2e6/sys-kernel/chromeos-kernel-experimental/chromeos-kernel-experimental-9999.ebuild
[modify] https://crrev.com/fe7496a6260122df864c0f8045e7821a866ff2e6/sys-kernel/chromeos-kernel-experimental/chromeos-kernel-experimental-4.18_rc2-r21.ebuild

Status: Fixed (was: Untriaged)
Now fixed, right?
Cc: akes...@chromium.org
Issue 921764 has been merged into this issue.
If this was happening in 72, do we need to merge this CL to 72?
Status: Available (was: Fixed)
> 'next' and 'experimental' were distincly different, one being used for Intel development and one to test the ongoing kernel rebase. At the time we needed both.

if the ebuild isn't going to be in a builder, then imo it doesn't belong in manifest.  kernel repos are not cheap.

> If this was happening in 72, do we need to merge this CL to 72?

we can, but we've started seeing the failure move on to other kernel repos (with less frequency it seems).  so whatever the problem is, it's still there.

but maybe the lower freq is good enough for existing release branches.
Labels: Merge-Request-72
Project Member

Comment 14 by sheriffbot@chromium.org, Jan 15

Labels: -Merge-Request-72 Merge-Review-72 Hotlist-Merge-Review
This bug requires manual review: We are only 13 days from stable.
Please contact the milestone owner if you have questions.
Owners: govind@(Android), kariahda@(iOS), djmm@(ChromeOS), abdulsyed@(Desktop)

For more details visit https://www.chromium.org/issue-tracking/autotriage - Your friendly Sheriffbot
Labels: -Merge-Review-72 Merge-Approved-72
FYI this is still breaking the CQ.  (Although I am not sure it's exactly the same.)

https://cros-goldeneye.corp.google.com/chromeos/healthmonitoring/buildDetails?buildbucketId=8924218365360476160

https://luci-logdog.appspot.com/logs/chromeos/buildbucket/cr-buildbucket.appspot.com/8924218365360476160/+/steps/Uprev/0/stdout

[1;31m16:55:48: ERROR: Package lakitu-kernel-4_14 has a chromeos-version.sh script but it returned no valid version for "/b/swarming/w/ir/cache/cbuild/repository/src/third_party/kernel/v4.14"
16:55:48: INFO: Determining whether to create new ebuild /b/swarming/w/ir/cache/cbuild/repository/src/overlays/overlay-lakitu/sys-kernel/dump-capture-kernel/dump-capture-kernel-0.0.1-r77.ebuild
16:55:48: INFO: Creating new stable ebuild /b/swarming/w/ir/cache/cbuild/repository/src/overlays/overlay-lakitu/sys-kernel/dump-capture-kernel/dump-capture-kernel-0.0.1-r77.ebuild
16:55:48: INFO: New ebuild commit id: "ef6df1cf8b33cf10779fe1f3102dca86f24a6e2c"
16:55:48: ERROR: Package lakitu-kernel-4_4 has a chromeos-version.sh script but it returned no valid version for "/b/swarming/w/ir/cache/cbuild/repository/src/third_party/kernel/v4.4"
16:55:50: INFO: Determining whether to create new ebuild /b/swarmi

FWIW, I don't see kernel-experimental in the logs from #16. I don't think the problem is really related to kernel-experimental.

Project Member

Comment 18 by bugdroid1@chromium.org, Jan 16 (6 days ago)

The following revision refers to this bug:
  https://chromium.googlesource.com/chromiumos/manifest/+/ba2ff1b4510747c49d1c87e77a4bb27187db75a5

commit ba2ff1b4510747c49d1c87e77a4bb27187db75a5
Author: Mike Frysinger <vapier@chromium.org>
Date: Wed Jan 16 09:47:01 2019

drop unused kernel/experimental

This isn't actively used and is wasting space.  Drop it.

BUG=chromium:917099
TEST=None

Change-Id: Ia1ff9941ed7c16831c37fb62706f8c30a344654c
Reviewed-on: https://chromium-review.googlesource.com/1412217
Commit-Ready: Mike Frysinger <vapier@chromium.org>
Tested-by: Mike Frysinger <vapier@chromium.org>
Reviewed-by: Guenter Roeck <groeck@chromium.org>
Reviewed-by: Bernie Thompson <bhthompson@chromium.org>

[modify] https://crrev.com/ba2ff1b4510747c49d1c87e77a4bb27187db75a5/full.xml

Project Member

Comment 19 by bugdroid1@chromium.org, Jan 16 (6 days ago)

The following revision refers to this bug:
  https://chrome-internal.googlesource.com/chromeos/manifest-internal/+/7262d3b15a3b9cfa8264b105fa197a71e7cb50ec

commit 7262d3b15a3b9cfa8264b105fa197a71e7cb50ec
Author: Mike Frysinger <vapier@chromium.org>
Date: Wed Jan 16 09:46:54 2019

Project Member

Comment 20 by bugdroid1@chromium.org, Jan 16 (6 days ago)

Labels: merge-merged-release-R72-11316.B
The following revision refers to this bug:
  https://chromium.googlesource.com/chromiumos/overlays/chromiumos-overlay/+/dd856918e43cf460f0973a495d7039785a0feed4

commit dd856918e43cf460f0973a495d7039785a0feed4
Author: Mike Frysinger <vapier@chromium.org>
Date: Wed Jan 16 16:59:48 2019

chromeos-kernel-experimental: blacklist ebuild

This isn't actively used.  Blacklist it so we can drop it from the manifest.

BUG=chromium:917099
TEST=None

Change-Id: I9b11c8b1816840e73d59294d16c9437f7c5e97d5
Reviewed-on: https://chromium-review.googlesource.com/c/1390216
Reviewed-by: Bernie Thompson <bhthompson@chromium.org>
Commit-Queue: Bernie Thompson <bhthompson@chromium.org>
Tested-by: Bernie Thompson <bhthompson@chromium.org>

[modify] https://crrev.com/dd856918e43cf460f0973a495d7039785a0feed4/sys-kernel/chromeos-kernel-experimental/chromeos-kernel-experimental-9999.ebuild
[modify] https://crrev.com/dd856918e43cf460f0973a495d7039785a0feed4/sys-kernel/chromeos-kernel-experimental/chromeos-kernel-experimental-4.18_rc2-r21.ebuild

Comment 21 by bhthompson@google.com, Jan 16 (6 days ago)

We merged the blacklist CL to 72, but we need to get to the bottom of this, having random build flakes on release branches is not something we can allow to continue on for long, this will become a P0 quickly as 72 nears stable if the blacklisting CL does not resolve it.

Comment 22 by vapier@chromium.org, Jan 16 (6 days ago)

at this point, i suspect it might be related to git sync what with the other flakes/errors we've seen there.  but we might need to add more debugging to the uprev code to display when there's a failure first.

Comment 23 by djmm@google.com, Jan 16 (6 days ago)

This and other flakes that cause images to go missing on scheduled release days has a significant ripple effect that causes severe scheduling adjustments affecting multiple teams.  72 will be going to stable in just a couple of weeks.  We really can't afford much more of this.  Can we make this a P0.5?

Comment 24 by semenzato@chromium.org, Jan 16 (6 days ago)

Re #16---that breakage didn't actually break the CQ because lakitu is experimental (unrelated to kernel-experimental).  But I suppose it could have happened anywhere.

Comment 25 by semenzato@chromium.org, Jan 16 (6 days ago)

Summary: Uprev failing due to kernel version lookup failing (was: Uprev failing due to chromeos-kernel-experimental version lookup failing)
Here's another one, from the latest run of buddy-release.

https://cros-goldeneye.corp.google.com/chromeos/healthmonitoring/buildDetails?buildbucketId=8924180839835796032


02:55:52: ERROR: Package chromeos-kernel-3_10 has a chromeos-version.sh script but it returned no valid version for "/b/swarming/w/ir/cache/cbuild/repository/src/third_party/kernel/v3.10"


Comment 26 by semenzato@chromium.org, Jan 16 (6 days ago)

Cc: jclinton@chromium.org
Also setzer-release.

https://cros-goldeneye.corp.google.com/chromeos/healthmonitoring/buildDetails?buildbucketId=8924180694734934672

+jclinton FYI

Comment 27 by nedngu...@google.com, Jan 16 (6 days ago)

dgarrett@: are you actively working on this? If not, we need to find someone to drive this.

Comment 28 by semenzato@chromium.org, Jan 16 (6 days ago)

Also wizpig-release.  Interestingly, these are all for kernel 3.10.  Sorry for the spam, I am probably done now.

https://cros-goldeneye.corp.google.com/chromeos/healthmonitoring/buildDetails?buildbucketId=8924180670429218160

Comment 29 by dgarr...@chromium.org, Jan 16 (6 days ago)

Re #27: No, I never did, was just under the impression it was a solved issue.

I thought I thought it was the same thing as the kernel-experimental issue.

Comment 30 by jclinton@chromium.org, Jan 16 (6 days ago)

Owner: ----
This isn't a CI issue and it looks solved from the log

Comment 31 by semenzato@chromium.org, Jan 16 (6 days ago)

> This isn't a CI issue and it looks solved from the log

Sorry---what do you mean by "this"?  And by "solved"?

Should I open a different bug for the failures in #25, #26, and #28?

Thanks.


Comment 32 by dgarr...@chromium.org, Jan 16 (6 days ago)

Jason, any chance this is the git corruption bug we've been seeing?

Comment 33 by jclinton@chromium.org, Jan 16 (6 days ago)

A chance? Sure. It could also be cosmic rays. We don't know because the logging isn't there. Build team, please add logging or maybe attempt to log in to a bot and repro manually.

In the meantime, if it is related to the git corruption, we don't have a root cause on that one but mikenichols@ is working on a mitigation on issue 919166.

We should assume that this issue is not related to git corruption and be working toward root-causing it. Uprev is owned by Build and my (incomplete) comment in #30 was meant to clarify why Don shouldn't be focusing on this bug report at this stage.

Comment 34 by nedngu...@google.com, Jan 16 (6 days ago)

Owner: lamontjones@chromium.org
Lamont: can you take a look at this?

Comment 35 by semenzato@chromium.org, Jan 16 (6 days ago)

Cc: evanhernandez@chromium.org
Never say never.  Also peach_pit-release, also kernel 3.10.

https://cros-goldeneye.corp.google.com/chromeos/healthmonitoring/buildDetails?buildbucketId=8924150541125996304

At this point it's a safe bet that there is a correlation.

Comment 36 by vapier@chromium.org, Jan 16 (6 days ago)

Owner: vapier@chromium.org
let me implement my ideas in comment #22 and see if that helps drive further debugging.  we'd want that in general anyways as sometimes people writing custom chromeos-version.sh need a little help.

Comment 37 by vapier@chromium.org, Jan 16 (6 days ago)

Owner: lamontjones@chromium.org
i'm about to grab dinner.  here's what i was thinking if you want to run with it more.

--- a/lib/portage_util.py
+++ b/lib/portage_util.py
@@ -885,16 +885,15 @@ class EBuild(object):

     # The chromeos-version script will output a usable raw version number,
     # or nothing in case of error or no available version
-    try:
-      output = self._RunCommand([vers_script] + srcdirs).strip()
-    except cros_build_lib.RunCommandError as e:
-      cros_build_lib.Die('Package %s chromeos-version.sh failed: %s' %
-                         (self.pkgname, e))
+    result = self._RunCommand(['bash', '-x', vers_script] + srcdirs,
+                              error_code_ok=True)

-    if not output:
-      cros_build_lib.Die('Package %s has a chromeos-version.sh script but '
-                         'it returned no valid version for "%s"' %
-                         (self.pkgname, ' '.join(srcdirs)))
+    output = result.output
+    if result.returncode or not output:
+      cros_build_lib.Die(
+          'Package %s has a chromeos-version.sh script but failed:\n'
+          'return code = %s\nstdout = %s\nstderr = %s\ndir listing = %s\n',
+          self.pkgname, result.returncode, result.output, result.error, ...)

     # Sanity check: disallow versions that will be larger than the 9999 ebuild
     # used by cros-workon.

prob want to include the srcdirs, the listing of the srcdirs, and the .git/ subdirs too.  that should help us debug a bit more.  although if it's git corruption, we might have to also run some `git` commands in each subdir to see what's wrong.

Comment 38 by semenzato@chromium.org, Jan 17 (5 days ago)

Failed again in celes-release.

https://cros-goldeneye.corp.google.com/chromeos/healthmonitoring/buildDetails?buildbucketId=8924089028221336256

Should I even bother reporting failures?

Comment 39 by dgarr...@chromium.org, Jan 17 (5 days ago)

At this point it's understood and being worked. Until there is a change in that, I wouldn't bother updating.

Comment 40 by lamontjones@chromium.org, Jan 17 (5 days ago)

https://chromium-review.googlesource.com/c/chromiumos/chromite/+/1417954 should give us more information about what is going on.  It doesn't solve the problem, but may let us make progress.

Comment 41 by lamontjones@chromium.org, Jan 17 (5 days ago)

Status: Started (was: Available)
Project Member

Comment 42 by sheriffbot@chromium.org, Yesterday (37 hours ago)

Cc: bhthompson@google.com djmm@google.com
This issue has been approved for a merge. Please merge the fix to any appropriate branches as soon as possible!

If all merges have been completed, please remove any remaining Merge-Approved labels from this issue.

Thanks for your time! To disable nags, add the Disable-Nags label.

For more details visit https://www.chromium.org/issue-tracking/autotriage - Your friendly Sheriffbot

Comment 43 by lamontjones@chromium.org, Today (6 hours ago)

The change has landed in both master and R72-11316.B -- I am looking for the next example of a failure.  Feel free to point one out if you see it.

Sign in to add a comment