Project: chromium Issues People Development process History Sign in
New issue
Advanced search Search tips
Note: Color blocks (like or ) mean that a user may not be available. Tooltip shows the reason.
Starred by 3 users
Status: Assigned
Owner:
Cc:
Components:
EstimatedDays: ----
NextAction: ----
OS: Chrome
Pri: 2
Type: Bug



Sign in to add a comment
build_image failing again in canary archive step with cryptic error
Project Member Reported by semenzato@chromium.org, Feb 6 2017 Back to list
Dozens of dead canaries.

I have some issues with this log file.  The root cause, if present, is difficult to find.  Also the code seems to have invented time travel, the time stamps are all over.

https://uberchromegw.corp.google.com/i/chromeos/builders/gru-release/builds/863/steps/Archive/logs/stdio

@@@BUILD_STEP@Archive@@@
************************************************************
@@@STEP_LINK@stdout-->stdio@https://luci-logdog.appspot.com/v/?s=chromeos%2Fbb%2Fchromeos%2Fgru-release%2F863%2F%2B%2Frecipes%2Fsteps%2FArchive%2F0%2Fstdout@@@
** Start Stage Archive - Mon, 06 Feb 2017 04:39:48 -0800 (PST)
** 
** Archives build and test artifacts for developer consumption.
** 
**   Attributes:
**     release_tag: The release tag. E.g. 2981.0.0
**     version: The full version string, including the milestone.
**         E.g. R26-2981.0.0-b123
************************************************************
04:39:48: INFO: Created cidb engine bot@173.194.81.53 for pid 30625
04:39:48: INFO: Running cidb query on pid 30625, repr(query) starts with <sqlalchemy.sql.expression.Update object at 0x7f0882ba7f90>
Preconditions for the stage successfully met. Beginning to execute stage...
04:39:48: INFO: Running cidb query on pid 30625, repr(query) starts with <sqlalchemy.sql.expression.Update object at 0x7f0882baff10>
04:39:55: INFO: RunCommand: /b/cbuild/internal_master/chromite/bin/cros_sdk 'USE=-cros-debug chrome_internal' 'PARALLEL_EMERGE_STATUS_FILE=/tmp/tmpFpPkKD' -- ./mod_image_for_recovery.sh '--board=gru' '--image=/mnt/host/source/src/build/images/gru/R58-9256.0.0/tmpTduJlU/chromiumos_base_image.bin' in /b/cbuild/internal_master
04:47:58: INFO: RunCommand: /b/cbuild/internal_master/chromite/bin/cros_sdk 'USE=-cros-debug chrome_internal' 'PARALLEL_EMERGE_STATUS_FILE=/tmp/tmpai4amL' -- ./build_image '--board=gru' --replace '--symlink=factory_shim' '--build_attempt=3' factory_install in /b/cbuild/internal_master
04:58:56: ERROR: 
return code: 1; command: /b/cbuild/internal_master/chromite/bin/cros_sdk 'USE=-cros-debug chrome_internal' 'PARALLEL_EMERGE_STATUS_FILE=/tmp/tmpai4amL' -- ./build_image '--board=gru' --replace '--symlink=factory_shim' '--build_attempt=3' factory_install
 * Generating locale-archive: forcing # of jobs to 1


 
I noticed that several builders got interrupted around the time Feb 3 14:23:29, like:
  https://uberchromegw.corp.google.com/i/chromeos/builders/auron_yuna-release/builds/821
  https://uberchromegw.corp.google.com/i/chromeos/builders/celes-release/builds/820
  https://uberchromegw.corp.google.com/i/chromeos/builders/gandof-release/builds/812
  https://uberchromegw.corp.google.com/i/chromeos/builders/kevin-release/builds/830

And all the builds after that time failed at the same reason of this bug.

I doubted that it may be caused by the latest push-to-prod. akeshet@ sent the push-to-prod email around that time:

[push to prod autotest fae149c..4de4e74 chromite 9b91a54..2b7dd13]

autotest | grep autotest
4de4e74d1 autotest: Remove backend dead codes related to recurring jobs.
f353b10fc [autotest] Update servo-stat to support servo v4 duts.
d8220fa92 [autotest] Update TKO db schema
e351a2c14 [autotest] Add sql script to create a stored procedure to clean up old tests in TKO
20a5d8b39 [autotest] [atomic] Remove atomic groups from server/
335f2162d [autotest] [atomic] Remove atomic groups from cli
395b34785 [autotest] Cleanup adb_host.

chromite
2b7dd13e Do not raise certain 409 GOBErrors on RemoveReady.
00fbffae Simple Chrome: use same compiler as ebuild workflow.
911af741 Search obsolete slaves with right buckets and tags in CleanUpStage.
150c4863 sand: add builder configuration
50fd9e75 Add CHROMEOS_ARC_ANDROID_SDK_VERSION to /etc/lsb-release.
44ef524b chrome_stages: Remove obsolete test for 'Lucid'.
74d7c3fd jadeite: add builder configuration
1811e6af license_lib: Don't crash on bad characters.
bddfc794 Remote connection should not fail on failed ping by default
1edbc1da generic_stages: Treat ExitEarlyException as CIDB success.
af2b2a1d som_alerts_dispatcher: handle missing logdog annotations gracefully
ed0b56ed chromite: Fix som and prpc libraries to handle spaces in hostnames.
Cc: akes...@chromium.org
There was some maintenance around the time 03 Feb 14:20 according to chromiumos-status.appspot.com.

akeshet	Fri, 03 Feb 14:20	Tree is closed for maintenance (mass builder reimaging happening now)
akeshet	Fri, 03 Feb 14:11	Tree is closed for maintenance (waterfall restart after current CQ run)

Probably it may be related. +akeshet to see any clue.
Cc: bleung@chromium.org
It failed to build the factory_install_image. Other images are fine and generated, like:
  https://pantheon.corp.google.com/storage/browser/chromeos-image-archive/gru-release/R58-9250.0.0

But goldeye seems to treat it as a build failed and totally does not show all images. It'd be more serious.
Cc: -bleung@chromium.org -akes...@chromium.org -jinjingl@chromium.org -sheckylin@chromium.org -waihong@chromium.org leecy@chromium.org
Cc: akes...@chromium.org sheckylin@chromium.org jinjingl@chromium.org waihong@chromium.org bleung@chromium.org
Didn't mean to minus the Ccs. Added Christine for more comments on images in GoldenEye.
Comment 6 by leecy@chromium.org, Feb 6 2017
Yes, the archive stage has failed on these builders, so nothing has been copied to the chromeos-releases bucket:

e.g. 
https://pantheon.corp.google.com/storage/browser/chromeos-releases/canary-channel/samus/9256.0.0/?pli=1

vs.

https://pantheon.corp.google.com/storage/browser/chromeos-releases/canary-channel/buddy/9256.0.0/?pli=1

or for the example above: https://pantheon.corp.google.com/storage/browser/chromeos-releases/canary-channel/gru/9250.0.0

That means there aren't any signed images or any payloads for these boards, only the unsigned image produced by the builders (still in chromeos-image-archives).

I think we need to figure out what is making the factory_install image generation failure so the archive stage works.
Note that there are two bugs here.  One is, why did build_image fail, and two, can we make it easier to tell why it failed from the logs.  I think it makes sense to leave it as a single bug for now (maybe forever) but the two issues may need separate resolutions.

Comment 8 by sbasi@chromium.org, Feb 6 2017
Cc: sbasi@chromium.org
Owner: ----
If build_image is failing this looks more to be a sheriff issue rather than a deputy issue.
#8 this is unclear.  It's not failing for all builds, so it could be a flake, which could be due to the build_image code, but also to infra issues. :/

Is there a recent build with an example of this? If so, please link to it.
Added link to recent build of this failure: https://uberchromegw.corp.google.com/i/chromeos/builders/gru-release/builds/863.
Re: #8: It seems to be pretty consistently those particular builds that are failing.  Coincidentally, if I look at GoldenEye, I can see that these boards are all the boards that have an ARC container (publicly enabled or not).
https://cros-goldeneye.corp.google.com/chromeos/console/listBuild?milestone=58#/details.

Click open one of the status with low success rate and all the builds with missing signed image has an ARC version.
Cc: vapier@chromium.org dgreid@chromium.org
I compared the log with the code which print the log. Given this one as an example:
  https://uberchromegw.corp.google.com/i/chromeos/builders/gru-release/builds/863/steps/Archive/logs/stdio

Look like the code of the line 220 still worked fine, which calls gconv_strip.
  https://cs.corp.google.com/chromeos_public/src/scripts/build_library/base_image_util.sh?type=cs&q=delete_prompt+package:%5Echromeos_public$&l=220

Generated the log of:
04:58:51: INFO: Searching for unused gconv files defined in /mnt/host/source/src/build/images/gru/R58-9256.0.0-a3/rootfs/usr/lib/gconv/gconv-modules
04:58:52: INFO: Will search for 1131 strings in 10 files
04:58:53: INFO: Done. Using 20 gconv modules. Removed 226 unused modules (17140.1 KiB) and 6 unused dependencies (928.0 KiB)

But the code of the line 244 seems to be unreached, which calls insert_container_publickey.sh:
  https://cs.corp.google.com/chromeos_public/src/scripts/build_library/base_image_util.sh?type=cs&q=delete_prompt+package:%5Echromeos_public$&l=244

The insert_container_publickey.sh script is supposed to print a log of "Container verification key was installed. Do not forget to resign the image!" when done.
  https://cs.corp.google.com/chromeos_public/src/platform/vboot_reference/scripts/image_signing/insert_container_publickey.sh?q=insert_container_publickey&dr&l=45

However, this string didn't show up on the log. On the other hand, an error string showed:
Could not open /mnt/host/source/src/build/images/gru/R58-9256.0.0-a3/rootfs/opt/google/containers/android/system.raw.img, because No such file or directory

The recent change of adding this insert_container_publickey.sh script is the most suspicious.
  https://chromium-review.googlesource.com/#/c/430830/

Added the author dgreid@ to clarify if this script works on a factory install image.
Cc: elijahtaylor@chromium.org
Owner: nya@chromium.org
It is not related to https://chromium-review.googlesource.com/#/c/430830/.

The cause is somewhere happened before, i.e. line 232 which calls get_arc_build_info.
https://cs.corp.google.com/chromeos_public/src/scripts/build_library/base_image_util.sh?type=cs&q=delete_prompt+package:%5Echromeos_public$&l=232

The changes are:
  https://chromium-review.googlesource.com/#/c/433997/
  https://chrome-internal-review.googlesource.com/c/321325/

That matches the error message:
Could not open /mnt/host/source/src/build/images/gru/R58-9256.0.0-a3/rootfs/opt/google/containers/android/system.raw.img, because No such file or directory

Cc: bhthompson@chromium.org
there have been some chromite changes related to ARC, but not sure if Bernie's changes have landed yet
The factory install shim reuses the same create_base_image() method. But some ARC++ logic doesn't apply to the factory install shim, like the suspicious ones in c#14 (insert_container_publickey) and c#15 (get_arc_build_info). Should ignore calling these methods on the factory install shim case, or create another create_factory_install_image().
This is the changelog for 9248.0.0 which is when the issue began: https://crosland.corp.google.com/log/9247.0.0..9248.0.0
Re c#18, for the infra issue (or non-image related issue) like this one, looking at the changes between (build-1)..(build) is not enough. Should look at the changes between (second-latest-push-to-prod)..(latest-push-to-prod). However, it seems no easy way to get the build number of push-to-prod. And the schedule of push-to-prod doesn't align with the build boundary.

Changes like the following are only pushed to the production servers on regular push-to-prod schedule.
  https://chromium-review.googlesource.com/#/c/430830/
  https://chromium-review.googlesource.com/#/c/433997/
Re comment 16, these appear to be before my changes in 9250, I don't see anything obvious in the delta, maybe the new glib?
This definitely looks like a failure of https://chrome-internal-review.googlesource.com/c/321325/ and dependent CLs.  I think we should just revert them in the meantime
Project Member Comment 23 by bugdroid1@chromium.org, Feb 7 2017
Project Member Comment 24 by bugdroid1@chromium.org, Feb 7 2017
The following revision refers to this bug:
  https://chromium.googlesource.com/chromiumos/platform/crosutils/+/d21e05fead7fd12ab58d43558903f9b7f0a4a3cd

commit d21e05fead7fd12ab58d43558903f9b7f0a4a3cd
Author: Wai-Hong Tam <waihong@google.com>
Date: Tue Feb 07 03:01:18 2017

Revert "Pass ARC release info to cros_set_lsb_release."

The change broke Canary builds.

CQ-DEPEND=CL:*325265
BUG=chromium:689072
TEST=build_image --board=gru --replace --symlink=factory_shim \
     --build_attempt=3 factory_install

This reverts commit 180d3f8a79ae3dcf022e4ae51f3850bfd4be26d9.

Change-Id: Ieb08f353886632059201f11a4b41b7d99cd36182
Reviewed-on: https://chromium-review.googlesource.com/438771
Reviewed-by: Elijah Taylor <elijahtaylor@chromium.org>
Commit-Queue: Wai-Hong Tam <waihong@google.com>
Tested-by: Wai-Hong Tam <waihong@google.com>

[modify] https://crrev.com/d21e05fead7fd12ab58d43558903f9b7f0a4a3cd/build_library/base_image_util.sh

Comment 25 by nya@chromium.org, Feb 7 2017
Cc: nya@chromium.org
Owner: waihong@chromium.org
Status: Started
Sorry for breakage and thanks for reverts. I reproduced this issue locally with ./build_image factory_install, so I believe those reverts will fix the issue.

I'm reassigning this issue to waihong@ as he made reverts. I think we can mark it fixed after we verify release builders get back to green.

I'll reland those patches soon.

Chrome OS CQ is now failing with "ERROR: Could not determine Android SDK version":
https://luci-milo.appspot.com/buildbot/chromeos/veyron_minnie-paladin/1594

Does it mean the following change
https://chrome-internal-review.googlesource.com/c/322768/
also needs to be reverted for a while?
Comment 27 by nya@chromium.org, Feb 7 2017
Ah, yes, that's true. Sorry I forgot that those tests are in bvt-cq. I will make a revert.

Comment 28 by nya@chromium.org, Feb 7 2017
I created a revert:
https://chrome-internal-review.googlesource.com/c/325328

I enqueued it to CQ, but maybe we can chump this change since CQ will fail for sure without it. I'll defer to sheriffs.

Project Member Comment 29 by bugdroid1@chromium.org, Feb 7 2017
The following revision refers to this bug:
  https://chrome-internal.googlesource.com/chromeos/autotest-cheets/+/08c9ceca9b10092eff27dabe7610edcad7cbbae0

commit 08c9ceca9b10092eff27dabe7610edcad7cbbae0
Author: Shuhei Takahashi <nya@google.com>
Date: Tue Feb 07 16:38:38 2017

Status: Fixed
Both the issues (factory install image failed in Archive in Canary and "Could not determine Android SDK version" in CQ) are fixed and don't happen on recent builds.
Project Member Comment 31 by bugdroid1@chromium.org, Feb 8 2017
The following revision refers to this bug:
  https://chrome-internal.googlesource.com/chromeos/overlays/project-cheets-private/+/94eced08087144057bd19f78cabbd3b3a783ceb3

commit 94eced08087144057bd19f78cabbd3b3a783ceb3
Author: Shuhei Takahashi <nya@google.com>
Date: Wed Feb 08 06:27:03 2017

Project Member Comment 32 by bugdroid1@chromium.org, Feb 8 2017
The following revision refers to this bug:
  https://chromium.googlesource.com/chromiumos/platform/crosutils/+/7a82c91c41d6cd35d76668523101a3e5413e0e42

commit 7a82c91c41d6cd35d76668523101a3e5413e0e42
Author: Shuhei Takahashi <nya@chromium.org>
Date: Wed Feb 08 06:27:03 2017

Reland: Pass ARC release info to cros_set_lsb_release.

The cause was identified and fixed in CL:*325403.

This reverts commit d21e05fead7fd12ab58d43558903f9b7f0a4a3cd.

Original change's description:
> Revert "Pass ARC release info to cros_set_lsb_release."
>
> The change broke Canary builds.
>
> CQ-DEPEND=CL:*325265
> BUG=chromium:689072
> TEST=build_image --board=gru --replace --symlink=factory_shim \
>      --build_attempt=3 factory_install
>
> This reverts commit 180d3f8a79ae3dcf022e4ae51f3850bfd4be26d9.
>
> Change-Id: Ieb08f353886632059201f11a4b41b7d99cd36182
> Reviewed-on: https://chromium-review.googlesource.com/438771
> Reviewed-by: Elijah Taylor <elijahtaylor@chromium.org>
> Commit-Queue: Wai-Hong Tam <waihong@google.com>
> Tested-by: Wai-Hong Tam <waihong@google.com>

CQ-DEPEND=CL:*325403
BUG=b:34693882
TEST=build_image --board=samus-cheets
TEST=build_image --board=samus-cheets factory_install

Change-Id: I5bcdf360af6f6404420a5335c533ebe4cd69e456
Reviewed-on: https://chromium-review.googlesource.com/438905
Commit-Ready: Shuhei Takahashi <nya@chromium.org>
Tested-by: Shuhei Takahashi <nya@chromium.org>
Reviewed-by: Elijah Taylor <elijahtaylor@chromium.org>

[modify] https://crrev.com/7a82c91c41d6cd35d76668523101a3e5413e0e42/build_library/base_image_util.sh

Status: Assigned
Postmortem questions

a) What the root cause this?

b) If the root cause was a CL that was reverted, how the the CL manage to land before breaking the CQ? Was it chumped, or was there some kind of hole in the CQ testing?
Comment 34 by leecy@chromium.org, Feb 14 2017
I don't think the CQ does the same archiving steps (creating factory install shim) as the canary?
Hmm. I the weekly summary, some CQ failures were blamed on this bug. Was that blame incorrect?
Comment 37 by leecy@chromium.org, Feb 14 2017
I think that was an incomplete revert (mentioned above).
Comment 38 by nya@chromium.org, Feb 15 2017
#33:

We had two problems:
1. canary builder breakage
2. CQ breakage

1 was caused because my patch:
https://chrome-internal-review.googlesource.com/c/321325/
did not consider the case of factory_install images.

2 was due to incomplete reverts. Following changes were initial reverts:
https://chrome-internal-review.googlesource.com/c/325265/
https://chromium-review.googlesource.com/c/438771/
They broke CQ. Actually we also needed this revert:
https://chrome-internal-review.googlesource.com/c/325328/

We could avoid these breakages if:
1: we tested the patch with ./build_image factory_install
2: we committed the reverts via CQ

Comment 39 by aut...@google.com, Feb 21 2017
Labels: -current-issue
Labels: akeshet-pending-downgrade
ChromeOS Infra P1 Bugscrub.

P1 Bugs in this component should be important enough to get weekly status updates.

Is this already fixed?  -> Fixed
Is this no longer relevant? -> Archived or WontFix
Is this not a P1, based on go/chromeos-infra-bug-slo rubric? -> lower priority.
Is this a Feature Request rather than a bug? Type -> Feature
Is this missing important information or scope needed to decide how to proceed? -> Ask question on bug, possibly reassign.
Does this bug have the wrong owner? -> reassign.

Bugs that remain in this state next week will be downgraded to P2.
Labels: -akeshet-pending-downgrade Pri-2
ChromeOS Infra P1 Bugscrub.

Issue untouched in a week after previous message. Downgrading to P2.
Sign in to add a comment