New issue
Advanced search Search tips
Note: Color blocks (like or ) mean that a user may not be available. Tooltip shows the reason.

Issue 873868 link

Starred by 2 users

Issue metadata

Status: Available
Owner: ----
Cc:
Components:
EstimatedDays: ----
NextAction: ----
OS: Chrome
Pri: 2
Type: Feature
Build-Toolchain



Sign in to add a comment

CQ should promote chroot snapshots to stable after 2 successful runs

Project Member Reported by jwer...@chromium.org, Aug 14

Issue description

reef-paladin has been failing for the last three days (example: https://logs.chromium.org/v/?s=chromeos%2Fbb%2Fchromeos%2Freef-paladin%2F6522%2F%2B%2Frecipes%2Fsteps%2FBuildPackages%2F0%2Fstdout), with a number of packages failing to build with errors such as

libpcre-8.41-r1: ./.libs/libpcrecpp.so: error: undefined reference to 'std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char> >::~basic_string()'
libpcre-8.41-r1: clang-7: error: linker command failed with exit code 1 (use -v to see invocation)

smartmontools-6.6-r1: utility.cpp:458: error: undefined reference to 'std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char> >::~basic_string()'
smartmontools-6.6-r1: clang-7: error: linker command failed with exit code 1 (use -v to see invocation)

protobuf-3.3.0: ./.libs/libprotoc.so: error: undefined reference to 'std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char> >::~basic_string()'
protobuf-3.3.0: clang-7: error: linker command failed with exit code 1 (use -v to see invocation)

opencv-2.3.0-r12: ../../../OpenCV-2.3.0/modules/stitching/main.cpp:123: error: undefined reference to 'std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char> >::~basic_string()'
opencv-2.3.0-r12: clang-7: error: linker command failed with exit code 1 (use -v to see invocation)

Assigning to Luis since https://chromium-review.googlesource.com/1168987 looks vaguely toolchain/stdlib related and seems to fit time wise. If this is something else, can you please help find the right owner for it?
 
Summary: reef-paladin failing, possibly related to LLVM update? (was: reef-paladin failing, possibly related to libc++ migration?)
Cc: vapier@chromium.org dgarr...@chromium.org
Components: Infra>Client>ChromeOS>Build
Labels: OS-Chrome
Note that reef-paladin is experimental (and has been) for a long time.

Since reef-release is fine (https://cros-goldeneye.corp.google.com/chromeos/legoland/builderHistory?buildConfig=reef-release&buildBranch=master), it is probably a prebuilts issue.

Can someone clobber the builder, dgarrett@ or vapier@ ?
I reinstanced it, but don't really expect that to help. Prebuilts are usually clobbered after every failed build.

I think there is something special about being marked experimental. Looking at  setup_board stage logs, it is still using a old chroot.
Oh... that sounds bad.
Labels: -M-70
Owner: ----
Summary: CQ does not reset failing experimental board's chroots (was: reef-paladin failing, possibly related to LLVM update?)
https://uberchromegw.corp.google.com/i/chromeos/builders/reef-paladin/builds/6545 is doing fine after reinstancing.

Change the bug to reflect that CQ is not resetting chroots after failed builds for experimental builders.
Components: -Infra>Client>ChromeOS>Build Infra>Client>ChromeOS>CI
Owner: jclinton@chromium.org
Wait, how could it possibly be using an old chroot (#4) after reinstancing (#3)?

Labels: -Pri-1 Pri-2
Currently only affecting experimental builds so setting this to P2.
Comment#4 was referring to the state before re-instancing.

e.g. 
Failing builds (failed in build_packages); InitSDK/Setup_board stage reused the old chroot.
https://uberchromegw.corp.google.com/i/chromeos/builders/reef-paladin/builds/6543
https://uberchromegw.corp.google.com/i/chromeos/builders/reef-paladin/builds/6544

After reinstancing, new chroot was used so build_packages passed.
https://uberchromegw.corp.google.com/i/chromeos/builders/reef-paladin/builds/6545
Ah, thanks. I'll work on a fix now.

Summary: CQ does not reset failing board's chroots (was: CQ does not reset failing experimental board's chroots)
Looks like the issue not limited to experimental buidlers but affects all CQ builders.

Taking kevin-paladin as an example:

It has been failing since build 5360. But the next runs are still using an old chroot.

First failing build:
https://uberchromegw.corp.google.com/i/chromeos/builders/kevin-paladin/builds/5360

Setup Board stage output shows that the next build 5361 is still using the old chroot.

https://logs.chromium.org/v/?s=chromeos%2Fbb%2Fchromeos%2Fkevin-paladin%2F5361%2F%2B%2Frecipes%2Fsteps%2FSetupBoard%2F0%2Fstdout

15:55:21: INFO: RunCommand: /b/c/cbuild/repository/chromite/bin/cros_sdk 'PARALLEL_EMERGE_STATUS_FILE=/tmp/tmp3zGKw9' 'USE=chrome_internal' 'FEATURES=separatedebug' -- ./setup_board '--board=kevin' '--accept_licenses=@CHROMEOS' --skip_chroot_upgrade '--save_install_plan=/tmp/kevin_install_plan.2866425' in /b/c/cbuild/repository
15:55:21: NOTICE: /b/c/cbuild/repository/chroot.img is using 38 GiB more than needed.  Running fstrim.
INFO    : Selecting profile: /mnt/host/source/src/private-overlays/overlay-kevin-private/profiles/base for /build/kevin
INFO    : Cross toolchain already up to date.  Nothing to do.
WARNING : Board output directory '/build/kevin' already exists.
WARNING : Exiting early.
WARNING : Use --force to clobber the board root and start again.

Same story for the following builds where an old chroot continues to be used:

https://uberchromegw.corp.google.com/i/chromeos/builders/kevin-paladin/builds/5361
https://uberchromegw.corp.google.com/i/chromeos/builders/kevin-paladin/builds/5362
https://uberchromegw.corp.google.com/i/chromeos/builders/kevin-paladin/builds/5363
Labels: -Pri-2 Pri-1
Bump back to P1. 
Labels: -Pri-1 Pri-2
Explained offline: we always reuse known-good chroots on all builds. So, that we are seeing the chroot already existing does not imply that there is a bug. This is achieved with a known-good filesystem snapshot. So, in the kevin-paladin example, the chroot is from 5359 and has been since it started failing.

LVM2 is used to implement this. when the build passes, LVM2 will merge the filesystem delta from the build into the base snapshot making a new known-good image.

Back to the original bug report: are we seeing actual consequences on the experimental builders that makes you think that something about the chroot is bad? I ask because the reef-paladin has passed recently without resetting the chroot: https://cros-goldeneye.corp.google.com/chromeos/legoland/builderHistory?buildConfig=reef-paladin&buildBranch=master

Status: WontFix (was: Available)
I've stared at this for hours and am pretty sure that there's no bug: there's no logic specific to experimental status as that level of detail is not made available to the code that manages the chroots.

Please reopen if you feel that I've missed something.
Question.... at which point do we snapshot the chroot? IE: What do we reset to after a failure?
Explained in this design doc (which has been fully implemented): https://docs.google.com/document/d/1bPaB8ZzaCbghQYR3lv4eYTIQn6f-c0REEhCKa8GU0DA/edit

Cc: -manojgupta@chromium.org bjanakiraman@chromium.org anojgupta@chromium.org
I don't see how it is a won't fix. Yes, the root cause has changed to snapshots but the issues caused by a bad snapshot have not been addressed.

If CQ is taking snapshots, the snapshots should have sort of expiry or other logic that ignores the snapshots after some number of fails.

The P0 bug in https://bugs.chromium.org/p/chromium/issues/detail?id=876634 was clearly a case of snapshots being incorrect or out of date.
Cc: -anojgupta@chromium.org bmgordon@chromium.org manojgupta@chromium.org
Labels: -Type-Bug -Pri-2 Pri-1 Type-Feature
Owner: ----
Status: Available (was: WontFix)
Summary: CQ should promote chroot snapshots to stable after 2 successful runs (was: CQ does not reset failing board's chroots)
> I don't see how it is a won't fix. Yes, the root cause has changed to snapshots but the issues caused by a bad snapshot have not been addressed.
> 
> If CQ is taking snapshots, the snapshots should have sort of expiry or other logic that ignores the snapshots after some number of fails.

Implementing that kind of logic would be really hard and error prone. However, the opposite would be attainable: only promoting chroot snapshots after two successful runs. That would prevent the N+1 style breakage. Reopening this bug and retitling it to track.

> The P0 bug in https://bugs.chromium.org/p/chromium/issues/detail?id=876634 was clearly a case of snapshots being incorrect or out of date.

Yes, but ideally we'd stop bad CL's from breaking the chroots in the first place. However, I believe that the proposal above would be feasible and provide that safety net without too much performance impact.

It we can improve chroot creation and setup board performance, we could reset the chroot on every single build, giving us more reproducible results.

One option is to publish pre-created chroot.img files that builders download and use. They could include additional setup steps such as running setup_board for every board in advance from a clean tree.

Builders then have to update the chroot, and run-run setup_board in case of new changes, but don't have to start from scratch.

We keep a few builders (probably release, full, and the new chroot.img publisher) that always start from scratch to make certain we can.

Labels: -Pri-1 Pri-2

Sign in to add a comment