Quota flag causing "corrupted stateful partition" on older builds
Reported by
jrbarnette@chromium.org,
Jun 19 2018
|
|||||||||||||||||
Issue description
Overnight, there was a number of failures provisioning 'candy'
for BVT testing. I checked three different boards: candy, gandof,
and banon; only candy showed the failure.
The common feature of the failure was this message:
FAIL ---- verify.python timestamp=1529417339 localtime=Jun 19 07:08:59 Python is missing; may be caused by powerwash
That message is part of post-provisioning sanity checks.
Here are logs of a sample failure:
https://stainless.corp.google.com/browse/chromeos-autotest-results/hosts/chromeos4-row8-rack2-host4/1090823-provision/
Studying debug/autoserv.DEBUG, you find that the final reboot after
installing the new build gets logged like this:
06/19 07:00:52.131 DEBUG| ssh_host:0301| Running (ssh) '( sleep 1; reboot & sleep 10; reboot -f ) </dev/null >/dev/null 2>&1 & echo -n $!' from 'log_op|run_op|reboot|run_background|run|run_very_slowly'
[ ... ]
06/19 07:01:10.997 DEBUG| abstract_ssh:0748| Host chromeos4-row8-rack2-host4 is now unreachable over ssh, is down
06/19 07:01:11.005 DEBUG| ssh_host:0301| Running (ssh) 'true' from 'wait_for_restart|wait_up|is_up|ssh_ping|run|run_very_slowly'
06/19 07:08:42.606 ERROR| utils:0283| [stderr] mux_client_request_session: read from master failed: Broken pipe
06/19 07:08:43.181 DEBUG| abstract_ssh:0670| Host chromeos4-row8-rack2-host4 is now up
This shows the reboot took 7.5 (!) minutes.
Looking at the EC eventlog, you see these entries:
215 | 2018-06-19 07:01:05 | Kernel Event | Clean Shutdown
216 | 2018-06-19 07:01:07 | System boot | 36541
217 | 2018-06-19 07:01:07 | System Reset
218 | 2018-06-19 07:06:27 | Kernel Event | Clean Shutdown
219 | 2018-06-19 07:06:27 | System boot | 36542
220 | 2018-06-19 07:06:27 | System Reset
So, there were _two_ reboots, and something in that process wiped away
the stateful partition along they way.
,
Jun 19 2018
I did a survey of the "from" and "to" builds for each failure. Here's what I found: # From To 7 R69-10796.0.0 R67-10575.55.0 1 R69-10794.0.0 R65-10323.91.0 1 R69-10797.0.0 R66-10452.103.0 1 R69-10798.0.0 R66-10452.103.0 The target builds range from R65 to R67; the source build is always R69. So, at first blush, I'd say the problem originates in a recent canary change, no later than candy-release/R69-10794.0.0. Passing to a randomly selected sheriff for further triage/debug.
,
Jun 19 2018
The complete inventory of the failures to date, should it be needed:
https://stainless.corp.google.com/browse/chromeos-autotest-results/hosts/chromeos4-row8-rack2-host4/1090823-provision/
https://stainless.corp.google.com/browse/chromeos-autotest-results/hosts/chromeos2-row11-rack4-host5/1090810-provision/
https://stainless.corp.google.com/browse/chromeos-autotest-results/hosts/chromeos4-row8-rack3-host22/1087202-provision/
https://stainless.corp.google.com/browse/chromeos-autotest-results/hosts/chromeos4-row8-rack3-host20/1087204-provision/
https://stainless.corp.google.com/browse/chromeos-autotest-results/hosts/chromeos4-row8-rack3-host18/1087201-provision/
https://stainless.corp.google.com/browse/chromeos-autotest-results/hosts/chromeos4-row8-rack3-host16/1087199-provision/
https://stainless.corp.google.com/browse/chromeos-autotest-results/hosts/chromeos4-row8-rack3-host12/1087205-provision/
https://stainless.corp.google.com/browse/chromeos-autotest-results/hosts/chromeos4-row8-rack3-host10/1087200-provision/
https://stainless.corp.google.com/browse/chromeos-autotest-results/hosts/chromeos4-row8-rack2-host15/1087198-provision/
https://stainless.corp.google.com/browse/chromeos-autotest-results/hosts/chromeos4-row8-rack2-host12/1087203-provision/
https://stainless.corp.google.com/browse/chromeos-autotest-results/hosts/chromeos4-row8-rack3-host8/1087197-provision/
https://stainless.corp.google.com/browse/chromeos-autotest-results/hosts/chromeos2-row11-rack4-host5/1085848-provision/
,
Jun 19 2018
<sigh> Two of the URLs above correspond to a different failure.
This is the correct inventory of failures:
https://stainless.corp.google.com/browse/chromeos-autotest-results/hosts/chromeos4-row8-rack2-host4/1090823-provision/
https://stainless.corp.google.com/browse/chromeos-autotest-results/hosts/chromeos2-row11-rack4-host5/1090810-provision/
https://stainless.corp.google.com/browse/chromeos-autotest-results/hosts/chromeos4-row8-rack3-host22/1087202-provision/
https://stainless.corp.google.com/browse/chromeos-autotest-results/hosts/chromeos4-row8-rack3-host20/1087204-provision/
https://stainless.corp.google.com/browse/chromeos-autotest-results/hosts/chromeos4-row8-rack3-host18/1087201-provision/
https://stainless.corp.google.com/browse/chromeos-autotest-results/hosts/chromeos4-row8-rack3-host16/1087199-provision/
https://stainless.corp.google.com/browse/chromeos-autotest-results/hosts/chromeos4-row8-rack3-host12/1087205-provision/
https://stainless.corp.google.com/browse/chromeos-autotest-results/hosts/chromeos4-row8-rack2-host15/1087198-provision/
https://stainless.corp.google.com/browse/chromeos-autotest-results/hosts/chromeos4-row8-rack2-host12/1087203-provision/
https://stainless.corp.google.com/browse/chromeos-autotest-results/hosts/chromeos2-row11-rack4-host5/1085848-provision/
,
Jun 19 2018
,
Jun 19 2018
As Richard said, this is a product bug.
,
Jun 19 2018
,
Jun 19 2018
"Oh sh*t" is what comes to mind. Having all the logs wiped out by switch to/from dev mode isn't helping at all. I'll start looking too.
,
Jun 19 2018
> "Oh sh*t" is what comes to mind. Yeah. Glad I'm not you. :-) > [ ... ] the logs wiped out by switch to/from dev mode [ ... ] It's not a switch to/from dev mode. Something may be triggering powerwash, though. But either way, yeah, the failure wipes out its own history.
,
Jun 19 2018
> [ ... ] Something may be triggering powerwash, [ ... ]
... and, ITOT powerwash leaves behind a log in /var/log/clobber-state.log.
Which (praise be the deity of your choice) is preserved after the
failure.
The leading edge is the first line, which explains the reason for
powerwash:
2018/06/19 14:01:11 UTC (repair): /dev/mmcblk0p1 Self-repair corrupted stateful partition
,
Jun 19 2018
Richard, which $BOARDs are using quick-provision today? Is there a time line of when each was rolled out?
,
Jun 19 2018
> Richard, which $BOARDs are using quick-provision today? > Is there a time line of when each was rolled out? It's not done on a board-by-board basis. Basically, all canary, CQ and PFQ builds will use quick-provision. I think there's a small handful such as jetstream that might still be using the old flow. However: For this failure, because of bug 854061, some of the failures were with quick-provision on, and some were seen with it off. That's consistent with the basic theory of the failure: The differences between quick-provision and the AU based flow are small, especially w.r.t. the stateful partition. In both cases, all that happens is "unpack a tar file into stateful". The evidence from the logs is that the stateful file system winds up corrupted during shutdown, which triggers power wash and failure. So, we need to figure out what could be causing shutdown not to work right.
,
Jun 19 2018
> Is there a time line of when each was rolled out? Regarding specifically "timeline": quick-provision has been in use for months. This problem is brand new, not even 24 hours old.
,
Jun 19 2018
Ah ok - I had no clue when quick-provisioning was deployed - thanks for clarifying. I don't entirely agree with "the differences ... are small". The workload the storage device sees will be quite different - that's the main reason quick provision is faster. I believe the "tar" file in this case is a full disk image which then get's DD straight to the block device, not the file system. I do agree conceptually they are the same and in neither case should the stateful partition get corrupted. I've asked a few other people to provide guidance here since I don't think I have enough info to root cause this problem. Especially since this is mostly happening on candy machines (vs other HW that is running the same build versions). And I'll take bug ownership since this is close to my "area of expertise" and that should allow my co-sheriff to continue focusing on annotation of other paladin failures.
,
Jun 19 2018
> I believe the "tar" file in this case is a full disk image
> which then get's DD straight to the block device, not the file system.
For stateful, no. The tar file is a tar file, and whether it's
quick-provision or the regular stateful update, it gets extracted
by a command more or less like this:
curl $URL | tar xvf -
> [ ... ] this is mostly happening on candy machines [ ... ]
Probably... I did some spot checks trying to see if this was happening
on banjo (another rambi board). I couldn't find evidence of it there.
On my list was to do a more comprehensive search for provision failures
on all the rambi boards. The command in question looks something like
this:
dut-status -b $BOARD -u '2018-06-19 16:00:00' -f | grep provision | grep -v OK
That'll show links to all failed provision tasks for $BOARD prior
to 16:00 today (local time). Then you have to follow all the links,
looking for instances of this failure in "status.log"
,
Jun 19 2018
Stateful partition is getting over written as part of the quick-provision. I'm not sure that is "normal". Have to look at other logs. https://stainless.corp.google.com/browse/chromeos-autotest-results/hosts/chromeos4-row8-rack2-host4/1090823-provision/ mnt/stateful_partition/unencrypted/preserve/log/quick-provision.log ... │ ├── 2018-06-19 06:50:00-07:00 INFO: Stateful reset 2018-06-19 06:50:00-07:00 INFO: Updated status: DUT: Stateful reset 2018-06-19 06:50:00-07:00 INFO: Stateful update 2018-06-19 06:50:00-07:00 INFO: Updated status: DUT: Stateful update --2018-06-19 06:50:00-- http://100.115.219.135:8082/static/candy-release/R69-10798.0.0/stateful.tgz ... 2018-06-19 06:50:23 (17.4 MB/s) - written to stdout [423748786/423748786] 2018-06-19 06:50:24-07:00 INFO: Stateful clean 2018-06-19 06:50:24-07:00 INFO: Updated status: DUT: Stateful clean KEYVAL: UPDATE_STATEFUL_start=1529416200 KEYVAL: UPDATE_STATEFUL_end=1529416224 KEYVAL: UPDATE_STATEFUL_elapsed=24
,
Jun 19 2018
> Stateful partition is getting over written as part of the
> quick-provision. I'm not sure that is "normal". Have to
> look at other logs.
It's normal, and it's done by simple extraction of a tar file.
Here's the relevant source from the script:
# Performs a stateful update using a specified stateful.tgz URL.
# Function will exit script on failure.
stateful_update() {
local url="$1"
# Stateful reset.
info "Stateful reset"
post_status "DUT: Stateful reset"
rm -rf "${STATEFUL_DIR}/${UPDATE_STATE_FILE}" \
"${STATEFUL_DIR}/var_new" \
"${STATEFUL_DIR}/dev_image_new" || die "Unable to reset stateful."
# Stateful update.
info "Stateful update"
post_status "DUT: Stateful update"
get_url_to_stdout "${url}" |
tar --ignore-command-error --overwrite --directory="${STATEFUL_DIR}" -xzf -
local pipestatus=("${PIPESTATUS[@]}")
if [[ "${pipestatus[0]}" -ne "0" ]]; then
die "Retrieving ${url} failed. (statuses ${pipestatus[*]})"
elif [[ "${pipestatus[1]}" -ne "0" ]]; then
die "Untarring to ${STATEFUL_DIR} failed. (statuses ${pipestatus[*]})"
fi
# Stateful clean.
info "Stateful clean"
post_status "DUT: Stateful clean"
printf "clobber" > "${STATEFUL_DIR}/${UPDATE_STATE_FILE}" || \
die "Unable to clean stateful."
}
,
Jun 19 2018
Yup: stateful update is at file system level and not block level (like rootfs is). I have the code in front of me: ~/trunk/src/platform/dev/quick-provision/quick-provision
But it doesn't "sync" or do anything else to make sure data has landed on storage. I guess "reboot" is supposed to do that...but if reboot ends up with "hung task" since IO might take more than 120 seconds to clear up, then it might be likely that the stateful is corrupted. :( [This is just speculation - I have no evidence this is the case.]
Something like this would give me warm fuzzies even if it's not the root cause of this bug:
diff --git a/quick-provision/quick-provision b/quick-provision/quick-provision
index 4f69ac3..5bc708f 100644
--- a/quick-provision/quick-provision
+++ b/quick-provision/quick-provision
@@ -167,6 +167,9 @@ update_partition() {
elif [[ "${pipestatus[2]}" -ne "0" ]]; then
die "Writing to ${part} failed. (statuses ${pipestatus[*]})"
fi
+
+ # force stall until all dirty buffers are at least scheduled to be written.
+ sync
}
# Performs a stateful update using a specified stateful.tgz URL.
And the perhaps something similar to stateful partition.
,
Jun 20 2018
Looking at the DD rates, it looks like my theory about "hung tasks" is unlikely. The DD emits statistics about writes that are consistent with eMMC write speeds - ie data is going directly to "media" (eMMC flash).
,
Jun 21 2018
Just talked with Gwendal and he has concerns with this change: commit 86fed24396246733bae4c963ed5208bff777fb61 Author: Risan <risan@google.com> Date: Wed May 16 16:48:56 2018 +0900 init: Conditionally enables Quota for ext4 There are 2 parts: 1. For fresh installation, chromeos-install will mkfs the ext4 filesystem with quota option on. 2. Otherwise, chromeos_startup conditionally checks whether the quota option is turned on. If it hasn't, the scripts turn it on. BUG=b:62995196 TEST=- Turn off quota and checked if chromeos_startup enables it. TEST=- Checked that chromeos_startup doesn't trigger tune2fs when quota is on (by adding an else in the chromeos_startup change - and make sure that the else is triggered). TEST=- Turn off kernel quota config, and /mnt/stateful_partition is still correctly mounted, without quota. Change-Id: I7e62c7dd79ec65ec380b8049e2d77fd0778844da Reviewed-on: https://chromium-review.googlesource.com/1064571 Commit-Ready: ChromeOS CL Exonerator Bot <chromiumos-cl-exonerator@appspot.gserviceaccount.com> Tested-by: Risan <risan@chromium.org> Reviewed-by: Ryo Hashimoto <hashimoto@chromium.org> Reviewed-by: Mike Frysinger <vapier@chromium.org> The concern is: before running tune2fs, if the file system has minor corruption that fsck can fix, then do that before running tune2fs: https://chromium-review.googlesource.com/c/chromiumos/platform2/+/1064571/5/init/chromeos_startup#233 Gwendal is taking another look.
,
Jun 21 2018
We are seeing several other failures that look like they may be related. Example chrome logs: [9313:9313:0621/034950.979652:ERROR:device_event_log_impl.cc(159)] [03:49:50.979] Login: cryptohome_authenticator.cc:140 MountEx failed. Error: 1 [9313:9313:0621/034950.980047:ERROR:device_event_log_impl.cc(159)] [03:49:50.980] Login: cryptohome_authenticator.cc:951 Cryptohome failure: state(AuthState)=2, code(cryptohome::MountError)=1 [9313:9313:0621/034950.980081:VERBOSE1:cryptohome_authenticator.cc(791)] Resolved state to: 2 [9313:9313:0621/034950.980392:ERROR:device_event_log_impl.cc(159)] [03:49:50.980] Login: cryptohome_authenticator.cc:725 Login failed: Could not mount cryptohome. [9313:9313:0621/034950.980440:ERROR:login_performer.cc(63)] Login failure, reason=1, error.state=0 [9313:9313:0621/034950.980503:VERBOSE1:existing_user_controller.cc(1482)] Could not mount cryptohome.
,
Jun 21 2018
Other issues with cryptohome mount problems will be set as blocked-by this one.
,
Jun 21 2018
,
Jun 21 2018
,
Jun 21 2018
Here's one that may have more useful logs in it. https://stainless.corp.google.com/browse/chromeos-autotest-results/210463773-chromeos-test/
,
Jun 21 2018
FTR, QUOTA support was recent enabled in the kernel: https://chromium-review.googlesource.com/c/chromiumos/third_party/kernel/+/1016226 "CHROMIUM: config: Kernel config to enable quota" Landed in 10756.0.0 While this was recently introduced and still suspect, the chromeos_startup change will reset the feature if the kernel doesn't explicitly claim support for it. I'll test this anyway since there is only a very small chance reverting corrupts the filesystem.
,
Jun 22 2018
,
Jun 24 2018
The kernel and user space with out Quota support will still mount the stateful partition even if quota is enabled. The first boot was with kernel + user space which enabled quota enabled. The second boot was with a kernel w/o Quota and user space didn't change the quota settings on the filesystem. Unless I screwed something up, the theory in comment #26 dead. :(
,
Jun 25 2018
,
Jun 25 2018
My experiment on friday was wrong: my build which reverted the /sbin/chromeos_startup change did in fact NOT revert the change. So let me try that again.
,
Jun 26 2018
TL;DR: confirmed enabling quota will cause older Chrome OS builds installed later to powerwash stateful. :( Need to determine the "right way" to handle older OS images within the test lab. *sigh*
My 10808.0.2018_06_20_1640 build has Quota support enabled:
./R69-10802.0.2018_06_20_1640-a1/chromiumos_test_image.bin
CHROMEOS_RELEASE_DESCRIPTION=10802.0.2018_06_20_1640 (Test Build - grundler) developer-build atlas
localhost ~ # fgrep quota /sbin/chromeos_startup
# Enable/disable quota feature.
if [ -d /proc/sys/fs/quota ]; then
# Quota is enabled in the kernel, make sure that quota is enabled in the
grep -qe "^Filesystem features:.* quota.*"; then
tune2fs -Oquota -Qusrquota,grpquota "${STATE_DEV}" || :
# Quota is not enabled in the kernel, make sure that quota is disabled in
grep -qe "^Filesystem features:.* quota.*"; then
tune2fs -O^quota -Q^usrquota,^grpquota "${STATE_DEV}" || :
localhost ~ # uname -a
Linux localhost 4.4.138 #7 SMP PREEMPT Wed Jun 20 15:47:07 PDT 2018 x86_64 Intel(R) Core(TM) i7-7Y75 CPU @ 1.30GHz GenuineIntel GNU/Linux
localhost ~ # dumpe2fs -h /dev/mmcblk0p1 | fgrep -i quota
dumpe2fs 1.44.1 (24-Mar-2018)
Filesystem features: has_journal ext_attr resize_inode dir_index filetype needs_recovery extent 64bit flex_bg encrypt sparse_super large_file huge_file dir_nlink extra_isize quota metadata_csum
User quota inode: 3
Group quota inode: 4
Build WITHOUT quota support in either kernel or chromeos_startup
./R69-10816.0.2018_06_25_1443-a1/chromiumos_test_image.bin
clobber.log after reboot:
2018/06/26 00:13:36 UTC (repair): /dev/mmcblk0p1 Self-repair corrupted stateful
partition
dumpe2fs 1.44.1 (24-Mar-2018)
All "test account" info not present. Python not present. :(
,
Jun 26 2018
Some context: Current provision flow: Shard call devserver to trigger provision on a DUT. Devserver runs every commands in quick_provision on the DUT. Richard will enable new provision flow, which is: Shard runs every commands in quick_provision on the DUT. So 2 options here: 1) Add this to quick_provision script (http://shortn/_rKwb4fp8Zv) Pros: no need to make the same change twice cons: for fallback of a failed quick provision, we lose this coverage. 2) Add this to machine_install_by_devserver (http://shortn/_yvTrbn0493) first, then move it machine_install (http://shortn/_ufPIOcpjkl) once Richard enables the new code flow. Pros: Full coverage everywhere. Cons: Need to make this change twice. I actually prefer 2), but let the owner to decide which is better :).
,
Jun 26 2018
I need more help from jrbarnett and/or xixuan.
We need to run this command when /mnt/stateful is NOT mounted:
tune2fs -O^quota -Q^usrquota,^grpquota "${STATE_DEV}" || :
BUILD is older than 10756.0.0
BEFORE booting the next build
AFTER stateful is UNMOUNTED (or before it's mounted on next boot)
,
Jun 26 2018
,
Jun 26 2018
For the record here are the two changes start this problem: "init: Conditionally enables Quota for ext4" https://chromium-review.googlesource.com/c/chromiumos/platform2/+/1064571 (this was cherry-picked into M68 branch as well) "CHROMIUM: config: Kernel config to enable quota" https://chromium-review.googlesource.com/c/chromiumos/third_party/kernel/+/1016226 landed in 10756.0.0 on Jun 5, 4:03pm PST(?)
,
Jun 26 2018
> We need to run this command when /mnt/stateful is NOT mounted:
> tune2fs -O^quota -Q^usrquota,^grpquota "${STATE_DEV}" || :
If the objective is to run this with the file system not mounted,
I think it would have to go into the shutdown code paths. Unmounting
stateful on a running system is tricky, at best. In particular, if
you want to run 'ssh', stateful has to be mounted, because the machine
identity data sshd must provide to clients is in stateful.
,
Jun 26 2018
My understanding is this command only needs to be run on the DUT before it's updated by stateful mounting in next OLD build provision, to make sure this old build doesn't mount stateful with quota. It doesn't matter what's current condition of this DUT, i.e. what's current file system & whether it is mounted or not.
,
Jun 26 2018
#37, #36: You can not run tune2fs by yourself outside of the installer/init code, you will change the flow the user experiences. It is unfortunate rolling back from R68+ trigger a clobber due to the quota change, but we can install the dev tool in the stateful image that matches the new rootfs. [via RestoreStateful in auto_updater.py] It will take more time than quick autoupdate, but we won't change the flow.
,
Jun 26 2018
> #37, #36: You can not run tune2fs by yourself outside of the > installer/init code, you will change the flow the user experiences. That's not quite true. The code to do this can be conditioned on "only in test images, and only during updates." We already have such conditions in places like chromeos_startup; we can add them for chromeos_shutdown, too. But if I understand comment #33 properly, we can only run tune2fs while the target file system isn't mounted. In practical terms, that means we _must_ perform the operation during chromeos_startup. That's the only place in the system where it's practical to unmount stateful for the necessary purpose. We can't do it in chromeos_startup, because _that_ code will belong to some old image that we cannot change. > [ ... ] but we can install the dev tool [ ... ] Which dev tool do you mean, exactly? > [ ... ] [via RestoreStateful in auto_updater.py] [ ... ] I'm actively working to delete that part of the code. Adding the ability to restore stateful after update will require a non-trivial amount of effort. I'm skeptical of requiring this in any event.
,
Jun 26 2018
> But if I understand comment #33 properly, we can only run tune2fs while
> the target file system isn't mounted. [ ... ]
Cutting through the uncertainty, the answer is easy to determine:
localhost ~ # tune2fs -O^quota -Q^usrquota,^grpquota "${STATE_DEV}" || :
tune2fs 1.44.1 (24-Mar-2018)
The quota feature may only be changed when the filesystem is unmounted.
So, the code changes must go into chromeos_shutdown.
grundler@ and I went through the source; the script already knows about
stateful_update, and the difference between test and non-test, so the
problem at this point is just a SMOP.
,
Jun 26 2018
Modifying the superblock at every shutdown does sound scary to me, but umount does it all the time, so that's possible. The drawback is we are increasing the boot time, because we need to enable quota at every reboot from now on.
,
Jun 26 2018
> Modifying the superblock at every shutdown [ ... ] That's not the proposal. The code will only be invoked when we're downgrading to a version that requires the operation.
,
Jun 26 2018
Sorry, I misunderstood. Grant pointed out chromeos_shutdown can guess if we are downgrading, so your proposal looks good.
,
Jun 27 2018
I've uploaded a change to chromeos_shutdown which implements 1/2 the proposal: https://chromium-review.googlesource.com/c/chromiumos/platform2/+/1116286 The "other part" likely needs to be in quick-provision script which can create the .target_version file which contains the build ID which is going to be used on reboot. I'm open to other mechanisms to communicate this. Feel free to modify/comment as you see fit. I'd like to get this sorted out today please. Either we go with something like this ASAP or wait until Richard's planned changes land in a few weeks.
,
Jun 27 2018
,
Jun 27 2018
> [ ... ] likely needs to be in quick-provision script [ ... ] Probably, putting it in quick-provision won't help... The issue is that quick-provision doesn't know the version number, it knows only a URL. I'd be suspicious of any solution that required quick-provision to parse the URL to extract the version number. If we conclude that we don't want quick-provision to parse the URL for a version string, then the remaining options are likely to be summarized as "it's easier not to change quick-provision."
,
Jun 27 2018
Ugh sorry, this effect to consumer is properly understood (https://chromium-review.googlesource.com/c/chromiumos/platform2/+/1064571#message-d34c77e508cf6e75a4576dc7ac4a06a24850edac) but we missed the testing device :( Just to make sure I understand this correctly, your change is only affecting testing device and not the release product - is that correct?
,
Jun 27 2018
Risan, correct. Consumer devices do the equivalent of powerwash when using recovery media to go back to an older release. They are not affected by this bug. Chrome OS Enterprise is working on launching rollback features starting with R69 and this issue is primarily for releases older than R68 - so also not affected.
,
Jun 27 2018
Actually rollback starts from R67, but since they partial powerwash has not been implemented, and this causes a full powerwash anyway, this should be fine.
,
Jun 27 2018
> The issue is that quick-provision doesn't know the version number, > it knows only a URL. I'd be suspicious of any solution that required > quick-provision to parse the URL to extract the version number. I've re-read the code, and as often happens with such things, my memory was ... faulty. The quick-provision script _does_ get a build name, plus a base URL that it combines with the build name to produce the actual URL. However, even then, the build name is a string that includes a version, and not just the version itself. So, the string would still have to be parsed. Parsing the build name may still be a dicey proposition. I think we also use quick-provisioning for parts of Paygen testing. The strings to identify builds used there will be quite different from the build names used in ordinary provision tasks. So, it still needs study/reflection.
,
Jun 27 2018
I've included this thought as a comment to the proposed CL but I'll repeat here:
BTW, your use of the word "downgrade" reminds me that we should decide where to test if this downgrade crosses that build ID number (high to low) or not (low to low or high to high). The provisioning code can test this to decide if it should create .target_version or this code can look up what is in /etc/lsb-release and test that.
In fact, looking at this, seems like we should just copy /etc/lsb-release to ".target-lsb-release" and "." execute that file. Then just compare current ${CHROMEOS_RELEASE_BUILD_NUMBER} with the one in .target-lsb-release. WDYT?
,
Jun 28 2018
> [ ... ] test if this downgrade crosses that build ID number (high to low)
> or not (low to low or high to high) [ ... ]
As currently constructed, the code would only run on high build numbers.
For that reason, if the ".target_version" indicates that the target
is below the threshold, we can be sure that this is a downgrade that
requires adjusting stateful.
> In fact, looking at this, seems like we should just copy
> /etc/lsb-release to ".target-lsb-release" and "." execute that
> file. Then just compare current ${CHROMEOS_RELEASE_BUILD_NUMBER}
> with the one in .target-lsb-release. WDYT?
That would be nice. However, we don't have ready access to
/etc/lsb-release as it would be installed in the new build.
,
Jun 28 2018
> Parsing the build name may still be a dicey proposition. I think we also > use quick-provisioning for parts of Paygen testing. The strings to identify > builds used there will be quite different from the build names used in > ordinary provision tasks. So, it still needs study/reflection. Looking it over, I think the "parsing" question is moot. The bigger issue is that there are, and will be for the indefinite future, code paths that don't use quick-provisioning at all. That's especially true for Paygen testing, which is/will be the main source of installing old, pre-quota kernels for the foreseeable future. So, the code to create the ".target_version" file must live on the ssh client side code (i.e. Autotest or devserver).
,
Jul 4
The following revision refers to this bug: https://chromium.googlesource.com/chromiumos/platform2/+/238cf8d8e41849b97515c8c6ae5ee11b381e6dd7 commit 238cf8d8e41849b97515c8c6ae5ee11b381e6dd7 Author: Grant Grundler <grundler@chromium.org> Date: Wed Jul 04 01:18:23 2018 init: turn off ext4 quota for builds before 10756 Build 10756.0.0 enabled quota on /mnt/stateful_partition in order to support ARC++. Unfortunately, in the test lab, we try to preserve stateful on a host when installing the next test image. If that next image is older than 10756.0.0: 1) the mount command will fail when the next image boots 2) the failure is reported as "corrupted stateful" 3) machine will then powerwash to recover 4) the provision step fails because python is not present on stateful 5) the machine will get "repaired" successfully This has already happened to a pool of machines that was already low on devices. The "temporary" loss of additional machines prevented the corresponding paladin from running test suites. Turning off quota has to be done before we boot the buid image which knows nothing about quota. And we only need/want to do this for test lab machines. And we can only run tune2fs AFTER the stateful partition is unmounted. BUG= chromium:854278 TEST=manual Change-Id: Id3e86c482857cb67710441b697855e35b8404173 Reviewed-on: https://chromium-review.googlesource.com/1116286 Commit-Ready: Grant Grundler <grundler@chromium.org> Tested-by: Grant Grundler <grundler@chromium.org> Reviewed-by: Richard Barnette <jrbarnette@google.com> Reviewed-by: Grant Grundler <grundler@chromium.org> Reviewed-by: Mike Frysinger <vapier@chromium.org> [modify] https://crrev.com/238cf8d8e41849b97515c8c6ae5ee11b381e6dd7/init/chromeos_shutdown
,
Jul 6
I think everything needed to fix this issue has landed. Please re-open if powerwash still appears to be happening ("python not found") after provision an older release.
,
Jul 9
The following revision refers to this bug: https://chromium.googlesource.com/chromiumos/third_party/autotest/+/3ef29a85ce1c2b734698fd6b9ae0118cbcd69bff commit 3ef29a85ce1c2b734698fd6b9ae0118cbcd69bff Author: Richard Barnette <jrbarnette@chromium.org> Date: Mon Jul 09 19:42:38 2018 [autotest] Set a "target version" for stateful updates In some cases in the lab, provisioning may downgrade a DUT. Recent OS changes have introduced a problem where downgrades from builds after R69-10756.0.0 to builds before may fail and cause a powerwash. This changes the provisioning flow to create a file in stateful that indicates the target version of an update. Shutdown code in the OS uses this file to recognize when a downgrade is occurring, and prevent the unwanted powerwash. BUG= chromium:854278 TEST=Run sanity suite on a local Autotest instance; examine logs Change-Id: I757118ac94a6f4e590a961b654cd24fa220d633b Reviewed-on: https://chromium-review.googlesource.com/1119197 Commit-Ready: ChromeOS CL Exonerator Bot <chromiumos-cl-exonerator@appspot.gserviceaccount.com> Tested-by: Richard Barnette <jrbarnette@chromium.org> Reviewed-by: Grant Grundler <grundler@chromium.org> [modify] https://crrev.com/3ef29a85ce1c2b734698fd6b9ae0118cbcd69bff/server/cros/autoupdater.py |
|||||||||||||||||
►
Sign in to add a comment |
|||||||||||||||||
Comment 1 by jrbarnette@chromium.org
, Jun 19 2018