New issue
Advanced search Search tips
Note: Color blocks (like or ) mean that a user may not be available. Tooltip shows the reason.

Issue 854278 link

Starred by 1 user

Issue metadata

Status: Fixed
Owner:
Closed: Jul 6
Cc:
Components:
EstimatedDays: ----
NextAction: ----
OS: Chrome
Pri: 0
Type: Bug

Blocking:
issue 855082



Sign in to add a comment

Quota flag causing "corrupted stateful partition" on older builds

Reported by jrbarnette@chromium.org, Jun 19 2018

Issue description

Overnight, there was a number of failures provisioning 'candy'
for BVT testing.  I checked three different boards: candy, gandof,
and banon; only candy showed the failure.

The common feature of the failure was this message:
	FAIL	----	verify.python	timestamp=1529417339	localtime=Jun 19 07:08:59	Python is missing; may be caused by powerwash

That message is part of post-provisioning sanity checks.

Here are logs of a sample failure:
    https://stainless.corp.google.com/browse/chromeos-autotest-results/hosts/chromeos4-row8-rack2-host4/1090823-provision/

Studying debug/autoserv.DEBUG, you find that the final reboot after
installing the new build gets logged like this:
06/19 07:00:52.131 DEBUG|          ssh_host:0301| Running (ssh) '( sleep 1; reboot & sleep 10; reboot -f ) </dev/null >/dev/null 2>&1 & echo -n $!' from 'log_op|run_op|reboot|run_background|run|run_very_slowly'
[ ... ]
06/19 07:01:10.997 DEBUG|      abstract_ssh:0748| Host chromeos4-row8-rack2-host4 is now unreachable over ssh, is down
06/19 07:01:11.005 DEBUG|          ssh_host:0301| Running (ssh) 'true' from 'wait_for_restart|wait_up|is_up|ssh_ping|run|run_very_slowly'
06/19 07:08:42.606 ERROR|             utils:0283| [stderr] mux_client_request_session: read from master failed: Broken pipe
06/19 07:08:43.181 DEBUG|      abstract_ssh:0670| Host chromeos4-row8-rack2-host4 is now up

This shows the reboot took 7.5 (!) minutes.

Looking at the EC eventlog, you see these entries:
215 | 2018-06-19 07:01:05 | Kernel Event | Clean Shutdown
216 | 2018-06-19 07:01:07 | System boot | 36541
217 | 2018-06-19 07:01:07 | System Reset
218 | 2018-06-19 07:06:27 | Kernel Event | Clean Shutdown
219 | 2018-06-19 07:06:27 | System boot | 36542
220 | 2018-06-19 07:06:27 | System Reset

So, there were _two_ reboots, and something in that process wiped away
the stateful partition along they way.

 
Components: OS>Systems
Labels: OS-Chrome
Owner: jbrandmeyer@chromium.org
Status: Assigned (was: Available)
I did a survey of the "from" and "to" builds for each
failure.  Here's what I found:
#   From           To
7   R69-10796.0.0  R67-10575.55.0
1   R69-10794.0.0  R65-10323.91.0
1   R69-10797.0.0  R66-10452.103.0
1   R69-10798.0.0  R66-10452.103.0

The target builds range from R65 to R67; the source build is always R69.
So, at first blush, I'd say the problem originates in a recent canary
change, no later than candy-release/R69-10794.0.0.

Passing to a randomly selected sheriff for further triage/debug.
The complete inventory of the failures to date, should it be needed:
    https://stainless.corp.google.com/browse/chromeos-autotest-results/hosts/chromeos4-row8-rack2-host4/1090823-provision/
    https://stainless.corp.google.com/browse/chromeos-autotest-results/hosts/chromeos2-row11-rack4-host5/1090810-provision/
    https://stainless.corp.google.com/browse/chromeos-autotest-results/hosts/chromeos4-row8-rack3-host22/1087202-provision/
    https://stainless.corp.google.com/browse/chromeos-autotest-results/hosts/chromeos4-row8-rack3-host20/1087204-provision/
    https://stainless.corp.google.com/browse/chromeos-autotest-results/hosts/chromeos4-row8-rack3-host18/1087201-provision/
    https://stainless.corp.google.com/browse/chromeos-autotest-results/hosts/chromeos4-row8-rack3-host16/1087199-provision/
    https://stainless.corp.google.com/browse/chromeos-autotest-results/hosts/chromeos4-row8-rack3-host12/1087205-provision/
    https://stainless.corp.google.com/browse/chromeos-autotest-results/hosts/chromeos4-row8-rack3-host10/1087200-provision/
    https://stainless.corp.google.com/browse/chromeos-autotest-results/hosts/chromeos4-row8-rack2-host15/1087198-provision/
    https://stainless.corp.google.com/browse/chromeos-autotest-results/hosts/chromeos4-row8-rack2-host12/1087203-provision/
    https://stainless.corp.google.com/browse/chromeos-autotest-results/hosts/chromeos4-row8-rack3-host8/1087197-provision/
    https://stainless.corp.google.com/browse/chromeos-autotest-results/hosts/chromeos2-row11-rack4-host5/1085848-provision/
Owner: jkop@chromium.org

Comment 6 by jkop@chromium.org, Jun 19 2018

Owner: jbrandmeyer@chromium.org
As Richard said, this is a product bug.
Cc: athilenius@chromium.org la...@chromium.org

Comment 8 by grundler@google.com, Jun 19 2018

"Oh sh*t" is what comes to mind. Having all the logs wiped out by switch to/from dev mode isn't helping at all. I'll start looking too.
> "Oh sh*t" is what comes to mind.

Yeah.  Glad I'm not you.  :-)

> [ ... ] the logs wiped out by switch to/from dev mode [ ... ]

It's not a switch to/from dev mode.  Something may be triggering powerwash,
though.  But either way, yeah, the failure wipes out its own history.

> [ ... ] Something may be triggering powerwash, [ ... ]

... and, ITOT powerwash leaves behind a log in /var/log/clobber-state.log.

Which (praise be the deity of your choice) is preserved after the
failure.

The leading edge is the first line, which explains the reason for
powerwash:
    2018/06/19 14:01:11 UTC (repair): /dev/mmcblk0p1 Self-repair corrupted stateful partition

Richard, which $BOARDs are using quick-provision today?
Is there a time line of when each was rolled out?
> Richard, which $BOARDs are using quick-provision today?
> Is there a time line of when each was rolled out?

It's not done on a board-by-board basis.  Basically, all
canary, CQ and PFQ builds will use quick-provision.  I think
there's a small handful such as jetstream that might still be
using the old flow.

However:  For this failure, because of bug 854061, some of the
failures were with quick-provision on, and some were seen with
it off.  That's consistent with the basic theory of the failure:
The differences between quick-provision and the AU based flow are
small, especially w.r.t. the stateful partition.  In both cases,
all that happens is "unpack a tar file into stateful".

The evidence from the logs is that the stateful file system winds
up corrupted during shutdown, which triggers power wash and failure.
So, we need to figure out what could be causing shutdown not to work
right.

> Is there a time line of when each was rolled out?

Regarding specifically "timeline":  quick-provision has been in use
for months.  This problem is brand new, not even 24 hours old.

Cc: gwendal@chromium.org ahass...@chromium.org
Owner: grundler@chromium.org
Status: Started (was: Assigned)
Ah ok - I had no clue when quick-provisioning was deployed - thanks for clarifying.

I don't entirely agree with "the differences ... are small". The workload the storage device sees will be quite different - that's the main reason quick provision is faster.
 
I believe the "tar" file in this case is a full disk image which then get's DD straight to the block device, not the file system.

I do agree conceptually they are the same and in neither case should the stateful partition get corrupted.

I've asked a few other people to provide guidance here since I don't think I have enough info to root cause this problem.  Especially since this is mostly happening on candy machines (vs other HW that is running the same build versions).

And I'll take bug ownership since this is close to my "area of expertise" and that should allow my co-sheriff to continue focusing on annotation of other paladin failures.
> I believe the "tar" file in this case is a full disk image
> which then get's DD straight to the block device, not the file system.

For stateful, no.  The tar file is a tar file, and whether it's
quick-provision or the regular stateful update, it gets extracted
by a command more or less like this:
    curl $URL | tar xvf -

> [ ... ] this is mostly happening on candy machines [ ... ]

Probably...  I did some spot checks trying to see if this was happening
on banjo (another rambi board).  I couldn't find evidence of it there.
On my list was to do a more comprehensive search for provision failures
on all the rambi boards.  The command in question looks something like
this:
    dut-status -b $BOARD -u '2018-06-19 16:00:00' -f | grep provision | grep -v OK

That'll show links to all failed provision tasks for $BOARD prior
to 16:00 today (local time).  Then you have to follow all the links,
looking for instances of this failure in "status.log"

Stateful partition is getting over written as part of the quick-provision. I'm not sure that is "normal". Have to look at other logs.

https://stainless.corp.google.com/browse/chromeos-autotest-results/hosts/chromeos4-row8-rack2-host4/1090823-provision/

mnt/stateful_partition/unencrypted/preserve/log/quick-provision.log

...          │                       ├──
2018-06-19 06:50:00-07:00 INFO: Stateful reset
2018-06-19 06:50:00-07:00 INFO: Updated status: DUT: Stateful reset
2018-06-19 06:50:00-07:00 INFO: Stateful update
2018-06-19 06:50:00-07:00 INFO: Updated status: DUT: Stateful update
--2018-06-19 06:50:00--  http://100.115.219.135:8082/static/candy-release/R69-10798.0.0/stateful.tgz
...
2018-06-19 06:50:23 (17.4 MB/s) - written to stdout [423748786/423748786]

2018-06-19 06:50:24-07:00 INFO: Stateful clean
2018-06-19 06:50:24-07:00 INFO: Updated status: DUT: Stateful clean
KEYVAL: UPDATE_STATEFUL_start=1529416200
KEYVAL: UPDATE_STATEFUL_end=1529416224
KEYVAL: UPDATE_STATEFUL_elapsed=24

> Stateful partition is getting over written as part of the
> quick-provision. I'm not sure that is "normal". Have to
> look at other logs.

It's normal, and it's done by simple extraction of a tar file.

Here's the relevant source from the script:
# Performs a stateful update using a specified stateful.tgz URL.
# Function will exit script on failure.
stateful_update() {
  local url="$1"

  # Stateful reset.
  info "Stateful reset"
  post_status "DUT: Stateful reset"
  rm -rf "${STATEFUL_DIR}/${UPDATE_STATE_FILE}" \
    "${STATEFUL_DIR}/var_new" \
    "${STATEFUL_DIR}/dev_image_new" || die "Unable to reset stateful."

  # Stateful update.
  info "Stateful update"
  post_status "DUT: Stateful update"
  get_url_to_stdout "${url}" |
    tar --ignore-command-error --overwrite --directory="${STATEFUL_DIR}" -xzf -
  local pipestatus=("${PIPESTATUS[@]}")
  if [[ "${pipestatus[0]}" -ne "0" ]]; then
    die "Retrieving ${url} failed. (statuses ${pipestatus[*]})"
  elif [[ "${pipestatus[1]}" -ne "0" ]]; then
    die "Untarring to ${STATEFUL_DIR} failed. (statuses ${pipestatus[*]})"
  fi

  # Stateful clean.
  info "Stateful clean"
  post_status "DUT: Stateful clean"
  printf "clobber" > "${STATEFUL_DIR}/${UPDATE_STATE_FILE}" || \
    die "Unable to clean stateful."
}

Yup: stateful update is at file system level and not block level (like rootfs is). I have the code in front of me: ~/trunk/src/platform/dev/quick-provision/quick-provision

But it doesn't "sync" or do anything else to make sure data has landed on storage.  I guess "reboot" is supposed to do that...but if reboot ends up with "hung task" since IO might take more than 120 seconds to clear up, then it might be likely that the stateful is corrupted. :(  [This is just speculation - I have no evidence this is the case.]

Something like this would give me warm fuzzies even if it's not the root cause of this bug:

diff --git a/quick-provision/quick-provision b/quick-provision/quick-provision
index 4f69ac3..5bc708f 100644
--- a/quick-provision/quick-provision
+++ b/quick-provision/quick-provision
@@ -167,6 +167,9 @@ update_partition() {
   elif [[ "${pipestatus[2]}" -ne "0" ]]; then
     die "Writing to ${part} failed. (statuses ${pipestatus[*]})"
   fi
+
+  # force stall until all dirty buffers are at least scheduled to be written.
+  sync
 }
 
 # Performs a stateful update using a specified stateful.tgz URL.


And the perhaps something similar to stateful partition.
Looking at the DD rates, it looks like my theory about "hung tasks" is unlikely. The DD emits statistics about writes that are consistent with eMMC write speeds - ie data is going directly to "media" (eMMC flash).
Just talked with Gwendal and he has concerns with this change:

commit 86fed24396246733bae4c963ed5208bff777fb61
Author: Risan <risan@google.com>
Date:   Wed May 16 16:48:56 2018 +0900


    init: Conditionally enables Quota for ext4
    
    There are 2 parts:
    1. For fresh installation, chromeos-install will mkfs the ext4
    filesystem with quota option on.
    2. Otherwise, chromeos_startup conditionally checks whether the quota
    option is turned on. If it hasn't, the scripts turn it on.
    
    BUG=b:62995196
    TEST=- Turn off quota and checked if chromeos_startup enables it.
    TEST=- Checked that chromeos_startup doesn't trigger tune2fs when quota
    is on (by adding an else in the chromeos_startup change - and make sure
    that the else is triggered).
    TEST=- Turn off kernel quota config, and /mnt/stateful_partition is
    still correctly mounted, without quota.
    
    Change-Id: I7e62c7dd79ec65ec380b8049e2d77fd0778844da
    Reviewed-on: https://chromium-review.googlesource.com/1064571
    Commit-Ready: ChromeOS CL Exonerator Bot <chromiumos-cl-exonerator@appspot.gserviceaccount.com>
    Tested-by: Risan <risan@chromium.org>
    Reviewed-by: Ryo Hashimoto <hashimoto@chromium.org>
    Reviewed-by: Mike Frysinger <vapier@chromium.org>

The concern is: before running tune2fs, if the file system has minor corruption that fsck can fix, then do that before running tune2fs:
https://chromium-review.googlesource.com/c/chromiumos/platform2/+/1064571/5/init/chromeos_startup#233

Gwendal is taking another look.
We are seeing several other failures that look like they may be related.  Example chrome logs:


[9313:9313:0621/034950.979652:ERROR:device_event_log_impl.cc(159)] [03:49:50.979] Login: cryptohome_authenticator.cc:140 MountEx failed. Error: 1
[9313:9313:0621/034950.980047:ERROR:device_event_log_impl.cc(159)] [03:49:50.980] Login: cryptohome_authenticator.cc:951 Cryptohome failure: state(AuthState)=2, code(cryptohome::MountError)=1
[9313:9313:0621/034950.980081:VERBOSE1:cryptohome_authenticator.cc(791)] Resolved state to: 2
[9313:9313:0621/034950.980392:ERROR:device_event_log_impl.cc(159)] [03:49:50.980] Login: cryptohome_authenticator.cc:725 Login failed: Could not mount cryptohome.
[9313:9313:0621/034950.980440:ERROR:login_performer.cc(63)] Login failure, reason=1, error.state=0
[9313:9313:0621/034950.980503:VERBOSE1:existing_user_controller.cc(1482)] Could not mount cryptohome.
Labels: -Pri-1 Pri-0
Summary: Stateful partition corruption (was: candy units failing provision)
Other issues with cryptohome mount problems will be set as blocked-by this one.
Blocking: 855072
Blocking: 855082
FTR, QUOTA support was recent enabled in the kernel:
https://chromium-review.googlesource.com/c/chromiumos/third_party/kernel/+/1016226
"CHROMIUM: config: Kernel config to enable quota"

Landed in	10756.0.0

While this was recently introduced and still suspect, the chromeos_startup change will reset the feature if the kernel doesn't explicitly claim support for it. I'll test this anyway since there is only a very small chance reverting corrupts the filesystem.
Blocking: -855072
The kernel and user space with out Quota support will still mount the stateful partition even if quota is enabled. The first boot was with kernel + user space which enabled quota enabled. The second boot was with a kernel w/o Quota and user space didn't change the quota settings on the filesystem. Unless I screwed something up, the theory in comment #26 dead. :(
Cc: adurbin@chromium.org
My experiment on friday was wrong: my build which reverted the /sbin/chromeos_startup change did in fact NOT revert the change. So let me try that again.
TL;DR: confirmed enabling quota will cause older Chrome OS builds installed later to powerwash stateful. :(  Need to determine the "right way" to handle older OS images within the test lab. *sigh*


My 10808.0.2018_06_20_1640 build has Quota support enabled:
./R69-10802.0.2018_06_20_1640-a1/chromiumos_test_image.bin
CHROMEOS_RELEASE_DESCRIPTION=10802.0.2018_06_20_1640 (Test Build - grundler) developer-build atlas

localhost ~ # fgrep quota /sbin/chromeos_startup 
  # Enable/disable quota feature.
    if [ -d /proc/sys/fs/quota ]; then
      # Quota is enabled in the kernel, make sure that quota is enabled in the
           grep -qe "^Filesystem features:.* quota.*"; then
        tune2fs -Oquota -Qusrquota,grpquota "${STATE_DEV}" || :
      # Quota is not enabled in the kernel, make sure that quota is disabled in
           grep -qe "^Filesystem features:.* quota.*"; then
        tune2fs -O^quota -Q^usrquota,^grpquota "${STATE_DEV}" || :

localhost ~ # uname -a
Linux localhost 4.4.138 #7 SMP PREEMPT Wed Jun 20 15:47:07 PDT 2018 x86_64 Intel(R) Core(TM) i7-7Y75 CPU @ 1.30GHz GenuineIntel GNU/Linux


localhost ~ # dumpe2fs -h /dev/mmcblk0p1 | fgrep -i quota
dumpe2fs 1.44.1 (24-Mar-2018)
Filesystem features:      has_journal ext_attr resize_inode dir_index filetype needs_recovery extent 64bit flex_bg encrypt sparse_super large_file huge_file dir_nlink extra_isize quota metadata_csum
User quota inode:         3
Group quota inode:        4


Build WITHOUT quota support in either kernel or chromeos_startup
./R69-10816.0.2018_06_25_1443-a1/chromiumos_test_image.bin

clobber.log after reboot:
2018/06/26 00:13:36 UTC (repair): /dev/mmcblk0p1 Self-repair corrupted stateful 
partition
dumpe2fs 1.44.1 (24-Mar-2018)

All "test account" info not present. Python not present. :(
Some context:

Current provision flow:
Shard call devserver to trigger provision on a DUT.
Devserver runs every commands in quick_provision on the DUT.

Richard will enable new provision flow, which is:
Shard runs every commands in quick_provision on the DUT.

So 2 options here:
1) Add this to quick_provision script (http://shortn/_rKwb4fp8Zv)
Pros: no need to make the same change twice
cons: for fallback of a failed quick provision, we lose this coverage.

2) Add this to machine_install_by_devserver (http://shortn/_yvTrbn0493) first, then move it machine_install (http://shortn/_ufPIOcpjkl) once Richard enables the new code flow.
Pros: Full coverage everywhere.
Cons: Need to make this change twice.

I actually prefer 2), but let the owner to decide which is better :). 
I need more help from jrbarnett and/or xixuan.

We need to run this command when /mnt/stateful is NOT mounted:
    tune2fs -O^quota -Q^usrquota,^grpquota "${STATE_DEV}" || :

BUILD is older than 10756.0.0
BEFORE booting the next build
AFTER stateful is UNMOUNTED (or before it's mounted on next boot)
Cc: jrbarnette@chromium.org xixuan@chromium.org
For the record here are the two changes start this problem:
"init: Conditionally enables Quota for ext4"

   https://chromium-review.googlesource.com/c/chromiumos/platform2/+/1064571
   (this was cherry-picked into M68 branch as well)

"CHROMIUM: config: Kernel config to enable quota"
   https://chromium-review.googlesource.com/c/chromiumos/third_party/kernel/+/1016226
   landed in 	10756.0.0 on Jun 5, 4:03pm PST(?)

> We need to run this command when /mnt/stateful is NOT mounted:
>     tune2fs -O^quota -Q^usrquota,^grpquota "${STATE_DEV}" || :

If the objective is to run this with the file system not mounted,
I think it would have to go into the shutdown code paths.  Unmounting
stateful on a running system is tricky, at best.  In particular, if
you want to run 'ssh', stateful has to be mounted, because the machine
identity data sshd must provide to clients is in stateful.
My understanding is this command only needs to be run on the DUT before it's updated by stateful mounting in next OLD build provision, to make sure this old build doesn't mount stateful with quota. It doesn't matter what's current condition of this DUT, i.e. what's current file system & whether it is mounted or not.
Cc: risan@chromium.org
#37, #36: You can not run tune2fs by yourself outside of the installer/init code, you will change the flow the user experiences.

It is unfortunate rolling back from R68+ trigger a clobber due to the quota change, but we can install the dev tool in the stateful image that matches the new rootfs. [via RestoreStateful in auto_updater.py]

It will take more time than quick autoupdate, but we won't change the flow. 
> #37, #36: You can not run tune2fs by yourself outside of the
> installer/init code, you will change the flow the user experiences.

That's not quite true.  The code to do this can be conditioned on "only
in test images, and only during updates."  We already have such conditions
in places like chromeos_startup; we can add them for chromeos_shutdown, too.

But if I understand comment #33 properly, we can only run tune2fs while
the target file system isn't mounted.  In practical terms, that means we
_must_ perform the operation during chromeos_startup.  That's the only
place in the system where it's practical to unmount stateful for the
necessary purpose.  We can't do it in chromeos_startup, because _that_
code will belong to some old image that we cannot change.


> [ ... ] but we can install the dev tool [ ... ]

Which dev tool do you mean, exactly?

> [ ... ] [via RestoreStateful in auto_updater.py] [ ... ]

I'm actively working to delete that part of the code.  Adding the
ability to restore stateful after update will require a non-trivial
amount of effort.  I'm skeptical of requiring this in any event.

> But if I understand comment #33 properly, we can only run tune2fs while
> the target file system isn't mounted.  [ ... ]

Cutting through the uncertainty, the answer is easy to determine:
    localhost ~ # tune2fs -O^quota -Q^usrquota,^grpquota "${STATE_DEV}" || :
    tune2fs 1.44.1 (24-Mar-2018)
    The quota feature may only be changed when the filesystem is unmounted.

So, the code changes must go into chromeos_shutdown.

grundler@ and I went through the source; the script already knows about
stateful_update, and the difference between test and non-test, so the
problem at this point is just a SMOP.

Modifying the superblock at every shutdown does sound scary to me, but umount does it all the time, so that's possible. 
The drawback is we are increasing the boot time, because we need to enable quota at every reboot from now on.
> Modifying the superblock at every shutdown [ ... ]

That's not the proposal.  The code will only be invoked when
we're downgrading to a version that requires the operation.

Sorry, I misunderstood. Grant pointed out chromeos_shutdown can guess if we are downgrading, so your proposal looks good.
Summary: Quota flag causing "stateful partition corruption" on older builds (was: Stateful partition corruption)
I've uploaded a change to chromeos_shutdown which implements 1/2 the proposal:
   https://chromium-review.googlesource.com/c/chromiumos/platform2/+/1116286

The "other part" likely needs to be in quick-provision script which can create the .target_version file which contains the build ID which is going to be used on reboot. I'm open to other mechanisms to communicate this. Feel free to modify/comment as you see fit.

I'd like to get this sorted out today please. Either we go with something like this ASAP or wait until Richard's planned changes land in a few weeks.
Summary: Quota flag causing "corrupted stateful partition" on older builds (was: Quota flag causing "stateful partition corruption" on older builds)
> [ ... ] likely needs to be in quick-provision script [ ... ]

Probably, putting it in quick-provision won't help...

The issue is that quick-provision doesn't know the version number,
it knows only a URL.  I'd be suspicious of any solution that required
quick-provision to parse the URL to extract the version number.

If we conclude that we don't want quick-provision to parse the URL for
a version string, then the remaining options are likely to be summarized
as "it's easier not to change quick-provision."

Comment 47 by risan@google.com, Jun 27 2018

Ugh sorry, this effect to consumer is properly understood (https://chromium-review.googlesource.com/c/chromiumos/platform2/+/1064571#message-d34c77e508cf6e75a4576dc7ac4a06a24850edac) but we missed the testing device :(

Just to make sure I understand this correctly, your change is only affecting testing device and not the release product - is that correct?

Risan, correct. Consumer devices do the equivalent of powerwash when using recovery media to go back to an older release. They are not affected by this bug.

Chrome OS Enterprise is working on launching rollback features starting with R69 and this issue is primarily for releases older than R68 - so also not affected.
Cc: hunyadym@chromium.org
Actually rollback starts from R67, but since they partial powerwash has not been implemented, and this causes a full powerwash anyway, this should be fine.
> The issue is that quick-provision doesn't know the version number,
> it knows only a URL.  I'd be suspicious of any solution that required
> quick-provision to parse the URL to extract the version number.

I've re-read the code, and as often happens with such things, my memory
was ... faulty.  The quick-provision script _does_ get a build name, plus
a base URL that it combines with the build name to produce the actual URL.
However, even then, the build name is a string that includes a version,
and not just the version itself.  So, the string would still have to be
parsed.

Parsing the build name may still be a dicey proposition.  I think we also
use quick-provisioning for parts of Paygen testing.  The strings to identify
builds used there will be quite different from the build names used in
ordinary provision tasks.  So, it still needs study/reflection.

I've included this thought as a comment to the proposed CL but I'll repeat here:

BTW, your use of the word "downgrade" reminds me that we should decide where to test if this downgrade crosses that build ID number (high to low) or not (low to low or high to high). The provisioning code can test this to decide if it should create .target_version or this code can look up what is in /etc/lsb-release and test that.

In fact, looking at this, seems like we should just copy /etc/lsb-release to ".target-lsb-release" and "." execute that file. Then just compare current ${CHROMEOS_RELEASE_BUILD_NUMBER} with the one in .target-lsb-release. WDYT?
> [ ... ] test if this downgrade crosses that build ID number (high to low)
> or not (low to low or high to high) [ ... ]

As currently constructed, the code would only run on high build numbers.
For that reason, if the ".target_version" indicates that the target
is below the threshold, we can be sure that this is a downgrade that
requires adjusting stateful.


> In fact, looking at this, seems like we should just copy
> /etc/lsb-release to ".target-lsb-release" and "." execute that
> file. Then just compare current ${CHROMEOS_RELEASE_BUILD_NUMBER}
> with the one in .target-lsb-release. WDYT?

That would be nice.  However, we don't have ready access to
/etc/lsb-release as it would be installed in the new build.
> Parsing the build name may still be a dicey proposition.  I think we also
> use quick-provisioning for parts of Paygen testing.  The strings to identify
> builds used there will be quite different from the build names used in
> ordinary provision tasks.  So, it still needs study/reflection.

Looking it over, I think the "parsing" question is moot.  The bigger
issue is that there are, and will be for the indefinite future, code paths
that don't use quick-provisioning at all.  That's especially true for
Paygen testing, which is/will be the main source of installing old, pre-quota
kernels for the foreseeable future.

So, the code to create the ".target_version" file must live on the ssh
client side code (i.e. Autotest or devserver).
Project Member

Comment 54 by bugdroid1@chromium.org, Jul 4

The following revision refers to this bug:
  https://chromium.googlesource.com/chromiumos/platform2/+/238cf8d8e41849b97515c8c6ae5ee11b381e6dd7

commit 238cf8d8e41849b97515c8c6ae5ee11b381e6dd7
Author: Grant Grundler <grundler@chromium.org>
Date: Wed Jul 04 01:18:23 2018

init: turn off ext4 quota for builds before 10756

Build 10756.0.0 enabled quota on /mnt/stateful_partition in order
to support ARC++. Unfortunately, in the test lab, we try to preserve
stateful on a host when installing the next test image. If that next
image is older than 10756.0.0:
1) the mount command will fail when the next image boots
2) the failure is reported as "corrupted stateful"
3) machine will then powerwash to recover
4) the provision step fails because python is not present on stateful
5) the machine will get "repaired" successfully

This has already happened to a pool of machines that was already low
on devices. The "temporary" loss of additional machines prevented the
corresponding paladin from running test suites.

Turning off quota has to be done before we boot the buid image
which knows nothing about quota. And we only need/want to do
this for test lab machines. And we can only run tune2fs AFTER
the stateful partition is unmounted.

BUG= chromium:854278 
TEST=manual

Change-Id: Id3e86c482857cb67710441b697855e35b8404173
Reviewed-on: https://chromium-review.googlesource.com/1116286
Commit-Ready: Grant Grundler <grundler@chromium.org>
Tested-by: Grant Grundler <grundler@chromium.org>
Reviewed-by: Richard Barnette <jrbarnette@google.com>
Reviewed-by: Grant Grundler <grundler@chromium.org>
Reviewed-by: Mike Frysinger <vapier@chromium.org>

[modify] https://crrev.com/238cf8d8e41849b97515c8c6ae5ee11b381e6dd7/init/chromeos_shutdown

Status: Fixed (was: Started)
I think everything needed to fix this issue has landed. Please re-open if powerwash still appears to be happening ("python not found") after provision an older release.
Project Member

Comment 56 by bugdroid1@chromium.org, Jul 9

The following revision refers to this bug:
  https://chromium.googlesource.com/chromiumos/third_party/autotest/+/3ef29a85ce1c2b734698fd6b9ae0118cbcd69bff

commit 3ef29a85ce1c2b734698fd6b9ae0118cbcd69bff
Author: Richard Barnette <jrbarnette@chromium.org>
Date: Mon Jul 09 19:42:38 2018

[autotest] Set a "target version" for stateful updates

In some cases in the lab, provisioning may downgrade a DUT.  Recent
OS changes have introduced a problem where downgrades from builds
after R69-10756.0.0 to builds before may fail and cause a powerwash.

This changes the provisioning flow to create a file in stateful that
indicates the target version of an update.  Shutdown code in the OS
uses this file to recognize when a downgrade is occurring, and
prevent the unwanted powerwash.

BUG= chromium:854278 
TEST=Run sanity suite on a local Autotest instance; examine logs

Change-Id: I757118ac94a6f4e590a961b654cd24fa220d633b
Reviewed-on: https://chromium-review.googlesource.com/1119197
Commit-Ready: ChromeOS CL Exonerator Bot <chromiumos-cl-exonerator@appspot.gserviceaccount.com>
Tested-by: Richard Barnette <jrbarnette@chromium.org>
Reviewed-by: Grant Grundler <grundler@chromium.org>

[modify] https://crrev.com/3ef29a85ce1c2b734698fd6b9ae0118cbcd69bff/server/cros/autoupdater.py

Sign in to add a comment