New issue
Advanced search Search tips
Note: Color blocks (like or ) mean that a user may not be available. Tooltip shows the reason.

Issue 788473 link

Starred by 1 user

Issue metadata

Status: Assigned
Owner:
Cc:
Components:
EstimatedDays: ----
NextAction: ----
OS: Chrome
Pri: 3
Type: Bug

Blocking:
issue 779870



Sign in to add a comment

Quick provisioning postinst failure with segfault/Exec format error

Project Member Reported by davidri...@chromium.org, Nov 25 2017

Issue description

After running 10-20k successful provisions (with <10 provision related failures during that time), one round of 50 simultaneous quick provisions failed horrendously.  Specifically 5 provisions failed and 29 succeeded before I aborted the run.  (All from chromeos2-devserver9, a server currently dedicated to my test)

Of the 5 that failed, 4 were nyan_blaze provisioning nyan_blaze-release/R64-10145.0.0, and 1 was guado provisioning guado-release/R64-10147.0.0.  Those same versions had been provisioned many times before, running the exact same devserver code, on the same duts, and also successfully provisioned to some other duts during the same run.

It looked postinst was reproducible failing with a segfault/exec format error, with subsequent provisions on the device all failing in a similar way:
2017-11-25 01:21:43+00:00 INFO: Update next kernel to try (via postinst)
dm:dm bht[DEBUG] Setting block_count 317440
dm:dm bht[DEBUG] Setting depth to 3.
dm:dm bht[DEBUG] depth: 0 entries: 1
dm:dm bht[DEBUG] depth: 1 entries: 20
dm:dm bht[DEBUG] depth: 2 entries: 2480
Segmentation fault (core dumped)
2017-11-25 01:22:13+00:00 ERROR: FATAL: postinst failed.
2017-11-25 01:22:13+00:00 INFO: Updated status: FATAL: postinst failed.

or
2017-11-24 16:49:32-08:00 INFO: Update next kernel to try (via postinst)
/tmp/quick-provision: line 168: /tmp/tmp.SJm5HTH1d2/postinst: cannot execute binary file: Exec format error
2017-11-24 16:49:32-08:00 ERROR: FATAL: postinst failed.
2017-11-24 16:49:32-08:00 INFO: Updated status: FATAL: postinst failed.

A number of postinst tmpmnts were left in /tmp, and unmounting those and re-running the postinst script after that resulted in the problem going away.

One DUTs had rebooted and some how self corrected prior to me investigating, one has problems occur but rerunning postinst resulted in the system correctly updating.

I've tried md5suming contents and comparing things and haven't found any obvious issues from that, but my searching there hasn't been exhaustive.

It feels like there's at least two separate problems:
1. Something caused multiple simultaneous failures in provisioning (software bug, devserver memory or disk getting corrupted, network blip) which caused data corruption.  The exact same devserver software, provisioning the same DUTs with the same versions, with the same load factor had successfully run without issue many times.  dmesg doesn't show any obvious alerts about memory failures.  On-disk copies seem to be okay.
2. Quick provisioning bug where such failures persist across invocations until all mounts are released, likely from some Linux fs/block cache interaction.

As a design decision, quick provisioning trusts the data and data transport from the devserver.  If this is unreliable, this decision might need to be revisited.

I'm at a total loss of what could cause #1, hence a rather large CC list for ideas.
 
For #2: doing an unconditional umount should make this more resilient to previous failures.  Furthermore, if fallback had been enabled, this wouldn't have happened since the fallback provisioning would have rebooted which also cleans up the stat.  I've got fallback disabled to accommodate my testing better.

In order to debug #1 better, I'm trying to reproduce where I pipe the contents downloaded to md5sum as well to get a hash of the incoming data.

The two times I've seen this both seemed to happen at the start of an onrush of provisioning.  The second time I had 7 failures upfront, the 7 DUTs were retried and failed (due to #2 above) with my test harness blacklisting those 7 DUTs, followed by 986 successful provisions.
Owner: davidri...@chromium.org
Cc: haoweiw@chromium.org
Blocking: 779870
Cc: vapier@chromium.org
I can't think of any way that a corrupt image could result in postinst failing once time without failing repeatedly.

Bugs in ChromeOS (especially old kernels) could result in that behavior, mostly related to disk cache handling.

Long, long ago, we had kernel bugs related to cache corruption when doing block writes followed by rapid mount / filesystem reads. This would probably be related to how long the delay is between finishing the block writes and doing the initial mount.

If that was the cause, I would expect a steady stream of background failures, and expect them to be somewhat board specific. 

I can't think of how a Chrome OS bug would affect multiple DUTs all at the same time.  I agree with what you're saying about background failures and board specific in that case.


do we flush the page cache at all ?  if we're modifying the block device directly, then trying to mount it, and possibly keeping mounts open (from before the block update), i can see the block layers getting confused since caches aren't (iiuc) mount point specific.

having the quick provision script tear down known mounts like everything under /tmp, updating the disk, then flushing all caches sounds like a reasonable approach.  assuming it works ;).
With traditional updates, we only do block writes to the unused kernel/rootfs which are unmounted.

The bugs we saw way back when involved unflushed pages not being seen by the file system, even though the mount was done after the write calls had returned. I found that confusing.

Flushing the pages before mounting fixed this, but caused huge jank on a running system. David's system can afford to do a forced flush since it's not a background process.

However, the path we are going down doesn't match his symptoms very well. I just don't have anything better.
Cc: puneetster@chromium.org
Project Member

Comment 11 by bugdroid1@chromium.org, Dec 12 2017

The following revision refers to this bug:
  https://chromium.googlesource.com/chromiumos/platform/dev-util/+/a3e96631a4ecee8c78e42f34700b97548546dc6c

commit a3e96631a4ecee8c78e42f34700b97548546dc6c
Author: David Riley <davidriley@chromium.org>
Date: Tue Dec 12 19:09:10 2017

quick-provision: Remove tmpdir if postinst fails

BUG=chromium:788473
TEST=repeat 20 autotest/files/contrib/loadtest.py $DS config.json --simultaneous 100 --total 2000
Change-Id: I3968dc3dfcb2faba7739cdc6362e203bb8b6fc0c
Reviewed-on: https://chromium-review.googlesource.com/804963
Commit-Ready: David Riley <davidriley@chromium.org>
Tested-by: David Riley <davidriley@chromium.org>
Reviewed-by: Mike Frysinger <vapier@chromium.org>

[modify] https://crrev.com/a3e96631a4ecee8c78e42f34700b97548546dc6c/quick-provision/quick-provision

Project Member

Comment 12 by bugdroid1@chromium.org, Dec 12 2017

The following revision refers to this bug:
  https://chromium.googlesource.com/chromiumos/platform/dev-util/+/cddf942c87d82519b2d4a7f8989fd368df1d8649

commit cddf942c87d82519b2d4a7f8989fd368df1d8649
Author: David Riley <davidriley@chromium.org>
Date: Tue Dec 12 19:09:10 2017

quick-provision: Log hashes of downloaded contents for debugging

BUG=chromium:788473
TEST=None

Change-Id: Ibd473e21b7b8972b8f0952c2c4cdf7682bad9d7c
Reviewed-on: https://chromium-review.googlesource.com/804964
Commit-Ready: David Riley <davidriley@chromium.org>
Tested-by: David Riley <davidriley@chromium.org>
Reviewed-by: David Riley <davidriley@chromium.org>
Reviewed-by: Mike Frysinger <vapier@chromium.org>

[modify] https://crrev.com/cddf942c87d82519b2d4a7f8989fd368df1d8649/quick-provision/quick-provision

Provision to chromeos4-row2-rack12-host1 to kip-release/R64-10164.0.0 failed:
Set boot target to /dev/mmcblk0p5: Partition 5, Slot B
SetImage
KERNEL_CONFIG: console= loglevel=7 init=/sbin/init cros_secure oops=panic panic=-1 root=/dev/dm-0 rootwait ro dm_verity.error_behavior=3 dm_verity.max_bios=-1 dm_verity.dev_wait=1 dm="1 vroot none ro 1,0 3584000 verity payload=PARTUUID=%U/PARTNROFF=1 hashtree=PARTUUID=%U/PARTNROFF=1 hashstart=3584000 alg=sha1 root_hexdigest=824dd881b59f3d1838ebcaa97710577ddf4e1d98 salt=399bce03588b5c9eaa89a73c5da857465f37e7186809d2af87b9a5d36211149c" noinitrd cros_debug vt.global_cursor_default=0 kern_guid=%U add_efi_memmap boot=local noresume noswap i915.modeset=1 tpm_tis.force=1 tpm_tis.interrupts=0 nmi_watchdog=panic,lapic
Setting up verity.
Filesystem hash verification failed
Expected 824dd881b59f3d1838ebcaa97710577ddf4e1d98 != actual d0a9adc86de18d17800d3747367effa7add063c7
Finished after 56 seconds.
SetImage failed.
PostInstall Failed
Downloaded hashes prior to postinst failure:
70a1b0a6e27961c350e0ec6e1688ba48  -
33db9853a3628d3de87a6586c6a16d80  -
6df67de768f6e77d8a933d597e495fcd  -
2017-12-12 18:32:38-08:00 ERROR: FATAL: postinst failed.
2017-12-12 18:32:38-08:00 INFO: Updated status: FATAL: postinst failed.

/tmp/hashes reports successful HTTP transport:
70a1b0a6e27961c350e0ec6e1688ba48  -
33db9853a3628d3de87a6586c6a16d80  -
6df67de768f6e77d8a933d597e495fcd  -

Expected values:
chromeos-test@chromeos2-devserver9:~/images/kip-release/R64-10164.0.0$ md5sum full_dev_part_* stateful.tgz
70a1b0a6e27961c350e0ec6e1688ba48  full_dev_part_KERN.bin.gz
33db9853a3628d3de87a6586c6a16d80  full_dev_part_ROOT.bin.gz
6df67de768f6e77d8a933d597e495fcd  stateful.tgz

Contents written to disk do not match what should be expected:
localhost ~ # md5sum /dev/mmcblk0p4
3fc5909c8b642e36d309fdd9f9b476f4  /dev/mmcblk0p4
localhost ~ # md5sum /dev/mmcblk0p5
d602fc1dc333554bbd98d1dd2673c61c  /dev/mmcblk0p5

chromeos-test@chromeos2-devserver9:~/images/kip-release/R64-10164.0.0$ zcat full_dev_part_KERN.bin.gz | md5sum -
3fc5909c8b642e36d309fdd9f9b476f4  -
chromeos-test@chromeos2-devserver9:~/images/kip-release/R64-10164.0.0$ zcat full_dev_part_ROOT.bin.gz | md5sum -
62a78ef17df4487b468c376fbe73dfac  -
logs-chromeos4-row2-rack12-host1.tar.gz
102 KB Download
messages.chromeos4-row2-rack12-host1
109 KB Download
dmesg.chromeos4-row2-rack12-host1
47.3 KB Download
quick-provision.log.chromeos4-row2-rack12-host1
9.0 KB Download
Labels: provisionflake
Status: Assigned (was: Untriaged)

Sign in to add a comment