chromeos-init flaky loopback filesystem unit test |
||||||||||
Issue descriptionFailed build: https://cros-goldeneye.corp.google.com/chromeos/healthmonitoring/buildDetails?buildbucketId=8929799010603371408 Log: https://luci-logdog.appspot.com/logs/chromeos/buildbucket/cr-buildbucket.appspot.com/8929799010603371408/+/steps/UnitTest/0/stdout Relevant messages: chromeos-init-0.0.25-r3699: mke2fs 1.44.1 (24-Mar-2018) chromeos-init-0.0.25-r3699: Suggestion: Use Linux kernel >= 3.18 for improved stability of the metadata and journal checksum features. chromeos-init-0.0.25-r3699: Warning: could not erase sector 2: Input/output error chromeos-init-0.0.25-r3699: Creating filesystem with 2048 1k blocks and 256 inodes chromeos-init-0.0.25-r3699: chromeos-init-0.0.25-r3699: Allocating group tables: 0/1 done chromeos-init-0.0.25-r3699: Warning: could not read block 0: Input/output error chromeos-init-0.0.25-r3699: Warning: could not erase sector 0: Input/output error chromeos-init-0.0.25-r3699: Writing inode tables: 0/1 done chromeos-init-0.0.25-r3699: ext2fs_update_bb_inode: Input/output error while setting bad block inode chromeos-init-0.0.25-r3699: ../../../../../../../tmp/portage/chromeos-base/chromeos-init-0.0.25-r3699/work/chromeos-init-0.0.25/init/tests/clobber_state_test.cc:197: Failure chromeos-init-0.0.25-r3699: Value of: MakeFilesystem("ext4", 5) chromeos-init-0.0.25-r3699: Actual: false chromeos-init-0.0.25-r3699: Expected: true chromeos-init-0.0.25-r3699: terminating with uncaught exception of type testing::internal::GoogleTestFailureException: ../../../../../../../tmp/portage/chromeos-base/chromeos-init-0.0.25-r3699/work/chromeos-init-0.0.25/init/tests/clobber_state_test.cc:197: Failure chromeos-init-0.0.25-r3699: Value of: MakeFilesystem("ext4", 5) chromeos-init-0.0.25-r3699: Actual: false chromeos-init-0.0.25-r3699: Expected: true chromeos-init-0.0.25-r3699: Error: /var/cache/portage/chromeos-base/chromeos-init/out/Default/clobber_state_test: failed with signal SIGIOT|SIGABRT(6) chromeos-init-0.0.25-r3699: * ERROR: chromeos-base/chromeos-init-0.0.25-r3699::chromiumos failed (test phase): It looks like a disk error is preventing the mkfs.ext4 from succeeding, which causes a unit test failure.
,
Nov 15
I don't really understand what's going on.
,
Nov 15
That looks more like an image mount issue than a local drive failure; loop0/loop1 are not local disks. Looking at the logs it appears this is a chroot/image being mounted. Not sure if something changed in that or if it was a transient mount issue. Might suggest firing off a tryjob with that config to follow through a success run (and validate it works now). -- Mike
,
Nov 15
Potentially related failure on squawks-release: https://cros-goldeneye.corp.google.com/chromeos/healthmonitoring/buildDetails?buildbucketId=8929798909459031632 slightly different error message, but also a mkfs failure on a loop device: chromeos-init-0.0.25-r3699: mke2fs 1.44.1 (24-Mar-2018) chromeos-init-0.0.25-r3699: The file /dev/loop1p2 does not exist and no size was specified. chromeos-init-0.0.25-r3699: ../../../../../../../tmp/portage/chromeos-base/chromeos-init-0.0.25-r3699/work/chromeos-init-0.0.25/init/tests/clobber_state_test.cc:195: Failure chromeos-init-0.0.25-r3699: Value of: MakeFilesystem("ext2", 2) chromeos-init-0.0.25-r3699: Actual: false chromeos-init-0.0.25-r3699: Expected: true
,
Nov 15
I think we are convinced at this point that this is a flaky test (based on kernel logs). Over to sheriff to find an owner.
,
Nov 15
,
Nov 16
Another reproduction, this time on cyan-release: https://cros-goldeneye.corp.google.com/chromeos/healthmonitoring/buildDetails?buildbucketId=8929707885927364240 The previous sighting was on banon-release, so this doesn't seem to be tied to a specific builder. benchan@, any luck finding an owner?
,
Nov 16
-> fletcherw@chromium.org
,
Nov 16
This is a unit test I added in https://crrev.com/c/1297181. Would you like me to rollback the CL while I investigate? Unclear why it's being flaky; any idea what might be causing the call to mkfs.ext4 to fail?
,
Dec 4
,
Dec 4
we've long had issues on bots with loopbacks and kernel reliability. it's why a lot of code has explicit retries & syncs in them.
,
Dec 4
vapier: any idea why? Some ideas we had when brainstorming: 1) has someone altered the file we have mounted via loopback? changed permissions, deleted, moved, changed owner, etc? 2) is the file we are using for loopback failing a block allocation as we try and write to it, either because of a transient error or because the filesystem is full? Do you have any other ideas of things we could check? Adding retries seems like a last resort. :-/
,
Dec 6
The following revision refers to this bug: https://chromium.googlesource.com/chromiumos/platform2/+/a51b89fa44277358cd1d4ca4b0269e74ddb5151c commit a51b89fa44277358cd1d4ca4b0269e74ddb5151c Author: Fletcher Woodruff <fletcherw@chromium.org> Date: Thu Dec 06 02:28:53 2018 init: tweak clobber-state test to stop flakes clobber_state_test was failing occasionally on release builders. Reduce the size of the mock disk image and ensure that all blocks are allocated in order to (hopefully) ensure that the mkfs calls don't fail due to I/O errors. BUG= chromium:905683 TEST=run unit tests Change-Id: Icc3407dbb77ce77e26e57324e7681e7abef4bd91 Reviewed-on: https://chromium-review.googlesource.com/1361569 Commit-Ready: Fletcher Woodruff <fletcherw@chromium.org> Tested-by: Fletcher Woodruff <fletcherw@chromium.org> Reviewed-by: Dan Erat <derat@chromium.org> Reviewed-by: Ross Zwisler <zwisler@chromium.org> Reviewed-by: Justin TerAvest <teravest@chromium.org> [modify] https://crrev.com/a51b89fa44277358cd1d4ca4b0269e74ddb5151c/init/tests/clobber_state_test.cc
,
Dec 7
Still happening: https://luci-logdog.appspot.com/logs/chromeos/buildbucket/cr-buildbucket.appspot.com/8927835311615731024/+/steps/UnitTest/0/stdout chromeos-init-0.0.25-r3708: mke2fs 1.44.1 (24-Mar-2018) chromeos-init-0.0.25-r3708: Warning: could not erase sector 2: Input/output error chromeos-init-0.0.25-r3708: Creating filesystem with 16384 1k blocks and 4096 inodes chromeos-init-0.0.25-r3708: Filesystem UUID: 50a937ef-6711-4b76-b8c6-f5f5c55e9780 chromeos-init-0.0.25-r3708: Superblock backups stored on blocks: chromeos-init-0.0.25-r3708: 8193 chromeos-init-0.0.25-r3708: chromeos-init-0.0.25-r3708: Allocating group tables: 0/2 done chromeos-init-0.0.25-r3708: Warning: could not read block 0: Input/output error chromeos-init-0.0.25-r3708: Warning: could not erase sector 0: Input/output error chromeos-init-0.0.25-r3708: Writing inode tables: 0/2 done chromeos-init-0.0.25-r3708: Writing superblocks and filesystem accounting information: 0/2 chromeos-init-0.0.25-r3708: Warning, had trouble writing out superblocks. chromeos-init-0.0.25-r3708: ../../../../../../../tmp/portage/chromeos-base/chromeos-init-0.0.25-r3708/work/chromeos-init-0.0.25/init/tests/clobber_state_test.cc:199: Failure chromeos-init-0.0.25-r3708: Value of: MakeFilesystem("ext2", 2) chromeos-init-0.0.25-r3708: Actual: false chromeos-init-0.0.25-r3708: Expected: true How can I go about adding retries here? Or would I be better served by just removing this test and doing everything on-device?
,
Dec 7
It seems bound to fail occasionally even with retries, so if the cause can't be found, it's probably best to run it on-device instead (assuming that that works reliably).
,
Dec 10
,
Dec 10
Why would we assume that this would work reliably on device? I don't understand why that should be better than running on a builder.
,
Dec 10
If we don't understand the source of the flake, the test should be made informational so that it stops killing builds.
,
Dec 10
I put up a revert CL, but if making it informational would work too maybe that's better. How do I set a test as informational. None of the pages under https://www.chromium.org/chromium-os/testing even mention it.
,
Dec 12
Dan, can you answer the question in #19? I don't remember the details.
,
Dec 12
I don't think there's such a thing as an informational unit test. As I understand it, if something that runs during src_test fails, then the package fails. Mike would know for sure. You can add a DISABLED_ prefix to the test to make it be skipped, I think: https://github.com/google/googletest/blob/master/googletest/docs/advanced.md#temporarily-disabling-tests In general, running this test but ignoring its failures probably isn't something that we'd want to do, I think. It's unlikely to ever be fixed, so we'll probably just pay the maintenance and time cost to keep the code compiling without getting any benefits from it.
,
Dec 13
Reverted test. |
||||||||||
►
Sign in to add a comment |
||||||||||
Comment 1 by dgarr...@chromium.org
, Nov 15