vgchange returning error during CleanupChroot step |
||||||||
Issue descriptionA recent leon-paladin failed cleaning its chroot: https://luci-milo.appspot.com/buildbot/chromeos/leon-paladin/17735 The subsequent build was fine (probably because we restart the build machine after fixing crbug.com/851120 ) This seems similar but not the same as crbug.com/843441 --- Error from log --- 00:02:52: INFO: Cleaning up chroot at /b/c/cbuild/repository/chroot 00:02:52: INFO: Waiting for ts_mon flushing process to finish... 00:03:06: INFO: Finished waiting for ts_mon process. cbuildbot_launch: Unhandled exception: Traceback (most recent call last): File "/b/rr/tmpwUXp4A/w/chromite/scripts/cbuildbot_launch", line 169, in <module> DoMain() File "/b/rr/tmpwUXp4A/w/chromite/scripts/cbuildbot_launch", line 165, in DoMain commandline.ScriptWrapperMain(FindTarget) File "/b/rr/tmpwUXp4A/w/chromite/lib/commandline.py", line 911, in ScriptWrapperMain ret = target(argv[1:]) File "/b/rr/tmpwUXp4A/w/chromite/scripts/cbuildbot_launch.py", line 514, in main return _main(argv) File "/b/rr/tmpwUXp4A/w/chromite/scripts/cbuildbot_launch.py", line 506, in _main CleanupChroot(buildroot) File "/b/rr/tmpwUXp4A/w/chromite/scripts/cbuildbot_launch.py", line 72, in wrapped_functor result = functor(*args, **kwargs) File "/b/rr/tmpwUXp4A/w/chromite/scripts/cbuildbot_launch.py", line 415, in CleanupChroot cros_sdk_lib.CleanupChrootMount(chroot_dir, delete=False) File "/b/rr/tmpwUXp4A/w/chromite/lib/timeout_util.py", line 191, in TimeoutWrapper return func(*args, **kwargs) File "/b/rr/tmpwUXp4A/w/chromite/lib/cros_sdk_lib.py", line 406, in CleanupChrootMount cros_build_lib.SudoRunCommand(cmd, capture_output=True, print_cmd=False) File "/b/rr/tmpwUXp4A/w/chromite/lib/cros_build_lib.py", line 325, in SudoRunCommand return RunCommand(sudo_cmd, **kwargs) File "/b/rr/tmpwUXp4A/w/chromite/lib/cros_build_lib.py", line 669, in RunCommand raise RunCommandError(msg, cmd_result) chromite.lib.cros_build_lib.RunCommandError: return code: 5; command: sudo -- vgchange -an cros_b+c+cbuild+repository+chroot_000 /dev/loop1p1: read failed after 0 of 4096 at 2147418112: Input/output error /dev/loop1p1: read failed after 0 of 4096 at 2147475456: Input/output error /dev/loop1p1: read failed after 0 of 4096 at 0: Input/output error .... /dev/loop2p12: read failed after 0 of 4096 at 0: Input/output error /dev/loop2p12: read failed after 0 of 4096 at 4096: Input/output error Can't deactivate volume group "cros_b+c+cbuild+repository+chroot_000" with 2 open logical volume(s) cmd=['sudo', '--', 'vgchange', '-an', 'cros_b+c+cbuild+repository+chroot_000']
,
Jul 10
,
Jul 10
We're rolling out a kernel upgrade in issue 860017. Need to check if this is only happening on newly upgraded machines.
,
Jul 10
Many new paladins with the same failure. Here are only a few examples auron_yuna-paladin: https://cros-goldeneye.corp.google.com/chromeos/healthmonitoring/buildDetails?buildbucketId=8941378105670995744 betty-arc64-paladin: https://cros-goldeneye.corp.google.com/chromeos/healthmonitoring/buildDetails?buildbucketId=8941378104802371376 cyan-paladin: https://cros-goldeneye.corp.google.com/chromeos/healthmonitoring/buildDetails?buildbucketId=8941378098389737888 eve-paladin ...... all from https://cros-goldeneye.corp.google.com/chromeos/healthmonitoring/buildDetails?buildbucketId=8941379007420212128 These errors don't seem to fail the overall build for some reason.
,
Jul 10
I don't know if those machines are/were on the new kernel during that failure.
,
Jul 10
From auron_yuna-paladin's machine's kernel serial console log: Jul 10 11:05:29 cros-beefy10-c2 ModemManager[15612]: <info> [1531245929.370236] ModemManager is shut down [ 6450.371874] blk_update_request: I/O error, dev loop1, sector 8609664 [ 6450.378500] blk_update_request: I/O error, dev loop1, sector 8609664 [ 6450.384985] Buffer I/O error on dev loop1p1, logical block 524272, async page read Jul 10 11:06:54 [ 6450.392916] blk_update_request: I/O error, dev loop1, sector 8609664 [ 6450.400658] blk_update_request: I/O error, dev loop1, sector 8609664 cros-beefy10-c2 [ 6450.407162] Buffer I/O error on dev loop1p1, logical block 524272, async page read kernel: [ 6450.3[ 6450.416417] blk_update_request: I/O error, dev loop1, sector 8609664 [ 6450.424197] blk_update_request: I/O error, dev loop1, sector 8609664 71874] blk_updat[ 6450.430688] Buffer I/O error on dev loop1p1, logical block 524272, async page read e_request: I/O error, dev loop1, sector 8609664 [ 6450.443647] blk_update_request: I/O error, dev loop1, sector 16320 Jul 10 11:06:54 [ 6450.450630] blk_update_request: I/O error, dev loop1, sector 16320 cros-beefy10-c2 [ 6450.458150] Buffer I/O error on dev loop1p11, logical block 2032, async page read kernel: [ 6450.3[ 6450.467435] blk_update_request: I/O error, dev loop1, sector 16320 78500] blk_updat[ 6450.474943] blk_update_request: I/O error, dev loop1, sector 16320 [ 6450.482559] Buffer I/O error on dev loop1p11, logical block 2032, async page read e_request: I/O e[ 6450.490452] Buffer I/O error on dev loop1p11, logical block 2032, async page read rror, dev loop1, sector 8609664 [ 6450.501305] Buffer I/O error on dev loop1p12, logical block 8176, async page read Jul 10 11:06:54 [ 6450.509789] Buffer I/O error on dev loop1p12, logical block 8176, async page read cros-beefy10-c2 [ 6450.518750] Buffer I/O error on dev loop1p12, logical block 8176, async page read kernel: [ 6450.384985] Buffer I/[ 6450.529647] Buffer I/O error on dev loop1p2, logical block 4080, async page read O error on dev loop1p1, logical block 524272, async page read So this isn't a case of incompatibility between the LVM tools and the kernel version: the kernel version has a new bug.
,
Jul 10
I'm starting to think that the linux kernel handling of loopback mounting is just too flaky for chroot.img. It's conceptually great, but still not robust after many attempts to improve it.
,
Jul 10
Working on getting the kernel rolled back: https://bugs.chromium.org/p/chromium/issues/detail?id=860017
,
Jul 10
,
Jul 10
From the metadata.json logs (http://shortn/_V7IYgQAdoU), it appears that the kernel version is 4.4.139-r1593 on many (all) of the failing machines. Non-failing machines do no appear to be running that version of the kernel.
,
Jul 11
I just saw another recent failure for this on nyan-full-compile-paladin: https://luci-logdog.appspot.com/v/?s=chromeos/bb/chromeos/nyan-full-compile-paladin/13111/+/recipes/steps/CleanupChroot/0/stdout (https://cros-goldeneye.corp.google.com/chromeos/healthmonitoring/buildDetails?buildbucketId=8941295403691317088) It does *not* appear to be using the 4.4 kernel: https://pantheon.corp.google.com/storage/browser/chromeos-image-archive/nyan-full-compile-paladin/R69-10866.0.0-rc3
,
Jul 11
That's a new error that appears to be the result of not unmounting filesystems before calling vgchange -an.
,
Jul 11
Comparing the log in Comment 1 and Comment 11, the stack trace and error code are the same. The only difference I see is the "/dev/loop..." errors between the stack trace and the final error of "Can't deactivate volume group "cros_b+c+cbuild+repository+chroot_000" with 2 open logical volume(s)"
,
Jul 11
Latest nyan-full-compile-paladin passed w/o the CleanupChroot failure, it was running on cros-beefy8-c2 which has kernel 4.4.0-130-generic. https://cros-goldeneye.corp.google.com/chromeos/healthmonitoring/buildDetails?buildbucketId=8941282998211258032
,
Jul 13
We can create a new bug for this if we need to, but tricky-paladin failed in the CleanupChroot step with different (but slightly related) error:
umount: /b/c/cbuild/repository/chroot: device is busy.
(In some cases useful info about processes that use
the device is found by lsof(8) or fuser(1))
06:37:51: INFO: Waiting for ts_mon flushing process to finish...
06:38:43: INFO: Finished waiting for ts_mon process.
cbuildbot_launch: Unhandled exception:
Traceback (most recent call last):
File "/b/rr/tmpJdNbTQ/w/chromite/scripts/cbuildbot_launch", line 169, in <module>
DoMain()
File "/b/rr/tmpJdNbTQ/w/chromite/scripts/cbuildbot_launch", line 165, in DoMain
commandline.ScriptWrapperMain(FindTarget)
File "/b/rr/tmpJdNbTQ/w/chromite/lib/commandline.py", line 911, in ScriptWrapperMain
ret = target(argv[1:])
File "/b/rr/tmpJdNbTQ/w/chromite/scripts/cbuildbot_launch.py", line 488, in main
return _main(argv)
File "/b/rr/tmpJdNbTQ/w/chromite/scripts/cbuildbot_launch.py", line 480, in _main
CleanupChroot(buildroot)
File "/b/rr/tmpJdNbTQ/w/chromite/scripts/cbuildbot_launch.py", line 71, in wrapped_functor
result = functor(*args, **kwargs)
File "/b/rr/tmpJdNbTQ/w/chromite/scripts/cbuildbot_launch.py", line 391, in CleanupChroot
cros_sdk_lib.CleanupChrootMount(chroot_dir, delete=False)
File "/b/rr/tmpJdNbTQ/w/chromite/lib/timeout_util.py", line 191, in TimeoutWrapper
return func(*args, **kwargs)
File "/b/rr/tmpJdNbTQ/w/chromite/lib/cros_sdk_lib.py", line 376, in CleanupChrootMount
osutils.UmountTree(chroot)
File "/b/rr/tmpJdNbTQ/w/chromite/lib/osutils.py", line 866, in UmountTree
UmountDir(mount_pt, lazy=False, cleanup=False)
File "/b/rr/tmpJdNbTQ/w/chromite/lib/osutils.py", line 828, in UmountDir
runcmd(cmd, print_cmd=False)
File "/b/rr/tmpJdNbTQ/w/chromite/lib/cros_build_lib.py", line 321, in SudoRunCommand
return RunCommand(sudo_cmd, **kwargs)
File "/b/rr/tmpJdNbTQ/w/chromite/lib/cros_build_lib.py", line 665, in RunCommand
raise RunCommandError(msg, cmd_result)
chromite.lib.cros_build_lib.RunCommandError: return code: 1; command: sudo -- umount -d /b/c/cbuild/repository/chroot
cmd=['sudo', '--', 'umount', '-d', '/b/c/cbuild/repository/chroot']
from https://cros-goldeneye.corp.google.com/chromeos/healthmonitoring/buildDetails?buildbucketId=8941115042433502560
,
Jul 16
Copying current sheriffs
,
Oct 17
Ben, when you get some time, can you investigate this? It's blocking kernel upgrade to 4.4. It looks like perhaps being more aggressive with the vgchange commands might work around the issue.
,
Oct 17
BTW, I also just investigated on a bot and confirmed that there's no newer version of the lvm2 package available.
,
Oct 17
Reassigning to Mike, who will be working on our next big image upgrade. He is going to look at upgrading to a newer Ubuntu LTS (probably Xenial), which would likely solve this problem.
,
Oct 17
Looking at the errors, the failing loopback devices aren't the ones associated with chroot.img. I suspect there's something going on where LVM has cached the loopback devices created during BuildImage and is trying to re-scan them. Adding a call to pvscan before the vgchange might clear up the cache. A similar bug was also the reason why _RescanDeviceLvmMetadata is called on chroot.img's loopback after detaching it. Another short-term possibility: If the config calls for the machine to be rebooted after the run, we can just skip this stage. The cleanup is just there so that the machine can run repeated builds without leaking loopback devices.
,
Nov 6
We're working on upgrading to Xenial which may resolve this issue due to the newer kernel.
,
Nov 8
Another instance of the failure in comment #13: https://cros-goldeneye.corp.google.com/chromeos/healthmonitoring/buildDetails?buildbucketId=8930424050992997072
,
Nov 16
|
||||||||
►
Sign in to add a comment |
||||||||
Comment 1 by jettrink@chromium.org
, Jul 10