New issue
Advanced search Search tips
Note: Color blocks (like or ) mean that a user may not be available. Tooltip shows the reason.

Issue 862136 link

Starred by 2 users

Issue metadata

Status: Assigned
Owner:
Cc:
Components:
EstimatedDays: ----
NextAction: ----
OS: Chrome
Pri: 2
Type: Bug

Blocked on:
issue 851138



Sign in to add a comment

vgchange returning error during CleanupChroot step

Project Member Reported by jettrink@chromium.org, Jul 10

Issue description

A recent leon-paladin failed cleaning its chroot: https://luci-milo.appspot.com/buildbot/chromeos/leon-paladin/17735

The subsequent build was fine (probably because we restart the build machine after fixing  crbug.com/851120 )
This seems similar but not the same as  crbug.com/843441 

--- Error from log ---

00:02:52: INFO: Cleaning up chroot at /b/c/cbuild/repository/chroot
00:02:52: INFO: Waiting for ts_mon flushing process to finish...
00:03:06: INFO: Finished waiting for ts_mon process.
cbuildbot_launch: Unhandled exception:
Traceback (most recent call last):
  File "/b/rr/tmpwUXp4A/w/chromite/scripts/cbuildbot_launch", line 169, in <module>
    DoMain()
  File "/b/rr/tmpwUXp4A/w/chromite/scripts/cbuildbot_launch", line 165, in DoMain
    commandline.ScriptWrapperMain(FindTarget)
  File "/b/rr/tmpwUXp4A/w/chromite/lib/commandline.py", line 911, in ScriptWrapperMain
    ret = target(argv[1:])
  File "/b/rr/tmpwUXp4A/w/chromite/scripts/cbuildbot_launch.py", line 514, in main
    return _main(argv)
  File "/b/rr/tmpwUXp4A/w/chromite/scripts/cbuildbot_launch.py", line 506, in _main
    CleanupChroot(buildroot)
  File "/b/rr/tmpwUXp4A/w/chromite/scripts/cbuildbot_launch.py", line 72, in wrapped_functor
    result = functor(*args, **kwargs)
  File "/b/rr/tmpwUXp4A/w/chromite/scripts/cbuildbot_launch.py", line 415, in CleanupChroot
    cros_sdk_lib.CleanupChrootMount(chroot_dir, delete=False)
  File "/b/rr/tmpwUXp4A/w/chromite/lib/timeout_util.py", line 191, in TimeoutWrapper
    return func(*args, **kwargs)
  File "/b/rr/tmpwUXp4A/w/chromite/lib/cros_sdk_lib.py", line 406, in CleanupChrootMount
    cros_build_lib.SudoRunCommand(cmd, capture_output=True, print_cmd=False)
  File "/b/rr/tmpwUXp4A/w/chromite/lib/cros_build_lib.py", line 325, in SudoRunCommand
    return RunCommand(sudo_cmd, **kwargs)
  File "/b/rr/tmpwUXp4A/w/chromite/lib/cros_build_lib.py", line 669, in RunCommand
    raise RunCommandError(msg, cmd_result)
chromite.lib.cros_build_lib.RunCommandError: return code: 5; command: sudo -- vgchange -an cros_b+c+cbuild+repository+chroot_000
  /dev/loop1p1: read failed after 0 of 4096 at 2147418112: Input/output error
  /dev/loop1p1: read failed after 0 of 4096 at 2147475456: Input/output error
  /dev/loop1p1: read failed after 0 of 4096 at 0: Input/output error
....
  /dev/loop2p12: read failed after 0 of 4096 at 0: Input/output error
  /dev/loop2p12: read failed after 0 of 4096 at 4096: Input/output error
  Can't deactivate volume group "cros_b+c+cbuild+repository+chroot_000" with 2 open logical volume(s)
cmd=['sudo', '--', 'vgchange', '-an', 'cros_b+c+cbuild+repository+chroot_000']
 
Cc: seanpaul@chromium.org vapier@chromium.org
Cc: dgarr...@chromium.org
Components: -Infra>Client>ChromeOS>Build Infra>Client>ChromeOS>CI
Labels: -Pri-2 Pri-0
Owner: gmeinke@chromium.org
Status: Assigned (was: Untriaged)
We're rolling out a kernel upgrade in issue 860017. Need to check if this is only happening on newly upgraded machines.

I don't know if those machines are/were on the new kernel during that failure.
From auron_yuna-paladin's machine's kernel serial console log:

Jul 10 11:05:29 cros-beefy10-c2 ModemManager[15612]: <info>  [1531245929.370236] ModemManager is shut down
[ 6450.371874] blk_update_request: I/O error, dev loop1, sector 8609664
[ 6450.378500] blk_update_request: I/O error, dev loop1, sector 8609664
[ 6450.384985] Buffer I/O error on dev loop1p1, logical block 524272, async page read
Jul 10 11:06:54 [ 6450.392916] blk_update_request: I/O error, dev loop1, sector 8609664
[ 6450.400658] blk_update_request: I/O error, dev loop1, sector 8609664
cros-beefy10-c2 [ 6450.407162] Buffer I/O error on dev loop1p1, logical block 524272, async page read
kernel: [ 6450.3[ 6450.416417] blk_update_request: I/O error, dev loop1, sector 8609664
[ 6450.424197] blk_update_request: I/O error, dev loop1, sector 8609664
71874] blk_updat[ 6450.430688] Buffer I/O error on dev loop1p1, logical block 524272, async page read
e_request: I/O error, dev loop1, sector 8609664
[ 6450.443647] blk_update_request: I/O error, dev loop1, sector 16320
Jul 10 11:06:54 [ 6450.450630] blk_update_request: I/O error, dev loop1, sector 16320
cros-beefy10-c2 [ 6450.458150] Buffer I/O error on dev loop1p11, logical block 2032, async page read
kernel: [ 6450.3[ 6450.467435] blk_update_request: I/O error, dev loop1, sector 16320
78500] blk_updat[ 6450.474943] blk_update_request: I/O error, dev loop1, sector 16320
[ 6450.482559] Buffer I/O error on dev loop1p11, logical block 2032, async page read
e_request: I/O e[ 6450.490452] Buffer I/O error on dev loop1p11, logical block 2032, async page read
rror, dev loop1, sector 8609664
[ 6450.501305] Buffer I/O error on dev loop1p12, logical block 8176, async page read
Jul 10 11:06:54 [ 6450.509789] Buffer I/O error on dev loop1p12, logical block 8176, async page read
cros-beefy10-c2 [ 6450.518750] Buffer I/O error on dev loop1p12, logical block 8176, async page read
kernel: [ 6450.384985] Buffer I/[ 6450.529647] Buffer I/O error on dev loop1p2, logical block 4080, async page read
O error on dev loop1p1, logical block 524272, async page read

So this isn't a case of incompatibility between the LVM tools and the kernel version: the kernel version has a new bug.

I'm starting to think that the linux kernel handling of loopback mounting is just too flaky for chroot.img. It's conceptually great, but still not robust after many attempts to improve it.

Working on getting the kernel rolled back: https://bugs.chromium.org/p/chromium/issues/detail?id=860017
Labels: -Pri-0 Pri-2
From the metadata.json logs (http://shortn/_V7IYgQAdoU), it appears that the kernel version is 4.4.139-r1593 on many (all) of the failing machines. Non-failing machines do no appear to be running that version of the kernel.
That's a new error that appears to be the result of not unmounting filesystems before calling vgchange -an.
Comparing the log in Comment 1 and Comment 11, the stack trace and error code are the same. The only difference I see is the "/dev/loop..." errors between the stack trace and the final error of "Can't deactivate volume group "cros_b+c+cbuild+repository+chroot_000" with 2 open logical volume(s)"
Latest nyan-full-compile-paladin passed w/o the CleanupChroot failure, it was running on cros-beefy8-c2 which has kernel 4.4.0-130-generic.

https://cros-goldeneye.corp.google.com/chromeos/healthmonitoring/buildDetails?buildbucketId=8941282998211258032
We can create a new bug for this if we need to, but tricky-paladin failed in the CleanupChroot step with different (but slightly related) error:

umount: /b/c/cbuild/repository/chroot: device is busy.
        (In some cases useful info about processes that use
         the device is found by lsof(8) or fuser(1))
06:37:51: INFO: Waiting for ts_mon flushing process to finish...
06:38:43: INFO: Finished waiting for ts_mon process.
cbuildbot_launch: Unhandled exception:
Traceback (most recent call last):
  File "/b/rr/tmpJdNbTQ/w/chromite/scripts/cbuildbot_launch", line 169, in <module>
    DoMain()
  File "/b/rr/tmpJdNbTQ/w/chromite/scripts/cbuildbot_launch", line 165, in DoMain
    commandline.ScriptWrapperMain(FindTarget)
  File "/b/rr/tmpJdNbTQ/w/chromite/lib/commandline.py", line 911, in ScriptWrapperMain
    ret = target(argv[1:])
  File "/b/rr/tmpJdNbTQ/w/chromite/scripts/cbuildbot_launch.py", line 488, in main
    return _main(argv)
  File "/b/rr/tmpJdNbTQ/w/chromite/scripts/cbuildbot_launch.py", line 480, in _main
    CleanupChroot(buildroot)
  File "/b/rr/tmpJdNbTQ/w/chromite/scripts/cbuildbot_launch.py", line 71, in wrapped_functor
    result = functor(*args, **kwargs)
  File "/b/rr/tmpJdNbTQ/w/chromite/scripts/cbuildbot_launch.py", line 391, in CleanupChroot
    cros_sdk_lib.CleanupChrootMount(chroot_dir, delete=False)
  File "/b/rr/tmpJdNbTQ/w/chromite/lib/timeout_util.py", line 191, in TimeoutWrapper
    return func(*args, **kwargs)
  File "/b/rr/tmpJdNbTQ/w/chromite/lib/cros_sdk_lib.py", line 376, in CleanupChrootMount
    osutils.UmountTree(chroot)
  File "/b/rr/tmpJdNbTQ/w/chromite/lib/osutils.py", line 866, in UmountTree
    UmountDir(mount_pt, lazy=False, cleanup=False)
  File "/b/rr/tmpJdNbTQ/w/chromite/lib/osutils.py", line 828, in UmountDir
    runcmd(cmd, print_cmd=False)
  File "/b/rr/tmpJdNbTQ/w/chromite/lib/cros_build_lib.py", line 321, in SudoRunCommand
    return RunCommand(sudo_cmd, **kwargs)
  File "/b/rr/tmpJdNbTQ/w/chromite/lib/cros_build_lib.py", line 665, in RunCommand
    raise RunCommandError(msg, cmd_result)
chromite.lib.cros_build_lib.RunCommandError: return code: 1; command: sudo -- umount -d /b/c/cbuild/repository/chroot
cmd=['sudo', '--', 'umount', '-d', '/b/c/cbuild/repository/chroot']

from https://cros-goldeneye.corp.google.com/chromeos/healthmonitoring/buildDetails?buildbucketId=8941115042433502560
Cc: rrangel@chromium.org pmalani@chromium.org
Copying current sheriffs
Owner: bmgordon@chromium.org
Ben, when you get some time, can you investigate this? It's blocking kernel upgrade to 4.4. It looks like perhaps being more aggressive with the vgchange commands might work around the issue.
BTW, I also just investigated on a bot and confirmed that there's no newer version of the lvm2 package available.
Owner: mikenichols@chromium.org
Reassigning to Mike, who will be working on our next big image upgrade. He is going to look at upgrading to a newer Ubuntu LTS (probably Xenial), which would likely solve this problem.
Looking at the errors, the failing loopback devices aren't the ones associated with chroot.img.  I suspect there's something going on where LVM has cached the loopback devices created during BuildImage and is trying to re-scan them.  Adding a call to pvscan before the vgchange might clear up the cache.  A similar bug was also the reason why _RescanDeviceLvmMetadata is called on chroot.img's loopback after detaching it.

Another short-term possibility: If the config calls for the machine to be rebooted after the run, we can just skip this stage.  The cleanup is just there so that the machine can run repeated builds without leaking loopback devices.
Blockedon: 851138
We're working on upgrading to Xenial which may resolve this issue due to the newer kernel.
Cc: jclinton@google.com
 Issue 906123  has been merged into this issue.

Sign in to add a comment