New issue
Advanced search Search tips
Note: Color blocks (like or ) mean that a user may not be available. Tooltip shows the reason.

Issue 818874 link

Starred by 6 users

Issue metadata

Status: Fixed
Merged: issue 855151
Owner:
Last visit > 30 days ago
Closed: Jul 5
Cc:
Components:
EstimatedDays: ----
NextAction: ----
OS: Chrome
Pri: 1
Type: Bug



Sign in to add a comment

Release builders failing CleanUp stage, cannot unmount chroot

Project Member Reported by bhthompson@google.com, Mar 5 2018

Issue description

https://uberchromegw.corp.google.com/i/chromeos/builders/reef-release/builds/1996

The first build that failed in this fashion was https://uberchromegw.corp.google.com/i/chromeos/builders/reef-release/builds/1974

...
umount: /b/c/cbuild/repository/chroot: device is busy.
        (In some cases useful info about processes that use
         the device is found by lsof(8) or fuser(1))
18:10:31: ERROR: <class 'chromite.lib.cros_build_lib.RunCommandError'>: return code: 1; command: sudo -n 'CROS_CACHEDIR=/b/c/cbuild/repository/.cache' 'CROS_SUDO_KEEP_ALIVE=unknown' -- umount -d /b/c/cbuild/repository/chroot
cmd=['sudo', '-n', 'CROS_CACHEDIR=/b/c/cbuild/repository/.cache', 'CROS_SUDO_KEEP_ALIVE=unknown', '--', 'umount', '-d', '/b/c/cbuild/repository/chroot']
Traceback (most recent call last):
  File "/b/c/cbuild/repository/chromite/lib/failures_lib.py", line 229, in wrapped_functor
    return functor(*args, **kwargs)
  File "/b/c/cbuild/repository/chromite/cbuildbot/stages/build_stages.py", line 190, in PerformStage
    cros_build_lib.CleanupChrootMount(buildroot=self._build_root)
  File "/b/c/cbuild/repository/chromite/lib/cros_build_lib.py", line 1730, in CleanupChrootMount
    osutils.UmountTree(chroot)
  File "/b/c/cbuild/repository/chromite/lib/osutils.py", line 866, in UmountTree
    UmountDir(mount_pt, lazy=False, cleanup=False)
  File "/b/c/cbuild/repository/chromite/lib/osutils.py", line 828, in UmountDir
    runcmd(cmd, print_cmd=False)
  File "/b/c/cbuild/repository/chromite/lib/cros_build_lib.py", line 333, in SudoRunCommand
    return RunCommand(sudo_cmd, **kwargs)
  File "/b/c/cbuild/repository/chromite/lib/cros_build_lib.py", line 658, in RunCommand
    raise RunCommandError(msg, cmd_result)
RunCommandError: return code: 1; command: sudo -n 'CROS_CACHEDIR=/b/c/cbuild/repository/.cache' 'CROS_SUDO_KEEP_ALIVE=unknown' -- umount -d /b/c/cbuild/repository/chroot
cmd=['sudo', '-n', 'CROS_CACHEDIR=/b/c/cbuild/repository/.cache', 'CROS_SUDO_KEEP_ALIVE=unknown', '--', 'umount', '-d', '/b/c/cbuild/repository/chroot']

18:10:31: INFO: Translating result <class 'chromite.lib.cros_build_lib.RunCommandError'>: return code: 1; command: sudo -n 'CROS_CACHEDIR=/b/c/cbuild/repository/.cache' 'CROS_SUDO_KEEP_ALIVE=unknown' -- umount -d /b/c/cbuild/repository/chroot
cmd=['sudo', '-n', 'CROS_CACHEDIR=/b/c/cbuild/repository/.cache', 'CROS_SUDO_KEEP_ALIVE=unknown', '--', 'umount', '-d', '/b/c/cbuild/repository/chroot']
Traceback (most recent call last):
  File "/b/c/cbuild/repository/chromite/lib/failures_lib.py", line 229, in wrapped_functor
    return functor(*args, **kwargs)
  File "/b/c/cbuild/repository/chromite/cbuildbot/stages/build_stages.py", line 190, in PerformStage
    cros_build_lib.CleanupChrootMount(buildroot=self._build_root)
  File "/b/c/cbuild/repository/chromite/lib/cros_build_lib.py", line 1730, in CleanupChrootMount
    osutils.UmountTree(chroot)
  File "/b/c/cbuild/repository/chromite/lib/osutils.py", line 866, in UmountTree
    UmountDir(mount_pt, lazy=False, cleanup=False)
  File "/b/c/cbuild/repository/chromite/lib/osutils.py", line 828, in UmountDir
    runcmd(cmd, print_cmd=False)
  File "/b/c/cbuild/repository/chromite/lib/cros_build_lib.py", line 333, in SudoRunCommand
    return RunCommand(sudo_cmd, **kwargs)
  File "/b/c/cbuild/repository/chromite/lib/cros_build_lib.py", line 658, in RunCommand
    raise RunCommandError(msg, cmd_result)
RunCommandError: return code: 1; command: sudo -n 'CROS_CACHEDIR=/b/c/cbuild/repository/.cache' 'CROS_SUDO_KEEP_ALIVE=unknown' -- umount -d /b/c/cbuild/repository/chroot
cmd=['sudo', '-n', 'CROS_CACHEDIR=/b/c/cbuild/repository/.cache', 'CROS_SUDO_KEEP_ALIVE=unknown', '--', 'umount', '-d', '/b/c/cbuild/repository/chroot']
 to fail.
...

Looks like the chroot mount is borked somehow?

I am wondering if just rebooting the builder would resolve it... or maybe even a clobber.

Passing to deputy for thoughts.
 
Clobber first, ask later ;)

I've kicked a clobber build. Will dig if that doesn't help.
Labels: Hotlist-Deputy
Owner: ayatane@chromium.org
Status: Assigned (was: Untriaged)
https://luci-milo.appspot.com/buildbot/chromeos/kevin-release/2028
Cc: bmgordon@chromium.org
A reboot should fix it, but is there anything in the logs on those builders to hint at what is pinning the chroot?
Summary: Relase builders failing CleanUp stage, cannot unmount chroot (was: Reef builder is failing CleanUp stage)
Summary: Release builders failing CleanUp stage, cannot unmount chroot (was: Relase builders failing CleanUp stage, cannot unmount chroot)
Cc: jbudorick@chromium.org
jbudorick, can you take a quick look at one of the builders? (e.g., cros-beefy39-c2 for kevin-release).  The docs say only chrome troopers can SSH.
Cc: -jbudorick@chromium.org
Labels: Infra-Troopers
When engaging w/ troopers, please use the Infra-Troopers label rather than CCing the current trooper directly; doing so helps ensure that troopers see it across shift boundaries.

I can look at this in a bit, but SSH isn't limited to chrome-troopers. Instructions are here: https://chrome-internal.googlesource.com/infra/infra_internal/+/master/doc/ssh.md#overview
jbudorick, the Connecting to cros-beefy*-c2 machines
 sections says

If you are not a trooper, then there is not yet a standard method to connect to ChromeOS GCE bots yet. 

Comment 10 Deleted

Issue 821269 has been merged into this issue.
Issue 821281 has been merged into this issue.
I'm going to restart kevin, nautilus, and scarlet builders to see if that fixes it.

I want to keep peach_pi so a trooper can go look at it, I'd really like some diagnostics and peach_pi has been complained about the least.
Cc: dgarr...@chromium.org
+dgarrett FYI: I granted ayatane SSH access to chromeos-bot.
Owner: bmgordon@chromium.org
bmgordon should probably look, since the chroot loopback mounting is part of his project.
+hinoka: I thought the deputy rotation already had access?
Re #18 it might just be me.  I assumed it was an ACL issue since the documentation says only troopers have access, and my SSH connections get rejected by the remote host.
I tried to ssh to cros-beefy54-c2 this morning to look at the peach_pi failures, but I'm not able to get in.  When I click the ssh button in pantheon, I get "Transferring SSH keys to the VM." and then "Establishing connection to SSH server..." and then a generic "Failed to connect" message.  Is that still the right way to log in remotely?
It's how I do it. I do have an "SSH for Google Cloud Platform" extension installed. It seems weird that you aren't being prompted, if that's a hard requirement.

There is also a command line way to do it, and the web view will show you the exact command line to use (somewhere).
The command-line version gives a little more detail:

$ gcloud compute --project "chromeos-bot" ssh --zone "us-east1-d" "cros-beefy54-c2"
ssh: connect to host 104.196.64.173 port 22: Connection refused
ERROR: (gcloud.compute.ssh) [/usr/bin/ssh] exited with return code [255].

Is ssh not running, or am I maybe blocked by a firewall?
It's possible. I might try a traceroute to the IP and see how far you get, but if it's not a local filewall, I'd file a bug/ticket with the gcloud team.

This is an external service, this kind of connection should "just work".

Comment 24 by ihf@chromium.org, Mar 19 2018

Cc: ihf@chromium.org
Components: Infra>Client>ChromeOS
Labels: M-67
I still haven't been able to ssh into beefy54, but the serial console looks like it's stuck in the middle of a reboot.

A bunch of I/O-related stuff seems to hang with stack traces like this:

Mar  9 05:34:20 cros-beefy54-c2 kernel: [30368.654238] jbd2/dm-4-8     D ffff88336f2f3b00     0 10257      2 0x00000000
Mar  9 05:34:20 cros-beefy54-c2 kernel: [30368.654244]  ffff88313e401a98 0000000000000046 ffff883319e99800 0000000000013b00
Mar  9 05:34:20 cros-beefy54-c2 kernel: [30368.654248]  ffff88313e401fd8 0000000000013b00 ffff883319e99800 ffff88336f2f4398
Mar  9 05:34:20 cros-beefy54-c2 kernel: [30368.654251]  ffff88343ffa52e8 0000000000000002 ffffffff81155bd0 ffff88313e401b10
Mar  9 05:34:20 cros-beefy54-c2 kernel: [30368.654254] Call Trace:
Mar  9 05:34:20 cros-beefy54-c2 kernel: [30368.654262]  [<ffffffff81155bd0>] ? wait_on_page_read+0x60/0x60
Mar  9 05:34:20 cros-beefy54-c2 kernel: [30368.654266]  [<ffffffff8173b5bd>] io_schedule+0x9d/0x130
Mar  9 05:34:20 cros-beefy54-c2 kernel: [30368.654269]  [<ffffffff81155bde>] sleep_on_page+0xe/0x20
Mar  9 05:34:20 cros-beefy54-c2 kernel: [30368.654271]  [<ffffffff8173ba34>] __wait_on_bit+0x64/0x90
Mar  9 05:34:20 cros-beefy54-c2 kernel: [30368.654274]  [<ffffffff8115599f>] wait_on_page_bit+0x7f/0x90
Mar  9 05:34:20 cros-beefy54-c2 kernel: [30368.654277]  [<ffffffff810b0c20>] ? autoremove_wake_function+0x40/0x40
Mar  9 05:34:20 cros-beefy54-c2 kernel: [30368.654282]  [<ffffffff811633b1>] ? pagevec_lookup_tag+0x21/0x30
Mar  9 05:34:20 cros-beefy54-c2 kernel: [30368.654285]  [<ffffffff81160faa>] write_cache_pages+0x31a/0x4c0
Mar  9 05:34:20 cros-beefy54-c2 kernel: [30368.654290]  [<ffffffff8109db09>] ? ttwu_do_wakeup+0x19/0x100
Mar  9 05:34:20 cros-beefy54-c2 kernel: [30368.654293]  [<ffffffff81160500>] ? global_dirtyable_memory+0x50/0x50
Mar  9 05:34:20 cros-beefy54-c2 kernel: [30368.654296]  [<ffffffff81161190>] generic_writepages+0x40/0x60
Mar  9 05:34:20 cros-beefy54-c2 kernel: [30368.654302]  [<ffffffff81294205>] jbd2_journal_commit_transaction+0x505/0x1ba0
Mar  9 05:34:20 cros-beefy54-c2 kernel: [30368.654307]  [<ffffffff8107a87f>] ? try_to_del_timer_sync+0x4f/0x70
Mar  9 05:34:20 cros-beefy54-c2 kernel: [30368.654309]  [<ffffffff81299afd>] kjournald2+0xbd/0x240
Mar  9 05:34:20 cros-beefy54-c2 kernel: [30368.654312]  [<ffffffff810b0be0>] ? prepare_to_wait_event+0x100/0x100
Mar  9 05:34:20 cros-beefy54-c2 kernel: [30368.654314]  [<ffffffff81299a40>] ? commit_timeout+0x10/0x10
Mar  9 05:34:20 cros-beefy54-c2 kernel: [30368.654318]  [<ffffffff81090bcb>] kthread+0xcb/0xf0
Mar  9 05:34:20 cros-beefy54-c2 kernel: [30368.654320]  [<ffffffff81090b00>] ? kthread_create_on_node+0x1c0/0x1c0
Mar  9 05:34:20 cros-beefy54-c2 kernel: [30368.654325]  [<ffffffff81747e4e>] ret_from_fork+0x6e/0xa0
Mar  9 05:34:20 cros-beefy54-c2 kernel: [30368.654328]  [<ffffffff81090b00>] ? kthread_create_on_node+0x1c0/0x1c0

That timestamp corresponds to a few minutes after the start of the failed stage on the first failed build.  Eventually, the machine tried to reboot, but it looks like it got stuck there.  That explains why I can't ssh in, but I'm surprised any subsequent builds managed to get started at all.

I'm going to reboot that machine and wait to see if the same errors come back.  If not, it might be a kernel bug.  If they do, it might be a real storage error somewhere.

Comment 26 by ihf@chromium.org, Mar 20 2018

Looks like a lot of builds came back.
The initial failure that happened in PaygenBuildCanary happened on several other builders this morning:

https://luci-milo.appspot.com/buildbot/chromeos/gandof-release/2028
https://luci-milo.appspot.com/buildbot/chromeos/sentry-release/2039
https://luci-milo.appspot.com/buildbot/chromeos/celes-release/2039
https://luci-milo.appspot.com/buildbot/chromeos/clapper-release/2460
https://luci-milo.appspot.com/buildbot/chromeos/veyron_jerry-release/2031

All of them show I/O to the chroot loopback image in the stack traces.  That might be a sign that there's something going on with the loopback, or it might just be an error that happens to show on the loopback because that's where all the I/O is going.  I'll try to debug further.



Issue 823873 has been merged into this issue.
Ben, looks like the following need resets this AM:
mccloud-release
nyan_kitty-release
reef-release
scarlet-release

And the following are failing on the CleanupChroot:
kahlee-release
winky-release
wizpig-release

The following appear to have been restarted recently:
falco_li-release

All of the builders listed in #29 had the same underlying stuck I/O.  A couple of them were stalled in different processes, but most of them locked up while running the PaygenBuildCanary stage.  Most of them except reef-release had actually failed about a week ago and have been sitting since then without completing any builds.
I wonder if we are tickling a kernel bug?

I've been told by people on our kernel team that mount/umount are known to have a number of bugs which can happen under heavy system load.

This morning's list of needed restarts:
clapper-release
guado-release
veyron_rialto-release

Note that the bug that was dup'ed (crbug/823873) included (getting them on this bug to be comprehensive):
gandof-release
sentry-release
veyron_jerry-release
celes-release
clapper-release

So clapper-release re-appeared within a few days...


Cc: djkurtz@chromium.org sjg@chromium.org
 Issue 824801  has been merged into this issue.
Not offline but failing CleanupChroot:
rainier-release
veyron_jerry-release
veyron_minnie-release
wizpig-release

The builds that failed CleanupChroot self-recovered from reboots.  I've opened b/76204932 with the GCE team for help tracking this down.
Cc: gwendal@chromium.org
+gwendal@ who has helped with loopback and umount kernel issues before.
Cc: josa...@chromium.org
Ben, can we escalate?  This is having a dramatic impact for missing boards during DEV RC creation / handoff....

If Helpful I'm not seeing similar for beta (M66) or stable (M65) builders.

Offline; needs resetting this AM:
auron_paine-release
clapper-release
falco-release
falco_li-release
glimmer-release
gnawty-release
guado-release
hana-release
leon-release
lulu-release
nautilus-release
oak-release
veyron_rialto-release


Failed CleanupChroot (I'll add to the bug per #35):
orco-release
gru-release
daisy_skate-release
asuka-release
Should we disable chroot.img for now?
We could try it and see what happens, but I don't think that's actually the problem.  These hangs started about 3 weeks after we turned on chroot.img on the builders, and they're only hitting the canary builders.  It seems like we should be seeing this everywhere if it's being caused by the chroot.
I'm thinking that an updated kernel was pushed to the builders that is misbehaving with heavy load loopback mounts. Not your fault, but turning chroot.img on/off is the easiest level to test with.
Cc: akes...@chromium.org shuqianz@chromium.org
+shuqianz, +akeshet
I've restarted all of today's failed builders.  Here's a query that I've been running to more sure there aren't more.  It corresponds pretty well with Kevin's list:

SELECT builder_name,
       build.status AS build_status,
       min(build.start_time) AS first_fail,
       max(build.start_time) AS last_fail,
       min(build_number) AS first_build,
       max(build_number) AS last_build,
       stage.name AS stage_name,
       stage.status AS stage_status,
       substring_index(bot_hostname, '.', 1) AS builder
FROM buildTable build
JOIN buildStageTable stage ON (stage.build_id=build.id)
WHERE build_config LIKE '%-release'
  AND date(build.start_time) >= date_sub(curdate(), interval 3 DAY)
  AND date(build.start_time) < date_add(curdate(), interval 1 DAY)
  AND build.status <> 'pass'
  AND ((stage.name='CleanUp'
        AND stage.status NOT IN ('pass',
                                 'skipped')
        AND build.start_time >= curdate())
       OR (build.status='inflight'
           AND build.finish_time < build.start_time
           AND build.deadline < now()
           AND stage.name='PaygenBuildCanary')
       AND stage.status <> 'pass')
GROUP BY builder_name,
         build.status,
         stage.name,
         stage.status,
         builder
ORDER BY builder_name,
         first_fail;
crrev.com/c/981435 will turn the chroot loopbacks to see if that helps.  I'll submit as soon as trybots finish.
Today's list; thanks Ben!

falco_li-release
gru-release
novato-release
veyron_rialto-release
I restarted falco_li, gru, and veyron_rialto.  novato isn't running on a GCE instance, so I don't have access to look at the console output or reboot it.  

shuqianz@, could you take a look at build197-m2 and see what's going on?
shuqianz@ I can't ssh into that builder. That means you need to file a bug to the Golo team. The deputy docs say how to do that, and they are usually really responsive.
Adding daisy_skate-release, leon-release, novato-release
Project Member

Comment 48 by bugdroid1@chromium.org, Mar 28 2018

The following revision refers to this bug:
  https://chromium.googlesource.com/chromiumos/chromite/+/9912024a6e9017f907010db9edc738940a3bc23e

commit 9912024a6e9017f907010db9edc738940a3bc23e
Author: Benjamin Gordon <bmgordon@chromium.org>
Date: Wed Mar 28 03:35:11 2018

cbuildbot: Disable image-backed chroot

We've been seeing hangs with kernel crashes on release builders for a
couple of weeks.  The chroot loopback isn't the root cause, but
disabling it might avoid tickling whatever kernel bug we're seeing.

BUG= chromium:818874 
TEST=cros tryjob --local success-build

Change-Id: Ifb9ff78308636e60be4f8b2f426ddc6de764dd2a
Reviewed-on: https://chromium-review.googlesource.com/981435
Commit-Ready: Benjamin Gordon <bmgordon@chromium.org>
Tested-by: Benjamin Gordon <bmgordon@chromium.org>
Reviewed-by: Don Garrett <dgarrett@chromium.org>

[modify] https://crrev.com/9912024a6e9017f907010db9edc738940a3bc23e/cbuildbot/stages/build_stages.py
[modify] https://crrev.com/9912024a6e9017f907010db9edc738940a3bc23e/cbuildbot/commands.py

Project Member

Comment 49 by bugdroid1@chromium.org, Mar 28 2018

The following revision refers to this bug:
  https://chromium.googlesource.com/chromium/src.git/+/7379d186db927f13e8c11991fe1eed9e52179149

commit 7379d186db927f13e8c11991fe1eed9e52179149
Author: chromite-chromium-autoroll@skia-buildbots.google.com.iam.gserviceaccount.com <chromite-chromium-autoroll@skia-buildbots.google.com.iam.gserviceaccount.com>
Date: Wed Mar 28 07:45:38 2018

Roll src/third_party/chromite/ c145d2120..9912024a6 (3 commits)

https://chromium.googlesource.com/chromiumos/chromite.git/+log/c145d2120c34..9912024a6e90

$ git log c145d2120..9912024a6 --date=short --no-merges --format='%ad %ae %s'
2018-03-26 bmgordon cbuildbot: Disable image-backed chroot
2018-03-26 gmeinke chromite: default unibuild RW version to RO
2018-03-27 dgarrett chromeos_config: Move master-pi-android-pfq to beefy.

Created with:
  roll-dep src/third_party/chromite
BUG= chromium:818874 


The AutoRoll server is located here: https://chromite-chromium-roll.skia.org

Documentation for the AutoRoller is here:
https://skia.googlesource.com/buildbot/+/master/autoroll/README.md

If the roll is causing failures, please contact the current sheriff, who should
be CC'd on the roll, and stop the roller if necessary.


TBR=chrome-os-gardeners@chromium.org

Change-Id: Ife6e5f8c77be951bf604b4749835ffb939d77e93
Reviewed-on: https://chromium-review.googlesource.com/982574
Reviewed-by: Chromite Chromium Autoroll <chromite-chromium-autoroll@skia-buildbots.google.com.iam.gserviceaccount.com>
Commit-Queue: Chromite Chromium Autoroll <chromite-chromium-autoroll@skia-buildbots.google.com.iam.gserviceaccount.com>
Cr-Commit-Position: refs/heads/master@{#546432}
[modify] https://crrev.com/7379d186db927f13e8c11991fe1eed9e52179149/DEPS

It looks like none of the release builders that started after crrev.com/c/981435 went in have triggered the I/O hangs.  We still need to get build197-m2 restarted to get novato going again.

I'll try to find a reproducible case for this so we can get the root cause fixed and re-enable loopbacks later.
shuqianz@ did you hear back from the golo team about build197-m2?
Components: Infra>Client>ChromeOS>CI
Components: -Infra>Client>ChromeOS

Comment 54 by kerl@google.com, Apr 4 2018

Guado releases (http://uberchromegw/i/chromeos/builders/guado-release) have been failing for quite some time now. Most recently with:

NotEnoughDutsError: Not enough DUTs for board: guado, pool: bvt; required: 4, found: 3
  Will return from run_suite with status: INFRA_FAILURE

And earlier with:

provision     FAIL: Unhandled DevServerException: CrOS auto-update failed for host chromeos4-row3-rack8-host4: 0) RootfsUpdateError: Failed to perform rootfs update: RootfsUpdateError('Update failed with unexpected update status: UPDATE_STATUS_IDLE',), 1) RootfsUpdateError: Failed to perform rootfs update: RootfsUpdateError('Update failed with unexpected update status: UPDATE_STATUS_IDLE',)

Is this bug the underlying cause or are we hitting something new?
It's totally unrelated.
Project Member

Comment 56 by bugdroid1@chromium.org, Apr 12 2018

The following revision refers to this bug:
  https://chromium.googlesource.com/chromiumos/chromite/+/20586d941c23ececd2327c67b7e13910a24479d9

commit 20586d941c23ececd2327c67b7e13910a24479d9
Author: Benjamin Gordon <bmgordon@chromium.org>
Date: Thu Apr 12 19:38:24 2018

cbuildbot: Make chroot.img usage optional

We still haven't tracked down the root cause for  crbug.com/818874 .
Since the bug only seems to affect release builders and they don't
currently benefit from chroot.img use anyway, let's re-enable image use
for other builders.

This makes the use of chroot.img controlled by a new chroot_use_image
config option.  The new options defaults to True by default, but is
turned off for release builders.

BUG= chromium:818874 ,chromium:730144
TEST=Local tryjobs for incremental and non-incremental builds

Change-Id: I058e3b51a16e058729b266165040ff2e0e7c9e75
Reviewed-on: https://chromium-review.googlesource.com/998981
Commit-Ready: Benjamin Gordon <bmgordon@chromium.org>
Tested-by: Benjamin Gordon <bmgordon@chromium.org>
Reviewed-by: Don Garrett <dgarrett@chromium.org>

[modify] https://crrev.com/20586d941c23ececd2327c67b7e13910a24479d9/cbuildbot/config_dump.json
[modify] https://crrev.com/20586d941c23ececd2327c67b7e13910a24479d9/cbuildbot/stages/build_stages.py
[modify] https://crrev.com/20586d941c23ececd2327c67b7e13910a24479d9/lib/config_lib.py
[modify] https://crrev.com/20586d941c23ececd2327c67b7e13910a24479d9/cbuildbot/commands.py
[modify] https://crrev.com/20586d941c23ececd2327c67b7e13910a24479d9/cbuildbot/chromeos_config.py

Project Member

Comment 57 by bugdroid1@chromium.org, Apr 12 2018

The following revision refers to this bug:
  https://chromium.googlesource.com/chromium/src.git/+/ef31230778974528e91d46984c591249fd3cfeaa

commit ef31230778974528e91d46984c591249fd3cfeaa
Author: chromite-chromium-autoroll@skia-buildbots.google.com.iam.gserviceaccount.com <chromite-chromium-autoroll@skia-buildbots.google.com.iam.gserviceaccount.com>
Date: Thu Apr 12 23:31:13 2018

Roll src/third_party/chromite/ deb0ebc07..20586d941 (2 commits)

https://chromium.googlesource.com/chromiumos/chromite.git/+log/deb0ebc0777b..20586d941c23

$ git log deb0ebc07..20586d941 --date=short --no-merges --format='%ad %ae %s'
2018-04-06 bmgordon cbuildbot: Make chroot.img usage optional
2018-04-12 pwang chromeos_config: swap bob/kevin in paladin

Created with:
  roll-dep src/third_party/chromite
BUG= chromium:818874 ,chromium:730144


The AutoRoll server is located here: https://chromite-chromium-roll.skia.org

Documentation for the AutoRoller is here:
https://skia.googlesource.com/buildbot/+/master/autoroll/README.md

If the roll is causing failures, please contact the current sheriff, who should
be CC'd on the roll, and stop the roller if necessary.


TBR=chrome-os-gardeners@chromium.org

Change-Id: I89a6597c296c8171222b3ea8e4595d07e26f69ef
Reviewed-on: https://chromium-review.googlesource.com/1011246
Reviewed-by: Chromite Chromium Autoroll <chromite-chromium-autoroll@skia-buildbots.google.com.iam.gserviceaccount.com>
Commit-Queue: Chromite Chromium Autoroll <chromite-chromium-autoroll@skia-buildbots.google.com.iam.gserviceaccount.com>
Cr-Commit-Position: refs/heads/master@{#550409}
[modify] https://crrev.com/ef31230778974528e91d46984c591249fd3cfeaa/DEPS

Project Member

Comment 58 by bugdroid1@chromium.org, Apr 17 2018

Labels: merge-merged-testbranch
The following revision refers to this bug:
  https://chromium.googlesource.com/chromium/src.git/+/ef31230778974528e91d46984c591249fd3cfeaa

commit ef31230778974528e91d46984c591249fd3cfeaa
Author: chromite-chromium-autoroll@skia-buildbots.google.com.iam.gserviceaccount.com <chromite-chromium-autoroll@skia-buildbots.google.com.iam.gserviceaccount.com>
Date: Thu Apr 12 23:31:13 2018

Roll src/third_party/chromite/ deb0ebc07..20586d941 (2 commits)

https://chromium.googlesource.com/chromiumos/chromite.git/+log/deb0ebc0777b..20586d941c23

$ git log deb0ebc07..20586d941 --date=short --no-merges --format='%ad %ae %s'
2018-04-06 bmgordon cbuildbot: Make chroot.img usage optional
2018-04-12 pwang chromeos_config: swap bob/kevin in paladin

Created with:
  roll-dep src/third_party/chromite
BUG= chromium:818874 ,chromium:730144


The AutoRoll server is located here: https://chromite-chromium-roll.skia.org

Documentation for the AutoRoller is here:
https://skia.googlesource.com/buildbot/+/master/autoroll/README.md

If the roll is causing failures, please contact the current sheriff, who should
be CC'd on the roll, and stop the roller if necessary.


TBR=chrome-os-gardeners@chromium.org

Change-Id: I89a6597c296c8171222b3ea8e4595d07e26f69ef
Reviewed-on: https://chromium-review.googlesource.com/1011246
Reviewed-by: Chromite Chromium Autoroll <chromite-chromium-autoroll@skia-buildbots.google.com.iam.gserviceaccount.com>
Commit-Queue: Chromite Chromium Autoroll <chromite-chromium-autoroll@skia-buildbots.google.com.iam.gserviceaccount.com>
Cr-Commit-Position: refs/heads/master@{#550409}
[modify] https://crrev.com/ef31230778974528e91d46984c591249fd3cfeaa/DEPS

I hit another umount failure today,  is this same issue, it's in another stage
from a CQ bot:

https://luci-milo.appspot.com/buildbot/chromeos/hana-paladin/2885
The hana-paladin failure isn't the same cause.  I took a look at the builder and there aren't any I/O hangs in the logs.  This one just had something that didn't exit the chroot when it wanted to unmount.
Looks like this hit grunt-paladin a couple of days ago.  b/76204932 has a potential fix that we're waiting for GCE to roll out.
Cc: aaboagye@chromium.org hywu@chromium.org cra...@chromium.org xzhou@chromium.org
This happened again today with elm-paladin and fizz-paladin.

http://shortn/_wnZyg9luTM
http://shortn/_CQqkcMMIgp

06:26:38: INFO: RunCommand: /b/c/cbuild/repository/chromite/bin/cros_sdk --snapshot-list in /b/c/cbuild/repository
06:26:38: WARNING: could not read /b/c/cbuild/repository/chroot/etc/cros_chroot_version
06:26:53: WARNING: Failed to activate VG on try 1.
06:26:54: NOTICE: Mounted existing image /b/c/cbuild/repository/chroot.img on chroot
06:26:54: NOTICE: /b/c/cbuild/repository/chroot.img is using 59 GiB more than needed.  Running fstrim.
06:28:31: INFO: RunCommand: /b/c/cbuild/repository/chromite/bin/cros_sdk --snapshot-restore clean-chroot in /b/c/cbuild/repository
umount: /b/c/cbuild/repository/chroot: device is busy.
        (In some cases useful info about processes that use
         the device is found by lsof(8) or fuser(1))
cros_sdk: Unhandled exception:
Traceback (most recent call last):
  File "/b/c/cbuild/repository/chromite/bin/cros_sdk", line 169, in <module>
    DoMain()
  File "/b/c/cbuild/repository/chromite/bin/cros_sdk", line 165, in DoMain
    commandline.ScriptWrapperMain(FindTarget)
  File "/b/c/cbuild/repository/chromite/lib/commandline.py", line 911, in ScriptWrapperMain
    ret = target(argv[1:])
  File "/b/c/cbuild/repository/chromite/scripts/cros_sdk.py", line 968, in main
    osutils.UmountTree(options.chroot)
  File "/b/c/cbuild/repository/chromite/lib/osutils.py", line 866, in UmountTree
    UmountDir(mount_pt, lazy=False, cleanup=False)
  File "/b/c/cbuild/repository/chromite/lib/osutils.py", line 828, in UmountDir
    runcmd(cmd, print_cmd=False)
  File "/b/c/cbuild/repository/chromite/lib/cros_build_lib.py", line 282, in SudoRunCommand
    return RunCommand(cmd, **kwargs)
  File "/b/c/cbuild/repository/chromite/lib/cros_build_lib.py", line 665, in RunCommand
    raise RunCommandError(msg, cmd_result)
chromite.lib.cros_build_lib.RunCommandError: return code: 1; command: umount -d /b/c/cbuild/repository/chroot
cmd=['umount', '-d', '/b/c/cbuild/repository/chroot']

Looks like #62 isn't the same cause.  pantheon doesn't show any sign of the kernel hangs on either of those two machines.

Unrelated update: The GCE rollout of b/76204932 is at about 80% now, so we should be able to try switching the release builders back to image-backed chroots soon.
Mergedinto: 855151
Status: Duplicate (was: Assigned)
https://b.corp.google.com/issues/76204932 graphs show 100% rollout for 2018-05-21. That doesn't seem to have solved the problem so we are going to try some other options in  crbug.com/855151 .
Status: Fixed (was: Duplicate)
Actually, based on analysis in issue 860508. This specific issue was resolved (the kernel problem) and now we're facing a new issue that has nothing to do with the chroot at all.

Sign in to add a comment