Issue metadata
Sign in to add a comment
|
Release builders failing CleanUp stage, cannot unmount chroot |
||||||||||||||||||||||||
Issue descriptionhttps://uberchromegw.corp.google.com/i/chromeos/builders/reef-release/builds/1996 The first build that failed in this fashion was https://uberchromegw.corp.google.com/i/chromeos/builders/reef-release/builds/1974 ... umount: /b/c/cbuild/repository/chroot: device is busy. (In some cases useful info about processes that use the device is found by lsof(8) or fuser(1)) [1;31m18:10:31: ERROR: <class 'chromite.lib.cros_build_lib.RunCommandError'>: return code: 1; command: sudo -n 'CROS_CACHEDIR=/b/c/cbuild/repository/.cache' 'CROS_SUDO_KEEP_ALIVE=unknown' -- umount -d /b/c/cbuild/repository/chroot cmd=['sudo', '-n', 'CROS_CACHEDIR=/b/c/cbuild/repository/.cache', 'CROS_SUDO_KEEP_ALIVE=unknown', '--', 'umount', '-d', '/b/c/cbuild/repository/chroot'] Traceback (most recent call last): File "/b/c/cbuild/repository/chromite/lib/failures_lib.py", line 229, in wrapped_functor return functor(*args, **kwargs) File "/b/c/cbuild/repository/chromite/cbuildbot/stages/build_stages.py", line 190, in PerformStage cros_build_lib.CleanupChrootMount(buildroot=self._build_root) File "/b/c/cbuild/repository/chromite/lib/cros_build_lib.py", line 1730, in CleanupChrootMount osutils.UmountTree(chroot) File "/b/c/cbuild/repository/chromite/lib/osutils.py", line 866, in UmountTree UmountDir(mount_pt, lazy=False, cleanup=False) File "/b/c/cbuild/repository/chromite/lib/osutils.py", line 828, in UmountDir runcmd(cmd, print_cmd=False) File "/b/c/cbuild/repository/chromite/lib/cros_build_lib.py", line 333, in SudoRunCommand return RunCommand(sudo_cmd, **kwargs) File "/b/c/cbuild/repository/chromite/lib/cros_build_lib.py", line 658, in RunCommand raise RunCommandError(msg, cmd_result) RunCommandError: return code: 1; command: sudo -n 'CROS_CACHEDIR=/b/c/cbuild/repository/.cache' 'CROS_SUDO_KEEP_ALIVE=unknown' -- umount -d /b/c/cbuild/repository/chroot cmd=['sudo', '-n', 'CROS_CACHEDIR=/b/c/cbuild/repository/.cache', 'CROS_SUDO_KEEP_ALIVE=unknown', '--', 'umount', '-d', '/b/c/cbuild/repository/chroot'] [0m 18:10:31: INFO: Translating result <class 'chromite.lib.cros_build_lib.RunCommandError'>: return code: 1; command: sudo -n 'CROS_CACHEDIR=/b/c/cbuild/repository/.cache' 'CROS_SUDO_KEEP_ALIVE=unknown' -- umount -d /b/c/cbuild/repository/chroot cmd=['sudo', '-n', 'CROS_CACHEDIR=/b/c/cbuild/repository/.cache', 'CROS_SUDO_KEEP_ALIVE=unknown', '--', 'umount', '-d', '/b/c/cbuild/repository/chroot'] Traceback (most recent call last): File "/b/c/cbuild/repository/chromite/lib/failures_lib.py", line 229, in wrapped_functor return functor(*args, **kwargs) File "/b/c/cbuild/repository/chromite/cbuildbot/stages/build_stages.py", line 190, in PerformStage cros_build_lib.CleanupChrootMount(buildroot=self._build_root) File "/b/c/cbuild/repository/chromite/lib/cros_build_lib.py", line 1730, in CleanupChrootMount osutils.UmountTree(chroot) File "/b/c/cbuild/repository/chromite/lib/osutils.py", line 866, in UmountTree UmountDir(mount_pt, lazy=False, cleanup=False) File "/b/c/cbuild/repository/chromite/lib/osutils.py", line 828, in UmountDir runcmd(cmd, print_cmd=False) File "/b/c/cbuild/repository/chromite/lib/cros_build_lib.py", line 333, in SudoRunCommand return RunCommand(sudo_cmd, **kwargs) File "/b/c/cbuild/repository/chromite/lib/cros_build_lib.py", line 658, in RunCommand raise RunCommandError(msg, cmd_result) RunCommandError: return code: 1; command: sudo -n 'CROS_CACHEDIR=/b/c/cbuild/repository/.cache' 'CROS_SUDO_KEEP_ALIVE=unknown' -- umount -d /b/c/cbuild/repository/chroot cmd=['sudo', '-n', 'CROS_CACHEDIR=/b/c/cbuild/repository/.cache', 'CROS_SUDO_KEEP_ALIVE=unknown', '--', 'umount', '-d', '/b/c/cbuild/repository/chroot'] to fail. ... Looks like the chroot mount is borked somehow? I am wondering if just rebooting the builder would resolve it... or maybe even a clobber. Passing to deputy for thoughts.
,
Mar 12 2018
https://luci-milo.appspot.com/buildbot/chromeos/kevin-release/2028
,
Mar 12 2018
A reboot should fix it, but is there anything in the logs on those builders to hint at what is pinning the chroot?
,
Mar 12 2018
,
Mar 12 2018
,
Mar 12 2018
jbudorick, can you take a quick look at one of the builders? (e.g., cros-beefy39-c2 for kevin-release). The docs say only chrome troopers can SSH.
,
Mar 12 2018
When engaging w/ troopers, please use the Infra-Troopers label rather than CCing the current trooper directly; doing so helps ensure that troopers see it across shift boundaries. I can look at this in a bit, but SSH isn't limited to chrome-troopers. Instructions are here: https://chrome-internal.googlesource.com/infra/infra_internal/+/master/doc/ssh.md#overview
,
Mar 12 2018
jbudorick, the Connecting to cros-beefy*-c2 machines sections says If you are not a trooper, then there is not yet a standard method to connect to ChromeOS GCE bots yet.
,
Mar 13 2018
AFAICT this bug affects Nautilus and Kevin too: https://uberchromegw.corp.google.com/i/chromeos/builders/nautilus-release/builds/453 https://uberchromegw.corp.google.com/i/chromeos/builders/kevin-release/builds/2033
,
Mar 14 2018
Issue 821269 has been merged into this issue.
,
Mar 14 2018
Issue 821281 has been merged into this issue.
,
Mar 14 2018
I'm going to restart kevin, nautilus, and scarlet builders to see if that fixes it. I want to keep peach_pi so a trooper can go look at it, I'd really like some diagnostics and peach_pi has been complained about the least.
,
Mar 14 2018
,
Mar 14 2018
+dgarrett FYI: I granted ayatane SSH access to chromeos-bot.
,
Mar 14 2018
bmgordon should probably look, since the chroot loopback mounting is part of his project.
,
Mar 14 2018
+hinoka: I thought the deputy rotation already had access?
,
Mar 14 2018
Re #18 it might just be me. I assumed it was an ACL issue since the documentation says only troopers have access, and my SSH connections get rejected by the remote host.
,
Mar 19 2018
I tried to ssh to cros-beefy54-c2 this morning to look at the peach_pi failures, but I'm not able to get in. When I click the ssh button in pantheon, I get "Transferring SSH keys to the VM." and then "Establishing connection to SSH server..." and then a generic "Failed to connect" message. Is that still the right way to log in remotely?
,
Mar 19 2018
It's how I do it. I do have an "SSH for Google Cloud Platform" extension installed. It seems weird that you aren't being prompted, if that's a hard requirement. There is also a command line way to do it, and the web view will show you the exact command line to use (somewhere).
,
Mar 19 2018
The command-line version gives a little more detail: $ gcloud compute --project "chromeos-bot" ssh --zone "us-east1-d" "cros-beefy54-c2" ssh: connect to host 104.196.64.173 port 22: Connection refused ERROR: (gcloud.compute.ssh) [/usr/bin/ssh] exited with return code [255]. Is ssh not running, or am I maybe blocked by a firewall?
,
Mar 19 2018
It's possible. I might try a traceroute to the IP and see how far you get, but if it's not a local filewall, I'd file a bug/ticket with the gcloud team. This is an external service, this kind of connection should "just work".
,
Mar 19 2018
,
Mar 19 2018
I still haven't been able to ssh into beefy54, but the serial console looks like it's stuck in the middle of a reboot. A bunch of I/O-related stuff seems to hang with stack traces like this: Mar 9 05:34:20 cros-beefy54-c2 kernel: [30368.654238] jbd2/dm-4-8 D ffff88336f2f3b00 0 10257 2 0x00000000 Mar 9 05:34:20 cros-beefy54-c2 kernel: [30368.654244] ffff88313e401a98 0000000000000046 ffff883319e99800 0000000000013b00 Mar 9 05:34:20 cros-beefy54-c2 kernel: [30368.654248] ffff88313e401fd8 0000000000013b00 ffff883319e99800 ffff88336f2f4398 Mar 9 05:34:20 cros-beefy54-c2 kernel: [30368.654251] ffff88343ffa52e8 0000000000000002 ffffffff81155bd0 ffff88313e401b10 Mar 9 05:34:20 cros-beefy54-c2 kernel: [30368.654254] Call Trace: Mar 9 05:34:20 cros-beefy54-c2 kernel: [30368.654262] [<ffffffff81155bd0>] ? wait_on_page_read+0x60/0x60 Mar 9 05:34:20 cros-beefy54-c2 kernel: [30368.654266] [<ffffffff8173b5bd>] io_schedule+0x9d/0x130 Mar 9 05:34:20 cros-beefy54-c2 kernel: [30368.654269] [<ffffffff81155bde>] sleep_on_page+0xe/0x20 Mar 9 05:34:20 cros-beefy54-c2 kernel: [30368.654271] [<ffffffff8173ba34>] __wait_on_bit+0x64/0x90 Mar 9 05:34:20 cros-beefy54-c2 kernel: [30368.654274] [<ffffffff8115599f>] wait_on_page_bit+0x7f/0x90 Mar 9 05:34:20 cros-beefy54-c2 kernel: [30368.654277] [<ffffffff810b0c20>] ? autoremove_wake_function+0x40/0x40 Mar 9 05:34:20 cros-beefy54-c2 kernel: [30368.654282] [<ffffffff811633b1>] ? pagevec_lookup_tag+0x21/0x30 Mar 9 05:34:20 cros-beefy54-c2 kernel: [30368.654285] [<ffffffff81160faa>] write_cache_pages+0x31a/0x4c0 Mar 9 05:34:20 cros-beefy54-c2 kernel: [30368.654290] [<ffffffff8109db09>] ? ttwu_do_wakeup+0x19/0x100 Mar 9 05:34:20 cros-beefy54-c2 kernel: [30368.654293] [<ffffffff81160500>] ? global_dirtyable_memory+0x50/0x50 Mar 9 05:34:20 cros-beefy54-c2 kernel: [30368.654296] [<ffffffff81161190>] generic_writepages+0x40/0x60 Mar 9 05:34:20 cros-beefy54-c2 kernel: [30368.654302] [<ffffffff81294205>] jbd2_journal_commit_transaction+0x505/0x1ba0 Mar 9 05:34:20 cros-beefy54-c2 kernel: [30368.654307] [<ffffffff8107a87f>] ? try_to_del_timer_sync+0x4f/0x70 Mar 9 05:34:20 cros-beefy54-c2 kernel: [30368.654309] [<ffffffff81299afd>] kjournald2+0xbd/0x240 Mar 9 05:34:20 cros-beefy54-c2 kernel: [30368.654312] [<ffffffff810b0be0>] ? prepare_to_wait_event+0x100/0x100 Mar 9 05:34:20 cros-beefy54-c2 kernel: [30368.654314] [<ffffffff81299a40>] ? commit_timeout+0x10/0x10 Mar 9 05:34:20 cros-beefy54-c2 kernel: [30368.654318] [<ffffffff81090bcb>] kthread+0xcb/0xf0 Mar 9 05:34:20 cros-beefy54-c2 kernel: [30368.654320] [<ffffffff81090b00>] ? kthread_create_on_node+0x1c0/0x1c0 Mar 9 05:34:20 cros-beefy54-c2 kernel: [30368.654325] [<ffffffff81747e4e>] ret_from_fork+0x6e/0xa0 Mar 9 05:34:20 cros-beefy54-c2 kernel: [30368.654328] [<ffffffff81090b00>] ? kthread_create_on_node+0x1c0/0x1c0 That timestamp corresponds to a few minutes after the start of the failed stage on the first failed build. Eventually, the machine tried to reboot, but it looks like it got stuck there. That explains why I can't ssh in, but I'm surprised any subsequent builds managed to get started at all. I'm going to reboot that machine and wait to see if the same errors come back. If not, it might be a kernel bug. If they do, it might be a real storage error somewhere.
,
Mar 20 2018
Looks like a lot of builds came back.
,
Mar 20 2018
The initial failure that happened in PaygenBuildCanary happened on several other builders this morning: https://luci-milo.appspot.com/buildbot/chromeos/gandof-release/2028 https://luci-milo.appspot.com/buildbot/chromeos/sentry-release/2039 https://luci-milo.appspot.com/buildbot/chromeos/celes-release/2039 https://luci-milo.appspot.com/buildbot/chromeos/clapper-release/2460 https://luci-milo.appspot.com/buildbot/chromeos/veyron_jerry-release/2031 All of them show I/O to the chroot loopback image in the stack traces. That might be a sign that there's something going on with the loopback, or it might just be an error that happens to show on the loopback because that's where all the I/O is going. I'll try to debug further.
,
Mar 20 2018
Issue 823873 has been merged into this issue.
,
Mar 21 2018
Ben, looks like the following need resets this AM: mccloud-release nyan_kitty-release reef-release scarlet-release And the following are failing on the CleanupChroot: kahlee-release winky-release wizpig-release The following appear to have been restarted recently: falco_li-release
,
Mar 21 2018
All of the builders listed in #29 had the same underlying stuck I/O. A couple of them were stalled in different processes, but most of them locked up while running the PaygenBuildCanary stage. Most of them except reef-release had actually failed about a week ago and have been sitting since then without completing any builds.
,
Mar 21 2018
I wonder if we are tickling a kernel bug? I've been told by people on our kernel team that mount/umount are known to have a number of bugs which can happen under heavy system load.
,
Mar 22 2018
This morning's list of needed restarts: clapper-release guado-release veyron_rialto-release Note that the bug that was dup'ed (crbug/823873) included (getting them on this bug to be comprehensive): gandof-release sentry-release veyron_jerry-release celes-release clapper-release So clapper-release re-appeared within a few days...
,
Mar 22 2018
,
Mar 23 2018
Not offline but failing CleanupChroot: rainier-release veyron_jerry-release veyron_minnie-release wizpig-release
,
Mar 23 2018
The builds that failed CleanupChroot self-recovered from reboots. I've opened b/76204932 with the GCE team for help tracking this down.
,
Mar 23 2018
+gwendal@ who has helped with loopback and umount kernel issues before.
,
Mar 26 2018
Ben, can we escalate? This is having a dramatic impact for missing boards during DEV RC creation / handoff.... If Helpful I'm not seeing similar for beta (M66) or stable (M65) builders. Offline; needs resetting this AM: auron_paine-release clapper-release falco-release falco_li-release glimmer-release gnawty-release guado-release hana-release leon-release lulu-release nautilus-release oak-release veyron_rialto-release Failed CleanupChroot (I'll add to the bug per #35): orco-release gru-release daisy_skate-release asuka-release
,
Mar 26 2018
Should we disable chroot.img for now?
,
Mar 26 2018
We could try it and see what happens, but I don't think that's actually the problem. These hangs started about 3 weeks after we turned on chroot.img on the builders, and they're only hitting the canary builders. It seems like we should be seeing this everywhere if it's being caused by the chroot.
,
Mar 26 2018
I'm thinking that an updated kernel was pushed to the builders that is misbehaving with heavy load loopback mounts. Not your fault, but turning chroot.img on/off is the easiest level to test with.
,
Mar 26 2018
+shuqianz, +akeshet
,
Mar 26 2018
I've restarted all of today's failed builders. Here's a query that I've been running to more sure there aren't more. It corresponds pretty well with Kevin's list:
SELECT builder_name,
build.status AS build_status,
min(build.start_time) AS first_fail,
max(build.start_time) AS last_fail,
min(build_number) AS first_build,
max(build_number) AS last_build,
stage.name AS stage_name,
stage.status AS stage_status,
substring_index(bot_hostname, '.', 1) AS builder
FROM buildTable build
JOIN buildStageTable stage ON (stage.build_id=build.id)
WHERE build_config LIKE '%-release'
AND date(build.start_time) >= date_sub(curdate(), interval 3 DAY)
AND date(build.start_time) < date_add(curdate(), interval 1 DAY)
AND build.status <> 'pass'
AND ((stage.name='CleanUp'
AND stage.status NOT IN ('pass',
'skipped')
AND build.start_time >= curdate())
OR (build.status='inflight'
AND build.finish_time < build.start_time
AND build.deadline < now()
AND stage.name='PaygenBuildCanary')
AND stage.status <> 'pass')
GROUP BY builder_name,
build.status,
stage.name,
stage.status,
builder
ORDER BY builder_name,
first_fail;
,
Mar 26 2018
crrev.com/c/981435 will turn the chroot loopbacks to see if that helps. I'll submit as soon as trybots finish.
,
Mar 27 2018
Today's list; thanks Ben! falco_li-release gru-release novato-release veyron_rialto-release
,
Mar 27 2018
I restarted falco_li, gru, and veyron_rialto. novato isn't running on a GCE instance, so I don't have access to look at the console output or reboot it. shuqianz@, could you take a look at build197-m2 and see what's going on?
,
Mar 27 2018
shuqianz@ I can't ssh into that builder. That means you need to file a bug to the Golo team. The deputy docs say how to do that, and they are usually really responsive.
,
Mar 27 2018
Adding daisy_skate-release, leon-release, novato-release
,
Mar 28 2018
The following revision refers to this bug: https://chromium.googlesource.com/chromiumos/chromite/+/9912024a6e9017f907010db9edc738940a3bc23e commit 9912024a6e9017f907010db9edc738940a3bc23e Author: Benjamin Gordon <bmgordon@chromium.org> Date: Wed Mar 28 03:35:11 2018 cbuildbot: Disable image-backed chroot We've been seeing hangs with kernel crashes on release builders for a couple of weeks. The chroot loopback isn't the root cause, but disabling it might avoid tickling whatever kernel bug we're seeing. BUG= chromium:818874 TEST=cros tryjob --local success-build Change-Id: Ifb9ff78308636e60be4f8b2f426ddc6de764dd2a Reviewed-on: https://chromium-review.googlesource.com/981435 Commit-Ready: Benjamin Gordon <bmgordon@chromium.org> Tested-by: Benjamin Gordon <bmgordon@chromium.org> Reviewed-by: Don Garrett <dgarrett@chromium.org> [modify] https://crrev.com/9912024a6e9017f907010db9edc738940a3bc23e/cbuildbot/stages/build_stages.py [modify] https://crrev.com/9912024a6e9017f907010db9edc738940a3bc23e/cbuildbot/commands.py
,
Mar 28 2018
The following revision refers to this bug: https://chromium.googlesource.com/chromium/src.git/+/7379d186db927f13e8c11991fe1eed9e52179149 commit 7379d186db927f13e8c11991fe1eed9e52179149 Author: chromite-chromium-autoroll@skia-buildbots.google.com.iam.gserviceaccount.com <chromite-chromium-autoroll@skia-buildbots.google.com.iam.gserviceaccount.com> Date: Wed Mar 28 07:45:38 2018 Roll src/third_party/chromite/ c145d2120..9912024a6 (3 commits) https://chromium.googlesource.com/chromiumos/chromite.git/+log/c145d2120c34..9912024a6e90 $ git log c145d2120..9912024a6 --date=short --no-merges --format='%ad %ae %s' 2018-03-26 bmgordon cbuildbot: Disable image-backed chroot 2018-03-26 gmeinke chromite: default unibuild RW version to RO 2018-03-27 dgarrett chromeos_config: Move master-pi-android-pfq to beefy. Created with: roll-dep src/third_party/chromite BUG= chromium:818874 The AutoRoll server is located here: https://chromite-chromium-roll.skia.org Documentation for the AutoRoller is here: https://skia.googlesource.com/buildbot/+/master/autoroll/README.md If the roll is causing failures, please contact the current sheriff, who should be CC'd on the roll, and stop the roller if necessary. TBR=chrome-os-gardeners@chromium.org Change-Id: Ife6e5f8c77be951bf604b4749835ffb939d77e93 Reviewed-on: https://chromium-review.googlesource.com/982574 Reviewed-by: Chromite Chromium Autoroll <chromite-chromium-autoroll@skia-buildbots.google.com.iam.gserviceaccount.com> Commit-Queue: Chromite Chromium Autoroll <chromite-chromium-autoroll@skia-buildbots.google.com.iam.gserviceaccount.com> Cr-Commit-Position: refs/heads/master@{#546432} [modify] https://crrev.com/7379d186db927f13e8c11991fe1eed9e52179149/DEPS
,
Mar 28 2018
It looks like none of the release builders that started after crrev.com/c/981435 went in have triggered the I/O hangs. We still need to get build197-m2 restarted to get novato going again. I'll try to find a reproducible case for this so we can get the root cause fixed and re-enable loopbacks later.
,
Mar 28 2018
shuqianz@ did you hear back from the golo team about build197-m2?
,
Mar 30 2018
,
Mar 30 2018
,
Apr 4 2018
Guado releases (http://uberchromegw/i/chromeos/builders/guado-release) have been failing for quite some time now. Most recently with: NotEnoughDutsError: Not enough DUTs for board: guado, pool: bvt; required: 4, found: 3 Will return from run_suite with status: INFRA_FAILURE And earlier with: provision FAIL: Unhandled DevServerException: CrOS auto-update failed for host chromeos4-row3-rack8-host4: 0) RootfsUpdateError: Failed to perform rootfs update: RootfsUpdateError('Update failed with unexpected update status: UPDATE_STATUS_IDLE',), 1) RootfsUpdateError: Failed to perform rootfs update: RootfsUpdateError('Update failed with unexpected update status: UPDATE_STATUS_IDLE',) Is this bug the underlying cause or are we hitting something new?
,
Apr 4 2018
It's totally unrelated.
,
Apr 12 2018
The following revision refers to this bug: https://chromium.googlesource.com/chromiumos/chromite/+/20586d941c23ececd2327c67b7e13910a24479d9 commit 20586d941c23ececd2327c67b7e13910a24479d9 Author: Benjamin Gordon <bmgordon@chromium.org> Date: Thu Apr 12 19:38:24 2018 cbuildbot: Make chroot.img usage optional We still haven't tracked down the root cause for crbug.com/818874 . Since the bug only seems to affect release builders and they don't currently benefit from chroot.img use anyway, let's re-enable image use for other builders. This makes the use of chroot.img controlled by a new chroot_use_image config option. The new options defaults to True by default, but is turned off for release builders. BUG= chromium:818874 ,chromium:730144 TEST=Local tryjobs for incremental and non-incremental builds Change-Id: I058e3b51a16e058729b266165040ff2e0e7c9e75 Reviewed-on: https://chromium-review.googlesource.com/998981 Commit-Ready: Benjamin Gordon <bmgordon@chromium.org> Tested-by: Benjamin Gordon <bmgordon@chromium.org> Reviewed-by: Don Garrett <dgarrett@chromium.org> [modify] https://crrev.com/20586d941c23ececd2327c67b7e13910a24479d9/cbuildbot/config_dump.json [modify] https://crrev.com/20586d941c23ececd2327c67b7e13910a24479d9/cbuildbot/stages/build_stages.py [modify] https://crrev.com/20586d941c23ececd2327c67b7e13910a24479d9/lib/config_lib.py [modify] https://crrev.com/20586d941c23ececd2327c67b7e13910a24479d9/cbuildbot/commands.py [modify] https://crrev.com/20586d941c23ececd2327c67b7e13910a24479d9/cbuildbot/chromeos_config.py
,
Apr 12 2018
The following revision refers to this bug: https://chromium.googlesource.com/chromium/src.git/+/ef31230778974528e91d46984c591249fd3cfeaa commit ef31230778974528e91d46984c591249fd3cfeaa Author: chromite-chromium-autoroll@skia-buildbots.google.com.iam.gserviceaccount.com <chromite-chromium-autoroll@skia-buildbots.google.com.iam.gserviceaccount.com> Date: Thu Apr 12 23:31:13 2018 Roll src/third_party/chromite/ deb0ebc07..20586d941 (2 commits) https://chromium.googlesource.com/chromiumos/chromite.git/+log/deb0ebc0777b..20586d941c23 $ git log deb0ebc07..20586d941 --date=short --no-merges --format='%ad %ae %s' 2018-04-06 bmgordon cbuildbot: Make chroot.img usage optional 2018-04-12 pwang chromeos_config: swap bob/kevin in paladin Created with: roll-dep src/third_party/chromite BUG= chromium:818874 ,chromium:730144 The AutoRoll server is located here: https://chromite-chromium-roll.skia.org Documentation for the AutoRoller is here: https://skia.googlesource.com/buildbot/+/master/autoroll/README.md If the roll is causing failures, please contact the current sheriff, who should be CC'd on the roll, and stop the roller if necessary. TBR=chrome-os-gardeners@chromium.org Change-Id: I89a6597c296c8171222b3ea8e4595d07e26f69ef Reviewed-on: https://chromium-review.googlesource.com/1011246 Reviewed-by: Chromite Chromium Autoroll <chromite-chromium-autoroll@skia-buildbots.google.com.iam.gserviceaccount.com> Commit-Queue: Chromite Chromium Autoroll <chromite-chromium-autoroll@skia-buildbots.google.com.iam.gserviceaccount.com> Cr-Commit-Position: refs/heads/master@{#550409} [modify] https://crrev.com/ef31230778974528e91d46984c591249fd3cfeaa/DEPS
,
Apr 17 2018
The following revision refers to this bug: https://chromium.googlesource.com/chromium/src.git/+/ef31230778974528e91d46984c591249fd3cfeaa commit ef31230778974528e91d46984c591249fd3cfeaa Author: chromite-chromium-autoroll@skia-buildbots.google.com.iam.gserviceaccount.com <chromite-chromium-autoroll@skia-buildbots.google.com.iam.gserviceaccount.com> Date: Thu Apr 12 23:31:13 2018 Roll src/third_party/chromite/ deb0ebc07..20586d941 (2 commits) https://chromium.googlesource.com/chromiumos/chromite.git/+log/deb0ebc0777b..20586d941c23 $ git log deb0ebc07..20586d941 --date=short --no-merges --format='%ad %ae %s' 2018-04-06 bmgordon cbuildbot: Make chroot.img usage optional 2018-04-12 pwang chromeos_config: swap bob/kevin in paladin Created with: roll-dep src/third_party/chromite BUG= chromium:818874 ,chromium:730144 The AutoRoll server is located here: https://chromite-chromium-roll.skia.org Documentation for the AutoRoller is here: https://skia.googlesource.com/buildbot/+/master/autoroll/README.md If the roll is causing failures, please contact the current sheriff, who should be CC'd on the roll, and stop the roller if necessary. TBR=chrome-os-gardeners@chromium.org Change-Id: I89a6597c296c8171222b3ea8e4595d07e26f69ef Reviewed-on: https://chromium-review.googlesource.com/1011246 Reviewed-by: Chromite Chromium Autoroll <chromite-chromium-autoroll@skia-buildbots.google.com.iam.gserviceaccount.com> Commit-Queue: Chromite Chromium Autoroll <chromite-chromium-autoroll@skia-buildbots.google.com.iam.gserviceaccount.com> Cr-Commit-Position: refs/heads/master@{#550409} [modify] https://crrev.com/ef31230778974528e91d46984c591249fd3cfeaa/DEPS
,
Apr 19 2018
I hit another umount failure today, is this same issue, it's in another stage from a CQ bot: https://luci-milo.appspot.com/buildbot/chromeos/hana-paladin/2885
,
Apr 19 2018
The hana-paladin failure isn't the same cause. I took a look at the builder and there aren't any I/O hangs in the logs. This one just had something that didn't exit the chroot when it wanted to unmount.
,
Jun 5 2018
Looks like this hit grunt-paladin a couple of days ago. b/76204932 has a potential fix that we're waiting for GCE to roll out.
,
Jun 13 2018
This happened again today with elm-paladin and fizz-paladin. http://shortn/_wnZyg9luTM http://shortn/_CQqkcMMIgp 06:26:38: INFO: RunCommand: /b/c/cbuild/repository/chromite/bin/cros_sdk --snapshot-list in /b/c/cbuild/repository [1;33m06:26:38: WARNING: could not read /b/c/cbuild/repository/chroot/etc/cros_chroot_version[0m [1;33m06:26:53: WARNING: Failed to activate VG on try 1.[0m 06:26:54: NOTICE: Mounted existing image /b/c/cbuild/repository/chroot.img on chroot 06:26:54: NOTICE: /b/c/cbuild/repository/chroot.img is using 59 GiB more than needed. Running fstrim. 06:28:31: INFO: RunCommand: /b/c/cbuild/repository/chromite/bin/cros_sdk --snapshot-restore clean-chroot in /b/c/cbuild/repository umount: /b/c/cbuild/repository/chroot: device is busy. (In some cases useful info about processes that use the device is found by lsof(8) or fuser(1)) cros_sdk: Unhandled exception: Traceback (most recent call last): File "/b/c/cbuild/repository/chromite/bin/cros_sdk", line 169, in <module> DoMain() File "/b/c/cbuild/repository/chromite/bin/cros_sdk", line 165, in DoMain commandline.ScriptWrapperMain(FindTarget) File "/b/c/cbuild/repository/chromite/lib/commandline.py", line 911, in ScriptWrapperMain ret = target(argv[1:]) File "/b/c/cbuild/repository/chromite/scripts/cros_sdk.py", line 968, in main osutils.UmountTree(options.chroot) File "/b/c/cbuild/repository/chromite/lib/osutils.py", line 866, in UmountTree UmountDir(mount_pt, lazy=False, cleanup=False) File "/b/c/cbuild/repository/chromite/lib/osutils.py", line 828, in UmountDir runcmd(cmd, print_cmd=False) File "/b/c/cbuild/repository/chromite/lib/cros_build_lib.py", line 282, in SudoRunCommand return RunCommand(cmd, **kwargs) File "/b/c/cbuild/repository/chromite/lib/cros_build_lib.py", line 665, in RunCommand raise RunCommandError(msg, cmd_result) chromite.lib.cros_build_lib.RunCommandError: return code: 1; command: umount -d /b/c/cbuild/repository/chroot cmd=['umount', '-d', '/b/c/cbuild/repository/chroot']
,
Jun 13 2018
Looks like #62 isn't the same cause. pantheon doesn't show any sign of the kernel hangs on either of those two machines. Unrelated update: The GCE rollout of b/76204932 is at about 80% now, so we should be able to try switching the release builders back to image-backed chroots soon.
,
Jul 5
https://b.corp.google.com/issues/76204932 graphs show 100% rollout for 2018-05-21. That doesn't seem to have solved the problem so we are going to try some other options in crbug.com/855151 .
,
Jul 7
Actually, based on analysis in issue 860508. This specific issue was resolved (the kernel problem) and now we're facing a new issue that has nothing to do with the chroot at all. |
|||||||||||||||||||||||||
►
Sign in to add a comment |
|||||||||||||||||||||||||
Comment 1 by pprabhu@chromium.org
, Mar 6 2018