skylab-drones getting restarted after failed task on a single bot |
|||||
Issue descriptionSwarming bots by default restart the server they are running on when a task fails. For Skylab, we have multiple bots running on a drone, and we don't want to restart the drone when a bot's task fails. See issue 843783 for how this behaviour was implemented for skylab bots. But... I've seen some skylab-drones restart in the last two days on task failure.
,
May 17 2018
Here's a bot that has been rebooting the drone it is on: https://chrome-swarming.appspot.com/bot?id=chromeos-skylab-bot-d25b28c0-7b39-43cc-bfad-dbf829f510fb&selected=1&sort_stats=total%3Adesc I see bot_cleanup errors in the bot events, which _will_ lead to a server reboot right now.
,
May 17 2018
,
May 17 2018
+bpastene: I'd like to disable server reboot even on bot internal errors. Posted https://chrome-internal-review.googlesource.com/c/infradata/config/+/627953
,
May 17 2018
Really +bpastene
,
May 17 2018
The root cause I think is that the new chromium specific cleanup code is failing somewhere: https://chrome-internal.googlesource.com/infradata/config/+/master/configs/chrome-swarming/scripts/bot_config.py#171 But I can't confirm easily because the host reboots as a consequence of this error, and skylab bot manager cleans up the old bot directories after reboot.
,
May 17 2018
The following revision refers to this bug: https://chromium.googlesource.com/chromiumos/infra/lucifer/+/04f732cf34107d96211b674f27592c90e8aa74f0 commit 04f732cf34107d96211b674f27592c90e8aa74f0 Author: Prathmesh Prabhu <pprabhu@chromium.org> Date: Thu May 17 20:41:09 2018 lucifer_run_job: Store absolute path to provision results_dir BUG= chromium:844088 TEST=manual, on skylab-drone Change-Id: Ie6aab51371d0856c2194ba42a282f1ba86a9e6f6 Reviewed-on: https://chromium-review.googlesource.com/1064749 Commit-Ready: Prathmesh Prabhu <pprabhu@chromium.org> Tested-by: Prathmesh Prabhu <pprabhu@chromium.org> Reviewed-by: Allen Li <ayatane@chromium.org> [modify] https://crrev.com/04f732cf34107d96211b674f27592c90e8aa74f0/src/lucifer/cmd/lucifer_run_job/autotest.go
,
May 18 2018
The following revision refers to this bug: https://chrome-internal.googlesource.com/infradata/config/+/d08c30d39f77eb2fc802c40f06d491847efd7dcb commit d08c30d39f77eb2fc802c40f06d491847efd7dcb Author: Prathmesh Prabhu <pprabhu@chromium.org> Date: Fri May 18 00:38:07 2018
,
May 18 2018
Err #7 was for issue 843776
,
May 18 2018
The following revision refers to this bug: https://chrome-internal.googlesource.com/infradata/config/+/09f52d282f4e2460a1c3e2166c920ce32f7764b0 commit 09f52d282f4e2460a1c3e2166c920ce32f7764b0 Author: Prathmesh Prabhu <pprabhu@chromium.org> Date: Fri May 18 20:39:04 2018
,
May 18 2018
|
|||||
►
Sign in to add a comment |
|||||
Comment 1 by pprabhu@chromium.org
, May 17 2018