New issue
Advanced search Search tips
Note: Color blocks (like or ) mean that a user may not be available. Tooltip shows the reason.

Issue 844088 link

Starred by 1 user

Issue metadata

Status: Duplicate
Owner:
Closed: May 2018
Cc:
Components:
EstimatedDays: ----
NextAction: ----
OS: ----
Pri: 1
Type: Bug



Sign in to add a comment

skylab-drones getting restarted after failed task on a single bot

Project Member Reported by pprabhu@chromium.org, May 17 2018

Issue description

Swarming bots by default restart the server they are running on when a task fails.

For Skylab, we have multiple bots running on a drone, and we don't want to restart the drone when a bot's task fails.
See issue 843783 for how this behaviour was implemented for skylab bots.

But... I've seen some skylab-drones restart in the last two days on task failure.
 
Labels: -Pri-3 Pri-1
Here's a bot that has been rebooting the drone it is on: https://chrome-swarming.appspot.com/bot?id=chromeos-skylab-bot-d25b28c0-7b39-43cc-bfad-dbf829f510fb&selected=1&sort_stats=total%3Adesc

I see bot_cleanup errors in the bot events, which _will_ lead to a server reboot right now.
Status: Assigned (was: Untriaged)
Status: Started (was: Assigned)
+bpastene: I'd like to disable server reboot even on bot internal errors.
Posted https://chrome-internal-review.googlesource.com/c/infradata/config/+/627953
Cc: bpastene@chromium.org
Really +bpastene
The root cause I think is that the new chromium specific cleanup code is failing somewhere: https://chrome-internal.googlesource.com/infradata/config/+/master/configs/chrome-swarming/scripts/bot_config.py#171

But I can't confirm easily because the host reboots as a consequence of this error, and skylab bot manager cleans up the old bot directories after reboot.
Project Member

Comment 7 by bugdroid1@chromium.org, May 17 2018

The following revision refers to this bug:
  https://chromium.googlesource.com/chromiumos/infra/lucifer/+/04f732cf34107d96211b674f27592c90e8aa74f0

commit 04f732cf34107d96211b674f27592c90e8aa74f0
Author: Prathmesh Prabhu <pprabhu@chromium.org>
Date: Thu May 17 20:41:09 2018

lucifer_run_job: Store absolute path to provision results_dir

BUG= chromium:844088 
TEST=manual, on skylab-drone

Change-Id: Ie6aab51371d0856c2194ba42a282f1ba86a9e6f6
Reviewed-on: https://chromium-review.googlesource.com/1064749
Commit-Ready: Prathmesh Prabhu <pprabhu@chromium.org>
Tested-by: Prathmesh Prabhu <pprabhu@chromium.org>
Reviewed-by: Allen Li <ayatane@chromium.org>

[modify] https://crrev.com/04f732cf34107d96211b674f27592c90e8aa74f0/src/lucifer/cmd/lucifer_run_job/autotest.go

Project Member

Comment 8 by bugdroid1@chromium.org, May 18 2018

The following revision refers to this bug:
  https://chrome-internal.googlesource.com/infradata/config/+/d08c30d39f77eb2fc802c40f06d491847efd7dcb

commit d08c30d39f77eb2fc802c40f06d491847efd7dcb
Author: Prathmesh Prabhu <pprabhu@chromium.org>
Date: Fri May 18 00:38:07 2018

Err #7 was for  issue 843776 
Project Member

Comment 10 by bugdroid1@chromium.org, May 18 2018

The following revision refers to this bug:
  https://chrome-internal.googlesource.com/infradata/config/+/09f52d282f4e2460a1c3e2166c920ce32f7764b0

commit 09f52d282f4e2460a1c3e2166c920ce32f7764b0
Author: Prathmesh Prabhu <pprabhu@chromium.org>
Date: Fri May 18 20:39:04 2018

Mergedinto: 843783
Status: Duplicate (was: Started)

Sign in to add a comment