New issue
Advanced search Search tips

Issue 852023 link

Starred by 2 users

Issue metadata

Status: Fixed
Owner:
Closed: Jun 2018
Cc:
Components:
EstimatedDays: ----
NextAction: ----
OS: Chrome
Pri: 1
Type: ----



Sign in to add a comment

Skylab bot gets stuck when it's swarming cache dir is missing

Project Member Reported by xixuan@chromium.org, Jun 12 2018

Issue description

https://chrome-swarming.appspot.com/task?id=3e0dee9c8e50f410&refresh=10&request_detail=true&show_raw=1

skylab_swarming_worker: 2018/06/12 10:50:24 skylab_swarming_worker starting with args: [/opt/infra-tools/usr/bin/skylab_swarming_worker -client-test -task-name dummy_Pass -provision-labels cros-version:lumpy-release/R65-10323.58.0]
skylab_swarming_worker: 2018/06/12 10:50:24 Swarming bot config: &swarming.Bot{AutotestPath:"/usr/local/autotest", Env:"staging", DUTID:"bb844436-c345-4c46-9200-af1547c7208d", Inventory:swarming.Inventory{ToolsDir:"/usr/local/google/home/chromeos-test/chromiumos/infra/skylab_inventory/bin", DataDir:"/opt/infra-data/skylab_inventory/latest/data/skylab"}, LuciferBinDir:"/opt/infra-tools/usr/bin", Task:swarming.Task{ID:"3e0dee9c8e50f411"}}
skylab_swarming_worker: 2018/06/12 10:50:24 Created results directory /usr/local/autotest/results/swarming-3e0dee9c8e50f411/chromeos6-row2-rack8-host8
skylab_swarming_worker: 2018/06/12 10:50:24 Error running test: open /usr/local/autotest/swarming_state/bb844436-c345-4c46-9200-af1547c7208d.json: no such file or directory
load bot dimensions failed
main.runTest
	/var/tmp/portage/chromeos-base/lucifer-0.0.1-r91/work/lucifer-0.0.1/src/lucifer/cmd/skylab_swarming_worker/main.go:143
main.main
	/var/tmp/portage/chromeos-base/lucifer-0.0.1-r91/work/lucifer-0.0.1/src/lucifer/cmd/skylab_swarming_worker/main.go:68
runtime.main
	/usr/lib/go/src/runtime/proc.go:198
runtime.goexit
	/usr/lib/go/src/runtime/asm_amd64.s:2361


 
Labels: Hotlist-Skylab
Owner: pprabhu@chromium.org
Status: Assigned (was: Untriaged)
Summary: Skylab bot gets stuck when it's swarming cache dir is missing (was: [SKYLAB] load bot dimensions failed)
There are a few problems here:

(1) The said bot should have been quarantined.
(2) In this case, the bot would have gotten unstuck if the bot process had been restarted. But the bot itself was never restarted.
(3) The root cause of why the skylab_swarming cache file was removed is still a mystery.
Project Member

Comment 4 by bugdroid1@chromium.org, Jun 14 2018

The following revision refers to this bug:
  https://chrome-internal.googlesource.com/infradata/config/+/941285114ba2543e5f225e46cb460f1c5b5aee6c

commit 941285114ba2543e5f225e46cb460f1c5b5aee6c
Author: Prathmesh Prabhu <pprabhu@chromium.org>
Date: Thu Jun 14 21:00:21 2018

Status: Started (was: Assigned)
 issue 852969  is a related problem. Our bots don't even die when their DUT is removed from the inventory.
Cc: xixuan@chromium.org ayatane@chromium.org
The said bot got quarantined, as expected:
https://chrome-swarming.appspot.com/bot?id=chromeos-skylab-bot-bb844436-c345-4c46-9200-af1547c7208d&sort_stats=total%3Adesc

This validates #4
Project Member

Comment 8 by bugdroid1@chromium.org, Jun 14 2018

The following revision refers to this bug:
  https://chrome-internal.googlesource.com/chromeos/infra_internal/skylab_inventory/+/8201d4b1a5cba49a566cc7d82905920f7eb89792

commit 8201d4b1a5cba49a566cc7d82905920f7eb89792
Author: Prathmesh Prabhu <pprabhu@chromium.org>
Date: Thu Jun 14 21:33:40 2018

Project Member

Comment 9 by bugdroid1@chromium.org, Jun 14 2018

The following revision refers to this bug:
  https://chrome-internal.googlesource.com/chromeos/infra_internal/skylab_inventory/+/16abb042df298f8db0625b72f0abf042b65c5e76

commit 16abb042df298f8db0625b72f0abf042b65c5e76
Author: Prathmesh Prabhu <pprabhu@google.com>
Date: Thu Jun 14 21:52:58 2018

In conjunction with the fix for 852969, by removing (#8) and re-adding (#9) the DUT to the drone in the inventory, the bot has been recovered. This addresses concern (3) in (#2), in case bots get stuck like this, they can be refreshed by removing and re-adding to the inventory.
Status: Fixed (was: Started)
Any bot that enters this state now will be quarantined. I think we shouldn't automatically recover such bots. Fail obviously so that the root cause can be determined.

Sign in to add a comment