Skylab bot gets stuck when it's swarming cache dir is missing |
||||
Issue descriptionhttps://chrome-swarming.appspot.com/task?id=3e0dee9c8e50f410&refresh=10&request_detail=true&show_raw=1 skylab_swarming_worker: 2018/06/12 10:50:24 skylab_swarming_worker starting with args: [/opt/infra-tools/usr/bin/skylab_swarming_worker -client-test -task-name dummy_Pass -provision-labels cros-version:lumpy-release/R65-10323.58.0] skylab_swarming_worker: 2018/06/12 10:50:24 Swarming bot config: &swarming.Bot{AutotestPath:"/usr/local/autotest", Env:"staging", DUTID:"bb844436-c345-4c46-9200-af1547c7208d", Inventory:swarming.Inventory{ToolsDir:"/usr/local/google/home/chromeos-test/chromiumos/infra/skylab_inventory/bin", DataDir:"/opt/infra-data/skylab_inventory/latest/data/skylab"}, LuciferBinDir:"/opt/infra-tools/usr/bin", Task:swarming.Task{ID:"3e0dee9c8e50f411"}} skylab_swarming_worker: 2018/06/12 10:50:24 Created results directory /usr/local/autotest/results/swarming-3e0dee9c8e50f411/chromeos6-row2-rack8-host8 skylab_swarming_worker: 2018/06/12 10:50:24 Error running test: open /usr/local/autotest/swarming_state/bb844436-c345-4c46-9200-af1547c7208d.json: no such file or directory load bot dimensions failed main.runTest /var/tmp/portage/chromeos-base/lucifer-0.0.1-r91/work/lucifer-0.0.1/src/lucifer/cmd/skylab_swarming_worker/main.go:143 main.main /var/tmp/portage/chromeos-base/lucifer-0.0.1-r91/work/lucifer-0.0.1/src/lucifer/cmd/skylab_swarming_worker/main.go:68 runtime.main /usr/lib/go/src/runtime/proc.go:198 runtime.goexit /usr/lib/go/src/runtime/asm_amd64.s:2361
,
Jun 14 2018
There are a few problems here: (1) The said bot should have been quarantined. (2) In this case, the bot would have gotten unstuck if the bot process had been restarted. But the bot itself was never restarted. (3) The root cause of why the skylab_swarming cache file was removed is still a mystery.
,
Jun 14 2018
,
Jun 14 2018
The following revision refers to this bug: https://chrome-internal.googlesource.com/infradata/config/+/941285114ba2543e5f225e46cb460f1c5b5aee6c commit 941285114ba2543e5f225e46cb460f1c5b5aee6c Author: Prathmesh Prabhu <pprabhu@chromium.org> Date: Thu Jun 14 21:00:21 2018
,
Jun 14 2018
,
Jun 14 2018
issue 852969 is a related problem. Our bots don't even die when their DUT is removed from the inventory.
,
Jun 14 2018
The said bot got quarantined, as expected: https://chrome-swarming.appspot.com/bot?id=chromeos-skylab-bot-bb844436-c345-4c46-9200-af1547c7208d&sort_stats=total%3Adesc This validates #4
,
Jun 14 2018
The following revision refers to this bug: https://chrome-internal.googlesource.com/chromeos/infra_internal/skylab_inventory/+/8201d4b1a5cba49a566cc7d82905920f7eb89792 commit 8201d4b1a5cba49a566cc7d82905920f7eb89792 Author: Prathmesh Prabhu <pprabhu@chromium.org> Date: Thu Jun 14 21:33:40 2018
,
Jun 14 2018
The following revision refers to this bug: https://chrome-internal.googlesource.com/chromeos/infra_internal/skylab_inventory/+/16abb042df298f8db0625b72f0abf042b65c5e76 commit 16abb042df298f8db0625b72f0abf042b65c5e76 Author: Prathmesh Prabhu <pprabhu@google.com> Date: Thu Jun 14 21:52:58 2018
,
Jun 14 2018
In conjunction with the fix for 852969, by removing (#8) and re-adding (#9) the DUT to the drone in the inventory, the bot has been recovered. This addresses concern (3) in (#2), in case bots get stuck like this, they can be refreshed by removing and re-adding to the inventory.
,
Jun 15 2018
Any bot that enters this state now will be quarantined. I think we shouldn't automatically recover such bots. Fail obviously so that the root cause can be determined. |
||||
►
Sign in to add a comment |
||||
Comment 1 by pprabhu@chromium.org
, Jun 14 2018Owner: pprabhu@chromium.org
Status: Assigned (was: Untriaged)
Summary: Skylab bot gets stuck when it's swarming cache dir is missing (was: [SKYLAB] load bot dimensions failed)