some reef bots can't be started on skylab-drone |
||
Issue description
I send the following DUTs for fixing without migrating them back:
chromeos6-row4-rack10-host20
chromeos6-row3-rack12-host17
chromeos6-row4-rack10-host1
chromeos6-row3-rack12-host13
chromeos6-row4-rack10-host3
https://chrome-swarming.appspot.com/botlist?c=id&c=os&c=task&c=status&c=dut_state&c=label-board&c=label-pool&c=dut_name&f=pool%3AChromeOSSkylab&f=label-board%3Areef&l=100&s=label-pool%3Aasc
Currently there're 5 missing reef DUTs. However, I just found that not all of them are sent for fixing. chromeos6-row4-rack10-host1 is sent to fix, but correctly show on the page.
Copied from /var/log/upstart/skylab_swarming.log for DUT chromeos6-row3-rack12-host13:
128980 2018-07-19 22:11:19.999 I: on_before_poll(): 0s
128980 2018-07-19 22:11:20.298 I: ts_mon hook_name='get_dimensions' pool=u'cores:32|cpu:x86-64-E5-2699_v4|cpu:x86-64-avx2|gpu:none|inside_docker:0|kvm:1|machine_type:n1-highcpu-32|os:Linux|os:Ubuntu-14.04|pool:ChromeOSSkylab|python:2.7.6|quarantined:1'
128980 2018-07-19 22:11:20.299 I: get_dimensions(): 0.298s
128980 2018-07-19 22:11:20.299 E: get_dimensions() threw
Traceback (most recent call last):
File "/usr/local/google/home/chromeos-test/skylab_bots/d7863/swarming_bot.1.zip/bot_code/bot_main.py", line 293, in _call_hook_safe
return _call_hook(chained, botobj, name, *args)
File "/usr/local/google/home/chromeos-test/skylab_bots/d7863/swarming_bot.1.zip/bot_code/bot_main.py", line 164, in hook
return func(chained, botobj, name, *args, **kwargs)
File "/usr/local/google/home/chromeos-test/skylab_bots/d7863/swarming_bot.1.zip/bot_code/bot_main.py", line 260, in _call_hook
return hook(botobj, *args, **kwargs)
File "injected.py", line 63, in get_dimensions
File "injected.py", line 85, in _get_cached_dimensions
File "injected.py", line 161, in _get_cached_botinfo
File "/usr/lib/python2.7/json/__init__.py", line 290, in load
**kw)
File "/usr/lib/python2.7/json/__init__.py", line 338, in loads
return _default_decoder.decode(s)
File "/usr/lib/python2.7/json/decoder.py", line 366, in decode
obj, end = self.raw_decode(s, idx=_w(s, 0).end())
File "/usr/lib/python2.7/json/decoder.py", line 384, in raw_decode
raise ValueError("No JSON object could be decoded")
ValueError: No JSON object could be decoded
,
Jul 19
This is because the cached swarming state for the bot is empty. chromeos-test@pprabhu-skylab-drone-2:/usr/local/autotest/swarming_state$ cat cbece344-04a4-463a-b440-7d105b6cc845.json The last task to run on the bot was https://chrome-swarming.appspot.com/task?id=3ec98641bce0dd10&refresh=10&show_raw=1 This task was not able to correctly persist the swarming state on the bot due to issue 865561 To recover all the quarantined bots, we need to simply stop the bot. It will be restarted by skylab_swarming_manager with fresh cached bot state.
,
Jul 19
Sadly, the "shut down bot gracefully" functionality simply creates a task that would quit the bot. But that task will not run because the bot is quarantined: https://chrome-swarming.appspot.com/task?id=3ecdb4981645a810&refresh=10&show_raw=1
,
Jul 19
Another "clean" way to recover these bots is to remove them from the inventory, then re-add them once skylab_swarming_manager shuts down the bots.
,
Jul 19
The following revision refers to this bug: https://chrome-internal.googlesource.com/chromeos/infra_internal/skylab_inventory/+/1201ace97fdaa6e4f17059558f6118bc1ba2b622 commit 1201ace97fdaa6e4f17059558f6118bc1ba2b622 Author: Prathmesh Prabhu <pprabhu@chromium.org> Date: Thu Jul 19 23:37:00 2018
,
Jul 19
The following revision refers to this bug: https://chrome-internal.googlesource.com/chromeos/infra_internal/skylab_inventory/+/ebebe4fb41684627faffc270bc8fe54190638be7 commit ebebe4fb41684627faffc270bc8fe54190638be7 Author: Prathmesh Prabhu <pprabhu@google.com> Date: Thu Jul 19 23:43:14 2018
,
Jul 19
After #5 and #6, the bots are all back. So, removing and adding the DUTs back works. |
||
►
Sign in to add a comment |
||
Comment 1 by pprabhu@chromium.org
, Jul 19