New issue
Advanced search Search tips

Issue 865759 link

Starred by 1 user

Issue metadata

Status: Fixed
Owner:
Closed: Jul 19
Components:
EstimatedDays: ----
NextAction: ----
OS: Chrome
Pri: 1
Type: ----



Sign in to add a comment

some reef bots can't be started on skylab-drone

Project Member Reported by xixuan@chromium.org, Jul 19

Issue description

I send the following DUTs for fixing without migrating them back:
    chromeos6-row4-rack10-host20
    chromeos6-row3-rack12-host17
    chromeos6-row4-rack10-host1
    chromeos6-row3-rack12-host13
    chromeos6-row4-rack10-host3

https://chrome-swarming.appspot.com/botlist?c=id&c=os&c=task&c=status&c=dut_state&c=label-board&c=label-pool&c=dut_name&f=pool%3AChromeOSSkylab&f=label-board%3Areef&l=100&s=label-pool%3Aasc

Currently there're 5 missing reef DUTs. However, I just found that not all of them are sent for fixing. chromeos6-row4-rack10-host1 is sent to fix, but correctly show on the page.


Copied from /var/log/upstart/skylab_swarming.log for DUT chromeos6-row3-rack12-host13:

128980 2018-07-19 22:11:19.999 I: on_before_poll(): 0s
128980 2018-07-19 22:11:20.298 I: ts_mon hook_name='get_dimensions' pool=u'cores:32|cpu:x86-64-E5-2699_v4|cpu:x86-64-avx2|gpu:none|inside_docker:0|kvm:1|machine_type:n1-highcpu-32|os:Linux|os:Ubuntu-14.04|pool:ChromeOSSkylab|python:2.7.6|quarantined:1'
128980 2018-07-19 22:11:20.299 I: get_dimensions(): 0.298s
128980 2018-07-19 22:11:20.299 E: get_dimensions() threw
Traceback (most recent call last):
  File "/usr/local/google/home/chromeos-test/skylab_bots/d7863/swarming_bot.1.zip/bot_code/bot_main.py", line 293, in _call_hook_safe
    return _call_hook(chained, botobj, name, *args)
  File "/usr/local/google/home/chromeos-test/skylab_bots/d7863/swarming_bot.1.zip/bot_code/bot_main.py", line 164, in hook
    return func(chained, botobj, name, *args, **kwargs)
  File "/usr/local/google/home/chromeos-test/skylab_bots/d7863/swarming_bot.1.zip/bot_code/bot_main.py", line 260, in _call_hook
    return hook(botobj, *args, **kwargs)
  File "injected.py", line 63, in get_dimensions
  File "injected.py", line 85, in _get_cached_dimensions
  File "injected.py", line 161, in _get_cached_botinfo
  File "/usr/lib/python2.7/json/__init__.py", line 290, in load
    **kw)
  File "/usr/lib/python2.7/json/__init__.py", line 338, in loads
    return _default_decoder.decode(s)
  File "/usr/lib/python2.7/json/decoder.py", line 366, in decode
    obj, end = self.raw_decode(s, idx=_w(s, 0).end())
  File "/usr/lib/python2.7/json/decoder.py", line 384, in raw_decode
    raise ValueError("No JSON object could be decoded")
ValueError: No JSON object could be decoded

 
Status: Started (was: Assigned)
The bot for chromeos6-row3-rack12-host13 https://chrome-swarming.appspot.com/bot?id=chromeos-skylab-bot-cbece344-04a4-463a-b440-7d105b6cc845 has been quarantined, which is WAI.

Bot events show the same stack trace as OP: 
Failed to call hook get_state(): No JSON object could be decoded
Traceback (most recent call last):
File "/usr/local/google/home/chromeos-test/skylab_bots/d7863/swarming_bot.1.zip/bot_code/bot_main.py", line 293, in _call_hook_safe
return _call_hook(chained, botobj, name, *args)
File "/usr/local/google/home/chromeos-test/skylab_bots/d7863/swarming_bot.1.zip/bot_code/bot_main.py", line 164, in hook
return func(chained, botobj, name, *args, **kwargs)
File "/usr/local/google/home/chromeos-test/skylab_bots/d7863/swarming_bot.1.zip/bot_code/bot_main.py", line 260, in _call_hook
return hook(botobj, *args, **kwargs)
File "injected.py", line 110, in get_state
File "injected.py", line 132, in _get_cached_state
File "injected.py", line 161, in _get_cached_botinfo
File "/usr/lib/python2.7/json/__init__.py", line 290, in load
**kw)
File "/usr/lib/python2.7/json/__init__.py", line 338, in loads
return _default_decoder.decode(s)
File "/usr/lib/python2.7/json/decoder.py", line 366, in decode
obj, end = self.raw_decode(s, idx=_w(s, 0).end())
File "/usr/lib/python2.7/json/decoder.py", line 384, in raw_decode
raise ValueError("No JSON object could be decoded")
ValueError: No JSON object could be decoded
Calling stack:
0 api/bot.py:230:post_error()
1 bot_code/bot_main.py:298:_call_hook_safe()
2 bot_code/bot_main.py:366:_get_state()
3 bot_code/bot_main.py:1066:_run_bot_inner()
4 bot_code/bot_main.py:966:_run_bot()
5 bot_code/bot_main.py:1402:main()
6 __main__.py:166:CMDstart_bot()
7 __main__.py:254:main()
8 __main__.py:266:<module>()
9 runpy.py:72:_run_code()
10 runpy.py:162:_run_module_as_main()
This is because the cached swarming state for the bot is empty.

chromeos-test@pprabhu-skylab-drone-2:/usr/local/autotest/swarming_state$ cat cbece344-04a4-463a-b440-7d105b6cc845.json

The last task to run on the bot was https://chrome-swarming.appspot.com/task?id=3ec98641bce0dd10&refresh=10&show_raw=1

This task was not able to correctly persist the swarming state on the bot due to  issue 865561 

To recover all the quarantined bots, we need to simply stop the bot. It will be restarted by skylab_swarming_manager with fresh cached bot state.
Sadly, the "shut down bot gracefully" functionality simply creates a task that would quit the bot. But that task will not run because the bot is quarantined: https://chrome-swarming.appspot.com/task?id=3ecdb4981645a810&refresh=10&show_raw=1
Another "clean" way to recover these bots is to remove them from the inventory, then re-add them once skylab_swarming_manager shuts down the bots.
Project Member

Comment 5 by bugdroid1@chromium.org, Jul 19

The following revision refers to this bug:
  https://chrome-internal.googlesource.com/chromeos/infra_internal/skylab_inventory/+/1201ace97fdaa6e4f17059558f6118bc1ba2b622

commit 1201ace97fdaa6e4f17059558f6118bc1ba2b622
Author: Prathmesh Prabhu <pprabhu@chromium.org>
Date: Thu Jul 19 23:37:00 2018

Project Member

Comment 6 by bugdroid1@chromium.org, Jul 19

The following revision refers to this bug:
  https://chrome-internal.googlesource.com/chromeos/infra_internal/skylab_inventory/+/ebebe4fb41684627faffc270bc8fe54190638be7

commit ebebe4fb41684627faffc270bc8fe54190638be7
Author: Prathmesh Prabhu <pprabhu@google.com>
Date: Thu Jul 19 23:43:14 2018

Status: Fixed (was: Started)
After #5 and #6, the bots are all back.
So, removing and adding the DUTs back works.


Sign in to add a comment