10+ build*-m5 bots are quarantined |
|||||
Issue descriptionLast Thursday (2/23) I had noticed that 17 build*-m5 bots were quarantined with the message: "Quarantined: Monotonically increasing metric "swarming/bots/hooks/durations" was given value "-0.023973941803", which is not greater than or equal to "None"." Restarting them seemed to bring them back, I thought it was some strange one-off problem so did not think of filing a bug. But today again I see 11 bots quarantined with the same message: https://chrome-swarming.appspot.com/botlist?c=id&c=os&c=task&c=status&c=last_seen&f=pool%3ACT&f=status%3Aquarantined&l=1000&s=last_seen%3Aasc As an example of the complete message, the status on build46-m5 is: Quarantined: Monotonically increasing metric "swarming/bots/hooks/durations" was given value "-0.023973941803", which is not greater than or equal to "None". Traceback (most recent call last): File "swarming_bot.1.zip/bot_code/bot_main.py", line 213, in _call_hook_safe return _call_hook(chained, botobj, name, *args) File "swarming_bot.1.zip/bot_code/bot_main.py", line 162, in hook return fn(*args, **kwargs) File "swarming_bot.1.zip/bot_code/bot_main.py", line 187, in _call_hook return hook(botobj, *args) File "swarming_bot.1.zip/config/bot_config.py", line 119, in hook hooks_durations.add(time.time() - start, fields=fields) File "swarming_bot.1.zip/infra_libs/ts_mon/common/metrics.py", line 589, in add self._incr(fields, target_fields, value, modify_fn=modify_fn) File "swarming_bot.1.zip/infra_libs/ts_mon/common/metrics.py", line 321, in _incr target_fields, delta, modify_fn=modify_fn) File "swarming_bot.1.zip/infra_libs/ts_mon/common/metric_store.py", line 281, in incr raise errors.MonitoringDecreasingValueError(name, None, delta) MonitoringDecreasingValueError: Monotonically increasing metric "swarming/bots/hooks/durations" was given value "-0.023973941803", which is not greater than or equal to "None".
,
Feb 27 2017
This is super weird because the lines do not match: https://chrome-internal.googlesource.com/infradata/config.git/+/master/configs/chromium-swarm/scripts/bot_config.py#119 it's an old version of bot_config.py. There's something VERY wrong.
,
Feb 27 2017
This was fixed with https://github.com/luci/luci-py/commit/8f6e97dd03c75e8e42b92049eb4b18adaef3dce9 The live version of chrome-swarming doesn't have that picked up. It needs to be redeployed with a later version.
,
Feb 27 2017
Wow I'm blind; I hadn't realized it was chrome-swarming, sorry about that. I'll push it relatively soon.
,
Mar 6 2017
Checked again this morning and now looks like 13 bots are quarantined: https://chrome-swarming.appspot.com/botlist?c=id&c=os&c=task&c=status&f=pool%3ACT&f=status%3Aquarantined&l=100&q=CT&s=status%3Adesc Error message: "Quarantined: get_dimensions(): expected a dict, got None" The following message is under events here: https://chrome-swarming.appspot.com/bot?id=build1-m5&selected=1&sort_stats=total%3Adesc Failed to call hook get_dimensions(): Monotonically increasing metric "swarming/bots/hooks/durations" was given value "-0.0809741020203", which is not greater than or equal to "None". Traceback (most recent call last): File "swarming_bot.1.zip/bot_code/bot_main.py", line 238, in _call_hook_safe return _call_hook(chained, botobj, name, *args) File "swarming_bot.1.zip/bot_code/bot_main.py", line 130, in hook return func(chained, botobj, name, *args, **kwargs) File "swarming_bot.1.zip/bot_code/bot_main.py", line 212, in _call_hook return hook(botobj, *args, **kwargs) File "swarming_bot.1.zip/config/bot_config.py", line 119, in hook hooks_durations.add(time.time() - start, fields=fields) File "swarming_bot.1.zip/infra_libs/ts_mon/common/metrics.py", line 589, in add self._incr(fields, target_fields, value, modify_fn=modify_fn) File "swarming_bot.1.zip/infra_libs/ts_mon/common/metrics.py", line 321, in _incr target_fields, delta, modify_fn=modify_fn) File "swarming_bot.1.zip/infra_libs/ts_mon/common/metric_store.py", line 281, in incr raise errors.MonitoringDecreasingValueError(name, None, delta) MonitoringDecreasingValueError: Monotonically increasing metric "swarming/bots/hooks/durations" was given value "-0.0809741020203", which is not greater than or equal to "None". Calling stack: 0 swarming_bot.1.zip/api/bot.py:207:post_error() 1 swarming_bot.1.zip/bot_code/bot_main.py:243:_call_hook_safe() 2 swarming_bot.1.zip/bot_code/bot_main.py:252:_get_dimensions() 3 swarming_bot.1.zip/bot_code/bot_main.py:743:_run_bot_inner() 4 swarming_bot.1.zip/bot_code/bot_main.py:667:_run_bot() 5 swarming_bot.1.zip/bot_code/bot_main.py:1220:main() 6 swarming_bot.1.zip/__main__.py:164:CMDstart_bot() 7 swarming_bot.1.zip/__main__.py:247:main() 8 swarming_bot.1.zip/__main__.py:259:<module>() 9 runpy.py:72:_run_code() 10 runpy.py:162:_run_module_as_main()
,
Mar 7 2017
I manually rebooted all 13 quarantined bots and they seem to be happy for now: https://chrome-swarming.appspot.com/botlist?c=id&c=os&c=task&c=status&c=last_seen&f=pool%3ACT&f=status%3Aquarantined&f=status%3Adead&l=1000&s=last_seen%3Aasc I remember doing the same 2 weeks ago but they slowly started showing up as quarantined after that. Lets see what happens this time.
,
Mar 7 2017
Looks like there is already one quarantined bot with the same message: https://chrome-swarming.appspot.com/bot?id=build76-m5&selected=1&sort_stats=total%3Adesc
,
Mar 7 2017
chrome-swarming is uploading hook durations to ts_mon from both bot_config.py and bot_main.py at the moment. I'm not surprised duplicate reporting is causing problems. maruel@ removed uploading from bot_config.py for chromium-swarm. Sounds like the same needs to be done for chrome-swarming.
,
Mar 7 2017
Oops, can someone take care of this? Currently overbooked. :(
,
Mar 13 2017
5 bots currently quarantined with this problem: ct-vm-089 idle 2m ago Ubuntu-14.04 Quarantined: get_dimensions(): expected a dict, got None build76-m5 idle 59s ago Android Quarantined: get_dimensions(): expected a dict, got None build4-m5 idle 54s ago Android Quarantined: get_dimensions(): expected a dict, got None ct-vm-078 idle 29s ago Ubuntu-14.04 Quarantined: get_dimensions(): expected a dict, got None build79-m5 idle 19s ago Android Quarantined: get_dimensions(): expected a dict, got None
,
Mar 27 2017
On 3/27 3 bots quarantined with this problem: build49-m5 idle 24s ago Android Quarantined: get_dimensions(): expected a dict, got None build19-m5 idle 12s ago Android Quarantined: get_dimensions(): expected a dict, got None build22-m5 idle 8s ago Android Quarantined: get_dimensions(): expected a dict, got None Going to manually fix them.
,
Apr 3 2017
On 4/3 5 bots quarantined with this problem: ct-vm-204 idle 1m ago Ubuntu-14.04 Quarantined: get_state(): expected a dict, got None build67-m5 idle 43s ago Android Quarantined: get_dimensions(): expected a dict, got None ct-vm-039 idle 35s ago Ubuntu-14.04 Quarantined: get_dimensions(): expected a dict, got None ct-vm-002 idle 19s ago Ubuntu-14.04 Quarantined: get_dimensions(): expected a dict, got None build5-m5 idle 2s ago Android Quarantined: get_dimensions(): expected a dict, got None Going to manually fix them.
,
May 1 2017
M-A fixed this problem a few weeks ago. |
|||||
►
Sign in to add a comment |
|||||
Comment 1 by mar...@chromium.org
, Feb 27 2017