New issue
Advanced search Search tips

Issue 696644 link

Starred by 1 user

Issue metadata

Status: Fixed
Owner:
Closed: May 2017
Cc:
Components:
EstimatedDays: ----
NextAction: ----
OS: ----
Pri: 1
Type: Bug



Sign in to add a comment

10+ build*-m5 bots are quarantined

Project Member Reported by rmis...@google.com, Feb 27 2017

Issue description


Last Thursday (2/23) I had noticed that 17 build*-m5 bots were quarantined with the message:
"Quarantined: Monotonically increasing metric "swarming/bots/hooks/durations" was given value "-0.023973941803", which is not greater than or equal to "None"."

Restarting them seemed to bring them back, I thought it was some strange one-off problem so did not think of filing a bug.

But today again I see 11 bots quarantined with the same message:
https://chrome-swarming.appspot.com/botlist?c=id&c=os&c=task&c=status&c=last_seen&f=pool%3ACT&f=status%3Aquarantined&l=1000&s=last_seen%3Aasc


As an example of the complete message, the status on build46-m5 is:

Quarantined: Monotonically increasing metric "swarming/bots/hooks/durations" was given value "-0.023973941803", which is not greater than or equal to "None". Traceback (most recent call last): File "swarming_bot.1.zip/bot_code/bot_main.py", line 213, in _call_hook_safe return _call_hook(chained, botobj, name, *args) File "swarming_bot.1.zip/bot_code/bot_main.py", line 162, in hook return fn(*args, **kwargs) File "swarming_bot.1.zip/bot_code/bot_main.py", line 187, in _call_hook return hook(botobj, *args) File "swarming_bot.1.zip/config/bot_config.py", line 119, in hook hooks_durations.add(time.time() - start, fields=fields) File "swarming_bot.1.zip/infra_libs/ts_mon/common/metrics.py", line 589, in add self._incr(fields, target_fields, value, modify_fn=modify_fn) File "swarming_bot.1.zip/infra_libs/ts_mon/common/metrics.py", line 321, in _incr target_fields, delta, modify_fn=modify_fn) File "swarming_bot.1.zip/infra_libs/ts_mon/common/metric_store.py", line 281, in incr raise errors.MonitoringDecreasingValueError(name, None, delta) MonitoringDecreasingValueError: Monotonically increasing metric "swarming/bots/hooks/durations" was given value "-0.023973941803", which is not greater than or equal to "None".

 

Comment 1 by mar...@chromium.org, Feb 27 2017

Cc: sergeybe...@chromium.org

Comment 2 by mar...@chromium.org, Feb 27 2017

Cc: bpastene@chromium.org vadimsh@chromium.org
Labels: -Pri-2 Pri-1
Status: Available (was: Untriaged)
This is super weird because the lines do not match:
https://chrome-internal.googlesource.com/infradata/config.git/+/master/configs/chromium-swarm/scripts/bot_config.py#119

it's an old version of bot_config.py. There's something VERY wrong.
This was fixed with https://github.com/luci/luci-py/commit/8f6e97dd03c75e8e42b92049eb4b18adaef3dce9

The live version of chrome-swarming doesn't have that picked up. It needs to be redeployed with a later version.

Comment 4 by mar...@chromium.org, Feb 27 2017

Cc: -mar...@chromium.org
Components: -Infra>Labs Infra>Platform>Swarming
Owner: mar...@chromium.org
Status: Assigned (was: Available)
Wow I'm blind; I hadn't realized it was chrome-swarming, sorry about that. I'll push it relatively soon.

Comment 5 by rmis...@google.com, Mar 6 2017

Checked again this morning and now looks like 13 bots are quarantined: https://chrome-swarming.appspot.com/botlist?c=id&c=os&c=task&c=status&f=pool%3ACT&f=status%3Aquarantined&l=100&q=CT&s=status%3Adesc

Error message:
"Quarantined: get_dimensions(): expected a dict, got None"


The following message is under events here: https://chrome-swarming.appspot.com/bot?id=build1-m5&selected=1&sort_stats=total%3Adesc


Failed to call hook get_dimensions(): Monotonically increasing metric "swarming/bots/hooks/durations" was given value "-0.0809741020203", which is not greater than or equal to "None".
Traceback (most recent call last):
File "swarming_bot.1.zip/bot_code/bot_main.py", line 238, in _call_hook_safe
return _call_hook(chained, botobj, name, *args)
File "swarming_bot.1.zip/bot_code/bot_main.py", line 130, in hook
return func(chained, botobj, name, *args, **kwargs)
File "swarming_bot.1.zip/bot_code/bot_main.py", line 212, in _call_hook
return hook(botobj, *args, **kwargs)
File "swarming_bot.1.zip/config/bot_config.py", line 119, in hook
hooks_durations.add(time.time() - start, fields=fields)
File "swarming_bot.1.zip/infra_libs/ts_mon/common/metrics.py", line 589, in add
self._incr(fields, target_fields, value, modify_fn=modify_fn)
File "swarming_bot.1.zip/infra_libs/ts_mon/common/metrics.py", line 321, in _incr
target_fields, delta, modify_fn=modify_fn)
File "swarming_bot.1.zip/infra_libs/ts_mon/common/metric_store.py", line 281, in incr
raise errors.MonitoringDecreasingValueError(name, None, delta)
MonitoringDecreasingValueError: Monotonically increasing metric "swarming/bots/hooks/durations" was given value "-0.0809741020203", which is not greater than or equal to "None".
Calling stack:
0 swarming_bot.1.zip/api/bot.py:207:post_error()
1 swarming_bot.1.zip/bot_code/bot_main.py:243:_call_hook_safe()
2 swarming_bot.1.zip/bot_code/bot_main.py:252:_get_dimensions()
3 swarming_bot.1.zip/bot_code/bot_main.py:743:_run_bot_inner()
4 swarming_bot.1.zip/bot_code/bot_main.py:667:_run_bot()
5 swarming_bot.1.zip/bot_code/bot_main.py:1220:main()
6 swarming_bot.1.zip/__main__.py:164:CMDstart_bot()
7 swarming_bot.1.zip/__main__.py:247:main()
8 swarming_bot.1.zip/__main__.py:259:<module>()
9 runpy.py:72:_run_code()
10 runpy.py:162:_run_module_as_main()

Comment 6 by rmis...@google.com, Mar 7 2017

I manually rebooted all 13 quarantined bots and they seem to be happy for now: https://chrome-swarming.appspot.com/botlist?c=id&c=os&c=task&c=status&c=last_seen&f=pool%3ACT&f=status%3Aquarantined&f=status%3Adead&l=1000&s=last_seen%3Aasc

I remember doing the same 2 weeks ago but they slowly started showing up as quarantined after that. Lets see what happens this time.

Comment 7 by rmis...@google.com, Mar 7 2017

Looks like there is already one quarantined bot with the same message:
https://chrome-swarming.appspot.com/bot?id=build76-m5&selected=1&sort_stats=total%3Adesc
chrome-swarming is uploading hook durations to ts_mon from both bot_config.py and bot_main.py at the moment.

I'm not surprised duplicate reporting is causing problems. maruel@ removed uploading from bot_config.py for chromium-swarm. Sounds like the same needs to be done for chrome-swarming.
Cc: mar...@chromium.org
Owner: ----
Status: Available (was: Assigned)
Oops, can someone take care of this? Currently overbooked. :(

Comment 10 by rmis...@google.com, Mar 13 2017

5 bots currently quarantined with this problem:

ct-vm-089	idle	2m ago	Ubuntu-14.04	Quarantined: get_dimensions(): expected a dict, got None
build76-m5	idle	59s ago	Android	Quarantined: get_dimensions(): expected a dict, got None
build4-m5	idle	54s ago	Android	Quarantined: get_dimensions(): expected a dict, got None
ct-vm-078	idle	29s ago	Ubuntu-14.04	Quarantined: get_dimensions(): expected a dict, got None
build79-m5	idle	19s ago	Android	Quarantined: get_dimensions(): expected a dict, got None

Comment 11 by rmis...@google.com, Mar 27 2017

On 3/27 3 bots quarantined with this problem:

build49-m5	idle	24s ago	Android	Quarantined: get_dimensions(): expected a dict, got None
build19-m5	idle	12s ago	Android	Quarantined: get_dimensions(): expected a dict, got None
build22-m5	idle	8s ago	Android	Quarantined: get_dimensions(): expected a dict, got None

Going to manually fix them.
On 4/3 5 bots quarantined with this problem:


ct-vm-204	idle	1m ago	Ubuntu-14.04	Quarantined: get_state(): expected a dict, got None
build67-m5	idle	43s ago	Android	Quarantined: get_dimensions(): expected a dict, got None
ct-vm-039	idle	35s ago	Ubuntu-14.04	Quarantined: get_dimensions(): expected a dict, got None
ct-vm-002	idle	19s ago	Ubuntu-14.04	Quarantined: get_dimensions(): expected a dict, got None
build5-m5	idle	2s ago	Android	Quarantined: get_dimensions(): expected a dict, got None


Going to manually fix them.
Owner: mar...@chromium.org
Status: Fixed (was: Available)
M-A fixed this problem a few weeks ago.

Sign in to add a comment