Issue metadata
Sign in to add a comment
|
All shards: heartbeat is consistently failing |
||||||||||||||||||||||||
Issue description
12/08 17:12:46.978 ERROR| shard_client:0313| Heartbeat failed. JSONRPCException: DoesNotExist: Shard matching query does not exist. Lookup parameters were {'hostname': '${::fqdn}'}
Traceback (most recent call last):
File "/usr/local/autotest/frontend/afe/json_rpc/serviceHandler.py", line 109, in dispatchRequest
results['result'] = self.invokeServiceEndpoint(meth, args)
File "/usr/local/autotest/frontend/afe/json_rpc/serviceHandler.py", line 147, in invokeServiceEndpoint
return meth(*args)
File "/usr/local/autotest/frontend/afe/rpc_handler.py", line 270, in new_fn
return f(*args, **keyword_args)
File "/usr/local/autotest/frontend/afe/rpc_interface.py", line 2045, in shard_heartbeat
shard_obj = rpc_utils.retrieve_shard(shard_hostname=shard_hostname)
File "/usr/local/autotest/frontend/afe/rpc_utils.py", line 922, in retrieve_shard
return models.Shard.smart_get(shard_hostname)
File "/usr/local/autotest/frontend/afe/model_logic.py", line 835, in smart_get
return manager.get(**{cls.name_field : id_or_name})
File "/usr/local/autotest/site-packages/django/db/models/manager.py", line 143, in get
return self.get_query_set().get(*args, **kwargs)
File "/usr/local/autotest/site-packages/django/db/models/query.py", line 389, in get
(self.model._meta.object_name, kwargs))
DoesNotExist: Shard matching query does not exist. Lookup parameters were {'hostname': '${::fqdn}'}
,
Dec 9 2017
We should at least report success / failure here: https://cs.corp.google.com/chromeos_public/src/third_party/autotest/files/scheduler/shard/shard_client.py?l=367 My reading of this is that shard_clients aren't updating the state on the shard because the shard_client heartbeat RPC fails. This should cause all tests to fail. Is that happening?
,
Dec 9 2017
Indeed, that shard hasn't received any new jobs since around noon: http://shortn/_eLrGTVZeMx
,
Dec 9 2017
Wait, NONE of the shards have received any new jobs since noon. This is a full scale outage, it seems. http://shortn/_73sSFQSdJW
,
Dec 9 2017
,
Dec 9 2017
We get the shard hostname in ShardClient here: https://cs.corp.google.com/chromeos_public/src/third_party/autotest/files/scheduler/shard/shard_client.py?type=cs&q=ShardClient+case:yes&sq=package:chromeos+file:src/third_party/autotest/files&l=398 This is deployed by puppet, + that corrupted name makes me think puppet bug.
,
Dec 9 2017
,
Dec 9 2017
Filed issue 793538 for alerting follow up.
,
Dec 9 2017
The following revision refers to this bug: https://chrome-internal.googlesource.com/chromeos/chromeos-admin/+/de9780c7ec42b698aa997e2cb47358b9cc206b88 commit de9780c7ec42b698aa997e2cb47358b9cc206b88 Author: Allen Li <ayatane@chromium.org> Date: Sat Dec 09 01:42:45 2017
,
Dec 9 2017
Landed fix https://chrome-internal-review.googlesource.com/c/chromeos/chromeos-admin/+/525540 Ran Puppet and restarted shard-client
,
Dec 9 2017
Shard heartbeats are dropping, looks fixed
,
Dec 11 2017
,
Dec 11 2017
|
|||||||||||||||||||||||||
►
Sign in to add a comment |
|||||||||||||||||||||||||
Comment 1 by pprabhu@chromium.org
, Dec 9 2017