New issue
Advanced search Search tips

Issue 793532 link

Starred by 1 user

Issue metadata

Status: Duplicate
Merged: issue 793538
Owner: ----
Closed: Dec 2017
Cc:
Components:
EstimatedDays: ----
NextAction: ----
OS: ----
Pri: 1
Type: Bug



Sign in to add a comment

All shards: heartbeat is consistently failing

Project Member Reported by pprabhu@chromium.org, Dec 9 2017

Issue description

12/08 17:12:46.978 ERROR|      shard_client:0313| Heartbeat failed. JSONRPCException: DoesNotExist: Shard matching query does not exist. Lookup parameters were {'hostname': '${::fqdn}'}
Traceback (most recent call last):
  File "/usr/local/autotest/frontend/afe/json_rpc/serviceHandler.py", line 109, in dispatchRequest
    results['result'] = self.invokeServiceEndpoint(meth, args)
  File "/usr/local/autotest/frontend/afe/json_rpc/serviceHandler.py", line 147, in invokeServiceEndpoint
    return meth(*args)
  File "/usr/local/autotest/frontend/afe/rpc_handler.py", line 270, in new_fn
    return f(*args, **keyword_args)
  File "/usr/local/autotest/frontend/afe/rpc_interface.py", line 2045, in shard_heartbeat
    shard_obj = rpc_utils.retrieve_shard(shard_hostname=shard_hostname)
  File "/usr/local/autotest/frontend/afe/rpc_utils.py", line 922, in retrieve_shard
    return models.Shard.smart_get(shard_hostname)
  File "/usr/local/autotest/frontend/afe/model_logic.py", line 835, in smart_get
    return manager.get(**{cls.name_field : id_or_name})
  File "/usr/local/autotest/site-packages/django/db/models/manager.py", line 143, in get
    return self.get_query_set().get(*args, **kwargs)
  File "/usr/local/autotest/site-packages/django/db/models/query.py", line 389, in get
    (self.model._meta.object_name, kwargs))
DoesNotExist: Shard matching query does not exist. Lookup parameters were {'hostname': '${::fqdn}'}
 
Summary: chromeos-server5.hot, chromeos-server12.cbf shard heartbeat is consistently failing (was: chromeos-server5.hot shard heartbeat is consistently failing)
Observed on another shard. All shards?

This problem isn't reflected in our viceroy dashboard: https://viceroy.corp.google.com/chromeos/capacity_health
Labels: -Pri-2 Pri-1
We should at least report success / failure here: https://cs.corp.google.com/chromeos_public/src/third_party/autotest/files/scheduler/shard/shard_client.py?l=367

My reading of this is that shard_clients aren't updating the state on the shard because the shard_client heartbeat RPC fails.
This should cause all tests to fail. Is that happening?
Indeed, that shard hasn't received any new jobs since around noon: http://shortn/_eLrGTVZeMx
Labels: -Pri-1 Pri-0
Wait, NONE of the shards have received any new jobs since noon.

This is a full scale outage, it seems.
http://shortn/_73sSFQSdJW
Summary: All shards: heartbeat is consistently failing (was: chromeos-server5.hot, chromeos-server12.cbf shard heartbeat is consistently failing)
We get the shard hostname in ShardClient here: https://cs.corp.google.com/chromeos_public/src/third_party/autotest/files/scheduler/shard/shard_client.py?type=cs&q=ShardClient+case:yes&sq=package:chromeos+file:src/third_party/autotest/files&l=398
This is deployed by puppet, + that corrupted name makes me think puppet bug.
Owner: ayatane@chromium.org
Status: Started (was: Untriaged)
Filed  issue 793538  for alerting follow up.
Project Member

Comment 9 by bugdroid1@chromium.org, Dec 9 2017

The following revision refers to this bug:
  https://chrome-internal.googlesource.com/chromeos/chromeos-admin/+/de9780c7ec42b698aa997e2cb47358b9cc206b88

commit de9780c7ec42b698aa997e2cb47358b9cc206b88
Author: Allen Li <ayatane@chromium.org>
Date: Sat Dec 09 01:42:45 2017

Landed fix https://chrome-internal-review.googlesource.com/c/chromeos/chromeos-admin/+/525540

Ran Puppet and restarted shard-client
Labels: -Pri-0 Chase-Pending Pri-1
Status: Assigned (was: Started)
Shard heartbeats are dropping, looks fixed
Cc: ayatane@chromium.org
Owner: ----
Status: Available (was: Assigned)
Mergedinto: 793538
Status: Duplicate (was: Available)

Sign in to add a comment