Explain/mitigate recent shard database corruption
Reported by
jrbarnette@chromium.org,
Mar 23 2018
|
|||||||
Issue description
Recently, a host entry on a shard acquired a number of problematic
properties:
* There was more than one attribute with the key "HWID"
* The host was labeled with both board:stumpy and board:whirlwind.
* Various other labels on the host suggested it might be either
stumpy or whirlwind.
All of these characteristics were restricted to the shard database;
the master database did not have the errors.
The presence of two different attributes both named "HWID" caused a
failure loop in shard-client due to this error:
MultipleObjectsReturned: get() returned more than one HostAttribute -- it returned 2! Lookup parameters were {'attribute': u'HWID', 'host': <Host: chromeos3-row1-rack2-host8>}
The shard-client failure loop took out at least one CQ run.
Fixing the problem required manually editing the database on the
shard to mark the host entry invalid, effectively deleting the
host:
* Deleting the host on the master didn't work, because with
shard-client down, the shard couldn't pick up changes in
the master DB.
* Deleting the host on the shard didn't work, because the
same error that killed shard-client also killed the `atest`
command.
We should consider several corrective actions for this failure:
* Ideally, identify the cause of the corruption, and change the
code to prevent recurrence.
* Consider whether sentinel could be adjusted to be more agressive
about forcing shard databases to match with the master.
* Consider whether shard-client can be adjusted to ignore (or
correct) hosts with errors of this sort.
* Consider whether we can implement a simple command to force the
shard DB to match the master DB, at least for simple cases like
a single host.
,
Mar 23 2018
Please include the stacktrace that shard_client was failing with. Will make it much easier to identify and fix the crashloop.
,
Mar 23 2018
,
Mar 26 2018
Full stack trace with surrounding context:
03/22 10:40:51.565 INFO | shard_client:0184| Heartbeat response contains incorrect_host_ids [785] which will be deleted.
03/22 10:40:51.575 INFO | models:0753| Preconditions for deleting host chromeos3-row1-rack2-host8...
03/22 10:40:51.579 INFO | models:0760| Deleting attribute HostAttribute object...
03/22 10:40:51.582 ERROR| email_manager:0082| Uncaught exception. Terminating shard_client.
Traceback (most recent call last):
File "/usr/local/autotest/scheduler/shard/shard_client.py", line 510, in main
main_without_exception_handling(options)
File "/usr/local/autotest/scheduler/shard/shard_client.py", line 534, in main_without_exception_handling
_heartbeat_client.loop(options.lifetime_hours)
File "/usr/local/autotest/scheduler/shard/shard_client.py", line 412, in loop
self.tick()
File "/usr/local/autotest/scheduler/shard/shard_client.py", line 400, in tick
success = self.do_heartbeat()
File "/usr/local/autotest/site-packages/chromite/lib/metrics.py", line 490, in wrapper
return fn(*args, **kwargs)
File "/usr/local/autotest/scheduler/shard/shard_client.py", line 393, in do_heartbeat
self.process_heartbeat_response(response)
File "/usr/local/autotest/site-packages/chromite/lib/metrics.py", line 490, in wrapper
return fn(*args, **kwargs)
File "/usr/local/autotest/scheduler/shard/shard_client.py", line 185, in process_heartbeat_response
self._remove_incorrect_hosts(incorrect_host_ids)
File "/usr/local/autotest/scheduler/shard/shard_client.py", line 212, in _remove_incorrect_hosts
models.Host.objects.filter(id__in=incorrect_host_ids).delete()
File "/usr/local/autotest/frontend/afe/model_logic.py", line 477, in delete
model.delete()
File "/usr/local/autotest/frontend/afe/models.py", line 761, in delete
self.delete_attribute(host_attribute.attribute)
File "/usr/local/autotest/frontend/afe/model_logic.py", line 1344, in delete_attribute
attribute_model.objects.get(**get_args).delete()
File "/usr/local/autotest/site-packages/django/db/models/manager.py", line 143, in get
return self.get_query_set().get(*args, **kwargs)
File "/usr/local/autotest/site-packages/django/db/models/query.py", line 393, in get
(self.model._meta.object_name, num, kwargs))
MultipleObjectsReturned: get() returned more than one HostAttribute -- it returned 2! Lookup parameters were {'attribute': u'HWID', 'host': <Host: chromeos3-row1-rack2-host8>}
,
Mar 26 2018
Simple mitigation is the handle exception without crashing service. File follow up bugs for more detailed work.
,
Mar 29 2018
,
Apr 2 2018
|
|||||||
►
Sign in to add a comment |
|||||||
Comment 1 by jrbarnette@chromium.org
, Mar 23 2018