New issue
Advanced search Search tips
Note: Color blocks (like or ) mean that a user may not be available. Tooltip shows the reason.

Issue 825051 link

Starred by 1 user

Issue metadata

Status: Fixed
Owner:
Last visit > 30 days ago
Closed: Apr 2018
Components:
EstimatedDays: ----
NextAction: ----
OS: ----
Pri: 1
Type: Bug

Blocked on:
issue 827330



Sign in to add a comment

Explain/mitigate recent shard database corruption

Reported by jrbarnette@chromium.org, Mar 23 2018

Issue description

Recently, a host entry on a shard acquired a number of problematic
properties:
  * There was more than one attribute with the key "HWID"
  * The host was labeled with both board:stumpy and board:whirlwind.
  * Various other labels on the host suggested it might be either
    stumpy or whirlwind.

All of these characteristics were restricted to the shard database;
the master database did not have the errors.

The presence of two different attributes both named "HWID" caused a
failure loop in shard-client due to this error:
MultipleObjectsReturned: get() returned more than one HostAttribute -- it returned 2! Lookup parameters were {'attribute': u'HWID', 'host': <Host: chromeos3-row1-rack2-host8>}

The shard-client failure loop took out at least one CQ run.

Fixing the problem required manually editing the database on the
shard to mark the host entry invalid, effectively deleting the
host:
  * Deleting the host on the master didn't work, because with
    shard-client down, the shard couldn't pick up changes in
    the master DB.
  * Deleting the host on the shard didn't work, because the
    same error that killed shard-client also killed the `atest`
    command.

We should consider several corrective actions for this failure:
  * Ideally, identify the cause of the corruption, and change the
    code to prevent recurrence.
  * Consider whether sentinel could be adjusted to be more agressive
    about forcing shard databases to match with the master.
  * Consider whether shard-client can be adjusted to ignore (or
    correct) hosts with errors of this sort.
  * Consider whether we can implement a simple command to force the
    shard DB to match the master DB, at least for simple cases like
    a single host.

 
Labels: Chase-Pending
Labels: -Pri-3 Pri-1
Please include the stacktrace that shard_client was failing with. Will make it much easier to identify and fix the crashloop.
Owner: jrbarnette@chromium.org
Owner: ----
Full stack trace with surrounding context:

03/22 10:40:51.565 INFO |      shard_client:0184| Heartbeat response contains incorrect_host_ids [785] which will be deleted.
03/22 10:40:51.575 INFO |            models:0753| Preconditions for deleting host chromeos3-row1-rack2-host8...
03/22 10:40:51.579 INFO |            models:0760|   Deleting attribute HostAttribute object...
03/22 10:40:51.582 ERROR|     email_manager:0082| Uncaught exception. Terminating shard_client.
Traceback (most recent call last):
  File "/usr/local/autotest/scheduler/shard/shard_client.py", line 510, in main
    main_without_exception_handling(options)
  File "/usr/local/autotest/scheduler/shard/shard_client.py", line 534, in main_without_exception_handling
    _heartbeat_client.loop(options.lifetime_hours)
  File "/usr/local/autotest/scheduler/shard/shard_client.py", line 412, in loop
    self.tick()
  File "/usr/local/autotest/scheduler/shard/shard_client.py", line 400, in tick
    success = self.do_heartbeat()
  File "/usr/local/autotest/site-packages/chromite/lib/metrics.py", line 490, in wrapper
    return fn(*args, **kwargs)
  File "/usr/local/autotest/scheduler/shard/shard_client.py", line 393, in do_heartbeat
    self.process_heartbeat_response(response)
  File "/usr/local/autotest/site-packages/chromite/lib/metrics.py", line 490, in wrapper
    return fn(*args, **kwargs)
  File "/usr/local/autotest/scheduler/shard/shard_client.py", line 185, in process_heartbeat_response
    self._remove_incorrect_hosts(incorrect_host_ids)
  File "/usr/local/autotest/scheduler/shard/shard_client.py", line 212, in _remove_incorrect_hosts
    models.Host.objects.filter(id__in=incorrect_host_ids).delete()
  File "/usr/local/autotest/frontend/afe/model_logic.py", line 477, in delete
    model.delete()
  File "/usr/local/autotest/frontend/afe/models.py", line 761, in delete
    self.delete_attribute(host_attribute.attribute)
  File "/usr/local/autotest/frontend/afe/model_logic.py", line 1344, in delete_attribute
    attribute_model.objects.get(**get_args).delete()
  File "/usr/local/autotest/site-packages/django/db/models/manager.py", line 143, in get
    return self.get_query_set().get(*args, **kwargs)
  File "/usr/local/autotest/site-packages/django/db/models/query.py", line 393, in get
    (self.model._meta.object_name, num, kwargs))
MultipleObjectsReturned: get() returned more than one HostAttribute -- it returned 2! Lookup parameters were {'attribute': u'HWID', 'host': <Host: chromeos3-row1-rack2-host8>}

Labels: -Chase-Pending Chase
Owner: nxia@chromium.org
Simple mitigation is the handle exception without crashing service.

File follow up bugs for more detailed work.

Comment 6 by nxia@chromium.org, Mar 29 2018

Blockedon: 827330
Status: Fixed (was: Untriaged)

Sign in to add a comment