New issue
Advanced search Search tips
Note: Color blocks (like or ) mean that a user may not be available. Tooltip shows the reason.

Issue 770865 link

Starred by 1 user

Issue metadata

Status: Fixed
Last visit > 30 days ago
Closed: Nov 2017
EstimatedDays: ----
NextAction: ----
OS: ----
Pri: 1
Type: Bug

issue 780741

Sign in to add a comment

shard db's afe_shards table was inconsistent with master, causing shard_client crashloop

Project Member Reported by, Oct 2 2017

Issue description

10/02 14:01:09.009 ERROR|     email_manager:0082| Uncaught exception. Terminating shard_client.
Traceback (most recent call last):
  File "/usr/local/autotest/scheduler/shard/", line 431, in main
  File "/usr/local/autotest/scheduler/shard/", line 458, in main_without_exception_handling
  File "/usr/local/autotest/scheduler/shard/", line 373, in loop
  File "/usr/local/autotest/scheduler/shard/", line 366, in tick
  File "/usr/local/autotest/site-packages/chromite/lib/", line 482, in wrapper
    return fn(*args, **kwargs)
  File "/usr/local/autotest/scheduler/shard/", line 359, in do_heartbeat
    self._mark_jobs_as_uploaded([job['id'] for job in packet['jobs']])
  File "/usr/local/autotest/scheduler/shard/", line 251, in _mark_jobs_as_uploaded
  File "/usr/local/autotest/scheduler/shard/", line 223, in shard
    self._shard = models.Shard.smart_get(self.hostname)
  File "/usr/local/autotest/frontend/afe/", line 835, in smart_get
    return manager.get(**{cls.name_field : id_or_name})
  File "/usr/local/autotest/site-packages/django/db/models/", line 143, in get
    return self.get_query_set().get(*args, **kwargs)
  File "/usr/local/autotest/site-packages/django/db/models/", line 393, in get
    (self.model._meta.object_name, num, kwargs))
MultipleObjectsReturned: get() returned more than one Shard -- it returned 2! Lookup parameters were {'hostname': ''}

This is a production shard.

akeshet@akeshet:~/chromiumos/src$ atest shard list | grep chromeos-skunk-1
208    board:auron_paine
akeshet@akeshet:~/chromiumos/src$ atest server list | grep chromeos-skunk-1
Hostname     :

(was there an alert about this shard? I didn't see one
I'm perplexed by the failure. There is only 1 chromeos-skunk-1 entry in the database.

There is also a chromeos-skunk1 . We are emitting a wildcard query somehow that is matching both?
Also, why is shard_client emitting a django query at all? Which db is it querying?

I thought shard_client is only supposed to use RPCs to interact with master.
shard_client is hitting the local database, where there are indeed two entries in shard_table;

I don't know why we are even hitting local shard_table, makes no sense, but I will fix this incident by deleting the errant entry.

akeshet@akeshet:~/chromiumos/src/third_party/autotest/files$ autotest-db
Reading table information for completion of table and column names
You can turn off this feature to get a quicker startup with -A

Welcome to the MySQL monitor.  Commands end with ; or \g.
Your MySQL connection id is 3761212
Server version: 5.5.57-0ubuntu0.14.04.1 (Ubuntu)

Copyright (c) 2000, 2017, Oracle and/or its affiliates. All rights reserved.

Oracle is a registered trademark of Oracle Corporation and/or its
affiliates. Other names may be trademarks of their respective

Type 'help;' or '\h' for help. Type '\c' to clear the current input statement.

mysql> select * from afe_shards;
| id  | hostname                             |
| 207 | |
| 208 | |
2 rows in set (0.00 sec)

Entry #207 is the wrong one, deleting it.
Innnteresting... one of the DUTs in the shard-db is still stuck on id=207.

mysql> DELETE FROM afe_shards where id=207;
ERROR 1451 (23000): Cannot delete or update a parent row: a foreign key constraint fails (`chromeos_autotest_db`.`afe_hosts`, CONSTRAINT `hosts_to_shard_ibfk` FOREIGN KEY (`shard_id`) REFERENCES `afe_shards` (`id`))
mysql> SELECT * from afe_hosts where shard_id=207;
| id   | hostname                   | locked | synch_id | status        | invalid | protection | locked_by_id | lock_time | dirty | leased | shard_id | lock_reason |
| 4256 | chromeos4-row8-rack4-host8 |      0 |     NULL | Repair Failed |       1 |          0 |         NULL | NULL      |     1 |      1 |      207 |             |

All other hosts on this shard db have the correct id. I'll just update this host to match.
Have to do the same for afe_jobs...
mysql> select COUNT(*) from afe_jobs where shard_id=207;
| COUNT(*) |
|    15581 |
1 row in set (0.00 sec)

mysql> select COUNT(*) from afe_jobs where shard_id=208;
| COUNT(*) |
|      120 |
1 row in set (0.00 sec)


afe_jobs is pretty messed up, most of the jobs are associated with the incorrect shard id...
Summary: shard_client crashlooping on chromeos-skunk-1 | board:auron_paine outage (was: shard_client crashlooping on chromeos-skunk-1)
Ok, agressive fix time. I'm going to move this board:auron_paine from chromeos-skunk-1 to chromeos-skunk-2

However, since I'm worried about misbehavior from chromeos-skunk-1 I'm first going to remove it from serverdb and shard table. I may even wipe it.
auron_paine moved to chromeos-skunk-2. The new shard is ticking, I expect it should work correctly.

The old shard id exception-looping. I'm going to wipe it's tables to be safe...
(note: some job history may be lost, for jobs that ran on that shard)
akeshet@akeshet:~/chromiumos/chromeos-admin$ ./bin/run_server_task ShardCleanupTask --host_server

^ did a lot of cleanup but also seemed to die in puppet due to Issue 770903
Labels: Chase-Pending
Owner: ----
Status: Available (was: Untriaged)
I believe the production issue is solved. Chase-Pending follow up questions:

1) How did the shard's local afe_shards table get inconsistent with master db.
   (theory: this shard was added and removed and re-added to production under a new id)
2) Should db_sentinel notice this inconsistency?
3) Should db_sentinel heal such inconsistency?
Summary: shard db's afe_shards table was inconsistent with master, causing shard_client crashloop (was: shard_client crashlooping on chromeos-skunk-1 | board:auron_paine outage)
Labels: -Chase-Pending Chase
sentinel should enforce that every shard's afe_shards table is a subset of the master's
CL under review.
pprabhu to review
Project Member

Comment 22 by, Oct 24 2017

The following revision refers to this bug:

commit 74496e17ebd0a2ad8bbcb4932c58de069cf3e2c6
Author: Shuqian Zhao <>
Date: Tue Oct 24 05:39:05 2017

Status: Fixed (was: Available)
Status: Assigned (was: Fixed)
Either the logging is confusing, or something is not behaving correctly. I see the following message in the sentinel logs:

2017-11-15 12:28:47,727 ERRO| Shard 136 ( does not exist in master DB
2017-11-15 12:28:47,753 INFO| Done.

However, if I examine atest shard list I see:

136   board:guado, board:guado_moblab

I see the same message on many other shards too, in sentinel logs.
Status: Started (was: Assigned)
Blocking: 780741
Project Member

Comment 27 by, Nov 16 2017

The following revision refers to this bug:

commit d8075d2d36f02209266f72a27104b6cd596ec1cc
Author: Shuqian Zhao <>
Date: Thu Nov 16 07:52:12 2017

Status: Fixed (was: Started)

Sign in to add a comment