New issue
Advanced search Search tips
Note: Color blocks (like or ) mean that a user may not be available. Tooltip shows the reason.
Starred by 1 user

Issue metadata

Status: Fixed
Last visit > 30 days ago
Closed: Nov 2017
EstimatedDays: ----
NextAction: ----
OS: ----
Pri: 1
Type: Bug

issue 780741

Sign in to add a comment

Issue 770865: shard db's afe_shards table was inconsistent with master, causing shard_client crashloop

Reported by, Oct 2 2017 Project Member

Issue description

10/02 14:01:09.009 ERROR|     email_manager:0082| Uncaught exception. Terminating shard_client.
Traceback (most recent call last):
  File "/usr/local/autotest/scheduler/shard/", line 431, in main
  File "/usr/local/autotest/scheduler/shard/", line 458, in main_without_exception_handling
  File "/usr/local/autotest/scheduler/shard/", line 373, in loop
  File "/usr/local/autotest/scheduler/shard/", line 366, in tick
  File "/usr/local/autotest/site-packages/chromite/lib/", line 482, in wrapper
    return fn(*args, **kwargs)
  File "/usr/local/autotest/scheduler/shard/", line 359, in do_heartbeat
    self._mark_jobs_as_uploaded([job['id'] for job in packet['jobs']])
  File "/usr/local/autotest/scheduler/shard/", line 251, in _mark_jobs_as_uploaded
  File "/usr/local/autotest/scheduler/shard/", line 223, in shard
    self._shard = models.Shard.smart_get(self.hostname)
  File "/usr/local/autotest/frontend/afe/", line 835, in smart_get
    return manager.get(**{cls.name_field : id_or_name})
  File "/usr/local/autotest/site-packages/django/db/models/", line 143, in get
    return self.get_query_set().get(*args, **kwargs)
  File "/usr/local/autotest/site-packages/django/db/models/", line 393, in get
    (self.model._meta.object_name, num, kwargs))
MultipleObjectsReturned: get() returned more than one Shard -- it returned 2! Lookup parameters were {'hostname': ''}

Comment 1 by, Oct 2 2017

This is a production shard.

akeshet@akeshet:~/chromiumos/src$ atest shard list | grep chromeos-skunk-1
208    board:auron_paine
akeshet@akeshet:~/chromiumos/src$ atest server list | grep chromeos-skunk-1
Hostname     :

Comment 2 by, Oct 2 2017


Comment 3 by, Oct 2 2017

(was there an alert about this shard? I didn't see one

Comment 4 by, Oct 2 2017

I'm perplexed by the failure. There is only 1 chromeos-skunk-1 entry in the database.

There is also a chromeos-skunk1 . We are emitting a wildcard query somehow that is matching both?

Comment 5 by, Oct 2 2017

Also, why is shard_client emitting a django query at all? Which db is it querying?

I thought shard_client is only supposed to use RPCs to interact with master.

Comment 6 by, Oct 2 2017

shard_client is hitting the local database, where there are indeed two entries in shard_table;

I don't know why we are even hitting local shard_table, makes no sense, but I will fix this incident by deleting the errant entry.

akeshet@akeshet:~/chromiumos/src/third_party/autotest/files$ autotest-db
Reading table information for completion of table and column names
You can turn off this feature to get a quicker startup with -A

Welcome to the MySQL monitor.  Commands end with ; or \g.
Your MySQL connection id is 3761212
Server version: 5.5.57-0ubuntu0.14.04.1 (Ubuntu)

Copyright (c) 2000, 2017, Oracle and/or its affiliates. All rights reserved.

Oracle is a registered trademark of Oracle Corporation and/or its
affiliates. Other names may be trademarks of their respective

Type 'help;' or '\h' for help. Type '\c' to clear the current input statement.

mysql> select * from afe_shards;
| id  | hostname                             |
| 207 | |
| 208 | |
2 rows in set (0.00 sec)

Comment 7 by, Oct 2 2017

Entry #207 is the wrong one, deleting it.

Comment 8 by, Oct 2 2017

Innnteresting... one of the DUTs in the shard-db is still stuck on id=207.

mysql> DELETE FROM afe_shards where id=207;
ERROR 1451 (23000): Cannot delete or update a parent row: a foreign key constraint fails (`chromeos_autotest_db`.`afe_hosts`, CONSTRAINT `hosts_to_shard_ibfk` FOREIGN KEY (`shard_id`) REFERENCES `afe_shards` (`id`))
mysql> SELECT * from afe_hosts where shard_id=207;
| id   | hostname                   | locked | synch_id | status        | invalid | protection | locked_by_id | lock_time | dirty | leased | shard_id | lock_reason |
| 4256 | chromeos4-row8-rack4-host8 |      0 |     NULL | Repair Failed |       1 |          0 |         NULL | NULL      |     1 |      1 |      207 |             |

Comment 9 by, Oct 2 2017

All other hosts on this shard db have the correct id. I'll just update this host to match.

Comment 10 by, Oct 2 2017

Have to do the same for afe_jobs...

Comment 11 by, Oct 2 2017

mysql> select COUNT(*) from afe_jobs where shard_id=207;
| COUNT(*) |
|    15581 |
1 row in set (0.00 sec)

mysql> select COUNT(*) from afe_jobs where shard_id=208;
| COUNT(*) |
|      120 |
1 row in set (0.00 sec)


afe_jobs is pretty messed up, most of the jobs are associated with the incorrect shard id...

Comment 12 by, Oct 2 2017

Summary: shard_client crashlooping on chromeos-skunk-1 | board:auron_paine outage (was: shard_client crashlooping on chromeos-skunk-1)
Ok, agressive fix time. I'm going to move this board:auron_paine from chromeos-skunk-1 to chromeos-skunk-2

However, since I'm worried about misbehavior from chromeos-skunk-1 I'm first going to remove it from serverdb and shard table. I may even wipe it.

Comment 13 by, Oct 2 2017

auron_paine moved to chromeos-skunk-2. The new shard is ticking, I expect it should work correctly.

The old shard id exception-looping. I'm going to wipe it's tables to be safe...

Comment 14 by, Oct 2 2017

(note: some job history may be lost, for jobs that ran on that shard)

Comment 15 by, Oct 2 2017

akeshet@akeshet:~/chromiumos/chromeos-admin$ ./bin/run_server_task ShardCleanupTask --host_server

^ did a lot of cleanup but also seemed to die in puppet due to Issue 770903

Comment 16 by, Oct 2 2017


Comment 17 by, Oct 2 2017

Labels: Chase-Pending
Owner: ----
Status: Available (was: Untriaged)
I believe the production issue is solved. Chase-Pending follow up questions:

1) How did the shard's local afe_shards table get inconsistent with master db.
   (theory: this shard was added and removed and re-added to production under a new id)
2) Should db_sentinel notice this inconsistency?
3) Should db_sentinel heal such inconsistency?

Comment 18 by, Oct 3 2017

Summary: shard db's afe_shards table was inconsistent with master, causing shard_client crashloop (was: shard_client crashlooping on chromeos-skunk-1 | board:auron_paine outage)

Comment 19 by, Oct 9 2017

Labels: -Chase-Pending Chase
sentinel should enforce that every shard's afe_shards table is a subset of the master's

Comment 20 by, Oct 16 2017

CL under review.

Comment 21 by, Oct 23 2017

pprabhu to review

Comment 22 by, Oct 24 2017

Project Member

Comment 23 by, Oct 30 2017

Status: Fixed (was: Available)

Comment 24 by, Nov 15 2017

Status: Assigned (was: Fixed)
Either the logging is confusing, or something is not behaving correctly. I see the following message in the sentinel logs:

2017-11-15 12:28:47,727 ERRO| Shard 136 ( does not exist in master DB
2017-11-15 12:28:47,753 INFO| Done.

However, if I examine atest shard list I see:

136   board:guado, board:guado_moblab

I see the same message on many other shards too, in sentinel logs.

Comment 25 by, Nov 15 2017

Status: Started (was: Assigned)

Comment 26 by, Nov 16 2017

Blocking: 780741

Comment 27 by, Nov 16 2017

Project Member

Comment 28 by, Nov 20 2017

Status: Fixed (was: Started)

Sign in to add a comment