shard db's afe_shards table was inconsistent with master, causing shard_client crashloop |
||||||||||||||
Issue description
10/02 14:01:09.009 ERROR| email_manager:0082| Uncaught exception. Terminating shard_client.
Traceback (most recent call last):
File "/usr/local/autotest/scheduler/shard/shard_client.py", line 431, in main
main_without_exception_handling()
File "/usr/local/autotest/scheduler/shard/shard_client.py", line 458, in main_without_exception_handling
_heartbeat_client.loop()
File "/usr/local/autotest/scheduler/shard/shard_client.py", line 373, in loop
self.tick()
File "/usr/local/autotest/scheduler/shard/shard_client.py", line 366, in tick
self.do_heartbeat()
File "/usr/local/autotest/site-packages/chromite/lib/metrics.py", line 482, in wrapper
return fn(*args, **kwargs)
File "/usr/local/autotest/scheduler/shard/shard_client.py", line 359, in do_heartbeat
self._mark_jobs_as_uploaded([job['id'] for job in packet['jobs']])
File "/usr/local/autotest/scheduler/shard/shard_client.py", line 251, in _mark_jobs_as_uploaded
models.Job.objects.filter(pk__in=job_ids).update(shard=self.shard)
File "/usr/local/autotest/scheduler/shard/shard_client.py", line 223, in shard
self._shard = models.Shard.smart_get(self.hostname)
File "/usr/local/autotest/frontend/afe/model_logic.py", line 835, in smart_get
return manager.get(**{cls.name_field : id_or_name})
File "/usr/local/autotest/site-packages/django/db/models/manager.py", line 143, in get
return self.get_query_set().get(*args, **kwargs)
File "/usr/local/autotest/site-packages/django/db/models/query.py", line 393, in get
(self.model._meta.object_name, num, kwargs))
MultipleObjectsReturned: get() returned more than one Shard -- it returned 2! Lookup parameters were {'hostname': 'chromeos-skunk-1.mtv.corp.google.com'}
,
Oct 2 2017
,
Oct 2 2017
(was there an alert about this shard? I didn't see one https://viceroy.corp.google.com/chromeos/deputy-view#_VG_lnuPnWCa
,
Oct 2 2017
I'm perplexed by the failure. There is only 1 chromeos-skunk-1 entry in the database. There is also a chromeos-skunk1 . We are emitting a wildcard query somehow that is matching both?
,
Oct 2 2017
Also, why is shard_client emitting a django query at all? Which db is it querying? I thought shard_client is only supposed to use RPCs to interact with master.
,
Oct 2 2017
shard_client is hitting the local database, where there are indeed two entries in shard_table; I don't know why we are even hitting local shard_table, makes no sense, but I will fix this incident by deleting the errant entry. akeshet@akeshet:~/chromiumos/src/third_party/autotest/files$ autotest-db chromeos-skunk-1.mtv.corp.google.com Reading table information for completion of table and column names You can turn off this feature to get a quicker startup with -A Welcome to the MySQL monitor. Commands end with ; or \g. Your MySQL connection id is 3761212 Server version: 5.5.57-0ubuntu0.14.04.1 (Ubuntu) Copyright (c) 2000, 2017, Oracle and/or its affiliates. All rights reserved. Oracle is a registered trademark of Oracle Corporation and/or its affiliates. Other names may be trademarks of their respective owners. Type 'help;' or '\h' for help. Type '\c' to clear the current input statement. mysql> select * from afe_shards; +-----+--------------------------------------+ | id | hostname | +-----+--------------------------------------+ | 207 | chromeos-skunk-1.mtv.corp.google.com | | 208 | chromeos-skunk-1.mtv.corp.google.com | +-----+--------------------------------------+ 2 rows in set (0.00 sec)
,
Oct 2 2017
Entry #207 is the wrong one, deleting it.
,
Oct 2 2017
Innnteresting... one of the DUTs in the shard-db is still stuck on id=207. mysql> DELETE FROM afe_shards where id=207; ERROR 1451 (23000): Cannot delete or update a parent row: a foreign key constraint fails (`chromeos_autotest_db`.`afe_hosts`, CONSTRAINT `hosts_to_shard_ibfk` FOREIGN KEY (`shard_id`) REFERENCES `afe_shards` (`id`)) mysql> SELECT * from afe_hosts where shard_id=207; +------+----------------------------+--------+----------+---------------+---------+------------+--------------+-----------+-------+--------+----------+-------------+ | id | hostname | locked | synch_id | status | invalid | protection | locked_by_id | lock_time | dirty | leased | shard_id | lock_reason | +------+----------------------------+--------+----------+---------------+---------+------------+--------------+-----------+-------+--------+----------+-------------+ | 4256 | chromeos4-row8-rack4-host8 | 0 | NULL | Repair Failed | 1 | 0 | NULL | NULL | 1 | 1 | 207 | | +------+----------------------------+--------+----------+---------------+---------+------------+--------------+-----------+-------+--------+----------+-------------+
,
Oct 2 2017
All other hosts on this shard db have the correct id. I'll just update this host to match.
,
Oct 2 2017
Have to do the same for afe_jobs...
,
Oct 2 2017
mysql> select COUNT(*) from afe_jobs where shard_id=207; +----------+ | COUNT(*) | +----------+ | 15581 | +----------+ 1 row in set (0.00 sec) mysql> select COUNT(*) from afe_jobs where shard_id=208; +----------+ | COUNT(*) | +----------+ | 120 | +----------+ 1 row in set (0.00 sec) mysql> afe_jobs is pretty messed up, most of the jobs are associated with the incorrect shard id...
,
Oct 2 2017
Ok, agressive fix time. I'm going to move this board:auron_paine from chromeos-skunk-1 to chromeos-skunk-2 However, since I'm worried about misbehavior from chromeos-skunk-1 I'm first going to remove it from serverdb and shard table. I may even wipe it.
,
Oct 2 2017
auron_paine moved to chromeos-skunk-2. The new shard is ticking, I expect it should work correctly. The old shard id exception-looping. I'm going to wipe it's tables to be safe...
,
Oct 2 2017
(note: some job history may be lost, for jobs that ran on that shard)
,
Oct 2 2017
akeshet@akeshet:~/chromiumos/chromeos-admin$ ./bin/run_server_task ShardCleanupTask --host_server chromeos-skunk-1.mtv.corp.google.com ^ did a lot of cleanup but also seemed to die in puppet due to Issue 770903
,
Oct 2 2017
,
Oct 2 2017
I believe the production issue is solved. Chase-Pending follow up questions: 1) How did the shard's local afe_shards table get inconsistent with master db. (theory: this shard was added and removed and re-added to production under a new id) 2) Should db_sentinel notice this inconsistency? 3) Should db_sentinel heal such inconsistency?
,
Oct 3 2017
,
Oct 9 2017
sentinel should enforce that every shard's afe_shards table is a subset of the master's
,
Oct 16 2017
CL under review.
,
Oct 23 2017
pprabhu to review
,
Oct 24 2017
The following revision refers to this bug: https://chrome-internal.googlesource.com/chromeos/chromeos-admin/+/74496e17ebd0a2ad8bbcb4932c58de069cf3e2c6 commit 74496e17ebd0a2ad8bbcb4932c58de069cf3e2c6 Author: Shuqian Zhao <shuqianz@chromium.org> Date: Tue Oct 24 05:39:05 2017
,
Oct 30 2017
,
Nov 15 2017
Either the logging is confusing, or something is not behaving correctly. I see the following message in the sentinel logs: 2017-11-15 12:28:47,727 ERRO| Shard 136 (chromeos-server36.cbf.corp.google.com) does not exist in master DB 2017-11-15 12:28:47,753 INFO| chromeos-server36.cbf.corp.google.com: Done. However, if I examine atest shard list I see: 136 chromeos-server36.cbf.corp.google.com board:guado, board:guado_moblab I see the same message on many other shards too, in sentinel logs.
,
Nov 15 2017
,
Nov 16 2017
,
Nov 16 2017
The following revision refers to this bug: https://chrome-internal.googlesource.com/chromeos/chromeos-admin/+/d8075d2d36f02209266f72a27104b6cd596ec1cc commit d8075d2d36f02209266f72a27104b6cd596ec1cc Author: Shuqian Zhao <shuqianz@chromium.org> Date: Thu Nov 16 07:52:12 2017
,
Nov 20 2017
|
||||||||||||||
►
Sign in to add a comment |
||||||||||||||
Comment 1 by akes...@chromium.org
, Oct 2 2017