Project: chromium Issues People Development process History Sign in
New issue
Advanced search Search tips
Starred by 1 user
Status: Available
Owner:
Cc:
Components:
EstimatedDays: ----
NextAction: ----
OS: ----
Pri: 1
Type: Bug



Sign in to add a comment
shard db's afe_shards table was inconsistent with master, causing shard_client crashloop
Project Member Reported by akes...@chromium.org, Oct 2 Back to list
10/02 14:01:09.009 ERROR|     email_manager:0082| Uncaught exception. Terminating shard_client.
Traceback (most recent call last):
  File "/usr/local/autotest/scheduler/shard/shard_client.py", line 431, in main
    main_without_exception_handling()
  File "/usr/local/autotest/scheduler/shard/shard_client.py", line 458, in main_without_exception_handling
    _heartbeat_client.loop()
  File "/usr/local/autotest/scheduler/shard/shard_client.py", line 373, in loop
    self.tick()
  File "/usr/local/autotest/scheduler/shard/shard_client.py", line 366, in tick
    self.do_heartbeat()
  File "/usr/local/autotest/site-packages/chromite/lib/metrics.py", line 482, in wrapper
    return fn(*args, **kwargs)
  File "/usr/local/autotest/scheduler/shard/shard_client.py", line 359, in do_heartbeat
    self._mark_jobs_as_uploaded([job['id'] for job in packet['jobs']])
  File "/usr/local/autotest/scheduler/shard/shard_client.py", line 251, in _mark_jobs_as_uploaded
    models.Job.objects.filter(pk__in=job_ids).update(shard=self.shard)
  File "/usr/local/autotest/scheduler/shard/shard_client.py", line 223, in shard
    self._shard = models.Shard.smart_get(self.hostname)
  File "/usr/local/autotest/frontend/afe/model_logic.py", line 835, in smart_get
    return manager.get(**{cls.name_field : id_or_name})
  File "/usr/local/autotest/site-packages/django/db/models/manager.py", line 143, in get
    return self.get_query_set().get(*args, **kwargs)
  File "/usr/local/autotest/site-packages/django/db/models/query.py", line 393, in get
    (self.model._meta.object_name, num, kwargs))
MultipleObjectsReturned: get() returned more than one Shard -- it returned 2! Lookup parameters were {'hostname': 'chromeos-skunk-1.mtv.corp.google.com'}

 
This is a production shard.

akeshet@akeshet:~/chromiumos/src$ atest shard list | grep chromeos-skunk-1
208  chromeos-skunk-1.mtv.corp.google.com    board:auron_paine
akeshet@akeshet:~/chromiumos/src$ atest server list | grep chromeos-skunk-1
Hostname     : chromeos-skunk-1.mtv.corp.google.com

Owner: pho...@chromium.org
(was there an alert about this shard? I didn't see one

https://viceroy.corp.google.com/chromeos/deputy-view#_VG_lnuPnWCa
I'm perplexed by the failure. There is only 1 chromeos-skunk-1 entry in the database.

There is also a chromeos-skunk1 . We are emitting a wildcard query somehow that is matching both?
Cc: dshi@chromium.org
Also, why is shard_client emitting a django query at all? Which db is it querying?

I thought shard_client is only supposed to use RPCs to interact with master.
shard_client is hitting the local database, where there are indeed two entries in shard_table;

I don't know why we are even hitting local shard_table, makes no sense, but I will fix this incident by deleting the errant entry.

akeshet@akeshet:~/chromiumos/src/third_party/autotest/files$ autotest-db chromeos-skunk-1.mtv.corp.google.com
Reading table information for completion of table and column names
You can turn off this feature to get a quicker startup with -A

Welcome to the MySQL monitor.  Commands end with ; or \g.
Your MySQL connection id is 3761212
Server version: 5.5.57-0ubuntu0.14.04.1 (Ubuntu)

Copyright (c) 2000, 2017, Oracle and/or its affiliates. All rights reserved.

Oracle is a registered trademark of Oracle Corporation and/or its
affiliates. Other names may be trademarks of their respective
owners.

Type 'help;' or '\h' for help. Type '\c' to clear the current input statement.

mysql> select * from afe_shards;
+-----+--------------------------------------+
| id  | hostname                             |
+-----+--------------------------------------+
| 207 | chromeos-skunk-1.mtv.corp.google.com |
| 208 | chromeos-skunk-1.mtv.corp.google.com |
+-----+--------------------------------------+
2 rows in set (0.00 sec)

Cc: nxia@chromium.org
Entry #207 is the wrong one, deleting it.
Innnteresting... one of the DUTs in the shard-db is still stuck on id=207.

mysql> DELETE FROM afe_shards where id=207;
ERROR 1451 (23000): Cannot delete or update a parent row: a foreign key constraint fails (`chromeos_autotest_db`.`afe_hosts`, CONSTRAINT `hosts_to_shard_ibfk` FOREIGN KEY (`shard_id`) REFERENCES `afe_shards` (`id`))
mysql> SELECT * from afe_hosts where shard_id=207;
+------+----------------------------+--------+----------+---------------+---------+------------+--------------+-----------+-------+--------+----------+-------------+
| id   | hostname                   | locked | synch_id | status        | invalid | protection | locked_by_id | lock_time | dirty | leased | shard_id | lock_reason |
+------+----------------------------+--------+----------+---------------+---------+------------+--------------+-----------+-------+--------+----------+-------------+
| 4256 | chromeos4-row8-rack4-host8 |      0 |     NULL | Repair Failed |       1 |          0 |         NULL | NULL      |     1 |      1 |      207 |             |
+------+----------------------------+--------+----------+---------------+---------+------------+--------------+-----------+-------+--------+----------+-------------+



All other hosts on this shard db have the correct id. I'll just update this host to match.
Have to do the same for afe_jobs...
mysql> select COUNT(*) from afe_jobs where shard_id=207;
+----------+
| COUNT(*) |
+----------+
|    15581 |
+----------+
1 row in set (0.00 sec)

mysql> select COUNT(*) from afe_jobs where shard_id=208;
+----------+
| COUNT(*) |
+----------+
|      120 |
+----------+
1 row in set (0.00 sec)

mysql> 


afe_jobs is pretty messed up, most of the jobs are associated with the incorrect shard id...
Summary: shard_client crashlooping on chromeos-skunk-1 | board:auron_paine outage (was: shard_client crashlooping on chromeos-skunk-1)
Ok, agressive fix time. I'm going to move this board:auron_paine from chromeos-skunk-1 to chromeos-skunk-2

However, since I'm worried about misbehavior from chromeos-skunk-1 I'm first going to remove it from serverdb and shard table. I may even wipe it.
auron_paine moved to chromeos-skunk-2. The new shard is ticking, I expect it should work correctly.

The old shard id exception-looping. I'm going to wipe it's tables to be safe...
(note: some job history may be lost, for jobs that ran on that shard)
akeshet@akeshet:~/chromiumos/chromeos-admin$ ./bin/run_server_task ShardCleanupTask --host_server chromeos-skunk-1.mtv.corp.google.com

^ did a lot of cleanup but also seemed to die in puppet due to Issue 770903
Cc: ntang@chromium.org
Labels: Chase-Pending
Owner: ----
Status: Available
I believe the production issue is solved. Chase-Pending follow up questions:

1) How did the shard's local afe_shards table get inconsistent with master db.
   (theory: this shard was added and removed and re-added to production under a new id)
2) Should db_sentinel notice this inconsistency?
3) Should db_sentinel heal such inconsistency?
Summary: shard db's afe_shards table was inconsistent with master, causing shard_client crashloop (was: shard_client crashlooping on chromeos-skunk-1 | board:auron_paine outage)
Labels: -Chase-Pending Chase
Owner: shuqianz@chromium.org
sentinel should enforce that every shard's afe_shards table is a subset of the master's
Comment 20 by akes...@chromium.org, Oct 16 (6 days ago)
CL under review.
Sign in to add a comment