server102 failing with UnallowedRecordsSentToMaster error
Reported by
jrbarnette@chromium.org,
Apr 14 2017
|
||||
Issue description
Shard chromeos-server102 is failing with this exception:
Traceback (most recent call last):
File "/usr/local/autotest/frontend/afe/json_rpc/serviceHandler.py", line 120, in dispatchRequest
results['result'] = self.invokeServiceEndpoint(meth, args)
File "/usr/local/autotest/frontend/afe/json_rpc/serviceHandler.py", line 160, in invokeServiceEndpoint
return meth(*args)
File "/usr/local/autotest/frontend/afe/rpc_handler.py", line 270, in new_fn
return f(*args, **keyword_args)
File "/usr/local/autotest/frontend/afe/rpc_interface.py", line 1955, in shard_heartbeat
rpc_utils.persist_records_sent_from_shard(shard_obj, jobs, hqes)
File "/usr/local/autotest/frontend/afe/rpc_utils.py", line 1015, in persist_records_sent_from_shard
shard, jobs, models.Job)
File "/usr/local/autotest/frontend/afe/rpc_utils.py", line 986, in _persist_records_with_type_sent_from_shard
shard, serialized_record, *args, **kwargs)
File "/usr/local/autotest/frontend/afe/models.py", line 1288, in sanity_check_update_from_shard
shard.id))
UnallowedRecordsSentToMaster: Job id=112182619 ...
The problem has been going on since late morning, 4/12.
The shard serves terra, veyron_mickey, and veyron_tiger, so all
tests for those boards are failing.
,
Apr 15 2017
http://cautotest/afe/#tab_id=view_job&object_id=112182619 cyan-release/R59-9451.0.0/faft_bios_au_1/firmware_CorruptBothFwSigAB I hear you guys moved cyan off 102 to 97 early in the morning on 04/12?
,
Apr 15 2017
Surgery performed on the shard (after chasing down some key constraints). mysql> DELETE from afe_aborted_host_queue_entries WHERE queue_entry_id = 112515612; Query OK, 1 row affected (0.00 sec) mysql> DELETE from afe_host_queue_entries where job_id = 112182619; Query OK, 1 row affected (0.03 sec) mysql> DELETE from afe_jobs_dependency_labels where job_id=112182619; Query OK, 5 rows affected (0.01 sec) mysql> DELETE from afe_jobs where id=112182619; Query OK, 1 row affected (0.02 sec)
,
Apr 15 2017
mysql> select * from afe_host_queue_entries where job_id=112182620; +-----------+-----------+---------+---------+-----------+--------+----------+---------+------------------+-----------------+---------+------------+-------------+ | id | job_id | host_id | status | meta_host | active | complete | deleted | execution_subdir | atomic_group_id | aborted | started_on | finished_on | +-----------+-----------+---------+---------+-----------+--------+----------+---------+------------------+-----------------+---------+------------+-------------+ | 112515613 | 112182620 | NULL | Aborted | 246844 | 0 | 1 | 0 | | NULL | 1 | NULL | NULL | +-----------+-----------+---------+---------+-----------+--------+----------+---------+------------------+-----------------+---------+------------+-------------+ 1 row in set (0.00 sec) mysql> DELETE from afe_aborted_host_queue_entries WHERE queue_entry_id = 112515613; Query OK, 1 row affected (0.01 sec) mysql> DELETE from afe_host_queue_entries WHERE job_id = 112182620; Query OK, 1 row affected (0.04 sec) mysql> DELETE from afe_job_dependency_labels WHERE job_id = 112182620; ERROR 1146 (42S02): Table 'chromeos_autotest_db.afe_job_dependency_labels' doesn't exist mysql> DELETE from afe_jobs_dependency_labels WHERE job_id = 112182620; Query OK, 5 rows affected (0.00 sec) mysql> DELETE from afe_jobs WHERE id=112182620; Query OK, 1 row affected (0.01 sec) mysql>
,
Apr 15 2017
I think we should change this to an on-delete-cascade constraint, because this is getting pretty hairy otherwise.
,
Apr 15 2017
> I hear you guys moved cyan off 102 to 97 early in the morning on 04/12? Yes. There was a bug in the code responsible for the transition. That's caused database inconsistencies, which we're trying to clean up.
,
Apr 15 2017
Richard identified an id range of these invalid jobs. I attempted to trick shard_client into giving up on them, by doing what it does after it marks them as uploaded: mysql> select id, shard_id from afe_jobs WHERE id >= 112182621 AND id <= 112182631; +-----------+----------+ | id | shard_id | +-----------+----------+ | 112182621 | NULL | | 112182622 | NULL | | 112182623 | NULL | | 112182624 | NULL | | 112182625 | NULL | | 112182626 | NULL | | 112182627 | NULL | | 112182628 | NULL | | 112182629 | NULL | | 112182630 | NULL | | 112182631 | NULL | +-----------+----------+ 11 rows in set (0.00 sec) mysql> UPDATE afe_jobs SET shard_id=149 WHERE id >= 112182621 AND id <= 112182631; Query OK, 11 rows affected (0.01 sec) Rows matched: 11 Changed: 11 Warnings: 0 mysql>
,
Apr 15 2017
mysql> UPDATE afe_jobs SET shard_id=149 WHERE id >= 112182621 AND id <= 112182652; Query OK, 20 rows affected (0.01 sec) Rows matched: 32 Changed: 20 Warnings: 0
,
Apr 15 2017
mysql> select created_on, shard_id FROM afe_jobs WHERE shard_id IS NULL AND created_on < "2017-04-12 15:40:30"; <snip> . . . | 2017-04-12 10:34:53 | NULL | | 2017-04-12 10:45:18 | NULL | | 2017-04-12 10:50:52 | NULL | | 2017-04-12 10:56:55 | NULL | | 2017-04-12 10:59:07 | NULL | | 2017-04-12 11:07:54 | NULL | | 2017-04-12 11:08:11 | NULL | +---------------------+----------+ 1499 rows in set (0.00 sec) These look like the ones I want to squelch. mysql> UPDATE afe_jobs SET shard_id = 149 WHERE shard_id IS NULL AND created_on < "2017-04-12 15:40:30"; Query OK, 1499 rows affected (0.34 sec) Rows matched: 1499 Changed: 1499 Warnings: 0
,
Apr 15 2017
At this point, the shard is back to processing _something_. It remains to be seen whether it's prepared to do useful work, but we're looking to see if things go back to normal, or if we uncover some new problem.
,
Apr 15 2017
As of this moment, the shard can be seen to be scheduling work, at least to the extent that it can run Verify tasks on DUTs. Downgrading priority. If things haven't blown up after the weekend, we'll declare victory on the emergency. We still need to fix the underlying issue that caused this problem.
,
Apr 15 2017
Follow-up: We should have had an alert: https://bugs.chromium.org/p/chromium/issues/detail?id=711850 Removing a board shouldn't cause this timebomb: https://bugs.chromium.org/p/chromium/issues/detail?id=711852
,
Apr 17 2017
|
||||
►
Sign in to add a comment |
||||
Comment 1 by ihf@chromium.org
, Apr 15 2017