New issue
Advanced search Search tips
Note: Color blocks (like or ) mean that a user may not be available. Tooltip shows the reason.

Issue 711838 link

Starred by 2 users

Issue metadata

Status: Fixed
Owner: ----
Closed: Apr 2017
Cc:
Components:
EstimatedDays: ----
NextAction: ----
OS: ----
Pri: 1
Type: Bug



Sign in to add a comment

server102 failing with UnallowedRecordsSentToMaster error

Reported by jrbarnette@chromium.org, Apr 14 2017

Issue description

Shard chromeos-server102 is failing with this exception:

Traceback (most recent call last):
  File "/usr/local/autotest/frontend/afe/json_rpc/serviceHandler.py", line 120, in dispatchRequest
    results['result'] = self.invokeServiceEndpoint(meth, args)
  File "/usr/local/autotest/frontend/afe/json_rpc/serviceHandler.py", line 160, in invokeServiceEndpoint
    return meth(*args)
  File "/usr/local/autotest/frontend/afe/rpc_handler.py", line 270, in new_fn
    return f(*args, **keyword_args)
  File "/usr/local/autotest/frontend/afe/rpc_interface.py", line 1955, in shard_heartbeat
    rpc_utils.persist_records_sent_from_shard(shard_obj, jobs, hqes)
  File "/usr/local/autotest/frontend/afe/rpc_utils.py", line 1015, in persist_records_sent_from_shard
    shard, jobs, models.Job)
  File "/usr/local/autotest/frontend/afe/rpc_utils.py", line 986, in _persist_records_with_type_sent_from_shard
    shard, serialized_record, *args, **kwargs)
  File "/usr/local/autotest/frontend/afe/models.py", line 1288, in sanity_check_update_from_shard
    shard.id))
UnallowedRecordsSentToMaster: Job id=112182619 ...

The problem has been going on since late morning, 4/12.

The shard serves terra, veyron_mickey, and veyron_tiger, so all
tests for those boards are failing.

 

Comment 1 by ihf@chromium.org, Apr 15 2017

Cc: marc...@chromium.org ihf@chromium.org

Comment 2 by ihf@chromium.org, Apr 15 2017

Cc: akes...@chromium.org
http://cautotest/afe/#tab_id=view_job&object_id=112182619
cyan-release/R59-9451.0.0/faft_bios_au_1/firmware_CorruptBothFwSigAB

I hear you guys moved cyan off 102 to 97 early in the morning on 04/12?
Surgery performed on the shard (after chasing down some key constraints).

mysql> DELETE from afe_aborted_host_queue_entries WHERE queue_entry_id = 112515612;                                                                                                                                                                                                       
Query OK, 1 row affected (0.00 sec)

mysql> DELETE from afe_host_queue_entries where job_id = 112182619;                                                                                                                                                                                                                       
Query OK, 1 row affected (0.03 sec)

mysql> DELETE from afe_jobs_dependency_labels where job_id=112182619;                                                                                                                                                                                                                     
Query OK, 5 rows affected (0.01 sec)

mysql> DELETE from afe_jobs where id=112182619;                                                                                                                                                                                                                                           
Query OK, 1 row affected (0.02 sec)

mysql> select * from afe_host_queue_entries where job_id=112182620;
+-----------+-----------+---------+---------+-----------+--------+----------+---------+------------------+-----------------+---------+------------+-------------+
| id        | job_id    | host_id | status  | meta_host | active | complete | deleted | execution_subdir | atomic_group_id | aborted | started_on | finished_on |
+-----------+-----------+---------+---------+-----------+--------+----------+---------+------------------+-----------------+---------+------------+-------------+
| 112515613 | 112182620 |    NULL | Aborted |    246844 |      0 |        1 |       0 |                  |            NULL |       1 | NULL       | NULL        |
+-----------+-----------+---------+---------+-----------+--------+----------+---------+------------------+-----------------+---------+------------+-------------+
1 row in set (0.00 sec)

mysql> DELETE from afe_aborted_host_queue_entries WHERE queue_entry_id = 112515613;
Query OK, 1 row affected (0.01 sec)

mysql> DELETE from afe_host_queue_entries WHERE job_id = 112182620;
Query OK, 1 row affected (0.04 sec)

mysql> DELETE from afe_job_dependency_labels WHERE job_id = 112182620;
ERROR 1146 (42S02): Table 'chromeos_autotest_db.afe_job_dependency_labels' doesn't exist
mysql> DELETE from afe_jobs_dependency_labels WHERE job_id = 112182620;
Query OK, 5 rows affected (0.00 sec)

mysql> DELETE from afe_jobs WHERE id=112182620;
Query OK, 1 row affected (0.01 sec)

mysql> 
I think we should change this to an on-delete-cascade constraint, because this is getting pretty hairy otherwise.
> I hear you guys moved cyan off 102 to 97 early in the morning on 04/12?

Yes.  There was a bug in the code responsible for the transition.
That's caused database inconsistencies, which we're trying to clean
up.

Richard identified an id range of these invalid jobs. I attempted to trick shard_client into giving up on them, by doing what it does after it marks them as uploaded:

mysql> select id, shard_id from afe_jobs WHERE id >= 112182621 AND id <= 112182631; 
+-----------+----------+
| id        | shard_id |
+-----------+----------+
| 112182621 |     NULL |
| 112182622 |     NULL |
| 112182623 |     NULL |
| 112182624 |     NULL |
| 112182625 |     NULL |
| 112182626 |     NULL |
| 112182627 |     NULL |
| 112182628 |     NULL |
| 112182629 |     NULL |
| 112182630 |     NULL |
| 112182631 |     NULL |
+-----------+----------+
11 rows in set (0.00 sec)

mysql> UPDATE afe_jobs SET shard_id=149 WHERE id >= 112182621 AND id <= 112182631; 
Query OK, 11 rows affected (0.01 sec)
Rows matched: 11  Changed: 11  Warnings: 0

mysql> 

mysql> UPDATE afe_jobs SET shard_id=149 WHERE id >= 112182621 AND id <= 112182652;                                                                                                                                                                                                        
Query OK, 20 rows affected (0.01 sec)
Rows matched: 32  Changed: 20  Warnings: 0


mysql>  select created_on, shard_id FROM afe_jobs WHERE shard_id IS NULL AND created_on < "2017-04-12 15:40:30";

<snip>
.
.
.
| 2017-04-12 10:34:53 |     NULL |
| 2017-04-12 10:45:18 |     NULL |
| 2017-04-12 10:50:52 |     NULL |
| 2017-04-12 10:56:55 |     NULL |
| 2017-04-12 10:59:07 |     NULL |
| 2017-04-12 11:07:54 |     NULL |
| 2017-04-12 11:08:11 |     NULL |
+---------------------+----------+
1499 rows in set (0.00 sec)


These look like the ones I want to squelch.


mysql> UPDATE afe_jobs SET shard_id = 149 WHERE shard_id IS NULL AND created_on < "2017-04-12 15:40:30";
Query OK, 1499 rows affected (0.34 sec)
Rows matched: 1499  Changed: 1499  Warnings: 0


At this point, the shard is back to processing _something_.

It remains to be seen whether it's prepared to do useful work,
but we're looking to see if things go back to normal, or if we
uncover some new problem.

Labels: -Pri-0 Pri-1
As of this moment, the shard can be seen to be scheduling work,
at least to the extent that it can run Verify tasks on DUTs.

Downgrading priority.  If things haven't blown up after the
weekend, we'll declare victory on the emergency.

We still need to fix the underlying issue that caused this
problem.

Follow-up:

We should have had an alert: https://bugs.chromium.org/p/chromium/issues/detail?id=711850

Removing a board shouldn't cause this timebomb: https://bugs.chromium.org/p/chromium/issues/detail?id=711852
Status: Fixed (was: Available)

Sign in to add a comment