master scheduler crashlooping "due to no host_id exists" |
||||||
Issue description
07/05 13:13:23.835 WARNI| utils:0099| /usr/local/autotest/scheduler/scheduler_models.py:186: DeprecationWarning: object() takes no parameters
return super(DBObject, cls).__new__(cls, id=id, **kwargs)
07/05 13:13:28.829 DEBUG| agent_task:0181| No host is found for host_queue_entry_id: 127373277L
07/05 13:13:28.830 ERROR| monitor_db:0183| Uncaught exception, terminating monitor_db.
Traceback (most recent call last):
File "/usr/local/autotest/scheduler/monitor_db.py", line 166, in main_without_exception_handling
dispatcher.initialize(recover_hosts=options.recover_hosts)
File "/usr/local/autotest/scheduler/monitor_db.py", line 323, in initialize
self._recover_processes()
File "/usr/local/autotest/scheduler/monitor_db.py", line 478, in _recover_processes
agent_tasks = self._create_recovery_agent_tasks()
File "/usr/local/autotest/scheduler/monitor_db.py", line 492, in _create_recovery_agent_tasks
return (self._get_queue_entry_agent_tasks()
File "/usr/local/autotest/scheduler/monitor_db.py", line 530, in _get_queue_entry_agent_tasks
agent_task = self._get_agent_task_for_queue_entry(entry)
File "/usr/local/autotest/scheduler/monitor_db.py", line 563, in _get_agent_task_for_queue_entry
return QueueTask(queue_entries=task_entries)
File "/usr/local/autotest/scheduler/monitor_db.py", line 1272, in __init__
self._set_ids(queue_entries=queue_entries)
File "/usr/local/autotest/scheduler/agent_task.py", line 185, in _set_ids
'exists.' % entry.id)
NonHostExistError: Failed to schedule a job whose host_queue_entry_id=127373277L due to no host_id exists.
,
Jul 5 2017
,
Jul 5 2017
mysql> select * from afe_host_queue_entries WHERE id=127373277; +-----------+-----------+---------+----------+-----------+--------+----------+---------+-----------------------------+-----------------+---------+------------+-------------+ | id | job_id | host_id | status | meta_host | active | complete | deleted | execution_subdir | atomic_group_id | aborted | started_on | finished_on | +-----------+-----------+---------+----------+-----------+--------+----------+---------+-----------------------------+-----------------+---------+------------+-------------+ | 127373277 | 127035194 | NULL | Starting | 87565 | 0 | 1 | 0 | chromeos4-row10-rack9-host1 | NULL | 1 | NULL | NULL | +-----------+-----------+---------+----------+-----------+--------+----------+---------+-----------------------------+-----------------+---------+------------+-------------+ 1 row in set (0.00 sec)
,
Jul 5 2017
mysql> select id, name, shard_id from afe_jobs WHERE id=127035194; +-----------+--------------------------------------------------------+----------+ | id | name | shard_id | +-----------+--------------------------------------------------------+----------+ | 127035194 | quawks-release/R60-9592.49.0/bvt-cq/generic_RebootTest | 84 | +-----------+--------------------------------------------------------+----------+ 1 row in set (0.00 sec)
,
Jul 5 2017
This is indeed a quawks job, which is one of the boards that I shuffled around in Issue 737852. I'm going to simply try aborting the job.
,
Jul 5 2017
Didn't fix things. Re-pasting #3 after the abort, and better formatted, because I believe this is where the problem lies.
mysql> select * from afe_host_queue_entries WHERE id=127373277 \G;
*************************** 1. row ***************************
id: 127373277
job_id: 127035194
host_id: NULL
status: Starting
meta_host: 87565
active: 0
complete: 1
deleted: 0
execution_subdir: chromeos4-row10-rack9-host1
atomic_group_id: NULL
aborted: 1
started_on: NULL
finished_on: NULL
1 row in set (0.00 sec)
,
Jul 5 2017
in particular, is the erroneous state status:Starting + host_id:NULL ?
,
Jul 5 2017
My barbaric surgery idea now is to mark it as deleted=1, to try to recover things. Then come back later to understand how we got here and mitigate.
,
Jul 5 2017
mysql> UPDATE afe_host_queue_entries SET deleted=1 WHERE id=127373277; Query OK, 1 row affected (0.01 sec) Rows matched: 1 Changed: 1 Warnings: 0 Let's see what happens...
,
Jul 5 2017
That didn't work. Next round of barbaric surgery... delete rows.
,
Jul 5 2017
mysql> DELETE FROM afe_aborted_host_queue_entries WHERE queue_entry_id=127373277; Query OK, 1 row affected (0.02 sec) mysql> DELETE FROM afe_special_tasks WHERE queue_entry_id=127373277; Query OK, 1 row affected (0.00 sec) mysql> DELETE FROM afe_host_queue_entries WHERE id=127373277; Query OK, 1 row affected (0.00 sec)
,
Jul 5 2017
That worked, but other jobs in that state remain: NonHostExistError: Failed to schedule a job whose host_queue_entry_id=127373270L due to no host_id exists
,
Jul 5 2017
Manually deleted job in #12 as well. However, if I see any more of these, I'll need to come up with a better query.
,
Jul 5 2017
Next one: NonHostExistError: Failed to schedule a job whose host_queue_entry_id=127373274L due to no host_id exists. Ok, need a proper query. Prathmesh is constructing one...
,
Jul 5 2017
An idea from Prathmesh, set the special_task is_active to false on ^ mysql> UPDATE afe_special_tasks SET is_active = 0 where queue_entry_id=127373274; Query OK, 1 row affected (0.00 sec) Rows matched: 1 Changed: 1 Warnings: 0
,
Jul 5 2017
^ worked, we are on to a different failing job NonHostExistError: Failed to schedule a job whose host_queue_entry_id=127373273L due to no host_id exists.
,
Jul 5 2017
Per convo, concerned that simple deleting these might leave around zombie autoserv processes. New approach: mysql> UPDATE afe_special_tasks SET is_aborted=1 where queue_entry_id=127373273; Query OK, 1 row affected (0.00 sec) Rows matched: 1 Changed: 1 Warnings: 0
,
Jul 5 2017
^ didn't unstick that job.
,
Jul 5 2017
These are the currently active special tasks that have SOME HQE associated with them. I haven't checked yet if all of these are stuck actually. mysql> select id, queue_entry_id from afe_special_tasks where is_active=True and queue_entry_id is not NULL; +----------+----------------+ | id | queue_entry_id | +----------+----------------+ | 60938235 | 127389186 | | 60938281 | 127287560 | | 60938282 | 127287561 | | 60938283 | 127287562 | | 60938284 | 127287564 | | 60938308 | 127441322 | | 60938310 | 127441328 | | 60938316 | 127388924 | | 60938317 | 127388925 | | 60938318 | 127293904 | | 60938319 | 127293905 | | 60938320 | 127293906 | | 60938324 | 127373273 | | 60938338 | 127438636 | | 60938340 | 127438638 | | 60938344 | 127438643 | | 60938345 | 127438644 | | 60938347 | 127438646 | | 60938348 | 127441446 | | 60938355 | 127393147 | | 60938368 | 127442564 | | 60938371 | 127442314 | | 60938373 | 127442317 | | 60938374 | 127442318 | | 60938376 | 127442322 | | 60938377 | 127442324 | | 60938393 | 127373281 | | 60938394 | 127442668 | | 60938395 | 127442670 | | 60938862 | 127433741 | | 60938870 | 127319667 | | 60938871 | 127346583 | | 60939123 | 127378926 | | 60939158 | 127206783 | | 60939159 | 127378929 | | 60939160 | 127378931 | +----------+----------------+
,
Jul 5 2017
^ deleted, unrelated. (that's a swarming timeout which is a separate bug also under investigation)
,
Jul 5 2017
There's only three more of those with NULL host_id
mysql> select * from afe_host_queue_entries where id in ('127389186' ,'127287560'
-> ,'127287561' ,'127287562' ,'127287564' ,'127441322' ,'127441328' ,'127388924'
-> ,'127388925' ,'127293904' ,'127293905' ,'127293906' ,'127373273' ,'127438636'
-> ,'127438638' ,'127438643' ,'127438644' ,'127438646' ,'127441446' ,'127393147'
-> ,'127442564' ,'127442314' ,'127442317' ,'127442318' ,'127442322' ,'127442324'
-> ,'127373281' ,'127442668' ,'127442670' ,'127433741' ,'127319667' ,'127346583'
-> ,'127378926' ,'127206783' ,'127378929' ,'127378931') AND host_id is NULL;
+-----------+-----------+---------+---------+-----------+--------+----------+---------+------------------+-----------------+---------+------------+-------------+
| id | job_id | host_id | status | meta_host | active | complete | deleted | execution_subdir | atomic_group_id | aborted | started_on | finished_on |
+-----------+-----------+---------+---------+-----------+--------+----------+---------+------------------+-----------------+---------+------------+-------------+
| 127373273 | 127035190 | NULL | Aborted | 87565 | 0 | 1 | 0 | | NULL | 1 | NULL | NULL |
| 127373281 | 127035198 | NULL | Aborted | 87565 | 0 | 1 | 0 | | NULL | 1 | NULL | NULL |
| 127393147 | 127055063 | NULL | Aborted | 315354 | 0 | 1 | 0 | | NULL | 1 | NULL | NULL |
+-----------+-----------+---------+---------+-----------+--------+----------+---------+------------------+-----------------+---------+------------+-------------+
3 rows in set (0.09 sec)
,
Jul 5 2017
Done.
These will now become zombie special tasks.
And their autoserv's are now undead.
mysql> UPDATE afe_special_tasks SET is_active=0 where queue_entry_id IN ('127373273', '127373281', '127393147');
Query OK, 3 rows affected (0.00 sec)
Rows matched: 3 Changed: 3 Warnings: 0
,
Jul 5 2017
^ We can come back tomorrow and properly kill any that are still shambling around.
,
Jul 5 2017
Failed in a different place when trying to restart the special task.
Aborted it now:
mysql> UPDATE afe_special_tasks SET is_aborted=1 where queue_entry_id IN ('127373273', '127373281', '127393147');
Query OK, 2 rows affected (0.00 sec)
Rows matched: 3 Changed: 2 Warnings: 0
,
Jul 5 2017
mysql> DELETE from afe_special_tasks WHERE queue_entry_id=127393147; Query OK, 1 row affected (0.00 sec) mysql> DELETE from afe_special_tasks WHERE queue_entry_id=127373281; Query OK, 1 row affected (0.01 sec) mysql> DELETE from afe_special_tasks WHERE queue_entry_id=127373273; Query OK, 1 row affected (0.00 sec) mysql> DELETE from afe_aborted_host_queue_entries WHERE queue_entry_id=127373273; Query OK, 1 row affected (0.00 sec) mysql> DELETE from afe_aborted_host_queue_entries WHERE queue_entry_id=127373281; Query OK, 1 row affected (0.00 sec) mysql> DELETE from afe_aborted_host_queue_entries WHERE queue_entry_id=127393147; Query OK, 0 rows affected (0.01 sec) mysql> DELETE FROM afe_host_queue_entries WHERE id=127393147; Query OK, 1 row affected (0.00 sec) mysql> DELETE FROM afe_host_queue_entries WHERE id=127373281; Query OK, 1 row affected (0.00 sec) mysql> DELETE FROM afe_host_queue_entries WHERE id=127373273; Query OK, 1 row affected (0.00 sec)
,
Jul 5 2017
scheduler has passed that issue, demoting to P1.
,
Jul 5 2017
We're back. NonHostExistError: Failed to schedule a job whose host_queue_entry_id=127373274L due to no host_id exists.
,
Jul 5 2017
fixed again by deleting hqe from db
,
Jul 5 2017
,
Jul 14 2017
The following revision refers to this bug: https://chromium.googlesource.com/chromiumos/third_party/autotest/+/0c93b86dc6a834857e1a18af7feb3945c4dd3d67 commit 0c93b86dc6a834857e1a18af7feb3945c4dd3d67 Author: Aviv Keshet <akeshet@chromium.org> Date: Fri Jul 14 02:46:10 2017 autotest: skip hqes that suffer from missing host_id malformation Instead of crashing the scheduler, this CL will log the existence of the malformed hqe and skip handling it (while incrementing a metric). Follow-up work: determine a way to safely discard these hqes rather than simply skipping them. BUG= chromium:739486 TEST=None Change-Id: I0aa4fe0822244b4fb8832bf88c3387ab6bc5b5cd Reviewed-on: https://chromium-review.googlesource.com/570823 Commit-Ready: Aviv Keshet <akeshet@chromium.org> Tested-by: Aviv Keshet <akeshet@chromium.org> Reviewed-by: Prathmesh Prabhu <pprabhu@chromium.org> [modify] https://crrev.com/0c93b86dc6a834857e1a18af7feb3945c4dd3d67/scheduler/agent_task.py [modify] https://crrev.com/0c93b86dc6a834857e1a18af7feb3945c4dd3d67/scheduler/monitor_db.py [modify] https://crrev.com/0c93b86dc6a834857e1a18af7feb3945c4dd3d67/scheduler/scheduler_lib.py |
||||||
►
Sign in to add a comment |
||||||
Comment 1 by akes...@chromium.org
, Jul 5 2017