New issue
Advanced search Search tips
Note: Color blocks (like or ) mean that a user may not be available. Tooltip shows the reason.

Issue 739486 link

Starred by 1 user

Issue metadata

Status: Fixed
Owner: ----
Closed: Jul 2017
Cc:
Components:
EstimatedDays: ----
NextAction: ----
OS: ----
Pri: 1
Type: Bug



Sign in to add a comment

master scheduler crashlooping "due to no host_id exists"

Project Member Reported by akes...@chromium.org, Jul 5 2017

Issue description

07/05 13:13:23.835 WARNI|             utils:0099| /usr/local/autotest/scheduler/scheduler_models.py:186: DeprecationWarning: object() takes no parameters
  return super(DBObject, cls).__new__(cls, id=id, **kwargs)

07/05 13:13:28.829 DEBUG|        agent_task:0181| No host is found for host_queue_entry_id: 127373277L
07/05 13:13:28.830 ERROR|        monitor_db:0183| Uncaught exception, terminating monitor_db.
Traceback (most recent call last):
  File "/usr/local/autotest/scheduler/monitor_db.py", line 166, in main_without_exception_handling
    dispatcher.initialize(recover_hosts=options.recover_hosts)
  File "/usr/local/autotest/scheduler/monitor_db.py", line 323, in initialize
    self._recover_processes()
  File "/usr/local/autotest/scheduler/monitor_db.py", line 478, in _recover_processes
    agent_tasks = self._create_recovery_agent_tasks()
  File "/usr/local/autotest/scheduler/monitor_db.py", line 492, in _create_recovery_agent_tasks
    return (self._get_queue_entry_agent_tasks()
  File "/usr/local/autotest/scheduler/monitor_db.py", line 530, in _get_queue_entry_agent_tasks
    agent_task = self._get_agent_task_for_queue_entry(entry)
  File "/usr/local/autotest/scheduler/monitor_db.py", line 563, in _get_agent_task_for_queue_entry
    return QueueTask(queue_entries=task_entries)
  File "/usr/local/autotest/scheduler/monitor_db.py", line 1272, in __init__
    self._set_ids(queue_entries=queue_entries)
  File "/usr/local/autotest/scheduler/agent_task.py", line 185, in _set_ids
    'exists.' % entry.id)
NonHostExistError: Failed to schedule a job whose host_queue_entry_id=127373277L due to no host_id exists.

 
Possibly fallout from Issue 737852 ?
Summary: master scheduler crashlooping "due to no host_id exists" (was: master scheduler crashlooping)
mysql> select * from afe_host_queue_entries WHERE id=127373277;
+-----------+-----------+---------+----------+-----------+--------+----------+---------+-----------------------------+-----------------+---------+------------+-------------+
| id        | job_id    | host_id | status   | meta_host | active | complete | deleted | execution_subdir            | atomic_group_id | aborted | started_on | finished_on |
+-----------+-----------+---------+----------+-----------+--------+----------+---------+-----------------------------+-----------------+---------+------------+-------------+
| 127373277 | 127035194 |    NULL | Starting |     87565 |      0 |        1 |       0 | chromeos4-row10-rack9-host1 |            NULL |       1 | NULL       | NULL        |
+-----------+-----------+---------+----------+-----------+--------+----------+---------+-----------------------------+-----------------+---------+------------+-------------+
1 row in set (0.00 sec)


mysql> select id, name, shard_id from afe_jobs WHERE id=127035194;
+-----------+--------------------------------------------------------+----------+
| id        | name                                                   | shard_id |
+-----------+--------------------------------------------------------+----------+
| 127035194 | quawks-release/R60-9592.49.0/bvt-cq/generic_RebootTest |       84 |
+-----------+--------------------------------------------------------+----------+
1 row in set (0.00 sec)

This is indeed a quawks job, which is one of the boards that I shuffled around in  Issue 737852.

I'm going to simply try aborting the job.
Didn't fix things. Re-pasting #3 after the abort, and better formatted, because I believe this is where the problem lies.

mysql> select * from afe_host_queue_entries WHERE id=127373277 \G;
*************************** 1. row ***************************
              id: 127373277
          job_id: 127035194
         host_id: NULL
          status: Starting
       meta_host: 87565
          active: 0
        complete: 1
         deleted: 0
execution_subdir: chromeos4-row10-rack9-host1
 atomic_group_id: NULL
         aborted: 1
      started_on: NULL
     finished_on: NULL
1 row in set (0.00 sec)

in particular, is the erroneous state status:Starting + host_id:NULL ?
Cc: pprabhu@chromium.org
My barbaric surgery idea now is to mark it as deleted=1, to try to recover things. Then come back later to understand how we got here and mitigate.
mysql> UPDATE afe_host_queue_entries SET deleted=1 WHERE id=127373277;
Query OK, 1 row affected (0.01 sec)
Rows matched: 1  Changed: 1  Warnings: 0

Let's see what happens...
That didn't work. Next round of barbaric surgery... delete rows.
mysql> DELETE FROM afe_aborted_host_queue_entries WHERE queue_entry_id=127373277;                                                                                                                                                                                                         
Query OK, 1 row affected (0.02 sec)

mysql> DELETE FROM afe_special_tasks WHERE queue_entry_id=127373277;
Query OK, 1 row affected (0.00 sec)

mysql> DELETE FROM afe_host_queue_entries WHERE id=127373277;
Query OK, 1 row affected (0.00 sec)


That worked, but other jobs in that state remain:

NonHostExistError: Failed to schedule a job whose host_queue_entry_id=127373270L due to no host_id exists
Manually deleted job in #12 as well.

However, if I see any more of these, I'll need to come up with a better query.
Next one: NonHostExistError: Failed to schedule a job whose host_queue_entry_id=127373274L due to no host_id exists.


Ok, need a proper query. Prathmesh is constructing one...
An idea from Prathmesh, set the special_task is_active to false on ^

mysql> UPDATE afe_special_tasks SET is_active = 0 where queue_entry_id=127373274;                                                                                                                                                                                                         
Query OK, 1 row affected (0.00 sec)
Rows matched: 1  Changed: 1  Warnings: 0

^ worked, we are on to a different failing job

NonHostExistError: Failed to schedule a job whose host_queue_entry_id=127373273L due to no host_id exists.
Per convo, concerned that simple deleting these might leave around zombie autoserv processes.

New approach:
mysql> UPDATE afe_special_tasks SET is_aborted=1 where queue_entry_id=127373273;                                                                                                                                                                                                          
Query OK, 1 row affected (0.00 sec)
Rows matched: 1  Changed: 1  Warnings: 0

^ didn't unstick that job.
These are the currently active special tasks that have SOME HQE associated with them.
I haven't checked yet if all of these are stuck actually.

mysql> select id, queue_entry_id from afe_special_tasks where is_active=True and queue_entry_id is not NULL;
+----------+----------------+
| id       | queue_entry_id |
+----------+----------------+
| 60938235 |      127389186 |
| 60938281 |      127287560 |
| 60938282 |      127287561 |
| 60938283 |      127287562 |
| 60938284 |      127287564 |
| 60938308 |      127441322 |
| 60938310 |      127441328 |
| 60938316 |      127388924 |
| 60938317 |      127388925 |
| 60938318 |      127293904 |
| 60938319 |      127293905 |
| 60938320 |      127293906 |
| 60938324 |      127373273 |
| 60938338 |      127438636 |
| 60938340 |      127438638 |
| 60938344 |      127438643 |
| 60938345 |      127438644 |
| 60938347 |      127438646 |
| 60938348 |      127441446 |
| 60938355 |      127393147 |
| 60938368 |      127442564 |
| 60938371 |      127442314 |
| 60938373 |      127442317 |
| 60938374 |      127442318 |
| 60938376 |      127442322 |
| 60938377 |      127442324 |
| 60938393 |      127373281 |
| 60938394 |      127442668 |
| 60938395 |      127442670 |
| 60938862 |      127433741 |
| 60938870 |      127319667 |
| 60938871 |      127346583 |
| 60939123 |      127378926 |
| 60939158 |      127206783 |
| 60939159 |      127378929 |
| 60939160 |      127378931 |
+----------+----------------+

Comment 20 Deleted

^ deleted, unrelated. (that's a swarming timeout which is a separate bug also under investigation)


There's only three more of those with NULL host_id


mysql> select * from afe_host_queue_entries where id in ('127389186' ,'127287560'
    -> ,'127287561' ,'127287562' ,'127287564' ,'127441322' ,'127441328' ,'127388924'
    -> ,'127388925' ,'127293904' ,'127293905' ,'127293906' ,'127373273' ,'127438636'
    -> ,'127438638' ,'127438643' ,'127438644' ,'127438646' ,'127441446' ,'127393147'
    -> ,'127442564' ,'127442314' ,'127442317' ,'127442318' ,'127442322' ,'127442324'
    -> ,'127373281' ,'127442668' ,'127442670' ,'127433741' ,'127319667' ,'127346583' 
    -> ,'127378926' ,'127206783' ,'127378929' ,'127378931') AND host_id is NULL; 
+-----------+-----------+---------+---------+-----------+--------+----------+---------+------------------+-----------------+---------+------------+-------------+
| id        | job_id    | host_id | status  | meta_host | active | complete | deleted | execution_subdir | atomic_group_id | aborted | started_on | finished_on |
+-----------+-----------+---------+---------+-----------+--------+----------+---------+------------------+-----------------+---------+------------+-------------+
| 127373273 | 127035190 |    NULL | Aborted |     87565 |      0 |        1 |       0 |                  |            NULL |       1 | NULL       | NULL        |
| 127373281 | 127035198 |    NULL | Aborted |     87565 |      0 |        1 |       0 |                  |            NULL |       1 | NULL       | NULL        |
| 127393147 | 127055063 |    NULL | Aborted |    315354 |      0 |        1 |       0 |                  |            NULL |       1 | NULL       | NULL        |
+-----------+-----------+---------+---------+-----------+--------+----------+---------+------------------+-----------------+---------+------------+-------------+
3 rows in set (0.09 sec)

Done.
These will now become zombie special tasks.
And their autoserv's are now undead.

mysql> UPDATE afe_special_tasks SET is_active=0 where queue_entry_id IN ('127373273', '127373281', '127393147');
Query OK, 3 rows affected (0.00 sec)
Rows matched: 3  Changed: 3  Warnings: 0

^ We can come back tomorrow and properly kill any that are still shambling around.
Failed in a different place when trying to restart the special task.

Aborted it now:
mysql> UPDATE afe_special_tasks SET is_aborted=1 where queue_entry_id IN ('127373273', '127373281', '127393147');
Query OK, 2 rows affected (0.00 sec)
Rows matched: 3  Changed: 2  Warnings: 0

mysql> DELETE from afe_special_tasks WHERE queue_entry_id=127393147;
Query OK, 1 row affected (0.00 sec)

mysql> DELETE from afe_special_tasks WHERE queue_entry_id=127373281;
Query OK, 1 row affected (0.01 sec)

mysql> DELETE from afe_special_tasks WHERE queue_entry_id=127373273;
Query OK, 1 row affected (0.00 sec)

mysql> DELETE from afe_aborted_host_queue_entries WHERE queue_entry_id=127373273;                                                                                                                                                                                                         
Query OK, 1 row affected (0.00 sec)

mysql> DELETE from afe_aborted_host_queue_entries WHERE queue_entry_id=127373281;
Query OK, 1 row affected (0.00 sec)

mysql> DELETE from afe_aborted_host_queue_entries WHERE queue_entry_id=127393147;
Query OK, 0 rows affected (0.01 sec)

mysql> DELETE FROM afe_host_queue_entries WHERE id=127393147;
Query OK, 1 row affected (0.00 sec)

mysql> DELETE FROM afe_host_queue_entries WHERE id=127373281;
Query OK, 1 row affected (0.00 sec)

mysql> DELETE FROM afe_host_queue_entries WHERE id=127373273;
Query OK, 1 row affected (0.00 sec)


Labels: -Pri-0 Pri-1
scheduler has passed that issue, demoting to P1.
Labels: -Pri-1 Pri-0
We're back.
NonHostExistError: Failed to schedule a job whose host_queue_entry_id=127373274L due to no host_id exists.

Labels: -Pri-0 Pri-1
fixed again by deleting hqe from db
Status: Fixed (was: Untriaged)
Follow-up work forked to   Issue 739530 
Project Member

Comment 31 by bugdroid1@chromium.org, Jul 14 2017

The following revision refers to this bug:
  https://chromium.googlesource.com/chromiumos/third_party/autotest/+/0c93b86dc6a834857e1a18af7feb3945c4dd3d67

commit 0c93b86dc6a834857e1a18af7feb3945c4dd3d67
Author: Aviv Keshet <akeshet@chromium.org>
Date: Fri Jul 14 02:46:10 2017

autotest: skip hqes that suffer from missing host_id malformation

Instead of crashing the scheduler, this CL will log the existence of the
malformed hqe and skip handling it (while incrementing a metric).

Follow-up work: determine a way to safely discard these hqes rather than
simply skipping them.

BUG= chromium:739486 
TEST=None

Change-Id: I0aa4fe0822244b4fb8832bf88c3387ab6bc5b5cd
Reviewed-on: https://chromium-review.googlesource.com/570823
Commit-Ready: Aviv Keshet <akeshet@chromium.org>
Tested-by: Aviv Keshet <akeshet@chromium.org>
Reviewed-by: Prathmesh Prabhu <pprabhu@chromium.org>

[modify] https://crrev.com/0c93b86dc6a834857e1a18af7feb3945c4dd3d67/scheduler/agent_task.py
[modify] https://crrev.com/0c93b86dc6a834857e1a18af7feb3945c4dd3d67/scheduler/monitor_db.py
[modify] https://crrev.com/0c93b86dc6a834857e1a18af7feb3945c4dd3d67/scheduler/scheduler_lib.py

Sign in to add a comment