New issue
Advanced search Search tips
Note: Color blocks (like or ) mean that a user may not be available. Tooltip shows the reason.

Issue 756128 link

Starred by 1 user

Issue metadata

Status: Fixed
Owner:
Last visit > 30 days ago
Closed: Aug 2017
Cc:
Components:
EstimatedDays: ----
NextAction: ----
OS: ----
Pri: 1
Type: Bug



Sign in to add a comment

master scheduler crashlooping due to no host_id

Project Member Reported by akes...@chromium.org, Aug 16 2017

Issue description

tick rate dropped to 0

08/16 11:42:31.797 DEBUG|        agent_task:0180| No host is found for host_queue_entry_id: 135592551L
08/16 11:42:31.824 ERROR|        monitor_db:0183| Uncaught exception, terminating monitor_db.
Traceback (most recent call last):
  File "/usr/local/autotest/scheduler/monitor_db.py", line 166, in main_without_exception_handling
    dispatcher.initialize(recover_hosts=options.recover_hosts)
  File "/usr/local/autotest/scheduler/monitor_db.py", line 323, in initialize
    self._recover_processes()
  File "/usr/local/autotest/scheduler/monitor_db.py", line 487, in _recover_processes
    agent_tasks = self._create_recovery_agent_tasks()
  File "/usr/local/autotest/scheduler/monitor_db.py", line 502, in _create_recovery_agent_tasks
    + self._get_special_task_agent_tasks(is_active=True))
  File "/usr/local/autotest/scheduler/monitor_db.py", line 565, in _get_special_task_agent_tasks
    for task in special_tasks]
  File "/usr/local/autotest/scheduler/monitor_db.py", line 639, in _get_agent_task_for_special_task
    return agent_task_class(task=special_task)
  File "/usr/local/autotest/scheduler/prejob_task.py", line 368, in __init__
    self._set_ids(host=self.host, queue_entries=[self.queue_entry])
  File "/usr/local/autotest/scheduler/agent_task.py", line 184, in _set_ids
    % entry.id)
NoHostIdError: Failed to schedule a job whose host_queue_entry_id=135592551L due to no host_id.

 
This is suspiciously similar to  Issue 739486  which was thought to be fixed.
mysql>  select name from afe_jobs where id =135241986;
+------------------------------------------------------------------------------------------+
| name                                                                                     |
+------------------------------------------------------------------------------------------+
| nyan_blaze-release/R62-9843.0.0/wifi_matfunc/network_WiFi_DisconnectReason.deauth_client |
+------------------------------------------------------------------------------------------+
1 row in set (0.00 sec)

I recently moved nyan_blaze from chromeos_server42.cbf to chromeos_server33.cbf, this is probably caused by that. I will take a look
This fix in that issue was to skip HQEs that were malformed. However, this is a special task (?) that is malformed, and isn't covered by the same stack.

The relevant behavior was to catch MalformedRecordError and silently skip it. https://chromium-review.googlesource.com/#/c/chromiumos/third_party/autotest/+/570823/3/scheduler/monitor_db.py

We should add a similar check to wherever this is crashing too. I can monkey patch it onto production...
#2 yes, that probably triggered this issue. We need to either manually clean the breaking entries, or monkey patch a fix. I'll see if I can get a fix working quickly...
Project Member

Comment 6 by bugdroid1@chromium.org, Aug 16 2017

The following revision refers to this bug:
  https://chromium.googlesource.com/chromiumos/third_party/autotest/+/a829401beff6b5d483090758fa007485d64860f0

commit a829401beff6b5d483090758fa007485d64860f0
Author: Aviv Keshet <akeshet@chromium.org>
Date: Wed Aug 16 19:02:24 2017

autotest: scheduler: skip malformed special tasks

BUG= chromium:756128 
TEST=None

Change-Id: I95b06dd368491880cf3a4ff1f6d36a39e9aafa6d
Reviewed-on: https://chromium-review.googlesource.com/617584
Reviewed-by: Ningning Xia <nxia@chromium.org>
Tested-by: Ningning Xia <nxia@chromium.org>

[modify] https://crrev.com/a829401beff6b5d483090758fa007485d64860f0/scheduler/monitor_db.py

Comment 7 by nxia@chromium.org, Aug 16 2017

The fix has been pushed to the master and master scheduler has recovered.

Comment 8 by nxia@chromium.org, Aug 16 2017

Labels: -Pri-0 Pri-1

Comment 9 by nxia@chromium.org, Aug 18 2017

Status: Fixed (was: Untriaged)

Sign in to add a comment