master scheduler crashlooping due to no host_id |
|||
Issue description
tick rate dropped to 0
08/16 11:42:31.797 DEBUG| agent_task:0180| No host is found for host_queue_entry_id: 135592551L
08/16 11:42:31.824 ERROR| monitor_db:0183| Uncaught exception, terminating monitor_db.
Traceback (most recent call last):
File "/usr/local/autotest/scheduler/monitor_db.py", line 166, in main_without_exception_handling
dispatcher.initialize(recover_hosts=options.recover_hosts)
File "/usr/local/autotest/scheduler/monitor_db.py", line 323, in initialize
self._recover_processes()
File "/usr/local/autotest/scheduler/monitor_db.py", line 487, in _recover_processes
agent_tasks = self._create_recovery_agent_tasks()
File "/usr/local/autotest/scheduler/monitor_db.py", line 502, in _create_recovery_agent_tasks
+ self._get_special_task_agent_tasks(is_active=True))
File "/usr/local/autotest/scheduler/monitor_db.py", line 565, in _get_special_task_agent_tasks
for task in special_tasks]
File "/usr/local/autotest/scheduler/monitor_db.py", line 639, in _get_agent_task_for_special_task
return agent_task_class(task=special_task)
File "/usr/local/autotest/scheduler/prejob_task.py", line 368, in __init__
self._set_ids(host=self.host, queue_entries=[self.queue_entry])
File "/usr/local/autotest/scheduler/agent_task.py", line 184, in _set_ids
% entry.id)
NoHostIdError: Failed to schedule a job whose host_queue_entry_id=135592551L due to no host_id.
,
Aug 16 2017
mysql> select name from afe_jobs where id =135241986; +------------------------------------------------------------------------------------------+ | name | +------------------------------------------------------------------------------------------+ | nyan_blaze-release/R62-9843.0.0/wifi_matfunc/network_WiFi_DisconnectReason.deauth_client | +------------------------------------------------------------------------------------------+ 1 row in set (0.00 sec) I recently moved nyan_blaze from chromeos_server42.cbf to chromeos_server33.cbf, this is probably caused by that. I will take a look
,
Aug 16 2017
This fix in that issue was to skip HQEs that were malformed. However, this is a special task (?) that is malformed, and isn't covered by the same stack. The relevant behavior was to catch MalformedRecordError and silently skip it. https://chromium-review.googlesource.com/#/c/chromiumos/third_party/autotest/+/570823/3/scheduler/monitor_db.py We should add a similar check to wherever this is crashing too. I can monkey patch it onto production...
,
Aug 16 2017
#2 yes, that probably triggered this issue. We need to either manually clean the breaking entries, or monkey patch a fix. I'll see if I can get a fix working quickly...
,
Aug 16 2017
,
Aug 16 2017
The following revision refers to this bug: https://chromium.googlesource.com/chromiumos/third_party/autotest/+/a829401beff6b5d483090758fa007485d64860f0 commit a829401beff6b5d483090758fa007485d64860f0 Author: Aviv Keshet <akeshet@chromium.org> Date: Wed Aug 16 19:02:24 2017 autotest: scheduler: skip malformed special tasks BUG= chromium:756128 TEST=None Change-Id: I95b06dd368491880cf3a4ff1f6d36a39e9aafa6d Reviewed-on: https://chromium-review.googlesource.com/617584 Reviewed-by: Ningning Xia <nxia@chromium.org> Tested-by: Ningning Xia <nxia@chromium.org> [modify] https://crrev.com/a829401beff6b5d483090758fa007485d64860f0/scheduler/monitor_db.py
,
Aug 16 2017
The fix has been pushed to the master and master scheduler has recovered.
,
Aug 16 2017
,
Aug 18 2017
|
|||
►
Sign in to add a comment |
|||
Comment 1 by akes...@chromium.org
, Aug 16 2017