New issue
Advanced search Search tips
Note: Color blocks (like or ) mean that a user may not be available. Tooltip shows the reason.

Issue 714732 link

Starred by 1 user

Issue metadata

Status: Archived
Owner: ----
Closed: Mar 2018
Cc:
Components:
EstimatedDays: ----
NextAction: ----
OS: Chrome
Pri: 2
Type: Bug



Sign in to add a comment

don't crashloop master scheduler when facing invalid host_queue_entries

Project Member Reported by akes...@chromium.org, Apr 24 2017

Issue description

Follow-up to  crbug.com/714571 

EXCEPTION: Uncaught exception; terminating monitor_db
Traceback (most recent call last):
  File "/usr/local/autotest/scheduler/monitor_db.py", line 179, in main_without_exception_handling
    dispatcher.initialize(recover_hosts=options.recover_hosts)
  File "/usr/local/autotest/scheduler/monitor_db.py", line 336, in initialize
    self._recover_processes()
  File "/usr/local/autotest/scheduler/monitor_db.py", line 491, in _recover_processes
    agent_tasks = self._create_recovery_agent_tasks()
  File "/usr/local/autotest/scheduler/monitor_db.py", line 506, in _create_recovery_agent_tasks
    + self._get_special_task_agent_tasks(is_active=True))
  File "/usr/local/autotest/scheduler/monitor_db.py", line 560, in _get_special_task_agent_tasks
    for task in special_tasks]
  File "/usr/local/autotest/scheduler/monitor_db.py", line 637, in _get_agent_task_for_special_task
    return agent_task_class(task=special_task)
  File "/usr/local/autotest/scheduler/prejob_task.py", line 368, in __init__
    self._set_ids(host=self.host, queue_entries=[self.queue_entry])
  File "/usr/local/autotest/scheduler/agent_task.py", line 166, in _set_ids
    self.host_ids = [entry.host.id for entry in queue_entries]
AttributeError: 'NoneType' object has no attribute 'id'



When faced with invalid entries such as this, we should probably raise a purpose-specific exception and log it rather than crashinglooping.

Possible approaches proposed:
 1 Catch and log exception in the tick.
 2 Crash when seeing this error, but modify the scheduler start-up db_cleanup phase to fix these problems.
 3 Get sentinel service to fix these problems.

My suggestion is #1 is probably the easiest, and also would lead to lowest impact if/when we encounter this db inconsistency in future.
 

Comment 1 by dshi@chromium.org, Apr 24 2017

One risk for 1 is the scope of try-except. We don't want to handle every exception in all code path, as that might hide critical failures. On the other hand, only handle some particular code path, set_ids in this case, might not be enough, as there could be other failures. Also, ignore exception in some case might have its own ripple effects that lead to some unpredictable behavior.

Comment 2 by aut...@google.com, Apr 25 2017

Labels: -current-issue Hotlist-Fixit
Status: Archived (was: Untriaged)
This bug is very old, is Untriaged, and has no owner.  If it is still relevant, reopen as Untriaged or open a new bug

Sign in to add a comment