New issue
Advanced search Search tips
Note: Color blocks (like or ) mean that a user may not be available. Tooltip shows the reason.

Issue 810547 link

Starred by 1 user

Issue metadata

Status: Fixed
Owner:
Last visit > 30 days ago
Closed: Mar 2018
Cc:
Components:
EstimatedDays: ----
NextAction: ----
OS: ----
Pri: 1
Type: Bug



Sign in to add a comment

bare assert in drone_manager can cause scheduler crash

Project Member Reported by akes...@chromium.org, Feb 8 2018

Issue description

Currently crashing the scheduler, fallout of shard migration. Unclear if this is a self-resolving crash, or a crashloop.

Scheduler should not crash on jobs in a bad state.

02/08 14:19:13.175 ERROR|        monitor_db:0201| Uncaught exception, terminating monitor_db.       
Traceback (most recent call last):            
  File "/usr/local/autotest/scheduler/monitor_db.py", line 190, in main_without_exception_handling
    dispatcher.tick()                 
  File "/usr/local/autotest/site-packages/chromite/lib/metrics.py", line 483, in wrapper                                                                                                                                                                                                  
    return fn(*args, **kwargs)
  File "/usr/local/autotest/scheduler/monitor_db.py", line 395, in tick           
    self._send_to_lucifer()                                                  
  File "/usr/local/autotest/scheduler/monitor_db.py", line 303, in wrapper  
    return func(self, *args, **kwargs)                                                
  File "/usr/local/autotest/scheduler/monitor_db.py", line 976, in _send_to_lucifer                                                            
    self._send_gathering_to_lucifer()                                                                                                          
  File "/usr/local/autotest/scheduler/monitor_db.py", line 1003, in _send_gathering_to_lucifer                                                 
    pidfile_id=pidfile_id)                                                                                                                 
  File "/usr/local/autotest/scheduler/luciferlib.py", line 68, in spawn_gathering_job_handler                                              
    drone = manager.get_drone_for_pidfile(pidfile_id)                                                                                       
  File "/usr/local/autotest/scheduler/luciferlib.py", line 171, in get_drone_for_pidfile                                                       
    return _wrap_drone(self._manager.get_drone_for_pidfile_id(pidfile_id))                                                                 
  File "/usr/local/autotest/scheduler/drone_manager.py", line 361, in get_drone_for_pidfile_id         
    return self._get_drone_for_pidfile_id(pidfile_id)              
  File "/usr/local/autotest/scheduler/drone_manager.py", line 352, in _get_drone_for_pidfile_id
    assert pidfile_contents.process is not None                                                                                                                                                                                                                                           
AssertionError                                                             
 
Doesn't appear to be a tight crashloop, judging by the metrics.
Mergedinto: 802909
Status: Duplicate (was: Untriaged)
Status: Untriaged (was: Duplicate)
The root cause is completely unrelated, keep this separate.
Labels: Chase-Pending
Labels: -Chase-Pending Chase
Owner: shuqianz@chromium.org
Status: Assigned (was: Untriaged)
Status: Started (was: Assigned)
Project Member

Comment 7 by bugdroid1@chromium.org, Feb 22 2018

The following revision refers to this bug:
  https://chromium.googlesource.com/chromiumos/third_party/autotest/+/502a6ac5c328e8b664f4dca300f045b0a744ec4a

commit 502a6ac5c328e8b664f4dca300f045b0a744ec4a
Author: Shuqian Zhao <shuqianz@chromium.org>
Date: Thu Feb 22 01:45:03 2018

autotest: raise exception when the pidfile is empty

BUG= chromium:810547 
TEST=unittest

Change-Id: Ie529231ffd045b4f20eae8ffbf68f89a1a530c45
Reviewed-on: https://chromium-review.googlesource.com/929845
Commit-Ready: Shuqian Zhao <shuqianz@chromium.org>
Tested-by: Shuqian Zhao <shuqianz@chromium.org>
Reviewed-by: Allen Li <ayatane@chromium.org>

[modify] https://crrev.com/502a6ac5c328e8b664f4dca300f045b0a744ec4a/scheduler/drone_manager.py

Need to ensure that exception raised does not crash the scheduler.
Project Member

Comment 9 by bugdroid1@chromium.org, Feb 27 2018

The following revision refers to this bug:
  https://chromium.googlesource.com/chromiumos/third_party/autotest/+/0446fc6401f8da0e9a07f657a982517a148d9da2

commit 0446fc6401f8da0e9a07f657a982517a148d9da2
Author: Shuqian Zhao <shuqianz@chromium.org>
Date: Tue Feb 27 19:43:11 2018

autotest: mute the exception when fail to get a drone due to empty pidfile

BUG= chromium:810547 
TEST=unittest

Change-Id: I85bb32322b91ef06ff2c646ca450869000b9c8eb
Reviewed-on: https://chromium-review.googlesource.com/938662
Commit-Ready: Shuqian Zhao <shuqianz@chromium.org>
Tested-by: Shuqian Zhao <shuqianz@chromium.org>
Reviewed-by: Allen Li <ayatane@chromium.org>

[modify] https://crrev.com/0446fc6401f8da0e9a07f657a982517a148d9da2/scheduler/monitor_db.py

Status: Fixed (was: Started)

Sign in to add a comment