bare assert in drone_manager can cause scheduler crash |
|||||||
Issue description
Currently crashing the scheduler, fallout of shard migration. Unclear if this is a self-resolving crash, or a crashloop.
Scheduler should not crash on jobs in a bad state.
02/08 14:19:13.175 ERROR| monitor_db:0201| Uncaught exception, terminating monitor_db.
Traceback (most recent call last):
File "/usr/local/autotest/scheduler/monitor_db.py", line 190, in main_without_exception_handling
dispatcher.tick()
File "/usr/local/autotest/site-packages/chromite/lib/metrics.py", line 483, in wrapper
return fn(*args, **kwargs)
File "/usr/local/autotest/scheduler/monitor_db.py", line 395, in tick
self._send_to_lucifer()
File "/usr/local/autotest/scheduler/monitor_db.py", line 303, in wrapper
return func(self, *args, **kwargs)
File "/usr/local/autotest/scheduler/monitor_db.py", line 976, in _send_to_lucifer
self._send_gathering_to_lucifer()
File "/usr/local/autotest/scheduler/monitor_db.py", line 1003, in _send_gathering_to_lucifer
pidfile_id=pidfile_id)
File "/usr/local/autotest/scheduler/luciferlib.py", line 68, in spawn_gathering_job_handler
drone = manager.get_drone_for_pidfile(pidfile_id)
File "/usr/local/autotest/scheduler/luciferlib.py", line 171, in get_drone_for_pidfile
return _wrap_drone(self._manager.get_drone_for_pidfile_id(pidfile_id))
File "/usr/local/autotest/scheduler/drone_manager.py", line 361, in get_drone_for_pidfile_id
return self._get_drone_for_pidfile_id(pidfile_id)
File "/usr/local/autotest/scheduler/drone_manager.py", line 352, in _get_drone_for_pidfile_id
assert pidfile_contents.process is not None
AssertionError
,
Feb 8 2018
,
Feb 8 2018
The root cause is completely unrelated, keep this separate.
,
Feb 8 2018
,
Feb 20 2018
,
Feb 21 2018
,
Feb 22 2018
The following revision refers to this bug: https://chromium.googlesource.com/chromiumos/third_party/autotest/+/502a6ac5c328e8b664f4dca300f045b0a744ec4a commit 502a6ac5c328e8b664f4dca300f045b0a744ec4a Author: Shuqian Zhao <shuqianz@chromium.org> Date: Thu Feb 22 01:45:03 2018 autotest: raise exception when the pidfile is empty BUG= chromium:810547 TEST=unittest Change-Id: Ie529231ffd045b4f20eae8ffbf68f89a1a530c45 Reviewed-on: https://chromium-review.googlesource.com/929845 Commit-Ready: Shuqian Zhao <shuqianz@chromium.org> Tested-by: Shuqian Zhao <shuqianz@chromium.org> Reviewed-by: Allen Li <ayatane@chromium.org> [modify] https://crrev.com/502a6ac5c328e8b664f4dca300f045b0a744ec4a/scheduler/drone_manager.py
,
Feb 26 2018
Need to ensure that exception raised does not crash the scheduler.
,
Feb 27 2018
The following revision refers to this bug: https://chromium.googlesource.com/chromiumos/third_party/autotest/+/0446fc6401f8da0e9a07f657a982517a148d9da2 commit 0446fc6401f8da0e9a07f657a982517a148d9da2 Author: Shuqian Zhao <shuqianz@chromium.org> Date: Tue Feb 27 19:43:11 2018 autotest: mute the exception when fail to get a drone due to empty pidfile BUG= chromium:810547 TEST=unittest Change-Id: I85bb32322b91ef06ff2c646ca450869000b9c8eb Reviewed-on: https://chromium-review.googlesource.com/938662 Commit-Ready: Shuqian Zhao <shuqianz@chromium.org> Tested-by: Shuqian Zhao <shuqianz@chromium.org> Reviewed-by: Allen Li <ayatane@chromium.org> [modify] https://crrev.com/0446fc6401f8da0e9a07f657a982517a148d9da2/scheduler/monitor_db.py
,
Mar 5 2018
|
|||||||
►
Sign in to add a comment |
|||||||
Comment 1 by akes...@chromium.org
, Feb 8 2018