Drop dangerous call from scheduler to host_scheduler.schedule_host_job |
||||||
Issue descriptionhosts may be assigned to jobs via two independent paths when inline_host_acquisition is False: Normal path: Via host_scheduler process. This goes via HostScheculer.tick --> HostScheduler->_schedule_jobs --> (HostScheduler.find_hosts_for_jobs, HostScheduler._reocrd_host_assignments, HostScheduler.schedule_host_jobs) [http://shortn/_mY1kPHrRI6] But, there is a second path. If BaseDispatcher finds jobs waiting for hosts, it can try to assign hosts to jobs via: BaseDispatcher._schedule_new_jobs -> (HostScheduler.find_hosts_for_jobs, HostScheduler.schedule_host_job) The second path is taken despite _inline_host_acquisition being False if we find an non matched host_jobs [http://shortn/_6iyaG1T1SB] There are two problems with this: (1) HostScheduler is not written to be thread safe. Both the host_scheduler and monitor_db processes could try to assign hosts to the same job and step over each other's feet. (2) The path from monitor_db doesn't exactly reflect the one from HostScheduler.tick. So, it could confuse future assignment of hosts from HostScheduler. Now, we do have metrics about how often we assign hosts via each path, and it looks like we _never_ take the monitor_db path:http://shortn/_ZUAqxl89mc There is also supposed to be an email alert whenever we take that path, and there are 0 instances of this alert being raised in either of the email aliases: http://shortn/_ZdDdBqBMLu http://shortn/_v9JEf2egTI So, I propose we nix this code path.
,
Dec 16 2016
+dshi, +jrbarnette to catch me before I cause disaster, if I'm reading this incorrectly.
,
Dec 16 2016
,
Dec 16 2016
Just curious, under what condition does the scheduler do the host assignment bypassing host scheduler? If that's the condition we don't need to handle, that sounds like dead code we can remove. If there is some edge case that host scheduler won't handle, we might want to fix that.
,
Dec 20 2016
,
Dec 20 2016
Continuing from a question I asked the list: http://shortn/_bgviCRsD4W test_that doesn't (directly) depend on _inline_host_acquisition=True either. It sets up an in memory DB and then walks through the job life cycle itself. It doesn't use the scheduler at all, so should not be affected by this. So, there is no good reason to keep supporting _inline_host_acquisition that I know of at this point. I'll try to delete it and see what happens. ;)
,
Dec 20 2016
moblab is a blocker. It uses inline host acquisition. No separate host_scheduler process is launched: https://cs.corp.google.com/chromeos_public/src/overlays/project-moblab/chromeos-base/chromeos-bsp-moblab/files/init/ Again, this is pretty easy to change. It'll let us not have to support both.
,
Dec 22 2016
The following revision refers to this bug: https://chromium.googlesource.com/chromiumos/third_party/autotest/+/3c6a3bfd883d9770af66f9e14fa595271774c903 commit 3c6a3bfd883d9770af66f9e14fa595271774c903 Author: Prathmesh Prabhu <pprabhu@chromium.org> Date: Tue Dec 20 19:29:02 2016 scheduler: Don't schedule host_jobs when inline acquisition is false. host_scheduler, when run as an independent service assumes that it is the sole service scheduilng host_jobs. A fallback path in scheduler would try to schedule these jobs. This situation was never possible in the normal case, but the fallback was dangerous -- it could leave the DB corrupted. Instead, report errors in the fallback path and refuse to schedule the host jobs. BUG= chromium:675048 TEST=unittests, local jobs. Change-Id: I60bb56ded41c34484f6e281fdd12ace5601e739d Reviewed-on: https://chromium-review.googlesource.com/422535 Commit-Ready: Prathmesh Prabhu <pprabhu@chromium.org> Tested-by: Prathmesh Prabhu <pprabhu@chromium.org> Reviewed-by: Richard Barnette <jrbarnette@google.com> [modify] https://crrev.com/3c6a3bfd883d9770af66f9e14fa595271774c903/scheduler/monitor_db.py
,
Jan 30 2017
Turns out this requires pesky fixes for moblab. Like everything else, it's hard to do this for moblab. Not Fixit size.
,
Jun 28 2017
I don't care about inline_host_acquisition=False. I don't. |
||||||
►
Sign in to add a comment |
||||||
Comment 1 by pprabhu@chromium.org
, Dec 16 2016