Add instrumentation to host scheduler |
|||||||
Issue descriptionhttps://chromium.googlesource.com/chromiumos/third_party/autotest/+/master/scheduler/host_scheduler.py is the script in question. Having finer grained instrumentation that can tell us how much team each stage of the tick() routine is taking will help us to recognize the hotspots, and also help us to understand whether any improvements we make are having the desired effect consistently. I've added David Riley to the CC-list, as he may have information about how to add instrumentation, and create dashboards that observe said metrics.
,
Apr 26 2017
David, could you kindly point me to some canonical example or doc for adding metrics (and how to create dashboards for them) for such scripts that run on the test servers? Thanks!
,
Apr 26 2017
Adding phobbs for best examples of metrics -- he may have written documentation as well.. I'd recommend creating details in a bug and getting some combination of akeshet/phobbs/myself to review, detailing the type of metric and the fields you're planning on adding. Changing things after the fact is somewhat problematic, so it makes sense to get things right in the first place. For the Viceroy dashboards, I'd form your queries in panopticon, and then the dashboard code exists in: google3/monitoring/viceroy/dashboards/chromeos I'm not sure how much you know about Mash/Monarch, but there should be plenty of documentation at go/monarch-userguide and go/mash-reference.
,
Apr 27 2017
I've written up some notes for first steps, as well as open questions. I'm still unclear on what the big picture is here (where and how does the host_scheduler fit into the test infrastructure): https://docs.google.com/a/google.com/document/d/13UFXCgJnCBVBm3WMKIRE_yN-7AQbJUyKf-kYtvUctGs/edit?usp=sharing I would greatly appreciate some pointers as comments.
,
Apr 27 2017
,
Apr 28 2017
We have a tick rate counter for the host scheduler, you can see an example of its output here: https://viceroy.corp.google.com/chromeos/deputy-view#_VG__BnnYtbi The big swings to/from zero that you see are an artifact of a bug in the metric that has been fixed but not yet pushed to prod (https://chromium-review.googlesource.com/c/486249/) I think we have a worse problem with scheduler tick than with host-scheduler tick. The slowest host-scheduler looks to be around 12 per minute, which means we're paying a 1-time 5 second cost to each job. Scheduler tick is a much bigger lever, because we pay that for each transition in the job lifecycle of which there are several.
,
Apr 28 2017
,
Apr 28 2017
Yep, please use the RuntimeBreakdownCounter for these metrics (like the scheduler). Any dashboarding needs will be taken care of uniformly if you do that.
,
Apr 28 2017
Here's the example client code: http://shortn/_lrdxwYacuH
,
Apr 28 2017
Where would the scheduler tick code be located?
,
Apr 28 2017
Re #10: Linked to in #9. host_scheduler.py should do something very similar to that. I can help review.
,
May 5 2017
So I am looking at the scheduler code alongside an execution strace of it. A lot of the time is spent servicing small requests like these: 18:16:25 write(3, "*\0\0\0\3SELECT * FROM afe_jobs WHERE ID=113762073", 46) = 46 <0.000121> This is because we are creating a DBObject, which is created because it is the base class for the Job object, of which we are creating a lot. So we need to instrument what is creating that many job objects.
,
May 9 2017
BTW: https://viceroy.corp.google.com/chromeos/tick_breakdown was added.
,
May 24 2017
Re: Comment #13. Thanks David. Is it possible to add a dashboard with metrics pertaining to the host_scheduler.py script? There are decorators which suggest these are getting exported, but I don't know if it's available by default in Panopticon: https://cs.corp.google.com/chromeos_public/src/third_party/autotest/files/scheduler/host_scheduler.py?rcl=7c0bccc268fc4a87fa3b586387a6414ee508299b&l=324
,
Jun 1 2017
Forking discussion back from https://chromium-review.googlesource.com/c/517249/ Caching job_ids in the host_scheduler is unlikely to work. OTOH, host_scheduler can definitely be a _lot_ smarter about how it obtains job information. Roughly, a single host scheduler tick does the following: - Look up HQEs that need host matching - Look up free hosts - match hosts to HQEs -- update HQEs To reduce the number of DB hits, host_scheduler should run run batch (django) queries instead of using django models directly (because django models don't do batching very well in my experience). But, taking a step back: is the goal here to _instrument=_ host_scheduler to find out where improvements are needed? Are we convinced that host_scheduler is a good place to put efforts into? IOWs, my experience so far is that host_scheduler is not the bottleneck in the lab (master / shards)... It _is_ creating DB traffic that we can do without, so that's worth addressing. If we want to do just that, we should first instrument the django layer to tell us how many distinct DB queries being made per tick.
,
Jun 1 2017
Nothing here is RVG.
,
Jun 1 2017
host_scheduler is single threaded and usually uses close to a full core on a 32 core shard, hence it is a bottleneck in the lab.
,
Jun 1 2017
Re #17 host_scheduler is a bottleneck on master scheduler, but not on shard. Shard manages around 100 duts and schedule way less jobs comparing to master. Also, as we've sharded most duts, host scheduler should run much faster on master too. Do we know what's the host scheduler tick time on master to conclude it's actually a bottleneck?
,
Jun 1 2017
Shard manages 150-200 DUTs and will have more when we assign more boards. shard scheduler is a bottleneck.
,
Jun 7 2017
RE: Comment #15 : Yes, the original intent of this bug was stated as instrumenting host_scheduler. Some stracing and print-debugging has revealed that the SQL querying seems sub-optimal, and if the work is significant, we can fork a separate bug. Can we consider having a simple daemon through which all RDB querying is performed, and have the write-through caching and batching performed there? While it is a single process, we can have it be multi-threaded if it will help performance. There's probably some work involved in updating all services which make the RDB queries to use the new model, but caching does seem like it would save several RDB calls (I'd be happy to help in writing this daemon, and converting existing calls to use it :) ) I'm going to conduct some rudimentary analysis of the time taken for scheduler_tick when using in-memory caching v/s the current production method (while the new implementation can't be used in prod, I seem to be able to run it for long enough locally to get some measurements). That might give some insight into how much savings does caching actually provide.
,
Jun 8 2017
Re #20: Separate bug for fixing is not necessary (we can rename this ;) ) imo, https://chromium-review.googlesource.com/c/521687/ is a great example of some discussion on this bug leading to improvements that are future proof. Specific to your idea in #20, my first reaction is that it is at odds with the direction autotest infra code is going. We are trying to: - short term: Make monitor_db multiprocess so that each job roughly has its own process. All these processes will interact with the database independently. - medium term: Have these processes not interact with the database at all, but with a different synchronization service elsewhere. If interested in the details, please see go/monitor_db_per_job_refactor I am concerned about adding another process hop here that doesn't fit into the future architecture. OTOH, batching of DB queries like the CL mentioned above will give us similar benefits and will continue to do so in the future.
,
Jun 9 2017
Thanks for the comments Prathmesh, Re: Comment #21, WRT the short term optimizations you mentioned, I'm wondering what are the latencies involved. I did a rough measurement (using time.time() of each SQL call in _fetch_row_from_db(), and the time taken varies between 0.350ms to 0.550ms. In your experience what would you expect IPC times to be, if we were to even house an in-memory cache of the DB? I'm interested in knowing whether it makes sense latency wise, as a thought experiment.
,
Jun 9 2017
(To clarify, Comment #22 asks about the IPC latencies from the point of view of having a table cache daemon)
,
Jun 10 2017
The following revision refers to this bug: https://chromium.googlesource.com/chromiumos/third_party/autotest/+/085c1c5b39365553cda5fcafa935b1a18680d077 commit 085c1c5b39365553cda5fcafa935b1a18680d077 Author: Po-Hsien Wang <pwang@chromium.org> Date: Sat Jun 10 04:44:02 2017 scheduler_models.py: group multiple afe_jobs queries Override the fetch method in HostQueueEntry to prevent HostQueueEntry fetching its own jobs from DB individually. Use 'SELECT * FROM afe_jobs WHERE ID IN (jid1, ...) ' 'ORDER BY FIELD (ID, jid1, jid2)' to fetch all the jobs details from afe_jobs at once. BUG= chromium:715463 TEST=[1] unittests TEST=[2] python host_scheduler.py on guado_moblab (after setting inline_host_acquisition: False in moblab_config.ini) BEFORE (`strace | grep afe_jobs` per tick) write("SELECT afe_host_queue_entries.* FROM afe_host_queue_entries...") = 371 write("SELECT * FROM afe_jobs WHERE ID=667") = 40 read ("def\24chromeos_autotest_db\10afe_jobs\...") = 3305 write("SELECT * FROM afe_jobs WHERE ID=669") = 40 read ("def\24chromeos_autotest_db\10afe_jobs\...") = 3278 ... ... write("SELECT afe_jobs.id, afe_acl_groups_users.aclgroup_id...") = 616 read ("def\24chromeos_autotest_db\10afe_jobs\...") = 997 write("SELECT job_id, label_id FROM ...") = 432 read ("def\24chromeos_autotest_db\...") = 2313 AFTER (`strace | grep afe_jobs` per tick) write("SELECT afe_host_queue_entries.* FROM afe_host_queue_entries...") = 371 write("SELECT * FROM afe_jobs WHERE ID IN (667...") = 279 read ("def\24chromeos_autotest_db\10afe_jobs\...") = 8192 write("SELECT afe_jobs.id, afe_acl_groups_users.aclgroup_id...") = 384 read ("def\24chromeos_autotest_db\10afe_jobs\...") = 417 write("SELECT job_id, label_id FROM ...") = 200 read ("def\24chromeos_autotest_db\...") = 896 Change-Id: I657dbf29e679046452710de7627364f26fc1a81d Reviewed-on: https://chromium-review.googlesource.com/521687 Commit-Ready: Po-Hsien Wang <pwang@chromium.org> Tested-by: Po-Hsien Wang <pwang@chromium.org> Reviewed-by: Prathmesh Prabhu <pprabhu@chromium.org> [modify] https://crrev.com/085c1c5b39365553cda5fcafa935b1a18680d077/scheduler/scheduler_models_unittest.py [modify] https://crrev.com/085c1c5b39365553cda5fcafa935b1a18680d077/scheduler/scheduler_models.py
,
Jun 19 2017
Aside from the afe_jobs, after I added more duts into my local moblab, I found several afe_hosts queries, one afe_hosts queries for each grouped afe_jobs. Though there are already cache mechanism in-place, it seems the cache hit ratio is only about 20-50% for each host. https://pcon.corp.google.com/p#chrome-infra/queryplayground?yAxisMin=0&yAxisMin2=0&oldHeatmap=false&query=CAEY-AOKAXAKZAgPemAKXjJcChltb25hcmNoLmFjcXVpc2l0aW9ucy5UYXNrEj8KPS9jaHJvbWUvaW5mcmEvY2hyb21lb3MvYXV0b3Rlc3Qvc2NoZWR1bGVyL3JkYi9jYWNoZS9oaXRfcmF0aW8qAiAAMICkp9oGkgEXCgYIARigjQYSABoJaG9zdF9uYW1lYAA&names=Query%201
,
Jun 19 2017
It would be interesting to know exactly what is being cached. Without that context, it is difficult to reason whether that hit-ratio is reasonable.
,
Jun 19 2017
The code which manages the afe_hosts cache. http://cs/chromeos_public/src/third_party/autotest/files/scheduler/rdb_cache_manager.py I'm not sure if we can improve it by caching a more complete Hosts, Labels, ACLs data structure in-memory and managed by our own or not.
,
Oct 26 2017
I'm not actively working on this issue, so I'm going to go ahead and close it. |
|||||||
►
Sign in to add a comment |
|||||||
Comment 1 by pwang@chromium.org
, Apr 26 2017