instrument HQE lifecycle (autotest: add job lifecycle metrics) |
||||||||||||
Issue descriptionWe currently have a few ways to measure the throughput of jobs through our system. - scheduler tick tells us how fast we're dealing with micro tasks. - # jobs scheduled / completed tells us the actual throughput wrt jobs. What's missing is a way to measure the latency of each job going through the system. Arguably, this is even more important to the end-user of our system. The questions to answer are: - Once a job is created, how long does it take to finish? - How is this time split between the various steps. idea: track the time spent in each stage of the HQE life-cycle. Something along the lines of apache_log_metrics. Maintain an audit log at a well know location where the scheduler appends an entry (hqe_id, state, timestamp) every time the HQE state changes https://cs.corp.google.com/search/?q=set_status.*HostQueueEntry.Status&m=100&sq=package:chromeos+file:src/third_party/autotest/files&type=cs A separate metrics reporting daemon consumes this log and reports the amount of time spent between each state transition for each job separately.
,
Jun 2 2017
Another request here would be one extra step (outside of the HQE status changes): When the HQE obtains the DUT. It's important to know how long the HQE was sitting around waiting for DUT assignment.
,
Jun 5 2017
One motivation behind this request is the (repeated) discussion around host_scheduler tick. From my analysis so far, I believe that host_scheduler tick is _not_ a bottleneck for the job lifecycle. (It may be eating up too much CPU because it's written inefficiently (e.g., one of the CLs on that bug fixes the large number of DB queries it was generating), and have other problems). I want a decisive metric for this: How is our job lifecycle doing, and what are the bottlenecks.
,
Jun 5 2017
The bug mentioned in #3: https://bugs.chromium.org/p/chromium/issues/detail?id=715463#c18
,
Jun 5 2017
+ some iptaskforce people. Do we currently have such a metric available from another source? (Perhaps suite_report?) Can someone point out the nearest approximation that exists today. Also, your thoughts about the usefulness of such a metric.
,
Jun 5 2017
We might be able to have a correlation to AFE job and HQE, but we don't have HQE broken up much beyond special tasks (Provision/Reset) and the actual test invocation. I'd love to have more information there.
,
Jun 7 2017
,
Jun 12 2017
,
Jul 24 2017
This is Paul's P1 OKR.
,
Aug 1 2017
,
Aug 26 2017
The following revision refers to this bug: https://chromium.googlesource.com/chromiumos/third_party/autotest/+/68d98595a03c0c7a0c95e284a5a3bc7c3ba0477f commit 68d98595a03c0c7a0c95e284a5a3bc7c3ba0477f Author: Paul Hobbs <phobbs@google.com> Date: Sat Aug 26 02:54:07 2017 [autotest] Refactor record_autoserv a bit Pull out job status code for reuse. BUG= chromium:729199 TEST=None Change-Id: I5b8b2f35b800e58eaeaebecabf6cef5e0de8f409 Reviewed-on: https://chromium-review.googlesource.com/625678 Commit-Ready: Paul Hobbs <phobbs@google.com> Tested-by: Paul Hobbs <phobbs@google.com> Reviewed-by: Paul Hobbs <phobbs@google.com> [modify] https://crrev.com/68d98595a03c0c7a0c95e284a5a3bc7c3ba0477f/server/autoserv
,
Aug 26 2017
The following revision refers to this bug: https://chromium.googlesource.com/chromiumos/third_party/autotest/+/e9fd557506c95a6fcca8c2e3e11f3556463287ac commit e9fd557506c95a6fcca8c2e3e11f3556463287ac Author: Paul Hobbs <phobbs@google.com> Date: Sat Aug 26 02:54:08 2017 [autotest] Record to cloud trace in autoserv BUG= chromium:729199 TEST=None Change-Id: I8cdc955ac11b9ed69ac84abb45b8c985ede67848 Reviewed-on: https://chromium-review.googlesource.com/625680 Commit-Ready: Paul Hobbs <phobbs@google.com> Tested-by: Paul Hobbs <phobbs@google.com> Reviewed-by: Paul Hobbs <phobbs@google.com> [modify] https://crrev.com/e9fd557506c95a6fcca8c2e3e11f3556463287ac/server/autoserv_parser.py [modify] https://crrev.com/e9fd557506c95a6fcca8c2e3e11f3556463287ac/server/autoserv
,
Sep 14 2017
The following revision refers to this bug: https://chromium.googlesource.com/chromiumos/third_party/autotest/+/f508f8d3e984ea2e0b033b84f65ebb76b7212de0 commit f508f8d3e984ea2e0b033b84f65ebb76b7212de0 Author: Paul Hobbs <phobbs@google.com> Date: Thu Sep 14 18:36:34 2017 [autotest] scheduler: Add trace for HQE lifespan At the end of an HQE's lifespan, create a trace object recording the duration of the HQE. By convention, the topmost span in a trace (which this is) should have an id of 0. We also use by convention a hash of the HQE's id as the trace id - this allows agents associated with this HQE to know what the traceId and parentSpanId values are without relying on global state, or having to be within a chromite.lib.cloud_trace.SpanStack contextmanager. BUG= chromium:729199 TEST=None Change-Id: I2ceba1f5bef87b292f01faf85009599284865fe0 Reviewed-on: https://chromium-review.googlesource.com/655798 Commit-Ready: Paul Hobbs <phobbs@google.com> Tested-by: Paul Hobbs <phobbs@google.com> Reviewed-by: Xixuan Wu <xixuan@chromium.org> [modify] https://crrev.com/f508f8d3e984ea2e0b033b84f65ebb76b7212de0/scheduler/scheduler_models.py
,
Sep 29 2017
The following revision refers to this bug: https://chrome-internal.googlesource.com/chromeos/chromeos-admin/+/da1627a005f0a1e0c4a05c34f057fca786d442bb commit da1627a005f0a1e0c4a05c34f057fca786d442bb Author: Paul Hobbs <phobbs@google.com> Date: Fri Sep 29 20:11:01 2017
,
Feb 16 2018
What work is left here?
,
Feb 16 2018
Blocking this bug on issue 799633 to bring cloudtrace to lucifer as an FYI -- this will break in the new (lucifer) world, until that bug is fixed.
,
Mar 29 2018
Any further work with cloud trace / events from scheduler should be owned by Allen.
,
Mar 30 2018
,
Jun 12 2018
Cloud trace done, swarming UI also exposes other parts of the job lifecycle (e.g. swarming task pending) |
||||||||||||
►
Sign in to add a comment |
||||||||||||
Comment 1 by pprabhu@chromium.org
, Jun 2 2017