New issue
Advanced search Search tips
Note: Color blocks (like or ) mean that a user may not be available. Tooltip shows the reason.

Issue 729199 link

Starred by 1 user

Issue metadata

Status: Fixed
Owner:
Closed: Jun 2018
Cc:
Components:
EstimatedDays: ----
NextAction: ----
OS: ----
Pri: 1
Type: Feature
okr

Blocked on:
issue 751231
issue 799633



Sign in to add a comment

instrument HQE lifecycle (autotest: add job lifecycle metrics)

Project Member Reported by pprabhu@chromium.org, Jun 2 2017

Issue description

We currently have a few ways to measure the throughput of jobs through our system.
- scheduler tick tells us how fast we're dealing with micro tasks.
- # jobs scheduled / completed tells us the actual throughput wrt jobs.

What's missing is a way to measure the latency of each job going through the system. Arguably, this is even more important to the end-user of our system. The questions to answer are:

- Once a job is created, how long does it take to finish?
- How is this time split between the various steps.

idea: track the time spent in each stage of the HQE life-cycle.

Something along the lines of apache_log_metrics. Maintain an audit log at a well know location where the scheduler appends an entry (hqe_id, state, timestamp) every time the HQE state changes
  https://cs.corp.google.com/search/?q=set_status.*HostQueueEntry.Status&m=100&sq=package:chromeos+file:src/third_party/autotest/files&type=cs

A separate metrics reporting daemon consumes this log and reports the amount of time spent between each state transition for each job separately.
 
Labels: -Type-Bug Type-Feature
Another request here would be one extra step (outside of the HQE status changes): When the HQE obtains the DUT. It's important to know how long the HQE was sitting around waiting for DUT assignment.
Status: Available (was: Untriaged)
One motivation behind this request is the (repeated) discussion around host_scheduler tick. From my analysis so far, I believe that host_scheduler tick is _not_ a bottleneck for the job lifecycle. (It may be eating up too much CPU because it's written inefficiently (e.g., one of the CLs on that bug fixes the large number of DB queries it was generating), and have other problems).

I want a decisive metric for this: How is our job lifecycle doing, and what are the bottlenecks.
Cc: jrbarnette@chromium.org davidri...@chromium.org pho...@chromium.org
+ some iptaskforce people.
Do we currently have such a metric available from another source? (Perhaps suite_report?) Can someone point out the nearest approximation that exists today.

Also, your thoughts about the usefulness of such a metric.
We might be able to have a correlation to AFE job and HQE, but we don't have HQE broken up much beyond special tasks (Provision/Reset) and the actual test invocation.

I'd love to have more information there.
Labels: infra-overhead

Comment 8 by aut...@google.com, Jun 12 2017

Labels: OKR
Labels: -Pri-3 Pri-1
Owner: pho...@chromium.org
Status: Assigned (was: Available)
Summary: instrument HQE lifecycle (autotest: add job lifecycle metrics) (was: autotest: add job lifecycle metrics)
This is Paul's P1 OKR.
Blockedon: 751231
Project Member

Comment 11 by bugdroid1@chromium.org, Aug 26 2017

The following revision refers to this bug:
  https://chromium.googlesource.com/chromiumos/third_party/autotest/+/68d98595a03c0c7a0c95e284a5a3bc7c3ba0477f

commit 68d98595a03c0c7a0c95e284a5a3bc7c3ba0477f
Author: Paul Hobbs <phobbs@google.com>
Date: Sat Aug 26 02:54:07 2017

[autotest] Refactor record_autoserv a bit

Pull out job status code for reuse.

BUG= chromium:729199 
TEST=None

Change-Id: I5b8b2f35b800e58eaeaebecabf6cef5e0de8f409
Reviewed-on: https://chromium-review.googlesource.com/625678
Commit-Ready: Paul Hobbs <phobbs@google.com>
Tested-by: Paul Hobbs <phobbs@google.com>
Reviewed-by: Paul Hobbs <phobbs@google.com>

[modify] https://crrev.com/68d98595a03c0c7a0c95e284a5a3bc7c3ba0477f/server/autoserv

Project Member

Comment 12 by bugdroid1@chromium.org, Aug 26 2017

The following revision refers to this bug:
  https://chromium.googlesource.com/chromiumos/third_party/autotest/+/e9fd557506c95a6fcca8c2e3e11f3556463287ac

commit e9fd557506c95a6fcca8c2e3e11f3556463287ac
Author: Paul Hobbs <phobbs@google.com>
Date: Sat Aug 26 02:54:08 2017

[autotest] Record to cloud trace in autoserv

BUG= chromium:729199 
TEST=None

Change-Id: I8cdc955ac11b9ed69ac84abb45b8c985ede67848
Reviewed-on: https://chromium-review.googlesource.com/625680
Commit-Ready: Paul Hobbs <phobbs@google.com>
Tested-by: Paul Hobbs <phobbs@google.com>
Reviewed-by: Paul Hobbs <phobbs@google.com>

[modify] https://crrev.com/e9fd557506c95a6fcca8c2e3e11f3556463287ac/server/autoserv_parser.py
[modify] https://crrev.com/e9fd557506c95a6fcca8c2e3e11f3556463287ac/server/autoserv

Project Member

Comment 13 by bugdroid1@chromium.org, Sep 14 2017

The following revision refers to this bug:
  https://chromium.googlesource.com/chromiumos/third_party/autotest/+/f508f8d3e984ea2e0b033b84f65ebb76b7212de0

commit f508f8d3e984ea2e0b033b84f65ebb76b7212de0
Author: Paul Hobbs <phobbs@google.com>
Date: Thu Sep 14 18:36:34 2017

[autotest] scheduler: Add trace for HQE lifespan

At the end of an HQE's lifespan, create a trace object recording the
duration of the HQE. By convention, the topmost span in a trace (which
this is) should have an id of 0. We also use by convention a hash of the
HQE's id as the trace id - this allows agents associated with this HQE
to know what the traceId and parentSpanId values are without relying on
global state, or having to be within a
chromite.lib.cloud_trace.SpanStack contextmanager.

BUG= chromium:729199 
TEST=None

Change-Id: I2ceba1f5bef87b292f01faf85009599284865fe0
Reviewed-on: https://chromium-review.googlesource.com/655798
Commit-Ready: Paul Hobbs <phobbs@google.com>
Tested-by: Paul Hobbs <phobbs@google.com>
Reviewed-by: Xixuan Wu <xixuan@chromium.org>

[modify] https://crrev.com/f508f8d3e984ea2e0b033b84f65ebb76b7212de0/scheduler/scheduler_models.py

Project Member

Comment 14 by bugdroid1@chromium.org, Sep 29 2017

The following revision refers to this bug:
  https://chrome-internal.googlesource.com/chromeos/chromeos-admin/+/da1627a005f0a1e0c4a05c34f057fca786d442bb

commit da1627a005f0a1e0c4a05c34f057fca786d442bb
Author: Paul Hobbs <phobbs@google.com>
Date: Fri Sep 29 20:11:01 2017

Labels: -OKR okr
What work is left here?
Blockedon: 799633
Blocking this bug on issue 799633 to bring cloudtrace to lucifer as an FYI -- this will break in the new (lucifer) world, until that bug is fixed.
Owner: ayatane@chromium.org
Any further work with cloud trace / events from scheduler should be owned by Allen.
Labels: Hotlist-Lucifer
Status: Fixed (was: Assigned)
Cloud trace done, swarming UI also exposes other parts of the job lifecycle (e.g. swarming task pending)

Sign in to add a comment