New issue
Advanced search Search tips
Note: Color blocks (like or ) mean that a user may not be available. Tooltip shows the reason.

Issue 708728 link

Starred by 2 users

Issue metadata

Status: Archived
Owner:
Closed: May 2017
Cc:
Components:
EstimatedDays: ----
NextAction: ----
OS: Chrome
Pri: 2
Type: Bug



Sign in to add a comment

add a "slowest CQ slave" counter metric

Project Member Reported by akes...@chromium.org, Apr 5 2017

Issue description

Our weekly email has a count of "slowest slave per CQ run".

Would be good to have a monarch metric about this too. If we emit it from the CQ master, then we can also have it be aware of CQ self-destruct logic so that we increment only the longest pole for that run (regardless of if there might be other slaves still running that will pass or fail later).

So, suggestion: Completion stage on the master determines what is the final slave that it received any result from. So for instance if foo-paladin fails and CQ master is ready to early-abort, then the long pole is foo-paladin (even if bar-paladin is still running). Completion stage then increments the long_pole counter with build_config='foo-paladin'

Q: what to do in case of tie, if multiple slaves returned in the same completion loop iteration? probably increment for all the configs that were part of the tie.
 

Comment 1 by aut...@google.com, Apr 18 2017

Labels: -current-issue
Owner: pho...@chromium.org
+ phobbs for metrics
Owner: nxia@chromium.org
Status: Assigned (was: Untriaged)

Comment 3 by nxia@chromium.org, May 11 2017

If foo-paladin is relevant to all CLs, when it fails, CQ master will trigger self-destruction. foo-paladin isn't actually the slowest slave. maybe we should treat the running slaves as the slow ones? 

Comment 4 by nxia@chromium.org, May 11 2017

Cc: akes...@chromium.org
#3 in that scenario, the run time for foo-paladin was still the limiting factor for the run time of that CQ run. We would still gain if foo-paladin failed faster. That's why I like this metric, it tells us about all the other "long poles" that we don't suspect just by looking at the slowest overall slaves.
Owner: akes...@chromium.org
Status: Started (was: Assigned)
I have a CL in progress for this anyway...
c#5 I'm not sure if I agree.  That ends up being very CL dependent and potentially adds noise.  If we do include self-destructed, can we have a metric indicating if that run was destructed or not so we can exclude it if we want to later on?
#7 sure can take that as a follow up. my current CL is already in the CQ.

counterpoint: if we are self destructing that often on a different random terminal slave each time, then that tells us that there is no particular slave we can focus on to get a speed increase.
Well I mainly want to just get rid of the bimodality caused by self-destruction.

If I understand it right, you're potentially causing the fastest paladin run to show up as the slowest (since it failed first).  It's just adding noise to the data.

Even if your CL is in the CQ, can we get the metric right from the beginning instead of having some with and without the field.
I don't see that as noise. If the data tells us that 20% of the time, foo-paladin was the longest waited-for slave (even if it is always the fastest to complete) then that tells us that we can speed up 20% of CQ runs by making foo-paladin faster. In fact, this very discussion is suggesting to me that we might want to intentionally add fast-failing-broad-but-shallow-testing slaves for this very reason.
That sounds fair, but let's keep it marked separately to keep the data clean.  And along those lines, lets get our data right from the beginning instead of having a mishmash of fields which seems to cause us issues with Monarch.
https://chromium-review.googlesource.com/c/502747/

I'll take a look at the point of metric generation and see if that scope can easily determine whether we are self destructing.
Project Member

Comment 13 by bugdroid1@chromium.org, May 17 2017

The following revision refers to this bug:
  https://chromium.googlesource.com/chromiumos/chromite/+/435496ac02aabf487d88831c5423bcfc2d34689a

commit 435496ac02aabf487d88831c5423bcfc2d34689a
Author: Aviv Keshet <akeshet@chromium.org>
Date: Wed May 17 21:27:46 2017

build_status: add a last-slave-to-complete metric

BUG= chromium:708728 
TEST=new unit test

Change-Id: I645ba99f38ab3535d5dda4ff6f39eada4fe40c7c
Reviewed-on: https://chromium-review.googlesource.com/502747
Tested-by: Aviv Keshet <akeshet@chromium.org>
Reviewed-by: Ningning Xia <nxia@chromium.org>
Commit-Queue: Aviv Keshet <akeshet@chromium.org>

[modify] https://crrev.com/435496ac02aabf487d88831c5423bcfc2d34689a/cbuildbot/build_status_unittest.py
[modify] https://crrev.com/435496ac02aabf487d88831c5423bcfc2d34689a/lib/constants.py
[modify] https://crrev.com/435496ac02aabf487d88831c5423bcfc2d34689a/cbuildbot/build_status.py

Status: Fixed (was: Started)
Issue 721497 has been merged into this issue.
Labels: VerifyIn-61

Comment 17 by dchan@chromium.org, Jan 22 2018

Status: Archived (was: Fixed)

Sign in to add a comment