add a "slowest CQ slave" counter metric |
|||||||
Issue descriptionOur weekly email has a count of "slowest slave per CQ run". Would be good to have a monarch metric about this too. If we emit it from the CQ master, then we can also have it be aware of CQ self-destruct logic so that we increment only the longest pole for that run (regardless of if there might be other slaves still running that will pass or fail later). So, suggestion: Completion stage on the master determines what is the final slave that it received any result from. So for instance if foo-paladin fails and CQ master is ready to early-abort, then the long pole is foo-paladin (even if bar-paladin is still running). Completion stage then increments the long_pole counter with build_config='foo-paladin' Q: what to do in case of tie, if multiple slaves returned in the same completion loop iteration? probably increment for all the configs that were part of the tie.
,
Apr 18 2017
,
May 11 2017
If foo-paladin is relevant to all CLs, when it fails, CQ master will trigger self-destruction. foo-paladin isn't actually the slowest slave. maybe we should treat the running slaves as the slow ones?
,
May 11 2017
,
May 11 2017
#3 in that scenario, the run time for foo-paladin was still the limiting factor for the run time of that CQ run. We would still gain if foo-paladin failed faster. That's why I like this metric, it tells us about all the other "long poles" that we don't suspect just by looking at the slowest overall slaves.
,
May 11 2017
I have a CL in progress for this anyway...
,
May 11 2017
c#5 I'm not sure if I agree. That ends up being very CL dependent and potentially adds noise. If we do include self-destructed, can we have a metric indicating if that run was destructed or not so we can exclude it if we want to later on?
,
May 11 2017
#7 sure can take that as a follow up. my current CL is already in the CQ. counterpoint: if we are self destructing that often on a different random terminal slave each time, then that tells us that there is no particular slave we can focus on to get a speed increase.
,
May 11 2017
Well I mainly want to just get rid of the bimodality caused by self-destruction. If I understand it right, you're potentially causing the fastest paladin run to show up as the slowest (since it failed first). It's just adding noise to the data. Even if your CL is in the CQ, can we get the metric right from the beginning instead of having some with and without the field.
,
May 11 2017
I don't see that as noise. If the data tells us that 20% of the time, foo-paladin was the longest waited-for slave (even if it is always the fastest to complete) then that tells us that we can speed up 20% of CQ runs by making foo-paladin faster. In fact, this very discussion is suggesting to me that we might want to intentionally add fast-failing-broad-but-shallow-testing slaves for this very reason.
,
May 11 2017
That sounds fair, but let's keep it marked separately to keep the data clean. And along those lines, lets get our data right from the beginning instead of having a mishmash of fields which seems to cause us issues with Monarch.
,
May 11 2017
https://chromium-review.googlesource.com/c/502747/ I'll take a look at the point of metric generation and see if that scope can easily determine whether we are self destructing.
,
May 17 2017
The following revision refers to this bug: https://chromium.googlesource.com/chromiumos/chromite/+/435496ac02aabf487d88831c5423bcfc2d34689a commit 435496ac02aabf487d88831c5423bcfc2d34689a Author: Aviv Keshet <akeshet@chromium.org> Date: Wed May 17 21:27:46 2017 build_status: add a last-slave-to-complete metric BUG= chromium:708728 TEST=new unit test Change-Id: I645ba99f38ab3535d5dda4ff6f39eada4fe40c7c Reviewed-on: https://chromium-review.googlesource.com/502747 Tested-by: Aviv Keshet <akeshet@chromium.org> Reviewed-by: Ningning Xia <nxia@chromium.org> Commit-Queue: Aviv Keshet <akeshet@chromium.org> [modify] https://crrev.com/435496ac02aabf487d88831c5423bcfc2d34689a/cbuildbot/build_status_unittest.py [modify] https://crrev.com/435496ac02aabf487d88831c5423bcfc2d34689a/lib/constants.py [modify] https://crrev.com/435496ac02aabf487d88831c5423bcfc2d34689a/cbuildbot/build_status.py
,
May 24 2017
,
Jun 12 2017
Issue 721497 has been merged into this issue.
,
Aug 1 2017
,
Jan 22 2018
|
|||||||
►
Sign in to add a comment |
|||||||
Comment 1 by aut...@google.com
, Apr 18 2017Owner: pho...@chromium.org