Split CL submission stage on paladin master (for better metrics around time taken in submission) |
|||||||||
Issue descriptionCQ slave run time has been creeping up, especially in the tail. http://shortn/_6L20Qr0aq1 We are quite sensitive to the tail, because the master (generally) needs to wait for the slowest slave to complete in any batch. Not surprisingly, builds that are failing tend to be much longer. http://shortn/_vndIz8HiwG 99% of passing slaves took < 3 hours, even with recent slowdowns. I'd be nervous about cutting timeout below (or too close to) the 99th passing percentile, because we have ~100 bots per run. But a timeout around 4 hours sounds safe enough, and will kill a lot more failing-anyway builds than passing ones.
,
May 11 2017
> [ ... ] the duration of the CQ master is actually higher on passing runs [ ... ]
There's also the recent results suggesting that the time it takes
to commit large batches of CLs is significant.
So, the naive model (good enough to explain at least some builds)
looks like this:
T(success) = max(T(all slaves)) + T(commit)
T(failure) = min(T(failing slaves))
A model like that would probably predict some inversion. But
the formula for failure time doesn't capture the possibility
that some CLs only depend on certain slaves to be evaluated.
Put another way: If some the outcome of some CL depends on
the outcome of a subset of slaves, then
T(failure) = min(T(the subset of slaves))
If the failure for the subsets includes a long run time (e.g.
it fails with a timeout), then the inversion phenomenon won't
occur in that case.
So, lowering the timeout should help sometimes. As long as it
doesn't hurt (too much), it's worth doing.
,
May 12 2017
Actually, there's already a 4.5 (default) timeout configured for all the paladins. My question is why it isn't being respected... http://shortn/_ZAbfTGPMTi (slave distribution with the 4.5 hour threshold marked) http://shortn/_2Z1Qle1js3 (master cqcompletion stage with 4.5 hour threshold marked) +pprabhu who worked on the timeout logic. Any idea why the timeout isn't being respected? +nxia also
,
May 12 2017
Improved graph for the slaves (excluding the master) : http://shortn/_xLPFS8VZlH Apparently some of that >4.5 stuff in the tails was actually master runs being longer. Maybe the timeout is being (mostly) respected.
,
May 12 2017
The >4.5 times on master cqcompletion are probably because of long runs that also spend a long time in cqcompletions submitting stuff. (that stat would be easier to understand if we had a separate stage for "wait for slaves" vs. "submit+reject changes") Ok, I'm convinced now that the timeout is probably being respected.
,
May 12 2017
I don't see an AI here wrt timeout logic. It would help to split out the CL submission stage to get better stats.
,
Jun 2 2017
,
Jun 5 2017
Updating subject to match the current status of the bug. At this point, this is a low priority RFC. Assigning to nxia@ to triage / commit / tell us we're wrong.
,
Jun 7 2017
I looked into how to split wait_for_completion & submit_CLs before, and found it not an easy change. It involves refactor on HandleFailure & HandleSuccess for all completion stages. If we want to make the change, we can have a separate stage like HandleChangesStage, in which stage the master build will fetch the statuses of all slave builds and generates BuilderStatues again to triage failures and submit CLs.
,
Oct 16 2017
Last week, it took us ~36 hours to detect a GoB submission failure on the builders becuase slave failures were masking it.
,
Oct 16 2017
,
Oct 25 2017
,
Oct 30 2017
CLs under review.
,
Nov 1 2017
The following revision refers to this bug: https://chromium.googlesource.com/chromiumos/chromite/+/6c381658e592d0f5abbbe022fba53368dc610b05 commit 6c381658e592d0f5abbbe022fba53368dc610b05 Author: Ningning Xia <nxia@chromium.org> Date: Wed Nov 01 23:47:14 2017 builder_status_lib: create BuilderStatusesFetcher to fetch BuilderStatus First step of spliting the completion_stage into wait_for_slaves_to_complete_stage and handle_changes_stage. Create BuilderStatusesFetcher so it can be reused in different stages. BuildSpecsManager should only wait for slaves to complete or timeout and then return, shouldn't host the logic to the fetch BuilderStatus of the slaves. BUG= chromium:721224 TEST=unit_tests Change-Id: Ieaae8ab7fc34bd3ed1ee673e75a993ed1533fcc9 Reviewed-on: https://chromium-review.googlesource.com/737519 Commit-Ready: Ningning Xia <nxia@chromium.org> Tested-by: Ningning Xia <nxia@chromium.org> Reviewed-by: Ningning Xia <nxia@chromium.org> [modify] https://crrev.com/6c381658e592d0f5abbbe022fba53368dc610b05/scripts/cbuildbot.py [modify] https://crrev.com/6c381658e592d0f5abbbe022fba53368dc610b05/cbuildbot/manifest_version_unittest.py [modify] https://crrev.com/6c381658e592d0f5abbbe022fba53368dc610b05/cbuildbot/stages/completion_stages.py [modify] https://crrev.com/6c381658e592d0f5abbbe022fba53368dc610b05/cbuildbot/manifest_version.py [modify] https://crrev.com/6c381658e592d0f5abbbe022fba53368dc610b05/cbuildbot/stages/completion_stages_unittest.py [modify] https://crrev.com/6c381658e592d0f5abbbe022fba53368dc610b05/cbuildbot/build_status.py [modify] https://crrev.com/6c381658e592d0f5abbbe022fba53368dc610b05/lib/builder_status_lib.py [modify] https://crrev.com/6c381658e592d0f5abbbe022fba53368dc610b05/lib/builder_status_lib_unittest.py
,
Nov 1 2017
The following revision refers to this bug: https://chromium.googlesource.com/chromiumos/chromite/+/f3c8e7fb80be88413061268ca6afc19ffacd8062 commit f3c8e7fb80be88413061268ca6afc19ffacd8062 Author: Ningning Xia <nxia@chromium.org> Date: Wed Nov 01 23:47:15 2017 separate the handling_changes logic from CommitQueueCompletionStage Separate the logic of handling (submitting/forgiving/rejecting) changes (CLs) from CommitQueueCompletionStage to CommitQueueHandlingChangesStage. BUG= chromium:721224 TEST=unit_tests Change-Id: Icd6005e8fa5976a17c2676ff2c5d582044807c2e Reviewed-on: https://chromium-review.googlesource.com/740912 Commit-Ready: Ningning Xia <nxia@chromium.org> Tested-by: Ningning Xia <nxia@chromium.org> Reviewed-by: Paul Hobbs <phobbs@google.com> [add] https://crrev.com/f3c8e7fb80be88413061268ca6afc19ffacd8062/cbuildbot/stages/handle_changes_stages_unittest [modify] https://crrev.com/f3c8e7fb80be88413061268ca6afc19ffacd8062/cbuildbot/stages/completion_stages.py [modify] https://crrev.com/f3c8e7fb80be88413061268ca6afc19ffacd8062/cbuildbot/builders/simple_builders_unittest.py [modify] https://crrev.com/f3c8e7fb80be88413061268ca6afc19ffacd8062/cbuildbot/stages/completion_stages_unittest.py [add] https://crrev.com/f3c8e7fb80be88413061268ca6afc19ffacd8062/cbuildbot/stages/handle_changes_stages.py [modify] https://crrev.com/f3c8e7fb80be88413061268ca6afc19ffacd8062/cbuildbot/builders/simple_builders.py [modify] https://crrev.com/f3c8e7fb80be88413061268ca6afc19ffacd8062/lib/builder_status_lib.py [add] https://crrev.com/f3c8e7fb80be88413061268ca6afc19ffacd8062/cbuildbot/stages/handle_changes_stages_unittest.py
,
Nov 6 2017
,
Jan 22 2018
,
Jan 23 2018
|
|||||||||
►
Sign in to add a comment |
|||||||||
Comment 1 by akes...@chromium.org
, May 11 2017