New issue
Advanced search Search tips
Note: Color blocks (like or ) mean that a user may not be available. Tooltip shows the reason.

Issue 721224 link

Starred by 1 user

Issue metadata

Status: Fixed
Owner:
Last visit > 30 days ago
Closed: Nov 2017
Cc:
Components:
EstimatedDays: ----
NextAction: ----
OS: Chrome
Pri: 1
Type: Bug

Blocked on:
issue 778414



Sign in to add a comment

Split CL submission stage on paladin master (for better metrics around time taken in submission)

Project Member Reported by akes...@chromium.org, May 11 2017

Issue description

CQ slave run time has been creeping up, especially in the tail. http://shortn/_6L20Qr0aq1

We are quite sensitive to the tail, because the master (generally) needs to wait for the slowest slave to complete in any batch.

Not surprisingly, builds that are failing tend to be much longer.
http://shortn/_vndIz8HiwG

99% of passing slaves took < 3 hours, even with recent slowdowns. I'd be nervous about cutting timeout below (or too close to) the 99th passing percentile, because we have ~100 bots per run. But a timeout around 4 hours sounds safe enough, and will kill a lot more failing-anyway builds than passing ones.
 
possible counterpoint: there's a strange inversion in which the duration of the CQ master is actually higher on passing runs than failing runs: http://shortn/_CYsupHQlhv

I'm not entirely sure what this data is trying to say, but it is probably related to the CQ self destruct, which is allowing doomed CQ master runs to fail fast.
> [ ... ] the duration of the CQ master is actually higher on passing runs [ ... ]

There's also the recent results suggesting that the time it takes
to commit large batches of CLs is significant.

So, the naive model (good enough to explain at least some builds)
looks like this:
   T(success) = max(T(all slaves)) + T(commit)
   T(failure) = min(T(failing slaves))

A model like that would probably predict some inversion.  But
the formula for failure time doesn't capture the possibility
that some CLs only depend on certain slaves to be evaluated.
Put another way:  If some the outcome of some CL depends on
the outcome of a subset of slaves, then
    T(failure) = min(T(the subset of slaves))

If the failure for the subsets includes a long run time (e.g.
it fails with a timeout), then the inversion phenomenon won't
occur in that case.

So, lowering the timeout should help sometimes.  As long as it
doesn't hurt (too much), it's worth doing.

Cc: nxia@chromium.org pprabhu@chromium.org
Actually, there's already a 4.5 (default) timeout configured for all the paladins. My question is why it isn't being respected...

http://shortn/_ZAbfTGPMTi (slave distribution with the 4.5 hour threshold marked)
http://shortn/_2Z1Qle1js3 (master cqcompletion stage with 4.5 hour threshold marked)


+pprabhu who worked on the timeout logic. Any idea why the timeout isn't being respected?

+nxia also
Improved graph for the slaves (excluding the master) : http://shortn/_xLPFS8VZlH

Apparently some of that >4.5 stuff in the tails was actually master runs being longer.

Maybe the timeout is being (mostly) respected.
The >4.5 times on master cqcompletion are probably because of long runs that also spend a long time in cqcompletions submitting stuff. (that stat would be easier to understand if we had a separate stage for "wait for slaves" vs. "submit+reject changes")

Ok, I'm convinced now that the timeout is probably being respected.


I don't see an AI here wrt timeout logic.
It would help to split out the CL submission stage to get better stats.

Comment 7 by aut...@google.com, Jun 2 2017

Labels: -current-issue
Owner: nxia@chromium.org
Summary: Split CL submission stage on paladin master (for better metrics around time taken in submission) (was: Shorten CQ slave timeout to ~4 hours)
Updating subject to match the current status of the bug.

At this point, this is a low priority RFC.
Assigning to nxia@ to triage / commit / tell us we're wrong.

Comment 9 by nxia@chromium.org, Jun 7 2017

I looked into how to split wait_for_completion & submit_CLs before, and found it not an easy change. It involves refactor on HandleFailure & HandleSuccess for all completion stages. If we want to make the change, we can have a separate stage like HandleChangesStage, in which stage the master build will fetch the statuses of all slave builds and generates BuilderStatues again to triage failures and submit CLs. 
Labels: -Pri-2 Chase-Pending Pri-1
Owner: ----
Last week, it took us ~36 hours to detect a GoB submission failure on the builders becuase slave failures were masking it.
Labels: -Chase-Pending Chase
Owner: nxia@chromium.org
Status: Assigned (was: Untriaged)

Comment 12 by nxia@chromium.org, Oct 25 2017

Blockedon: 778414
tryjob testing is blocked on  crbug.com/778414 
CLs under review.
Project Member

Comment 14 by bugdroid1@chromium.org, Nov 1 2017

The following revision refers to this bug:
  https://chromium.googlesource.com/chromiumos/chromite/+/6c381658e592d0f5abbbe022fba53368dc610b05

commit 6c381658e592d0f5abbbe022fba53368dc610b05
Author: Ningning Xia <nxia@chromium.org>
Date: Wed Nov 01 23:47:14 2017

builder_status_lib: create BuilderStatusesFetcher to fetch BuilderStatus

First step of spliting the completion_stage into
wait_for_slaves_to_complete_stage and handle_changes_stage. Create
BuilderStatusesFetcher so it can be reused in different stages.

BuildSpecsManager should only wait for slaves to complete or timeout and
then return, shouldn't host the logic to the fetch BuilderStatus of
the slaves.

BUG= chromium:721224 
TEST=unit_tests

Change-Id: Ieaae8ab7fc34bd3ed1ee673e75a993ed1533fcc9
Reviewed-on: https://chromium-review.googlesource.com/737519
Commit-Ready: Ningning Xia <nxia@chromium.org>
Tested-by: Ningning Xia <nxia@chromium.org>
Reviewed-by: Ningning Xia <nxia@chromium.org>

[modify] https://crrev.com/6c381658e592d0f5abbbe022fba53368dc610b05/scripts/cbuildbot.py
[modify] https://crrev.com/6c381658e592d0f5abbbe022fba53368dc610b05/cbuildbot/manifest_version_unittest.py
[modify] https://crrev.com/6c381658e592d0f5abbbe022fba53368dc610b05/cbuildbot/stages/completion_stages.py
[modify] https://crrev.com/6c381658e592d0f5abbbe022fba53368dc610b05/cbuildbot/manifest_version.py
[modify] https://crrev.com/6c381658e592d0f5abbbe022fba53368dc610b05/cbuildbot/stages/completion_stages_unittest.py
[modify] https://crrev.com/6c381658e592d0f5abbbe022fba53368dc610b05/cbuildbot/build_status.py
[modify] https://crrev.com/6c381658e592d0f5abbbe022fba53368dc610b05/lib/builder_status_lib.py
[modify] https://crrev.com/6c381658e592d0f5abbbe022fba53368dc610b05/lib/builder_status_lib_unittest.py

Project Member

Comment 15 by bugdroid1@chromium.org, Nov 1 2017

The following revision refers to this bug:
  https://chromium.googlesource.com/chromiumos/chromite/+/f3c8e7fb80be88413061268ca6afc19ffacd8062

commit f3c8e7fb80be88413061268ca6afc19ffacd8062
Author: Ningning Xia <nxia@chromium.org>
Date: Wed Nov 01 23:47:15 2017

separate the handling_changes logic from CommitQueueCompletionStage

Separate the logic of handling (submitting/forgiving/rejecting) changes
(CLs) from CommitQueueCompletionStage to
CommitQueueHandlingChangesStage.

BUG= chromium:721224 
TEST=unit_tests

Change-Id: Icd6005e8fa5976a17c2676ff2c5d582044807c2e
Reviewed-on: https://chromium-review.googlesource.com/740912
Commit-Ready: Ningning Xia <nxia@chromium.org>
Tested-by: Ningning Xia <nxia@chromium.org>
Reviewed-by: Paul Hobbs <phobbs@google.com>

[add] https://crrev.com/f3c8e7fb80be88413061268ca6afc19ffacd8062/cbuildbot/stages/handle_changes_stages_unittest
[modify] https://crrev.com/f3c8e7fb80be88413061268ca6afc19ffacd8062/cbuildbot/stages/completion_stages.py
[modify] https://crrev.com/f3c8e7fb80be88413061268ca6afc19ffacd8062/cbuildbot/builders/simple_builders_unittest.py
[modify] https://crrev.com/f3c8e7fb80be88413061268ca6afc19ffacd8062/cbuildbot/stages/completion_stages_unittest.py
[add] https://crrev.com/f3c8e7fb80be88413061268ca6afc19ffacd8062/cbuildbot/stages/handle_changes_stages.py
[modify] https://crrev.com/f3c8e7fb80be88413061268ca6afc19ffacd8062/cbuildbot/builders/simple_builders.py
[modify] https://crrev.com/f3c8e7fb80be88413061268ca6afc19ffacd8062/lib/builder_status_lib.py
[add] https://crrev.com/f3c8e7fb80be88413061268ca6afc19ffacd8062/cbuildbot/stages/handle_changes_stages_unittest.py

Status: Fixed (was: Assigned)

Comment 17 by dchan@chromium.org, Jan 22 2018

Status: Archived (was: Fixed)

Comment 18 by dchan@chromium.org, Jan 23 2018

Status: Fixed (was: Archived)

Sign in to add a comment