New issue
Advanced search Search tips

Issue 909895 link

Starred by 1 user

Issue metadata

Status: Assigned
Owner:
Cc:
Components:
EstimatedDays: 4
NextAction: ----
OS: ----
Pri: 2
Type: Bug


Show other hotlists

Hotlists containing this issue:
CrOSParallelCQ


Sign in to add a comment

Cancel builds of past patchsets

Project Member Reported by no...@chromium.org, Nov 28

Issue description

CQ currently does not cancel tryjobs when a new patchset is uploaded.
This is a well-known problem and not a big deal.

FWIU it becomes critical for CrOS, where builds are very expensive.
 
Labels: CrOSParallelCQ
EstimatedDays: 4
The biggest complexity is communicating this to chromium-devs who may rely on current CQ behavior.
Actually, there is another complexity, which is primarily due to current CQ design: when does cancelling happen?

Because CQ has no local reliable state (like cloud datastore), one is tempted to cancel builds when CQ realizes that current CQ attempt must be stopped.

However, if user edits a CL description (and depending on gerrit config, or if user adds some CQ directive, like "No-Tree-Checks: True"), this will require a new CQ attempt, thus old tryjobs will get cancelled, but this isn't desirable.

So, instead we need some kind of cron-job, doing sweeps:
for build in buildbucket.getall(status=(SCHEDULED, STARTED), triggered_by=CQ):
  if gerrit.has_non_trivial_patches_after(build.cl, build.patchet):
    build.cancel()

Now that I wrote this, it doesn't seem so difficult after all.
It sounds canceling builds where only CL description changes is undesirable because CQ would like to reuse them. This means CQ must know which patchset it would and wouldn’t reuse. So why can’t CQ cancel builds of those patchsets that it will never reuse?
>  So why can’t CQ cancel builds of those patchsets that it will never reuse?

It can, that's what cron-job is for, and it would be inside CQ, not outside.
My point was that given today's design, CQ doesn't always know whether attempt is really cancelled OR it was Gerrit fluke. Nor does CQ guarantee to formally cancel CL.


===  If you care about more detail ===

due to Gerrit flukes and eventual consistency (e.g., accidental 404, stale revision or label data due to stale index or replica, delayed copying of CQ score when inserting new patchset), CQ stops all the work on an attempt but w/o posting any message to Gerrit or doing anything else.

Why? Because if it was Gerrit fluke, user expects CQ to not start a new attempt, but continue existing one. So, if after a few minutes, CQ may end up just continuing prior attempt as if fluke didn't even happen.

Similarly, if CQ process gets restarted(e.g., new CQ is deployed), it gets sometimes 2 minutes to re-create internal state of CQ. If user creates new patchset during this time, prior attempt will never be formally "cancelled" by CQ.
makes sense

Sign in to add a comment