New issue
Advanced search Search tips

Issue 800577 link

Starred by 1 user

Issue metadata

Status: Untriaged
Owner: ----
Cc:
Components:
EstimatedDays: ----
NextAction: ----
OS: ----
Pri: 3
Type: Feature



Sign in to add a comment

CL changes during a CQ run should not fail the CQ run.

Reported by jrbarnette@chromium.org, Jan 10 2018

Issue description

If a CL changes after the CQ picks it up for testing, and before
the CL is finally merged, the CommitQueueHandleChanges stage will
detect this, and refuse to submit the CL.  That produces an error
message like this:
    Could not submit ejcaruso:857590:899e18dd, error: CL:857590 was modified while the CQ was in the middle of testing it. Patch set 2 was uploaded.

This will turn the stage red, which in turn makes the CQ run be
red.  However, the event is _not_ a CQ failure, and if the CQ run
was otherwise successful, any unchanged CLs will be committed.

Since all red CQ runs normally must be annotated, the fact that
this event turns the stage red is a nuisance, and source of
confusion to the uninitiated (which means, almost any dev stuck
with Sheriff duty).

It would be helpful if the CQ/system behavior were different, e.g.
  * Don't turn the CQ red for the event (a potential problem, since
    these events aren't entirely harmless, since they block us from
    creating prebuilts).
  * Change go/som so that these events don't show up there.
  * Some other change so that either a) sheriffs intuitively know
    what to do, or b) sheriffs never see this, and so don't _need_
    to know.

 
What's the proposed behaviour?  I guess this applies to any case with a partial commit (since missing one change here, could actually eliminate all the changes based on dependencies).

For low frequency events where there is actually some potential negative side effects where we can annotate in SOM (which we should be able to do from classifying the logs) and have sheriffs easily ignore, I'd probably suggest that because it brings awareness to the fact that there is something exceptional going on, and that it's not serious, so that if it keeps happening, it can be raised in priority.
> What's the proposed behaviour?

I'm not married to any specific behavior.  The high-level requirement
is as stated:  Either a) make it obvious for sheriffs what they should
do, or b) hide the event from sheriffs well enough that they don't
ordinarily encounter it.

Status: Unconfirmed (was: Untriaged)
This keeps coming up from time to time, and I keep switching sides :D

There is a reason the CQ is red in this case -- since what the CQ tested (all the CLs in the run) -- is not what was actually submitted, the CQ will not publish prebuilts (it'll update the prebuilts, but then not update the binhost, so developers will be using stale binhosts).

This is not an exceptional condition in itself, but if you get this behavior for too long, developers will be left with really stale binhosts.

This situation now happens more often due to self abort of the CQ, and it is important to distinguish it from a green run, where the prebuilts are published.
Summary: CL changes during a CQ run should be less confusing. (was: CL changes during a CQ run shouldn't turn the CQ red)
The thoughts in c#3, kinda resonate with me.  

If everything was submitted then I think we're all in agreement as it being green (pass).  Similarly, if nothing gets submitted, then it's clear to be a red (fail) build.

When we have something in between, should it be:
- red*
- green*
- yellow
- something else

I personally think it should be red*.  The build wasn't totally successful, but there's some annotation/notes to indicate that there was at least a partial success and the sheriffs investigations should be directed and/or limited.  The reason I vote for red* is the reason I gave in c#1, is because if it keeps happening, I'd like someone to be aware of that fact, even if they choose to ignore it.  And by someone, I mean sheriffs, since it's their explicit job to be on top of such things.
> And by someone, I mean sheriffs

In this particular case, the sheriffs are the wrong people:  It's not
realistic to require them to have the expertise to know what to do
with "prebuilts haven't been published".  That's the responsibility of
Cros Infra, who _are_ required to understand, because we have to maintain
the code that makes prebuilts happen.

Additionally, requiring annotation in this specific case is a fool's errand:
the system knows and reports exactly what went wrong, which is that specific
CLs weren't committed.  The report from the system is accurate enough that
we shouldn't require any human being to go and enter the very same
information in the annotation tool.

There's a case to be made that the absence of prebuilts is a problem that
should get surfaced and, when necessary, dealt with.  *This bug is not for
that discussion*.  This bug is about solving a more limited problem:
  * Don't confuse the sheriffs with stuff they don't need to know or
    understand.
  * Don't ask anyone to engage in make-work annotation.

So, actions (2 separate bugs?):

 - Don't emit s-o-m tickets for runs that only failed due to change-was-modified
 - Have master-paladin insert the cidb annotation on its own for change-was-modified sparing the need for deputy to do so 
> So, actions (2 separate bugs?):

Potentially that's two separate bugs.  However, it's not clear
that we should do both; depending on our specific choices, one
fix might obviate the other.  So, my preference would be to hold
this open until someone takes it with a commitment to doing
_something_.  Then, once we see what we did, we can decide whether
to file feature requests for other work, or just close this as
"now it's good enough."

Components: -Infra>Client>ChromeOS Infra>Client>ChromeOS>CI
Summary: CL changes during a CQ run should not fail the CQ run. (was: CL changes during a CQ run should be less confusing.)
It should be safe to let the CQ master pass, even if some CLs were updated during the CQ run, and thus can't be submitted.

I've pushed back against that in the past because of binary prebuilt handling. However, I was taught a bit more about prebuilts and have changed my opinion.

Status: Untriaged (was: Unconfirmed)
I just had a conversation with dgarrett@, and there's some
enhancements to make regarding his statement:
  * As implemented, failing the CQ causes pre-builts not
    to be uploaded.  Apparently, failing the CQ is the only
    blessed way to stop the prebuilts from being uploaded.
  * Once upon a time, in a universe that no longer exists,
    it was necessary to block uploading prebuilts if the
    code that built them wasn't committed to the tree.
  * So long as we lived in that other universe, it was
    therefore necessary that the CQ fail when any CL failed
    to be committed.

dgarrett@ has learned that the universe we once lived in ceased
to exist some time ago.  It is now safe to upload prebuilts even
if some code changes don't get committed.

So, the proposed fix is:  Don't turn the CQ red because one or more
CLs fail to be committed.

Sign in to add a comment