New issue
Advanced search Search tips
Note: Color blocks (like or ) mean that a user may not be available. Tooltip shows the reason.

Issue 845941 link

Starred by 3 users

Issue metadata

Status: Fixed
Owner:
Closed: Jul 12
Cc:
Components:
EstimatedDays: ----
NextAction: ----
OS: Chrome
Pri: 2
Type: Feature



Sign in to add a comment

Kernel CLs flagged as relevant when they shouldn't be

Project Member Reported by diand...@chromium.org, May 23 2018

Issue description

Forked from  bug #845643 

Basic gist / example is that <https://chromium-review.googlesource.com/c/chromiumos/third_party/kernel/+/1056074/> originally wasn't submitted because "edgar-paladin" failed.  ...but "edgar-paladin" doesn't use kernel 4.14, so there was really no reason not to submit the CL.

This happens a lot.  Any way to make this work better?
 
> "edgar-paladin" failed.  ...but "edgar-paladin" doesn't use kernel 4.14

The CQ already has logic that says (in essence) "if a CL doesn't contribute
to a particular builder, failures on that builder won't cause the CL to be
rejected."  I don't know why the cited CL was rejected, but if it was a
kernel change for a kernel that's not edgar's kernel, then the edgar failure
should be the cause for rejection.

Labels: -Type-Feature -Pri-3 Pri-1 Type-Bug
If I'm reading it aright, here's one of the relevant edgar-paladin failures:
    https://luci-milo.appspot.com/buildbot/chromeos/edgar-paladin/3036

That run shows that CL 1056074 is actually listed under "DetectRelevantChanges".
So, the CQ believed that the CL contributed to the edgar build.  If edgar
doesn't use the 4.14 kernel, that would apear to be a bug.

Comment 3 by groeck@chromium.org, May 23 2018

The situation happens all the time. Non-scientifically, 90%+ of all kernel commits into chromeos-4.14 are rejected because of failures on devices not using a 4.14 kernel.

To double-confirm, I checked the build packages for edgar and found:

[ebuild  N     ] sys-kernel/chromeos-kernel-3_18-3.18-r2124

...so that confirms it's not using kernel 4.14.

---

Yeah, I thought it was a feature but as per  bug #845643  we thought maybe I was remembering wrong.
Summary: Kernel CLs flagged as relevant when they shouldn't be (was: Feature request: if the only failing CQ builders don't use a given kernel we should still be able to submit changes to it)
There is flat-out a bug here:  I also cross-checked (by running
'uname -a' on an edgar) and also found that the kernel is 3.18.

The CL is plainly for the chromeos-4.14 kernel branch, and edgar is
just as plainly built with the 3.18 kernel.  The CL (in fact, the
entire large stack of CLs) shouldn't have appeared in the list of
relevant CLs for the build.

I'm going to guess that the problem is somehow specific to kernel
CLs, but someone who has the time to debug and diagnose needs to
be the one to determine that.

+shuqianz@ who knows more about how the code determines relevance
than I do.

Cc: jrbarnette@chromium.org
Owner: shuqianz@chromium.org
Status: Assigned (was: Untriaged)
Actually +shuqianz@ this time.

This continues to impede progress as exemplified in crosreview.com/1070468.  Hopefully we can get some status updates this week on it.

Comment 7 by groeck@chromium.org, May 29 2018

#6: Yes, I start to contemplate chumping this merge; I am seriously getting behind upstream releases. Why bother with CQ if it is so badly broken. I don't really want to chump it, but this is getting really annoying.
On top of that, any _real_ problems are getting drowned in the noise.

The logic to detect whether a CL is relevant or not is come up by yjhong@, who already left the team. From my understanding, it is determined by the portage level. 

untriaged_changes = set(changes)
    irrelevant_changes = set()

    # Changes that modify projects used in building are always relevant.
    untriaged_changes -= cls.GetChangesToBuildTools(changes, manifest)

    if packages_under_test is not None:
      # Strip the version of the package in packages_under_test.
      cpv_list = [portage_util.SplitCPV(x) for x in packages_under_test]
      packages_under_test = ['%s/%s' % (x.category, x.package) for x in
                             cpv_list]

    # Handles overlay changes.
    # ClassifyOverlayChanges only handles overlays visible to this
    # build. For example, an external build may not be able to view
    # the internal overlays. However, in that case, the internal changes
    # have already been filtered out in CommitQueueSyncStage, and are
    # not included in |changes|.
    overlay_changes, irrelevant_overlay_changes = cls.ClassifyOverlayChanges(
        untriaged_changes, config, build_root, manifest, packages_under_test)
    untriaged_changes -= overlay_changes
    irrelevant_changes |= irrelevant_overlay_changes
Cc: shuqianz@chromium.org
Owner: athilenius@chromium.org
This needs to be triaged and assigned by the CI team; the test
team has expertise, but no longer has responsibility.

I don't see the bug, at first glance.

Why is this coming up now? This code hasn't changed since 2014.

> I don't see the bug, at first glance.

The issue is this:  Each CQ slave goes through the list of CLs, and
checks whether portage claims that the CL was included in the image
built for that slave.  The CLs found relevant are reported in CIDB.
If a CL is marked relevant when it shouldn't be, it has two related
bad effects:
 1) The main reason that the CQ determines relevance is so that if a
    CL isn't relevant to certain builders, failures in those builders
    won't cause the CL to be rejected.  If the list is wrong, CLs that
    could be committed will instead be rejected.
 2) Human beings need to be able to see the relevant CLs, because it's
    a more accurate blamelist.  That makes it easier to identify a bad
    CL by taking intersecting blamelists.


> Why is this coming up now? This code hasn't changed since 2014.

I don't know; I think the best way to answer that would be to take a stab
at getting to a root cause.

> he issue is this:  Each CQ slave goes through the list of CLs, and
> checks whether portage claims that the CL was included in the image
> built for that slave.  The CLs found relevant are reported in CIDB.
> If a CL is marked relevant when it shouldn't be, it has two related
> bad effects:
>  1) The main reason that the CQ determines relevance is so that if a
>     CL isn't relevant to certain builders, failures in those builders
>     won't cause the CL to be rejected.  If the list is wrong, CLs that
>     could be committed will instead be rejected.
>  2) Human beings need to be able to see the relevant CLs, because it's
>     a more accurate blamelist.  That makes it easier to identify a bad
>     CL by taking intersecting blamelists.

I know that. I don't see the bug *in the code*.

> > Why is this coming up now? This code hasn't changed since 2014.
> 
> I don't know; I think the best way to answer that would be to take a stab
> at getting to a root cause.

That's easier said than done. Everything can't be a P1; I need to know why we think a 4 year old bug is important enough to drop other P1's to work on it.

> I know that. I don't see the bug *in the code*.

Ah.  That's clearer, then.

> That's easier said than done. Everything can't be a P1;
> I need to know why we think a 4 year old bug is important
> enough to drop other P1's to work on it.

The code is 4 years old, but that's not the same as the problem
is 4 years old.  But, I don't think that the age is relevant.
What's relevant is the impact right now.  That impact is real
enough:  Anyone trying to submit a CL for the 4.14 kernel is going
to run into false rejections that could slow them down.  Depending
on the scope of the problem, other kinds of CLs could also be
impacted; we may not know that without debug, though.

As for triage against other P1 problems:  I'm not positioned to know
how many devs are impacted (but it's at least three), or how much
they feel the pain.  I also don't have the information I'd want to
offer advice on trading this off against other problems, and I'm not
sure I really want to give such advice if asked.

Speaking for myself:  Long kernel stacks like the one mentioned in
comment #7 tend to make it hard to pin blame on a CL when there's
a CQ failure and the bad CL has to be found.  That problem will be
harder if the blamelists aren't trustworthy.  This doesn't happen
frequently, so for me, it's a low-frequency, high-impact event.

Labels: -Pri-1 Pri-2
Owner: dhanyaganesh@chromium.org
Dhanya, this is a nice follow-on to the change you just did. Please take a look and we can chat if this looks confusing.
I'm curious about the downgrade to Pri-2.  This is definitely impacting developer productivity and the ability for things to get through the CQ.  Why not Pri-1?
Because no one answered Comment 10.
> Because no one answered Comment 10.

Wait.  I answered comment 10.  And, I followed up to answer
questions about the response.

To be clear:  I'm not heavily impacted by this, but I _am_ impacted,
and anyone doing sheriff/deputy/bobby duty can be impacted.  Any time
there's a complicated failure in the CQ, and the blamelist is long, this
bug makes it harder to identify the real cause, because of the possibility
of false-positives in the blamelist.

Labels: -Pri-2 Pri-1
Per comment 17, raising priority back to Pri-1.
Labels: -Pri-1 Pri-2
Re #17, I don't see an answer in your verbose responses. I see speculation and philosophizing. Neither are helpful.

I am looking for the folks impacted to tell me what recently changed versus the past 4 years that has suddenly made this urgent. Besides helping set priority, it would also perhaps help me with root causing the problem with Dhanya.

> [ ... ] tell me what recently changed versus the past 4 years [ ... ]

I did answer that, but I'll try to restate:  I don't know what's changed,
other than we noticed.  But the fact is, now we know.  The problem is
happening, and folks are impacted.  Whether the problem is new or old
doesn't change that.
If that's all we've got, it's a P2.
FWIW, I am heavily impacted; lately I spend several hours per week tracking down why my merge attempts or other CLs were rejected this time.
As for why this is coming up now - maybe because we have many more commits, and/or maybe because the system is much more unstable than it used to be, and/or maybe because people spend more time trying to understand why a CL was rejected, and/or maybe because we have more devices and kernel versions than we ever used to have, and the problem has become more obvious because of that.
However, those are just wild guesses and thus worthless, as stated in #19 "I see speculation and philosophizing. Neither are helpful". Which is exactly why I declined to reply initially (besides taking the question as rhetorical to start with).

The logic that we're talking about here only triggers when the CQ fails to pass in its entirety so indeed a few weeks of continuous red CQ runs will result in this logic not opportunistically submitting your CL's. It's been a bad last two weeks. If we fix that; your experience will get a lot better along with everyone else.
Jason could you please also provide the alternative view - i.e what does it take to fix this when CQ runs are failing?  Waiting for the CQ to be green all the time might take a while; so it wouldn't be a bad idea to do the worst case analysis and figure out the cost to fix this bug sooner.  Would you also consider bumping the priority back up?
Dhanya is currently looking at it. We don't know yet. Once we do, he will continue on to work toward fixing it.

Bug SLO's are being actively discussed across all of the Infra teams. The consensus we seem to have come to as it relates to this bug is:

P1:
Current partial outage or degradation of a Pre-CQ/CQ, Chrome PFQ, Android PFQ, CTS Testing.

P2:
Current minor degradation of a Pre-CQ/CQ, Chrome PFQ, Android PFQ, CTS Testing.

Yes, I'm being a more hard-nosed about this stuff than has been the case in the past: pulling people off to work on the most recent thing that has gotten attention will not allow us to obtain our major objective of fixing the CQ once and for all. We can't continue to do that. We're already behind on our OKR's toward these goals. And I've already pull one person off to address a specific request of the kernel team (email notifications).

If anyone wants to debate this further, I'm going to give in and take over the responsibility for fixing this because spending any more time debating this would actually just be better spent looking at it myself. Does anyone want to continue debating this?

Labels: -Type-Bug Type-Feature
Status: Started (was: Assigned)
Dhanya and I looked at this this morning. It's safe to say that there has never been any logic to try to specially handle kernel changes: all kernel changes (like all changes across CrOS) are added to the relevant list because the logic doesn't understand very well how to map a CL to particular Portage package. The two exceptions are overlay and cros_workon packages; those get filtered from the relevant list well.

That said, it seems like a good idea to add this feature so Dhanya is going to keep working on implementing it this week.

> [ ... ] there has never been any logic to try to specially handle
> kernel changes [ ... ]

I believe this to be true, but usually, kernel changes specific to only
certain boards do get flagged as relevant only to those boards.  Do we
know what makes edgar+v4.14 kernel different?

No, we don't see any evidence that that has ever worked for any board.
#28: same here. In my case, this is from practical experience, not from looking into any code/scripts.


Hrm, interesting. In that example, it's kernel 4.4 with a single change versus the example in Comment 2 with kernel 4.14 with a merge commit. I wonder if the merge commit is an issue and answers the question I asked Comment 10: "Why is this coming up now?". The merge commit itself contains no files which could confusing things. And support for merge commits is only a few months old.

The code defaults to treating CL's as relevant when it cannot infer anything from the CL's.

Sorry, I got behind on email.  I see above you said:

> The two exceptions are overlay and cros_workon packages

...but the kernel is a cros_workon package, isn't it?  ...or am I misunderstanding?

---

Also, I'll point out in response to @31 that the initial example in this bug was not a merge commit.
Project Member

Comment 33 by bugdroid1@chromium.org, Jul 1

The following revision refers to this bug:
  https://chromium.googlesource.com/chromiumos/chromite/+/a3230c43892ec837213dc2625633abf47be3ffaf

commit a3230c43892ec837213dc2625633abf47be3ffaf
Author: Dhanya Ganesh <dhanyaganesh@chromium.org>
Date: Sun Jul 01 00:21:43 2018

DetectRelevantChanges: Additional exception handling and
logging statements

This CL tries to mitigate some relevance misses in
DetectRelevantChanges with a try block for each change.
It also adds additional info into log statements to help
similar debug efforts in the future.

BUG= chromium:845941 
TEST=tryjob

Change-Id: I6085a204aa3ab5f018b5531e61315b5fd5c67369
Reviewed-on: https://chromium-review.googlesource.com/1110264
Commit-Ready: ChromeOS CL Exonerator Bot <chromiumos-cl-exonerator@appspot.gserviceaccount.com>
Tested-by: Dhanya Ganesh <dhanyaganesh@google.com>
Reviewed-by: Jason Clinton <jclinton@chromium.org>

[modify] https://crrev.com/a3230c43892ec837213dc2625633abf47be3ffaf/lib/triage_lib.py
[modify] https://crrev.com/a3230c43892ec837213dc2625633abf47be3ffaf/cbuildbot/stages/report_stages.py

Project Member

Comment 34 by bugdroid1@chromium.org, Jul 1

The following revision refers to this bug:
  https://chromium.googlesource.com/chromium/src.git/+/f55b180ef31be8d397a42310d60a0dfe7732acc2

commit f55b180ef31be8d397a42310d60a0dfe7732acc2
Author: Chromite Chromium Autoroll <chromite-chromium-autoroll@skia-buildbots.google.com.iam.gserviceaccount.com>
Date: Sun Jul 01 05:18:39 2018

Roll src/third_party/chromite dc425041e324..a3230c43892e (1 commits)

https://chromium.googlesource.com/chromiumos/chromite.git/+log/dc425041e324..a3230c43892e


git log dc425041e324..a3230c43892e --date=short --no-merges --format='%ad %ae %s'
2018-07-01 dhanyaganesh@chromium.org DetectRelevantChanges: Additional exception handling and


Created with:
  gclient setdep -r src/third_party/chromite@a3230c43892e

The AutoRoll server is located here: https://chromite-chromium-roll.skia.org

Documentation for the AutoRoller is here:
https://skia.googlesource.com/buildbot/+/master/autoroll/README.md

If the roll is causing failures, please contact the current sheriff, who should
be CC'd on the roll, and stop the roller if necessary.



BUG= chromium:845941 
TBR=chrome-os-gardeners@chromium.org

Change-Id: I0d46cbc6a9ecc3e31c21c441daac4e2ce341af9b
Reviewed-on: https://chromium-review.googlesource.com/1121499
Reviewed-by: Chromite Chromium Autoroll <chromite-chromium-autoroll@skia-buildbots.google.com.iam.gserviceaccount.com>
Commit-Queue: Chromite Chromium Autoroll <chromite-chromium-autoroll@skia-buildbots.google.com.iam.gserviceaccount.com>
Cr-Commit-Position: refs/heads/master@{#571791}
[modify] https://crrev.com/f55b180ef31be8d397a42310d60a0dfe7732acc2/DEPS

Does anybody still see this problem after the CL? The bug will be closed in a week.
CL:1124677

The workon package dev-util/turbostat has dependency on (practically) all kernel paths. This package is marked 'under test' for nami and cyan. Hence, the CL was marked relevant.
From dev-util/turbostat/turbostat-9999.ebuild, I see this:
    CROS_WORKON_PROJECT="chromiumos/third_party/kernel"
    CROS_WORKON_LOCALNAME="kernel/v4.14"
    CROS_WORKON_INCREMENTAL_BUILD=1
    CROS_WORKON_OUTOFTREE_BUILD=1

So, the package explicitly hard-codes a dependency on the 4.14 kernel
tree.  If the kernel dependency in this package is somehow making
4.14 kernel changes look relevant to non-4.14 kernels, that would
largely explain the problem we're seeing.  Most especially, AFAICT,
the problems complained of here are specific to 4.14 kernel CLs, and
no others.

So, if that package is the source of the problem, then something in
the ebuild is wrong.  We need to figure out what that is, and fix it.

I've opened a specific bug for that:  issue 863077 .

Calling the CI infrastructure side fixed since we now have the ability to identify these kinds of issues. Nice work, Dhanya.

Status: Fixed (was: Started)

Sign in to add a comment