Kernel CLs flagged as relevant when they shouldn't be |
||||||||||
Issue descriptionForked from bug #845643 Basic gist / example is that <https://chromium-review.googlesource.com/c/chromiumos/third_party/kernel/+/1056074/> originally wasn't submitted because "edgar-paladin" failed. ...but "edgar-paladin" doesn't use kernel 4.14, so there was really no reason not to submit the CL. This happens a lot. Any way to make this work better?
,
May 23 2018
If I'm reading it aright, here's one of the relevant edgar-paladin failures:
https://luci-milo.appspot.com/buildbot/chromeos/edgar-paladin/3036
That run shows that CL 1056074 is actually listed under "DetectRelevantChanges".
So, the CQ believed that the CL contributed to the edgar build. If edgar
doesn't use the 4.14 kernel, that would apear to be a bug.
,
May 23 2018
The situation happens all the time. Non-scientifically, 90%+ of all kernel commits into chromeos-4.14 are rejected because of failures on devices not using a 4.14 kernel.
,
May 23 2018
To double-confirm, I checked the build packages for edgar and found: [ebuild N ] sys-kernel/chromeos-kernel-3_18-3.18-r2124 ...so that confirms it's not using kernel 4.14. --- Yeah, I thought it was a feature but as per bug #845643 we thought maybe I was remembering wrong.
,
May 25 2018
There is flat-out a bug here: I also cross-checked (by running 'uname -a' on an edgar) and also found that the kernel is 3.18. The CL is plainly for the chromeos-4.14 kernel branch, and edgar is just as plainly built with the 3.18 kernel. The CL (in fact, the entire large stack of CLs) shouldn't have appeared in the list of relevant CLs for the build. I'm going to guess that the problem is somehow specific to kernel CLs, but someone who has the time to debug and diagnose needs to be the one to determine that. +shuqianz@ who knows more about how the code determines relevance than I do.
,
May 29 2018
Actually +shuqianz@ this time. This continues to impede progress as exemplified in crosreview.com/1070468. Hopefully we can get some status updates this week on it.
,
May 29 2018
#6: Yes, I start to contemplate chumping this merge; I am seriously getting behind upstream releases. Why bother with CQ if it is so badly broken. I don't really want to chump it, but this is getting really annoying. On top of that, any _real_ problems are getting drowned in the noise.
,
May 29 2018
The logic to detect whether a CL is relevant or not is come up by yjhong@, who already left the team. From my understanding, it is determined by the portage level.
untriaged_changes = set(changes)
irrelevant_changes = set()
# Changes that modify projects used in building are always relevant.
untriaged_changes -= cls.GetChangesToBuildTools(changes, manifest)
if packages_under_test is not None:
# Strip the version of the package in packages_under_test.
cpv_list = [portage_util.SplitCPV(x) for x in packages_under_test]
packages_under_test = ['%s/%s' % (x.category, x.package) for x in
cpv_list]
# Handles overlay changes.
# ClassifyOverlayChanges only handles overlays visible to this
# build. For example, an external build may not be able to view
# the internal overlays. However, in that case, the internal changes
# have already been filtered out in CommitQueueSyncStage, and are
# not included in |changes|.
overlay_changes, irrelevant_overlay_changes = cls.ClassifyOverlayChanges(
untriaged_changes, config, build_root, manifest, packages_under_test)
untriaged_changes -= overlay_changes
irrelevant_changes |= irrelevant_overlay_changes
,
May 29 2018
This needs to be triaged and assigned by the CI team; the test team has expertise, but no longer has responsibility.
,
May 29 2018
I don't see the bug, at first glance. Why is this coming up now? This code hasn't changed since 2014.
,
May 29 2018
> I don't see the bug, at first glance.
The issue is this: Each CQ slave goes through the list of CLs, and
checks whether portage claims that the CL was included in the image
built for that slave. The CLs found relevant are reported in CIDB.
If a CL is marked relevant when it shouldn't be, it has two related
bad effects:
1) The main reason that the CQ determines relevance is so that if a
CL isn't relevant to certain builders, failures in those builders
won't cause the CL to be rejected. If the list is wrong, CLs that
could be committed will instead be rejected.
2) Human beings need to be able to see the relevant CLs, because it's
a more accurate blamelist. That makes it easier to identify a bad
CL by taking intersecting blamelists.
> Why is this coming up now? This code hasn't changed since 2014.
I don't know; I think the best way to answer that would be to take a stab
at getting to a root cause.
,
May 29 2018
> he issue is this: Each CQ slave goes through the list of CLs, and > checks whether portage claims that the CL was included in the image > built for that slave. The CLs found relevant are reported in CIDB. > If a CL is marked relevant when it shouldn't be, it has two related > bad effects: > 1) The main reason that the CQ determines relevance is so that if a > CL isn't relevant to certain builders, failures in those builders > won't cause the CL to be rejected. If the list is wrong, CLs that > could be committed will instead be rejected. > 2) Human beings need to be able to see the relevant CLs, because it's > a more accurate blamelist. That makes it easier to identify a bad > CL by taking intersecting blamelists. I know that. I don't see the bug *in the code*. > > Why is this coming up now? This code hasn't changed since 2014. > > I don't know; I think the best way to answer that would be to take a stab > at getting to a root cause. That's easier said than done. Everything can't be a P1; I need to know why we think a 4 year old bug is important enough to drop other P1's to work on it.
,
May 29 2018
> I know that. I don't see the bug *in the code*. Ah. That's clearer, then. > That's easier said than done. Everything can't be a P1; > I need to know why we think a 4 year old bug is important > enough to drop other P1's to work on it. The code is 4 years old, but that's not the same as the problem is 4 years old. But, I don't think that the age is relevant. What's relevant is the impact right now. That impact is real enough: Anyone trying to submit a CL for the 4.14 kernel is going to run into false rejections that could slow them down. Depending on the scope of the problem, other kinds of CLs could also be impacted; we may not know that without debug, though. As for triage against other P1 problems: I'm not positioned to know how many devs are impacted (but it's at least three), or how much they feel the pain. I also don't have the information I'd want to offer advice on trading this off against other problems, and I'm not sure I really want to give such advice if asked. Speaking for myself: Long kernel stacks like the one mentioned in comment #7 tend to make it hard to pin blame on a CL when there's a CQ failure and the bad CL has to be found. That problem will be harder if the blamelists aren't trustworthy. This doesn't happen frequently, so for me, it's a low-frequency, high-impact event.
,
May 30 2018
Dhanya, this is a nice follow-on to the change you just did. Please take a look and we can chat if this looks confusing.
,
May 31 2018
I'm curious about the downgrade to Pri-2. This is definitely impacting developer productivity and the ability for things to get through the CQ. Why not Pri-1?
,
May 31 2018
Because no one answered Comment 10.
,
May 31 2018
> Because no one answered Comment 10. Wait. I answered comment 10. And, I followed up to answer questions about the response. To be clear: I'm not heavily impacted by this, but I _am_ impacted, and anyone doing sheriff/deputy/bobby duty can be impacted. Any time there's a complicated failure in the CQ, and the blamelist is long, this bug makes it harder to identify the real cause, because of the possibility of false-positives in the blamelist.
,
May 31 2018
Per comment 17, raising priority back to Pri-1.
,
May 31 2018
Re #17, I don't see an answer in your verbose responses. I see speculation and philosophizing. Neither are helpful. I am looking for the folks impacted to tell me what recently changed versus the past 4 years that has suddenly made this urgent. Besides helping set priority, it would also perhaps help me with root causing the problem with Dhanya.
,
May 31 2018
> [ ... ] tell me what recently changed versus the past 4 years [ ... ] I did answer that, but I'll try to restate: I don't know what's changed, other than we noticed. But the fact is, now we know. The problem is happening, and folks are impacted. Whether the problem is new or old doesn't change that.
,
May 31 2018
If that's all we've got, it's a P2.
,
May 31 2018
FWIW, I am heavily impacted; lately I spend several hours per week tracking down why my merge attempts or other CLs were rejected this time. As for why this is coming up now - maybe because we have many more commits, and/or maybe because the system is much more unstable than it used to be, and/or maybe because people spend more time trying to understand why a CL was rejected, and/or maybe because we have more devices and kernel versions than we ever used to have, and the problem has become more obvious because of that. However, those are just wild guesses and thus worthless, as stated in #19 "I see speculation and philosophizing. Neither are helpful". Which is exactly why I declined to reply initially (besides taking the question as rhetorical to start with).
,
Jun 1 2018
The logic that we're talking about here only triggers when the CQ fails to pass in its entirety so indeed a few weeks of continuous red CQ runs will result in this logic not opportunistically submitting your CL's. It's been a bad last two weeks. If we fix that; your experience will get a lot better along with everyone else.
,
Jun 1 2018
Jason could you please also provide the alternative view - i.e what does it take to fix this when CQ runs are failing? Waiting for the CQ to be green all the time might take a while; so it wouldn't be a bad idea to do the worst case analysis and figure out the cost to fix this bug sooner. Would you also consider bumping the priority back up?
,
Jun 1 2018
Dhanya is currently looking at it. We don't know yet. Once we do, he will continue on to work toward fixing it. Bug SLO's are being actively discussed across all of the Infra teams. The consensus we seem to have come to as it relates to this bug is: P1: Current partial outage or degradation of a Pre-CQ/CQ, Chrome PFQ, Android PFQ, CTS Testing. P2: Current minor degradation of a Pre-CQ/CQ, Chrome PFQ, Android PFQ, CTS Testing. Yes, I'm being a more hard-nosed about this stuff than has been the case in the past: pulling people off to work on the most recent thing that has gotten attention will not allow us to obtain our major objective of fixing the CQ once and for all. We can't continue to do that. We're already behind on our OKR's toward these goals. And I've already pull one person off to address a specific request of the kernel team (email notifications). If anyone wants to debate this further, I'm going to give in and take over the responsibility for fixing this because spending any more time debating this would actually just be better spent looking at it myself. Does anyone want to continue debating this?
,
Jun 4 2018
Dhanya and I looked at this this morning. It's safe to say that there has never been any logic to try to specially handle kernel changes: all kernel changes (like all changes across CrOS) are added to the relevant list because the logic doesn't understand very well how to map a CL to particular Portage package. The two exceptions are overlay and cros_workon packages; those get filtered from the relevant list well. That said, it seems like a good idea to add this feature so Dhanya is going to keep working on implementing it this week.
,
Jun 4 2018
> [ ... ] there has never been any logic to try to specially handle > kernel changes [ ... ] I believe this to be true, but usually, kernel changes specific to only certain boards do get flagged as relevant only to those boards. Do we know what makes edgar+v4.14 kernel different?
,
Jun 4 2018
No, we don't see any evidence that that has ever worked for any board.
,
Jun 4 2018
#28: same here. In my case, this is from practical experience, not from looking into any code/scripts.
,
Jun 4 2018
> No, we don't see any evidence that that has ever worked for any board.
The code definitely excludes some kernel CLs from some builds. For the
most recent example I could find, there's this CL:
https://chromium-review.googlesource.com/c/chromiumos/third_party/kernel/+/1077913
That CL was present in this CQ run:
https://uberchromegw.corp.google.com/i/chromeos/builders/master-paladin/builds/18769
The CL was marked relevant for the following slave runs:
https://uberchromegw.corp.google.com/i/chromeos/builders/winky-paladin/builds/5296
https://uberchromegw.corp.google.com/i/chromeos/builders/kip-paladin/builds/5287
https://uberchromegw.corp.google.com/i/chromeos/builders/quawks-paladin/builds/3118
https://uberchromegw.corp.google.com/i/chromeos/builders/bob-paladin/builds/3242
https://uberchromegw.corp.google.com/i/chromeos/builders/kevin-paladin/builds/4757
https://uberchromegw.corp.google.com/i/chromeos/builders/reef-paladin/builds/5985
https://uberchromegw.corp.google.com/i/chromeos/builders/coral-paladin/builds/3475
https://uberchromegw.corp.google.com/i/chromeos/builders/eve-paladin/builds/3311
https://uberchromegw.corp.google.com/i/chromeos/builders/kevin-arcnext-paladin/builds/618
The CL was deemed irrelevant to the other builders. AFAICT, relevance
detection worked exactly right in this case.
,
Jun 4 2018
Hrm, interesting. In that example, it's kernel 4.4 with a single change versus the example in Comment 2 with kernel 4.14 with a merge commit. I wonder if the merge commit is an issue and answers the question I asked Comment 10: "Why is this coming up now?". The merge commit itself contains no files which could confusing things. And support for merge commits is only a few months old. The code defaults to treating CL's as relevant when it cannot infer anything from the CL's.
,
Jun 8 2018
Sorry, I got behind on email. I see above you said: > The two exceptions are overlay and cros_workon packages ...but the kernel is a cros_workon package, isn't it? ...or am I misunderstanding? --- Also, I'll point out in response to @31 that the initial example in this bug was not a merge commit.
,
Jul 1
The following revision refers to this bug: https://chromium.googlesource.com/chromiumos/chromite/+/a3230c43892ec837213dc2625633abf47be3ffaf commit a3230c43892ec837213dc2625633abf47be3ffaf Author: Dhanya Ganesh <dhanyaganesh@chromium.org> Date: Sun Jul 01 00:21:43 2018 DetectRelevantChanges: Additional exception handling and logging statements This CL tries to mitigate some relevance misses in DetectRelevantChanges with a try block for each change. It also adds additional info into log statements to help similar debug efforts in the future. BUG= chromium:845941 TEST=tryjob Change-Id: I6085a204aa3ab5f018b5531e61315b5fd5c67369 Reviewed-on: https://chromium-review.googlesource.com/1110264 Commit-Ready: ChromeOS CL Exonerator Bot <chromiumos-cl-exonerator@appspot.gserviceaccount.com> Tested-by: Dhanya Ganesh <dhanyaganesh@google.com> Reviewed-by: Jason Clinton <jclinton@chromium.org> [modify] https://crrev.com/a3230c43892ec837213dc2625633abf47be3ffaf/lib/triage_lib.py [modify] https://crrev.com/a3230c43892ec837213dc2625633abf47be3ffaf/cbuildbot/stages/report_stages.py
,
Jul 1
The following revision refers to this bug: https://chromium.googlesource.com/chromium/src.git/+/f55b180ef31be8d397a42310d60a0dfe7732acc2 commit f55b180ef31be8d397a42310d60a0dfe7732acc2 Author: Chromite Chromium Autoroll <chromite-chromium-autoroll@skia-buildbots.google.com.iam.gserviceaccount.com> Date: Sun Jul 01 05:18:39 2018 Roll src/third_party/chromite dc425041e324..a3230c43892e (1 commits) https://chromium.googlesource.com/chromiumos/chromite.git/+log/dc425041e324..a3230c43892e git log dc425041e324..a3230c43892e --date=short --no-merges --format='%ad %ae %s' 2018-07-01 dhanyaganesh@chromium.org DetectRelevantChanges: Additional exception handling and Created with: gclient setdep -r src/third_party/chromite@a3230c43892e The AutoRoll server is located here: https://chromite-chromium-roll.skia.org Documentation for the AutoRoller is here: https://skia.googlesource.com/buildbot/+/master/autoroll/README.md If the roll is causing failures, please contact the current sheriff, who should be CC'd on the roll, and stop the roller if necessary. BUG= chromium:845941 TBR=chrome-os-gardeners@chromium.org Change-Id: I0d46cbc6a9ecc3e31c21c441daac4e2ce341af9b Reviewed-on: https://chromium-review.googlesource.com/1121499 Reviewed-by: Chromite Chromium Autoroll <chromite-chromium-autoroll@skia-buildbots.google.com.iam.gserviceaccount.com> Commit-Queue: Chromite Chromium Autoroll <chromite-chromium-autoroll@skia-buildbots.google.com.iam.gserviceaccount.com> Cr-Commit-Position: refs/heads/master@{#571791} [modify] https://crrev.com/f55b180ef31be8d397a42310d60a0dfe7732acc2/DEPS
,
Jul 11
Does anybody still see this problem after the CL? The bug will be closed in a week.
,
Jul 11
CL:1124677
,
Jul 12
The workon package dev-util/turbostat has dependency on (practically) all kernel paths. This package is marked 'under test' for nami and cyan. Hence, the CL was marked relevant.
,
Jul 12
From dev-util/turbostat/turbostat-9999.ebuild, I see this:
CROS_WORKON_PROJECT="chromiumos/third_party/kernel"
CROS_WORKON_LOCALNAME="kernel/v4.14"
CROS_WORKON_INCREMENTAL_BUILD=1
CROS_WORKON_OUTOFTREE_BUILD=1
So, the package explicitly hard-codes a dependency on the 4.14 kernel
tree. If the kernel dependency in this package is somehow making
4.14 kernel changes look relevant to non-4.14 kernels, that would
largely explain the problem we're seeing. Most especially, AFAICT,
the problems complained of here are specific to 4.14 kernel CLs, and
no others.
So, if that package is the source of the problem, then something in
the ebuild is wrong. We need to figure out what that is, and fix it.
,
Jul 12
I've opened a specific bug for that: issue 863077 . Calling the CI infrastructure side fixed since we now have the ability to identify these kinds of issues. Nice work, Dhanya.
,
Jul 12
|
||||||||||
►
Sign in to add a comment |
||||||||||
Comment 1 by jrbarnette@chromium.org
, May 23 2018