CHROMEOS_LKGM roll should not fallback to older chrome-sdk version |
|||||||
Issue descriptionCurrently LKGM roll can succeeed even if the package isn't available. It simply fallback to older sdk. It should fail if it failed to download the given version. +jclinton@ who asked why we suddenly started seeing ths issue. It was because the bot was testing using wrong version. It started failing recently because it exceeds max fallback versions now.
,
Dec 11
Comment in the CL (which introduces --exact-version to disable the fallback mechanism) with a concrete example: --exact-version disables the failover mechanism that searches backwards for an SDK when the current version doesn't have an SDK. This failover is useful for devs in cases where a particular board may be missing an SDK for a particular version, but it's bad on the bots, especially when they are trying to qualify a LKGM bumping CL. For example, the current LKGM is 11361.0.0, and the CL that bumped it is this: https://chromium-review.googlesource.com/1367484 Except 11361.0.0 does not have a valid SDK for amd640-generic or daisy - the failover found a prior LKGM that worked. In fact, the daisy-full builder has been broken since Dec 2, and we've been merrily bumping LKGMs based on old SDKs until this weekend, when the window of old versions went out of range. --exact-version is intended to be used on these chromium infra bots, which need to be strict about using the LKGM version in the CHROMEOS_LKGM file (which is not passed as a command line argument).
,
Dec 11
Currently --version supports an exact version for a complete version, e.g. R73-11361.0.0. Can that be used on the builders where this ins an issue?
,
Dec 11
The chromeos-amd64-generic-rel trybot uses whatever is in the CHROMEOS_LKGM file. Unless you're proposing that the bot logic should read the LKGM file and pass the contents as --version, which seems like the wrong place to do this kind of thing.
,
Dec 11
Digging into the 'succeeding' CL: https://chromium-review.googlesource.com/c/chromium/src/+/1367484/ The daisy bot run: https://ci.chromium.org/p/chromium/builders/luci.chromium.try/chromeos-daisy-rel/145002 The log file: https://logs.chromium.org/logs/chromium/buildbucket/cr-buildbucket.appspot.com/8927809603450425536/+/steps/gclient_runhooks__with_patch_/0/stdout The chrome-sdk command used: src/third_party/chromite/bin/cros chrome-sdk --nogoma --use-external-config --nogn-gen --board=daisy --cache-dir=src/build/cros_cache/ --log-level=error It *should* be: src/third_party/chromite/bin/cros chrome-sdk --nogoma --use-external-config --nogn-gen --board=daisy --cache-dir=src/build/cros_cache/ --log-level=debug --exact-version
,
Dec 11
I see. I was thinking this was being done within a chromite script. Yea, I guess a boolean arg makes sense, as long as we use something that doesn't look like yet-another-version :) I proposed--require-exact-version in the CL.
,
Dec 11
Ben/Cindy - do you have any thoughts about this? The idea is that we eliminate fallbacks on the chromeos trybots, and thus require the exact LKGM version to be available for amd64-generic and daisy.
,
Dec 11
This is probably the correct thing to do, but it effectively prevents us from upreving the PFQ if the amd64-generic or daisy "full" builders failed to generate a LATEST-12345.0.0 file for any reason (since that would then immediately break the chromium CQ). I don't recall whether the LATEST file is dependent on the builder's tests passing, or just on BuildPackages succeeding. If it's the latter we are probably OK since we don't really want to uprev chrome in that case. If test failures prevent the LATEST file from getting created, then this might be too fragile. +dgarrett@, +vapier@
,
Dec 11
> thus require the exact LKGM version to be available for amd64-generic and daisy. That seems fine to me. I always assumed we did the decrement fallback logic because we didn't entirely trust the builders to have uploaded images for that exact version, and it was better to have something close rather than nothing. But I think it does make sense to fail fast on the bots. Especially when we're vetting a new lkgm bump.
,
Dec 11
And I wonder if it's worth revisiting how we treat lkgm bump failures. IIUC, the PFQ currently spits out the bump CLs then quits shortly thereafter without waiting to see if it gets submitted. I wonder if we want to change that? ie: make the PFQ wait for the CL to land and turn red or throw up some flag if it fails. That would get more visibility into the status of the CLs. As is, I think it's mostly achuith and me monitoring them via watchlists :( Though that may significantly increase the PFQs cycle time since it'd be waiting for an additional chromium CQ run.
,
Dec 11
Yup, I was talking about this with Steven - in the current setup, the gardener sees a green PFQ while the LKGM hasn't upreved all week. I'll file a bug - I think we should do this, though the CQ turnaround time is unfortunate.
,
Dec 11
,
Dec 12
I'm a bit confused. > it effectively prevents us from upreving the PFQ if the amd64-generic or daisy "full" builders failed to generate a LATEST-12345.0.0 file for any reason (since that would then immediately break the chromium CQ). How so? They're currently failing and LKGM wasn't updated since Dec 2, but uprev has been successful afaik.
,
Dec 12
What specifically is "currently failing"? What do you mean exactly by LKGM?
,
Dec 12
Also: I thought the CHROMEOS_LKGM emails went to the gardeners, but apparently not. (I get those emails also, as do a few others.) Maybe we should fix that so that the gardner list gets them? It's only ~1 per day and it would alert more folks when there were problems. The biggest problem with waiting for the CHROMEOS_LKGM commit to succeed would be failing the PFQ run if the uprev fails. Currently we uprev chrome for CrOS, even if we fail to update CrOS for Chrome. Making that symmetrical would certainly be more robust, but also potentially more fragile. Tangent: The primary purpose of the fallback code is that rarely does every board pass on every CrOS uprev version, so without the fallback developers working on boards that are not tested in the PFQ (i.e. > 80% of boards) would potentially be broken, especially newer boards that are especially flaky.
,
Dec 12
Steven, I think we should still uprev chrome for CrOS even if we fail to update CrOS for chrome. I don't think this needs to be symmetrical. It's more urgent to have a recent chrome for CrOS than the other way around. All I'm proposing is that after we uprev chrome as we do today, we additionally wait for the chrome CL to land, and mark the run as red if it doesn't. It's purely a signalling issue so gardeners don't ignore it.
,
Dec 12
Won't that be confusing if Chrome is upreved but the PFQ fails?
,
Dec 12
* chromeos full builders: (This produces chrome-sdk for extenral config.) https://cros-goldeneye.corp.google.com/chromeos/legoland/builderSummary?builderGroups=full&limit=100&buildBranch=master It's been failing since Dec 2 due to crbug.com/913317. The last successful build that could produce image was 11327.0.0 https://pantheon.corp.google.com/storage/browser/chromeos-image-archive/daisy-full?prefix=LATEST-113&pli=1 https://pantheon.corp.google.com/storage/browser/chromeos-image-archive/amd64-generic-full?prefix=LATEST-113&pli=1 Note: According to chromeos sheriffs, full builders aren't maintained anymore because they thoguth it's not really used and didn't know its used to produce chrome-sdk for external config. The bot started failing badly from 11/16, but this issue didn't surface because it passed several times by luck. * LKGM roll builds on chromium waterfall. (chromeos-amd64-generic-rel/chromeos-daisy-rel) Example LKGM roll CL that is failing due to missing package: https://chromium-review.googlesource.com/c/chromium/src/+/1373434 Log: ERROR: Cannot find SDK for 'daisy' with version 11390.0.0 It was passing last week for a while due to this (913744) bug. It started failing this week because it exceeds max fallback range. By the way, we're using external version to roll LKGM, which is used by both internal and external, but passing these bots doesn't mean that there will be external version as well It may not be a big issue, but it can cause inconsistency in theory. * PFQ bots has been mostly green. https://cros-goldeneye.corp.google.com/chromeos/legoland/builderHistory?buildConfig=master-chromium-pfq&buildBranch=master
,
Dec 12
Steven - don't we have cases where the PFQ has a green run but chrome doesn't uprev? Unless you prefer the situation where we don't uprev chrome, but there's already too many random variables (infra flakiness being a big one) arrayed against a successful uprev that I wouldn't want to add one more. Oshima-san: I think your comment should probably go into crbug.com/913317?
,
Dec 12
Just addressing the immediate issues and comment #18 for now to reduce confusion. 1. I just checked and there are in fact no LATEST files for amd64-generic-full since 12/2, so Simple Chrome is falling back to LATEST-11327.0.0. The current latest is 11361.0.0, so that is going to start impacting developers and the CQ very soon (at 11369 = 11327 + 14*3). I added a note to issue 913317. 2. By 'LKGM' it looks like you mean the CHROMEOS_LKGM file. (There are others 'Last Known Good Manifest references in Chrome OS, so I like to be explicit). And correct, if we fall too far behind then the CQ will break. There is an underlying assumption that if one of the builders we rely on is breaking it is an urgent issue. Clearly that hasn't been correct with the 'full' builders and we need to address that. I'll talk to jclinton@ about that.
,
Dec 12
#19, no it was to answer question in #14 by stevenjb@ > 2. By 'LKGM' it looks like you mean the CHROMEOS_LKGM file. (There are others 'Last Known Good Manifest references in Chrome OS, so I like to be explicit). Sorry I wasn't aware of it. This bug is about CHROMEOS_LKGM. Updated the title to be explicit. >And correct, if we fall too far behind then the CQ will break. The issue is that it will roll even if the chrome-sdk for the specified version does not exist. How about my original question? Using exact version to roll CHROMEOS_LKGM can really cause break Chrome PFQ?
,
Dec 12
I agree that we should ideally test against the exact SDK version on amd64-generic and daisy-generic before updating CHROMEOS_LKGM. My concern is that if we require an exact version in the chromium CQ, it will make those builders more fragile for everything else (IIUC, we use the same CQ builders for the CHROMEOS_LKGM uprev). i.e. we would *never* be able to uprev CHROMEOS_LKGM if amd64-generic-full or daisy-full are failing, because that would immediately break the CQ. IIRC, there has been at least one occasion where we chumped the CHROMEOS_LKGM roll because: a) Failures on amd64-generic- or daisy-full were external specific (with an open bug under investigation) b) Simple Chrome was broken for developers using internal builds targeting a certain, important, board. That scenario should be pretty rare, but without the fallback mechanism we would not be able to unblock developers until the external builders were fixed. That said, I don't necessarily think that we shouldn't change the CQ builders, we just need to be cautious about it and make sure to update what documentation we have.
,
Dec 12
> There is an underlying assumption that if one of the builders we rely on is breaking it is an urgent issue. Clearly that hasn't been correct with the 'full' builders and we need to address that. I'll talk to jclinton@ about that. Agreed. Do keep myself and dpranke@ informed. It's likely that chromium CQ's dependence on these builders wasn't obvious or well-advertised, in which case their lack of repair/gardening/attention may be justified. I'd certainly be interested & willing to help in any effort to remediate that.
,
Dec 12
I may not familiar with how CHROMEOS_LKGM roll works. These bots are green for normal, other chromiun changes. It fails only when trying to build with new CHROMEOS_LKGM. https://ci.chromium.org/p/chromium/builders/luci.chromium.try/chromeos-daisy-rel My understanding is (was) that "cros chrome-sdk" uses CHROMEOS_LKGM, so amd64-generic-full/daisy-full in CQ should not fail as long as CHROMEOS_LGKM was landed with good chrome-sdk version for other CLs. Am I wrong?
,
Dec 12
Ah, right, because updating CHROMEOS_LKGM at this point exceeds 11369 (I didn't realize CrOS is up to 11390.0.0).
So that is definitely a downside to the current system - Chrome has been getting upreved on CrOS, but CrOS has not been getting upreved for Simple Chrome.
Short term we need to get issue 913317 fixed (I raised it to a P0) and increase awareness of the importance of the amd64-generic/daisy-full builders (I have spoken to some stakeholders, I believe that is in order).
Long term... probably we should:
a) Go ahead and disable the Simple Chrome version fallback code in chromeos-{amd64-generic|daisly}-rel so that the uprev starts failing sooner.
b) Add the gardener email list to the watchlist for CHROMEOS_LKGM.
Ideally we would also fail the PFQ if the CHROMEOS_LKGM uprev fails (instead of 'b'), since we can always do a manual uprev, but that seems challenging at best.
Then we just have to be sure to communicate that we can never chump the CHROMEOS_LKGM uprev if either of those builders fail.
,
Dec 13
The following revision refers to this bug: https://chromium.googlesource.com/chromiumos/chromite/+/e4e44f0a66da0010ea01427d38d9c0d6d63e9461 commit e4e44f0a66da0010ea01427d38d9c0d6d63e9461 Author: Achuith Bhandarkar <achuith@chromium.org> Date: Thu Dec 13 01:04:46 2018 cros_chrome_sdk: Support for --require-exact-version --require-exact-version disables the failback mechanism that tries prior versions when the SDK is not found. This flag will be used on the chromium bots, which currently pass LKGM versions that don't exist. BUG= chromium:913744 TEST=VersionTest.testExactVersion Change-Id: Ic7b4cc8e1733436a72bcce6dfe2217696e8a0d20 Reviewed-on: https://chromium-review.googlesource.com/1370563 Commit-Ready: Achuith Bhandarkar <achuith@chromium.org> Tested-by: Achuith Bhandarkar <achuith@chromium.org> Reviewed-by: Steven Bennetts <stevenjb@chromium.org> Reviewed-by: Ben Pastene <bpastene@chromium.org> [modify] https://crrev.com/e4e44f0a66da0010ea01427d38d9c0d6d63e9461/cli/cros/cros_chrome_sdk.py [modify] https://crrev.com/e4e44f0a66da0010ea01427d38d9c0d6d63e9461/cli/cros/cros_chrome_sdk_unittest.py
,
Dec 13
#25, sgtm We need to make sure that the target version is available when we do a), but I think we just need to enable & update CHROMEOS_LGKM (that we know exists) in one CL.
,
Dec 13
With achuith's addition of "--require-exact-version", we can now stop falling back to older versions on chromium's bots. Though we can't quite use it yet since the bots are currently running with versions that ~30 behind the current CHROMEOS_LKGM. Once we've got all the full builders back to green (bug 913317) we can turn that on and close this out.
,
Jan 8
The following revision refers to this bug: https://chromium.googlesource.com/chromium/src.git/+/70e3890029153b9571de1a9f035adcf465d44710 commit 70e3890029153b9571de1a9f035adcf465d44710 Author: Achuith Bhandarkar <achuith@chromium.org> Date: Tue Jan 08 03:05:51 2019 Use exact version of lkgm SDK on bots. --require-exact-version disallows simple chrome from using older SDKs. BUG= 913744 TEST=bots. Change-Id: I70e9d89d8a2378e654270adde282b22f38559ba3 Reviewed-on: https://chromium-review.googlesource.com/c/1399561 Reviewed-by: Ben Pastene <bpastene@chromium.org> Commit-Queue: Achuith Bhandarkar <achuith@chromium.org> Cr-Commit-Position: refs/heads/master@{#620594} [modify] https://crrev.com/70e3890029153b9571de1a9f035adcf465d44710/DEPS
,
Jan 8
|
|||||||
►
Sign in to add a comment |
|||||||
Comment 1 by achuith@chromium.org
, Dec 11Status: Started (was: Assigned)