Update libva across ABI incompatible versions |
||||||||||
Issue descriptionWe are trying to update the libva version used on Chrome OS on this CL: https://chromium-review.googlesource.com/c/557361 The new version (1.8.3) is ABI-incompatible with the old version (1.7.0). Chrome detects this case upon startup by comparing the ABI version it was built against to the ABI version of the runtime library. If the versions don't match, video decode acceleration is disabled. The problem is that Paladin builds seem to use a pre-compiled version of Chrome, which means a version of Chrome compiled against the old libva at the time the CL is tested. So Paladin trybots will run with video acceleration disabled, which will make related tests fail. On way to mitigate this issue would be to have a way to force Paladin to recompile Chrome after applying the CL (maybe through a special flag in the CL's commit message), but we are open to any way of solving this issue.
,
Aug 16 2017
This would also work. Since we will need to update this library every ~3 months, it would be nice is the process could be made simple and easy though.
,
Aug 16 2017
A sanity check question first. Are ebuild dependencies setup such that incrementing the API version without also incrementing the chrome ebuild causes a dependency failure? Heavily rephrased: If we were to chump in your change today, and then regenerate binary prebuilts, would it still break developers chroots and/or incremental builders?
,
Aug 16 2017
while we could (should?) update the chrome ebuild so that its dep on libva uses a subslot (and the subslot in the libva packages correctly reflect their ABI/SONAME settings), that wouldn't help. our build specifically says to always use a binpkg for chrome which means portage won't validate its deps. in this scenario, i think the easiest way forward is to temporarily disable the relevant tests in the CQ, and then re-enable once the Chrome PFQ passes and publishes new binaries.
,
Aug 16 2017
Ack. Making paladins recompile chrome is sort of a non-starter. I think it'll take long enough to push them over the timeouts anyway. And we may be assuming that chrome is unchanged in many places. I couldn't find any reference to us having to do this in the past. Is this the first ABI incompatible update of libva we're facing? In the meantime, back to acourbot@ to do the disable-tests;land;revert-disable dance.
,
Aug 16 2017
this has come up in the past when we try to update system libs that Chrome links against and they change their ABI with a bit more hostility -- they change their SONAME so `chrome` fails to even launch. in those scenarios, we would stage the upgrade either by temporarily disabling the functionality in Chrome (if possible) across PFQ rolls, or we'd have the package preserve both ABI libs temporarily until the PFQ rolled. there's also the DBUS IPC ABI that is enshrined in things like system_api ... devs have to add transitional bridges. either way, changing system ABIs that Chrome relies upon is a messy affair currently and we don't have a good/general solution.
,
Aug 16 2017
nothing in here is secret, so dropping the restriction
,
Aug 16 2017
The toolchain team uses this process to force binary prebuilts to be regenerated. https://sites.google.com/a/google.com/chromeos-toolchain-team-home2/home/toolchain-info/post-steps-for-toolchain-roll I've never fully understood why it works, and doesn't cause issues for developers/incremental builders while it's in progress. It does depend on getting green CQ and PFQ runs before all new prebuilts are available. I believe that following the same process but also disabling the additional tests will work for you, but I don't know if it will distrupt things while it's in progress.
,
Aug 16 2017
the toolchain steps don't include a Chrome rebuild :). they still rely on Chrome PFQ to produce new binpkgs, and if they need to revert, they rely on being able to pin older versions of Chrome binpkgs (like we do when a new Chrome version is bad). as for not being an issue for devs, it is ... we just don't notice because the toolchain rarely breaks such that devs need to reinstall everything. on the rare occasions it does, we look at PSAs and chroot upgrade hooks to purge the bad content and force an update/refetch of binpkgs.
,
Aug 17 2017
Thanks for the comments everyone. I guess the next step for me is to figure out how to temporarily disable the offending tests on Paladin. I am not familiar at all with the procedure, but could find https://chromium-review.googlesource.com/c/63393 that seems to do something similar. Can I take it as a model for the issue at hand?
,
Aug 18 2017
#10: We do not want to disable all tests, we just want to disable the video ones. This should be a matter of removing suite:<suites> in src/third_party/autotest/files/client/site_tests/<tests>/control.<subtest> , where <suites> and <tests/subtest> are the minimum required set. I'm not sure which suites/tests are going to break exactly, but last we tried to push the CL, only setzer-paladin was a problem, failing video_ChromeRTCHWDecodeUsed.mjpeg in bvt-cq (so tests/subtest=video_ChromeRTCHWDecodeUsed.mjpeg and suites=bvt-cq would be a good place to start). See https://chromium-review.googlesource.com/c/557361#message-3d9606034790dbc275145de877fbc5f0d2384942 .
,
Aug 18 2017
Sounds like we (roughly) have a plan to do this once. I'd like to have a cleaner plan for how to solve this in future. I suspect that if we had a clean way to invalidate binary prebuilts (including Chrome), this would become much more straight forward, even if it results in a really, really slow CQ run or two.
,
Aug 18 2017
i think we have a way for packages to signal such changes require rebuilds: subslots. we just today choose not to support building Chrome from source in the CQ.
,
Aug 18 2017
Then, I think what I'm suggesting is a way to allow the CQ build/upload prebuilts for Chrome in limited cases. Or... perhaps enable it generally, but emit explicit Sheriff/Deputy alerts whenever it happens. If we enable Goma in the CQ this will be less painful than it used to be (I believe it'll still add 30-40 minutes). As long as the prebuilts are unique for different sub-slots, and the CQ doesn't uprev the Chrome ebuild, this should be safe to have in conjunction with the Chrome PFQ. Is my understanding of how this works correct? Also, could slots also be used to make the toolchain process better?
,
Aug 21 2017
> As long as the prebuilts are unique for different sub-slots if you mean the lib packages, then yes, diff sub-slots will normally have diff $PV and thus have unique binpkgs. for chrome itself, obviously that won't be true. > Also, could slots also be used to make the toolchain process better? not currently. there are no deps on the toolchain packages. we've taken the approach in general of omitting those as they don't add value. it'd be like adding a dep on glibc to (almost) everything because everything needs a C lib at some point.
,
Aug 21 2017
> if you mean the lib packages, then yes, diff sub-slots will normally have diff > $PV and thus have unique binpkgs. for chrome itself, obviously that won't be > true. Can we make it true for Chrome? If we do, then an easy bump process for this API becomes "enable CQ building and prebuilt uploading for Chrome, with alerting". It would be really nice to do that for the toolchain as well, but I see the problem. Hum... related question. Is it feasible to add a new value of some kind that's included in $PV (or any other part of the file name)? Then any time we need to invalidate prebuilts, we bump that value. I'm picturing it as a constant that's checked in somewhere. Usable for the toolchain, corruption recovery, etc.
,
Aug 22 2017
> Can we make it true for Chrome? as it stands today, no. we'd need to be able to revbump the package in the CQ too. which, if we're going to rebuild from source, shouldn't be a problem. and would actually make existing workflows much easier (like changing USE flags on the chrome ebuild). the -r# *is* the thing that tracks this :).
,
Aug 22 2017
I'm pretty sure we'll have race conditions if we allow more than a single builder to revbump any given ebuild. IE: We can't have both the CQ and PFQ bumping the chrome ebuild at the same time. I was only proposing that we build Chrome in the CQ for unusual cases, not for normal runs.
,
Aug 22 2017
> I'm pretty sure we'll have race conditions if we allow more than a single builder to revbump any given ebuild. IE: We can't have both the CQ and PFQ bumping the chrome ebuild at the same time. if we changed the Chrome PFQ to never revbump, but only update the $PV, i think we'd be OK. atm, we seem to use _rc, but never _rc#. so if the Chrome PFQ rebuilt the same based $PV (62.0.3193.0), it'd change 62.0.3193.0_rc1-r* to 62.0.3193.0_rc2-r1. i don't think Chrome itself uses the rc# itself ? then the CQ would stick to -r# like other cros-workon packages. lets say we get into a situation where they're racing: - tree has 62.0.3193.0_rc1-r1 - Chrome PFQ runs and starts building 62.0.3193.0_rc2-r1 - someone posts a new change to the Chrome 9999 ebuild - CQ runs and starts building 62.0.3193.0_rc1-r2 (Chrome PFQ wins race) - Chrome PFQ finishes first and posts 62.0.3193.0_rc2-r1 - CQ finishes - if the CQ failed already, then the Chrome PFQ behavior is irrelevant - if the CQ passed, during the stage to finalize uprevs & push results, it detects 62.0.3193.0_rc1-r1 no longer exists - if the Chrome ebuild change was isolated (i.e. no CQ-DEPEND on any other change), temp reject that CL (so it can rebuild it in the next run), but merge the rest - if the Chrome ebuild change depended on system changes, then now what ? we could temp reject/retry all of them, or we could temp reject the entire run. start with the latter and see how well it behaves ? - next CQ run kicks off and takes care of bumping 62.0.3193.0_rc2-r1 to 62.0.3193.0_rc2-r2 (CQ wins race) - Chrome PFQ detects 62.0.3193.0_rc1-r1 no longer exists - Chrome PFQ fails run and waits for the next one to pick up - not a big deal as Chrome PFQ is already "eventually consistent" ? the overall assumption is that the # of changes to Chrome coming in via the CQ is low so won't frequently impact a Chrome PFQ run. as for locking, git already provides that for us for free -- we can't push non-fast forward commits which means we have to refetch to master, check the state, put our commit on top of that, and then push again. > I was only proposing that we build Chrome in the CQ for unusual cases, not for normal runs. i'm not seeing a diff here. either Chrome needs to rebuild, or it doesn't. either our infra is able to handle the load, or it isn't. how can the CQ differentiate ?
,
Aug 22 2017
> > I was only proposing that we build Chrome in the CQ for unusual cases, not for normal runs. > i'm not seeing a diff here. either Chrome needs to rebuild, or it doesn't. either our infra is able to handle the load, or it isn't. how can the CQ differentiate ? I didn't explain well. I would the CQ to emerge Chrome every time, but we'd expect it to use binary prebuilts for the majority of cases. And, it's important that we raise an alert (via SoM or Viceroy metrics) if the CQ does ever compile Chrome to catch build misconfigurations and other problems. We currently error out when compiling Chrome in the CQ as a form of alert that something is misconfigured.
,
Aug 23 2017
So, after discussing with drinkcat@ I have worked on figuring out which tests need to be disabled by submitting tryjobs that are known to fail until they get green. But first I wanted to reproduce a failure using a Paladin tryjob:
Comment #11 mentioned that setzer-paladin was failing, so I submitted it again:
$ cros tryjob -g 557361 setzer-paladin
And to my surprise it succeeded!
https://uberchromegw.corp.google.com/i/chromiumos.tryserver/builders/paladin/builds/3592
I need to be able to consistently reproduce the issue in order to narrow down the set of tests we need to disable. Obviously I am doing something wrong here, but what?
,
Aug 23 2017
#21: This is not running HWTest stage, try passing --hwtest parameter to cros tryjob.
,
Aug 24 2017
Thanks @drinkcat, I am now able to reproduce the failure myself! And to get a clear list of the failing tests. I should be able to follow-up with a patch now.
,
Aug 25 2017
The following three patches result in a successful trybot with picked together: https://chrome-internal-review.googlesource.com/c/chromeos/autotest-cheets/+/439153 https://chromium-review.googlesource.com/c/chromiumos/third_party/autotest/+/631860 https://chromium-review.googlesource.com/c/chromiumos/overlays/chromiumos-overlay/+/557361 Trybot: https://uberchromegw.corp.google.com/i/chromiumos.tryserver/builders/paladin/builds/3609 If I got the CQ-DEPEND tag properly, they should land together. Then I will revert the first two once we confirm that the Chromium binary is linked against libva 1.8.3. Reviews from people familiar with the testing infrastructure appreciated!
,
Aug 25 2017
The following revision refers to this bug: https://chrome-internal.googlesource.com/chromeos/autotest-cheets/+/9613461b8c7d4e50de05e8128e1325686ef5f420 commit 9613461b8c7d4e50de05e8128e1325686ef5f420 Author: Alexandre Courbot <acourbot@chromium.org> Date: Fri Aug 25 19:08:57 2017
,
Aug 25 2017
The following revision refers to this bug: https://chromium.googlesource.com/chromiumos/third_party/autotest/+/7406c66888685a85dbdeac30aaed540d5cca8f43 commit 7406c66888685a85dbdeac30aaed540d5cca8f43 Author: Alexandre Courbot <acourbot@chromium.org> Date: Fri Aug 25 19:08:57 2017 autotest: temporarily disable HW video tests on bvt-cq Gerrit CL 557361 updates libva to version 1.8.3. This will make the Chrome binary used by the CQ temporarily incompatible with the system library, resulting in HW acceleration being disabled on platforms using libva until the Chrome binary is rebuilt. Temporarily disable these tests to avoid breaking the CQ. This patch will be reverted as soon as the Chrome binary can make use of the new libva library. CQ-DEPEND=CL:*439153,CL:557361 BUG= chromium:755881 TEST=ran a trybot with hwtests on libva 1.8.3 and confirmed it passed. Change-Id: I04889d1bc11e24edd1ac55a432750b03c4804b2d Reviewed-on: https://chromium-review.googlesource.com/631860 Commit-Ready: Alexandre Courbot <acourbot@chromium.org> Tested-by: Alexandre Courbot <acourbot@chromium.org> Reviewed-by: Nicolas Boichat <drinkcat@chromium.org> [modify] https://crrev.com/7406c66888685a85dbdeac30aaed540d5cca8f43/client/site_tests/video_ChromeRTCHWEncodeUsed/control [modify] https://crrev.com/7406c66888685a85dbdeac30aaed540d5cca8f43/client/site_tests/video_ChromeHWDecodeUsed/control.h264 [modify] https://crrev.com/7406c66888685a85dbdeac30aaed540d5cca8f43/client/site_tests/video_ChromeHWDecodeUsed/control.h264.mse [modify] https://crrev.com/7406c66888685a85dbdeac30aaed540d5cca8f43/client/site_tests/video_ChromeHWDecodeUsed/control.vp8 [modify] https://crrev.com/7406c66888685a85dbdeac30aaed540d5cca8f43/client/site_tests/video_ChromeHWDecodeUsed/control.vp9 [modify] https://crrev.com/7406c66888685a85dbdeac30aaed540d5cca8f43/client/site_tests/video_ChromeHWDecodeUsed/control.vp9.mse [modify] https://crrev.com/7406c66888685a85dbdeac30aaed540d5cca8f43/client/site_tests/video_ChromeRTCHWDecodeUsed/control.y4m [modify] https://crrev.com/7406c66888685a85dbdeac30aaed540d5cca8f43/client/site_tests/video_ChromeRTCHWDecodeUsed/control.mjpeg [modify] https://crrev.com/7406c66888685a85dbdeac30aaed540d5cca8f43/client/site_tests/video_ChromeHWDecodeUsed/control.vp8.mse
,
Aug 28 2017
,
Aug 28 2017
,
Sep 19 2017
We got a promise from Intel to not break the ABI ever again, so hopefully the patches listed above won't be necessary. Closing this issue for now but keeping them around just in case... |
||||||||||
►
Sign in to add a comment |
||||||||||
Comment 1 by drinkcat@chromium.org
, Aug 16 2017