BoringSSL Android bots are a little flaky |
|||||||||
Issue descriptionIt seems about 1/4 of the recent runs have gone purple. https://ci.chromium.org/p/boringssl/g/main/console Samples: https://ci.chromium.org/p/boringssl/builders/luci.boringssl.ci/android_aarch64/b8931053818862218944 https://ci.chromium.org/p/boringssl/builders/luci.boringssl.ci/android_arm/b8931053818383685264 https://ci.chromium.org/p/boringssl/builders/luci.boringssl.ci/android_arm/b8931208952901614624 https://ci.chromium.org/p/boringssl/builders/luci.boringssl.ci/android_arm/b8931308378343084400 Infra folks: any ideas what might be the cause here, or how to debug it?
,
Nov 5
FYI - the canonical way to file trooper bugs is go/bug-a-trooper - or g.co/bugatrooper (public link).
,
Nov 5
The failures are all due to unexplained swarming bot deaths, handing off to foundation trooper.
,
Nov 5
,
Nov 5
Some of these failures are marked as BOT_DIED while another is marked TIMED_OUT, but I believe these are all timeouts due to the execution timeout being set to 30m and these all taking 32m.
,
Nov 5
Oh hrm. I assume these are an overall timeout? Poking around, those bots are on the slower end. I don't think I'd noticed that before. The SSL tests take about 16 minutes, which is half the budget. Maybe it's time for us to toy with swarming and the like. I also notice a lot of variance in bot_update. This one took 9 minutes. https://ci.chromium.org/p/boringssl/builders/luci.boringssl.ci/android_arm/b8931208952901614624 At a glance, it's because the rather heft android_ndk repo takes a while to download. Other runs have very fast bot_updates: https://ci.chromium.org/p/boringssl/builders/luci.boringssl.ci/android_aarch64/b8931764302397220736 Is this an issue of some cache not having been populated yet? I remember some talk of using CIPD for the NDK. Would that solve this? I assume that repo is huge because it contains binaries for lots of NDK versions in the history.
,
Nov 5
Chromium's bots still fetch the NDK via git. (The SDK is all fetched via CIPD.) I'm not aware of any active efforts to move it to CIPD. Additionally, the longer bot_update durations could be due to the fact that (IIUC) a swarming bot clobbers all its caches when a bot_death occurs (+MA for confirmation). I'd suggest increasing the timeout to see if that both avoids bot_deaths and increases cache persistence (which should reduce bot_update durations). These builders aren't owned by troopers, so over to tandrii who's in the OWNERS for those bots: https://boringssl.googlesource.com/boringssl/+/infra/config/OWNERS
,
Nov 5
Yes, BOT_DIED kills the cache. It's a safety measure, in case something messed up hard.
,
Nov 5
davidben@ the right solution is to increase timeout in your repo's cr-buildbucket.cfg file from 30minutes to say 40 minutes. And if you want this to run faster, yeah, isolating and potentially running tests in parallel is also a good idea.
,
Nov 5
,
Nov 5
The following revision refers to this bug: https://boringssl.googlesource.com/boringssl/+/219759563fb6a837041458b0ae79554cd6bda872 commit 219759563fb6a837041458b0ae79554cd6bda872 Author: David Benjamin <davidben@google.com> Date: Mon Nov 05 22:27:24 2018 Increase the timeout for Android bots. Those take a while to run. Bug: chromium:900953 Change-Id: I52530d36588bd28cff7f974a9e16faef5c93e5e6 Reviewed-on: https://boringssl-review.googlesource.com/c/32847 Reviewed-by: Adam Langley <agl@google.com> Commit-Queue: Adam Langley <agl@google.com> CQ-Verified: CQ bot account: commit-bot@chromium.org <commit-bot@chromium.org> [modify] https://crrev.com/219759563fb6a837041458b0ae79554cd6bda872/cr-buildbucket.cfg
,
Nov 5
Actually, something else is weird here, it appears "gclient sync" takes ~3 minutes to checkout files form local cache. And importantly by this time, no other git op was running on a machine. Note also that updated refs are printed shortly after start, but "checked out" message is 2.5 minutes later. So, I think at least for this machine, local disk IO is very slow. From https://logs.chromium.org/logs/boringssl/buildbucket/cr-buildbucket.appspot.com/8931208952901614624/+/steps/bot_update/0/stdout boringssl/util/bot/android_tools/ndk (Elapsed: 0:02:34) ---------------------------------------- [0:05:47] Started. _____ boringssl/util/bot/android_tools/ndk at e951c37287c7d8cd915bf8d4149fd4a06d808b55 [0:05:48] running "git cat-file -e e951c37287c7d8cd915bf8d4149fd4a06d808b55^{commit}" in "/b/swarming/w/ir/cache/git/chromium.googlesource.com-android_ndk" skipping mirror update, it has rev=e951c37287c7d8cd915bf8d4149fd4a06d808b55 already ________ running 'git -c core.deltaBaseCacheLimit=2g clone --no-checkout --progress --shared --verbose /b/swarming/w/ir/cache/git/chromium.googlesource.com-android_ndk /b/swarming/w/ir/kitchen-workdir/boringssl/util/bot/android_tools/_gclient_ndk_cIuJVI' in '/b/swarming/w/ir/kitchen-workdir' [0:05:50] Cloning into '/b/swarming/w/ir/kitchen-workdir/boringssl/util/bot/android_tools/_gclient_ndk_cIuJVI'... [0:05:51] done. ________ running 'git -c core.deltaBaseCacheLimit=2g fetch origin --prune --verbose' in '/b/swarming/w/ir/kitchen-workdir/boringssl/util/bot/android_tools/ndk' [0:05:51] From /b/swarming/w/ir/cache/git/chromium.googlesource.com-android_ndk [0:05:51] = [up to date] master -> origin/master [0:05:51] = [up to date] agrieve-inline-tweaks -> origin/agrieve-inline-tweaks [0:05:51] = [up to date] agrieve-inline-tweaks-string -> origin/agrieve-inline-tweaks-string [0:05:51] = [up to date] next -> origin/next [0:05:51] = [up to date] r16 -> origin/r16 [0:05:51] = [up to date] unmodified -> origin/unmodified [0:08:22] Checked out e951c37287c7d8cd915bf8d4149fd4a06d808b55 to a detached HEAD. Before making any commits in this repo, you should use 'git checkout <branch>' to switch to an existing branch or use 'git checkout origin -b <branch>' to create a new branch for your work. [0:08:22] Finished.
,
Nov 5
bpastene@ do you have an idea about whether this slow IO is actually expected? I assumed so far that this checkpout is done on the server machine, not the attached phone. Shall this bug actually be sent to Labs to look into why disk is so slow?
,
Nov 5
> bpastene@ do you have an idea about whether this slow IO is actually expected? Likely due to the fact that the builder is using multiple swarming bots on the same machine (ie: android docker). I'd check the task history of all 7 bots on build6-b9 to see if there's other builds ran at the same time. (ie: simultaneous bot_updates competing for IO) The android-docker setup does not scale well with high host-side resource usage (disk/cpu/mem) since its purpose is to drive tests on the devices themselves. Using the bots for build/compile is not exactly its intended use-case, so issues like these don't surprise me. |
|||||||||
►
Sign in to add a comment |
|||||||||
Comment 1 by davidben@chromium.org
, Nov 5