New issue
Advanced search Search tips

Issue 900953 link

Starred by 1 user

Issue metadata

Status: Assigned
Owner:
Cc:
Components:
EstimatedDays: ----
NextAction: ----
OS: Android
Pri: 3
Type: Bug



Sign in to add a comment

BoringSSL Android bots are a little flaky

Project Member Reported by davidben@chromium.org, Nov 1

Issue description

Labels: Infra-Troopers
(It occurs to me I don't know when things should go to Infra>Labs vs Infra-Troopers, so I may have filed this wrong.)
Components: -Infra>Labs Infra
FYI - the canonical way to file trooper bugs is go/bug-a-trooper - or g.co/bugatrooper (public link).
Labels: -Infra-Troopers Foundation-Troopers
The failures are all due to unexplained swarming bot deaths, handing off to foundation trooper.
Cc: bpastene@chromium.org
Labels: -Foundation-Troopers
Owner: gbeaty@chromium.org
Status: Started (was: Untriaged)
Some of these failures are marked as BOT_DIED while another is marked TIMED_OUT, but I believe these are all timeouts due to the execution timeout being set to 30m and these all taking 32m.
Oh hrm. I assume these are an overall timeout?

Poking around, those bots are on the slower end. I don't think I'd noticed that before. The SSL tests take about 16 minutes, which is half the budget. Maybe it's time for us to toy with swarming and the like. I also notice a lot of variance in bot_update. This one took 9 minutes.

https://ci.chromium.org/p/boringssl/builders/luci.boringssl.ci/android_arm/b8931208952901614624

At a glance, it's because the rather heft android_ndk repo takes a while to download. Other runs have very fast bot_updates:

https://ci.chromium.org/p/boringssl/builders/luci.boringssl.ci/android_aarch64/b8931764302397220736

Is this an issue of some cache not having been populated yet?

I remember some talk of using CIPD for the NDK. Would that solve this? I assume that repo is huge because it contains binaries for lots of NDK versions in the history.
Cc: mar...@chromium.org
Owner: tandrii@chromium.org
Chromium's bots still fetch the NDK via git. (The SDK is all fetched via CIPD.) I'm not aware of any active efforts to move it to CIPD.

Additionally, the longer bot_update durations could be due to the fact that (IIUC) a swarming bot clobbers all its caches when a bot_death occurs (+MA for confirmation). I'd suggest increasing the timeout to see if that both avoids bot_deaths and increases cache persistence (which should reduce bot_update durations).

These builders aren't owned by troopers, so over to tandrii who's in the OWNERS for those bots:
https://boringssl.googlesource.com/boringssl/+/infra/config/OWNERS
Yes, BOT_DIED kills the cache. It's a safety measure, in case something messed up hard.
Owner: davidben@chromium.org
Status: Assigned (was: Started)
davidben@ the right solution is to increase timeout in your repo's cr-buildbucket.cfg file from 30minutes to say 40 minutes.

And if you want this to run faster, yeah, isolating and potentially running tests in parallel is also a good idea.
Cc: tandrii@chromium.org
Components: -Infra Infra>Client
Project Member

Comment 11 by bugdroid1@chromium.org, Nov 5

Labels: merge-merged-config
The following revision refers to this bug:
  https://boringssl.googlesource.com/boringssl/+/219759563fb6a837041458b0ae79554cd6bda872

commit 219759563fb6a837041458b0ae79554cd6bda872
Author: David Benjamin <davidben@google.com>
Date: Mon Nov 05 22:27:24 2018

Increase the timeout for Android bots.

Those take a while to run.

Bug: chromium:900953
Change-Id: I52530d36588bd28cff7f974a9e16faef5c93e5e6
Reviewed-on: https://boringssl-review.googlesource.com/c/32847
Reviewed-by: Adam Langley <agl@google.com>
Commit-Queue: Adam Langley <agl@google.com>
CQ-Verified: CQ bot account: commit-bot@chromium.org <commit-bot@chromium.org>

[modify] https://crrev.com/219759563fb6a837041458b0ae79554cd6bda872/cr-buildbucket.cfg

Actually, something else is weird here, it appears "gclient sync" takes ~3 minutes to checkout files form local cache.
And importantly by this time, no other git op was running on a machine.

Note also that updated refs are printed shortly after start, but "checked out" message is 2.5 minutes later.
So, I think at least for this machine, local disk IO is very slow.


From https://logs.chromium.org/logs/boringssl/buildbucket/cr-buildbucket.appspot.com/8931208952901614624/+/steps/bot_update/0/stdout

boringssl/util/bot/android_tools/ndk (Elapsed: 0:02:34)
----------------------------------------
[0:05:47] Started.
_____ boringssl/util/bot/android_tools/ndk at e951c37287c7d8cd915bf8d4149fd4a06d808b55
[0:05:48] running "git cat-file -e e951c37287c7d8cd915bf8d4149fd4a06d808b55^{commit}" in "/b/swarming/w/ir/cache/git/chromium.googlesource.com-android_ndk"
skipping mirror update, it has rev=e951c37287c7d8cd915bf8d4149fd4a06d808b55 already

________ running 'git -c core.deltaBaseCacheLimit=2g clone --no-checkout --progress --shared --verbose /b/swarming/w/ir/cache/git/chromium.googlesource.com-android_ndk /b/swarming/w/ir/kitchen-workdir/boringssl/util/bot/android_tools/_gclient_ndk_cIuJVI' in '/b/swarming/w/ir/kitchen-workdir'
[0:05:50] Cloning into '/b/swarming/w/ir/kitchen-workdir/boringssl/util/bot/android_tools/_gclient_ndk_cIuJVI'...
[0:05:51] done.

________ running 'git -c core.deltaBaseCacheLimit=2g fetch origin --prune --verbose' in '/b/swarming/w/ir/kitchen-workdir/boringssl/util/bot/android_tools/ndk'
[0:05:51] From /b/swarming/w/ir/cache/git/chromium.googlesource.com-android_ndk
[0:05:51]  = [up to date]        master                -> origin/master
[0:05:51]  = [up to date]        agrieve-inline-tweaks -> origin/agrieve-inline-tweaks
[0:05:51]  = [up to date]        agrieve-inline-tweaks-string -> origin/agrieve-inline-tweaks-string
[0:05:51]  = [up to date]        next                  -> origin/next
[0:05:51]  = [up to date]        r16                   -> origin/r16
[0:05:51]  = [up to date]        unmodified            -> origin/unmodified
[0:08:22] Checked out e951c37287c7d8cd915bf8d4149fd4a06d808b55 to a detached HEAD. Before making any commits
in this repo, you should use 'git checkout <branch>' to switch to
an existing branch or use 'git checkout origin -b <branch>' to
create a new branch for your work.
[0:08:22] Finished.
bpastene@ do you have an idea about whether this slow IO is actually expected?
I assumed so far that this checkpout is done on the server machine, not the attached phone.

Shall this bug actually be sent to Labs to look into why disk is so slow?
> bpastene@ do you have an idea about whether this slow IO is actually expected?

Likely due to the fact that the builder is using multiple swarming bots on the same machine (ie: android docker). I'd check the task history of all 7 bots on build6-b9 to see if there's other builds ran at the same time. (ie: simultaneous bot_updates competing for IO)

The android-docker setup does not scale well with high host-side resource usage (disk/cpu/mem) since its purpose is to drive tests on the devices themselves. Using the bots for build/compile is not exactly its intended use-case, so issues like these don't surprise me.

Sign in to add a comment