Infra failure on perf.fyi webview bots |
||||||||
Issue descriptionThe go webview bot has been constantly failing with Infra Failure: https://ci.chromium.org/p/chrome/builders/luci.chrome.ci/android-go_webview-perf And the pixel 2 one has recently started doing the same, e.g.: https://ci.chromium.org/p/chrome/builders/luci.chrome.ci/android-pixel2_webview-perf/778 Any idea why is that? Looks like maybe a timeout? The (huge) logs for the performance_webview_test_suite step show a "Waiting for results from the following shards" close to the end.
,
Nov 26
Those webview bots have plenty of shards, so it's different. The problem here is the swarming shard's timeout exceed the LUCI build timeout (7hr iirc). I suggest for this bug, we reduce the swarming shard timeout of webview bots to 3-4 hours.
,
Nov 26
,
Nov 26
#2: I'm not following -- how is this different? In both cases, the perf tests take too long and wind up timing out the entire build. I'm also not following how your suggestion would help...
,
Nov 26
In the other case, there very few Window machines, which suggest the shard timeout is due to the lack of hardware. In this case, iirc, there are not a lack of harwares. If the 7hr mark is crossed, it's more like due to some shards have a tests being stuck. If the shard timeout limit is reduced so that the build step is not timeout, we can easily find out which shards exceed the timeout limit & look into it further to debug.
,
Nov 26
Issue 908126 has been merged into this issue.
,
Nov 26
908126 seems related to this. Merged it into this.
,
Nov 28
It appears #c2 and #c5 are accurate. The recipe quickly triggers a bunch of shards (which don't seem to be pending much - so the hardware is indeed not an issue), and then times out waiting for the shards. The individual shard timeout is set to 7h, just as the build, so the recipe has no chance to collect the timed out shards. I'll take this as the current trooper and see if I can update the shard timeout tomorrow.
,
Nov 30
Ping - this is now the only remaining blocker of issue 817842 :)
,
Dec 5
Sergey, did you have a chance to look at this?
,
Dec 11
Ping
,
Dec 11
Sorry, didn't have a chance to get to it. I'm a trooper again today, will take a look very soon.
,
Dec 11
Found a place where the timeout is configured: https://cs.chromium.org/chromium/src/testing/buildbot/chromium.perf.fyi.json?l=86&rcl=f4728fa7e63e608a59684c6a3cadf43138f961e8 The scary bit is that most of these *.json files are autogenerated from *.pyl specs, but this file apparently isn't.
,
Dec 11
,
Dec 11
The following revision refers to this bug: https://chromium.googlesource.com/chromium/src.git/+/27287febd7142c4100044813a057b653850e3a8b commit 27287febd7142c4100044813a057b653850e3a8b Author: Sergey Berezin <sergeyberezin@google.com> Date: Tue Dec 11 23:13:40 2018 [chromium.perf.fyi] Update swarming shards timeout to <7h The main build always runs with a timeout of 7h. Make sure the individual shards always complete sooner, so any failed or timed out shards are correctly indicated on the build UI. Bug: 907852 Change-Id: I1580ee0a92f371536f1435e56596ee3d8aeb861b Reviewed-on: https://chromium-review.googlesource.com/c/1372533 Reviewed-by: Ned Nguyen <nednguyen@google.com> Reviewed-by: Stephen Martinis <martiniss@chromium.org> Commit-Queue: Sergey Berezin <sergeyberezin@chromium.org> Cr-Commit-Position: refs/heads/master@{#615714} [modify] https://crrev.com/27287febd7142c4100044813a057b653850e3a8b/testing/buildbot/chromium.perf.fyi.json
,
Dec 13
Thanks! I think this looks better now.
,
Dec 13
The fix seems to work: https://ci.chromium.org/p/chrome/builders/luci.chrome.ci/android-go_webview-perf/2736 has a shard that timed out and correctly reported as such in a build step.
,
Dec 21
Sergey, unfortunately your change https://chromium-review.googlesource.com/c/chromium/src/+/1372533 doesn't work because https://cs.chromium.org/chromium/src/testing/buildbot/chromium.perf.fyi.json is actually autogenerated by the src/tools/perf/generate_perf_data script. So the next time that script was run it removed your changes. I have committed that change removal in https://chromium-review.googlesource.com/c/chromium/src/+/1388663 along with some code that will help prevent similar things from happening in the future. Sorry about this! I'm not sure what should be done now. The timeout tuning can be done in src/tools/perf/core/perf_data_generator.py and then you can re-run src/tools/perf/generate_perf_data to apply it. But I'm not sure the tuning is as fine-grain as you want.
,
Jan 7
Thanks for finding this out and adding a presubmit check! (I missed it originally because the generator was in an unusual place). I'll look into it later to see what can be done with the timeouts.
,
Jan 7
|
||||||||
►
Sign in to add a comment |
||||||||
Comment 1 by jbudorick@chromium.org
, Nov 26