New issue
Advanced search Search tips
Note: Color blocks (like or ) mean that a user may not be available. Tooltip shows the reason.

Issue 856268 link

Starred by 1 user

Issue metadata

Status: Fixed
Owner:
OOO until 2019-01-24
Closed: Jun 2018
Cc:
Components:
EstimatedDays: ----
NextAction: ----
OS: Mac
Pri: 1
Type: Bug

Blocked on:
issue 853307



Sign in to add a comment

Mac FYI Experimental Retina Release (NVIDIA) on chromium.gpu.fyi infra failure.

Project Member Reported by khushals...@chromium.org, Jun 25 2018

Issue description

The failure started at build #797 with tests failing to run with "not enough capacity" errors. Starting with build #802, it looks like some shards are getting allocated but it still eventually fails with the same error and the run is not getting marked clearly as an infra failure. Here is an example of a failing swarming task: https://chromium-swarm.appspot.com/task?id=3e41bf0198940310&refresh=10&show_raw=1.
 

Comment 1 by kbr@chromium.org, Jun 25 2018

Blockedon: 853307
Components: -Infra Internals>GPU>Testing Infra>Client>Chrome
Labels: GPU-NVidia OS-Mac
Owner: khushals...@chromium.org
Status: Assigned (was: Available)
This is the bot in question:
https://ci.chromium.org/p/chromium/builders/luci.chromium.ci/Mac%20FYI%20Experimental%20Retina%20Release%20(NVIDIA)

Some issues were seen with this bot last week in Issue 853307.

There's only one bot in the Swarming pool backing this configuration right now as we have barely enough NVIDIA MacBook Pros at this point (and can't get any more). We should upgrade the Mac bots to 10.13 at this point and allocate two (for redundancy) to the "experimental" configuration, which should be 10.14.

It's not clear whether there was a code change associated with these failures or whether the bot's spontaneously going offline during one of the shards of webgl2_conformance_tests.

Khushal, can you please monitor the current build:
https://ci.chromium.org/p/chromium/builders/luci.chromium.ci/Mac%20FYI%20Experimental%20Retina%20Release%20%28NVIDIA%29/818

and ping me if it fails?

Build #818 above has also failed, for the same shard #19. This is consistent across all failures so far.

Comment 3 by kbr@chromium.org, Jun 26 2018

Cc: jbudorick@chromium.org no...@chromium.org mar...@chromium.org
Hmm. Build 818 failed:
https://ci.chromium.org/p/chromium/builders/luci.chromium.ci/Mac%20FYI%20Experimental%20Retina%20Release%20%28NVIDIA%29/818

but build 819 passed with no significant (I assume) code changes:
https://ci.chromium.org/p/chromium/builders/luci.chromium.ci/Mac%20FYI%20Experimental%20Retina%20Release%20%28NVIDIA%29/819

The max pending time per shard is suspiciously close to 3 hours in both cases. Here's build 818's:
Max pending time: 2:52:56.604581 (shard #18)

and build 819's:
Max pending time: 2:59:17.252003 (shard #19)

Nodir, M-A, John, does the "execution_timeout_secs" timeout in src/infra/config/global/cr-buildbucket.cfg cover the maximum pending time as well as the potential per-shard execution time? Could it be the case that we've been adjusting the wrong timeout all along?

as for the max pending time; we may not be correctly reporting it in cases where a shard expired. note that shard 19 on build 818 had a pending time just over 3hr: https://chromium-swarm.appspot.com/task?id=3e50831274358810

Comment 6 by kbr@chromium.org, Jun 26 2018

Cc: khushals...@chromium.org
Owner: kbr@chromium.org
Status: Started (was: Assigned)
#4: aha, thanks. Fix incoming.

Project Member

Comment 7 by bugdroid1@chromium.org, Jun 26 2018

The following revision refers to this bug:
  https://chromium.googlesource.com/chromium/src.git/+/a8c3e9c80c478d32a43f0bbe4423ffd43359152f

commit a8c3e9c80c478d32a43f0bbe4423ffd43359152f
Author: Kenneth Russell <kbr@chromium.org>
Date: Tue Jun 26 03:42:26 2018

Increase expiration time on Mac FYI Exp Release (NVIDIA).

webgl2_conformance_tests' shards are taking at least 3 hours to
schedule, so increase the timeout to 6 hours.

Bug:  856268 
Tbr: jbudorick@chromium.org
Change-Id: Idf610388955f4b6cb73b81f5075a810c13c2fda0
Reviewed-on: https://chromium-review.googlesource.com/1114356
Reviewed-by: Kenneth Russell <kbr@chromium.org>
Reviewed-by: John Budorick <jbudorick@chromium.org>
Cr-Commit-Position: refs/heads/master@{#570322}
[modify] https://crrev.com/a8c3e9c80c478d32a43f0bbe4423ffd43359152f/testing/buildbot/chromium.gpu.fyi.json
[modify] https://crrev.com/a8c3e9c80c478d32a43f0bbe4423ffd43359152f/testing/buildbot/waterfalls.pyl

Comment 8 by kbr@chromium.org, Jun 26 2018

Status: Fixed (was: Started)
I think this will be reliably fixed by the above change. Please reopen if not.

Status: Assigned (was: Fixed)
Failures have started again on this bot since the last 3 builds. Here is an example:
https://ci.chromium.org/p/chromium/builders/luci.chromium.ci/Mac%20FYI%20Experimental%20Retina%20Release%20%28NVIDIA%29/833

All tests fail with the following error:

Missing results from the following shard(s): 0
This can happen in following cases:
  * Test failed to start (missing *.dll/*.so dependency for example)
  * Test crashed or hung
  * Task expired because there are not enough bots available and are all used
  * Swarming service experienced problems
Please examine logs to figure out what happened.
Looks like the machine got upgrade to 10.13.5 & the test configs need to be retargeted.

Comment 11 by kbr@chromium.org, Jun 28 2018

Cc: ynovikov@chromium.org
Sorry about breaking these bots while upgrading to 10.13.5 but that was fixed in  Issue 857527  by ynovikov.

Comment 12 by kbr@chromium.org, Jun 28 2018

Status: Fixed (was: Assigned)

Sign in to add a comment