New issue
Advanced search Search tips
Note: Color blocks (like or ) mean that a user may not be available. Tooltip shows the reason.

Issue 897424 link

Starred by 3 users

Issue metadata

Status: Assigned
Owner:
Cc:
Components:
EstimatedDays: ----
NextAction: ----
OS: Windows
Pri: 2
Type: Bug



Sign in to add a comment

Flaky GPU test failures to start on Windows Intel bots

Project Member Reported by jmad...@chromium.org, Oct 20

Issue description

This seems to be happening pretty regularly. Example flakes will pass 3/4 shards and one shard will time out as the tests are starting. Example:

https://ci.chromium.org/p/chromium/builders/luci.chromium.try/win_angle_rel_ng/3670
https://ci.chromium.org/p/chromium/builders/luci.chromium.ci/Win10%20FYI%20Release%20%28Intel%20HD%20630%29/2447
https://ci.chromium.org/p/chromium/builders/luci.chromium.try/win_angle_rel_ng/3631
https://ci.chromium.org/p/chromium/builders/luci.chromium.try/win_angle_rel_ng/3634

The failure mode is the same. Each test shard that fails will output only:

ERR: initialize(485): ANGLE Display::initialize error 12289: Intel OpenGL ES drivers are not supported.
Skipping tests using configuration ES3_OPENGLES because it is not available.
ERR: initialize(485): ANGLE Display::initialize error 12289: Intel OpenGL ES drivers are not supported.
Skipping tests using configuration ES3_1_OPENGLES because it is not available.

This could be a hang when ANGLE's tests are checking for Vulkan compatibility on startup. I'll try adding some logging. Putting on the Chromium issue tracker for higher visibility in case anyone else has any ideas.
 
Not much to go on here.  I agree that we need more logging, probably log each config that it's checking.
Labels: -Pri-1 Pri-2
Status: Untriaged (was: Available)
Yeah, after this happening like 5 times in a row I was unable to reproduce it in a test CL in about 6-7 tries. Let's wait for a bit and see if it happens again. Might be something that happens where there's more load.
Status: Available (was: Untriaged)
Cc: kbr@chromium.org
Components: -Internals>GPU>ANGLE Internals>GPU>Testing
Labels: -Pri-2 Pri-1
Summary: Flaky GPU test failures on Windows Intel bots (was: angle_end2end_tests timing out flakily on Windows Intel GPU bots)
Looks like this isn't just end2end tests.
Look at this WebGL failure, for example:
https://ci.chromium.org/p/chromium/builders/luci.chromium.ci/Win10%20FYI%20Release%20%28Intel%20HD%20630%29/2450

No tests were run, browser failed to start 3 times.

Maybe the new driver / OS version we've upgraded to has problems?
Summary: Flaky GPU test failures to start on Windows Intel bots (was: Flaky GPU test failures on Windows Intel bots)
Components: Tests>Telemetry Internals>Core
Owner: ccameron@chromium.org
CC'ing more people.

Unless there's a deeper problem with the browser starting reliably on Windows (which is possible), my first guess would be that it's a machine misconfiguration, like the IP KVM dongle becoming detached. In this case though:

https://ci.chromium.org/p/chromium/builders/luci.chromium.ci/Win10%20FYI%20Release%20%28Intel%20HD%20630%29/2450

for the failing shard:

https://chromium-swarm.appspot.com/task?id=40a424330838e910&refresh=10&show_raw=1

I don't see a widespread problem on the affected bot:

https://chromium-swarm.appspot.com/bot?id=build144-a9&sort_stats=total%3Adesc

but maybe problems like this are being auto-detected and fixed by the Labs team?

It's unclear whether the failure mode of angle_end2end_tests is the same. More logging inside that test harness would help.

Assigning to this week's pixel wrangler to keep an eye on this. jmadill@, as this week's ANGLE wrangler, your help reporting any more flakes on the ANGLE CQ is appreciated.

Status: Assigned (was: Available)
Labels: -Pri-1 Pri-2
Found one more case:
https://ci.chromium.org/p/chromium/builders/luci.chromium.ci/Win10%20FYI%20Release%20%28Intel%20HD%20630%29/2345

2 cases in 200 builds is not so bad, actually.
OK. That shard's from October 10 (12 days ago) and the bot's healthy at this point:

https://chromium-swarm.appspot.com/bot?id=build133-a9&sort_stats=total%3Adesc

Owner: riajiang@chromium.org
Thanks for the report. Here's the failing shard:
https://chromium-swarm.appspot.com/task?id=40cc50abc3cf9a10&refresh=10&show_raw=1

It doesn't look like anything's obviously wrong with that bot:
https://chromium-swarm.appspot.com/bot?id=build158-a9&sort_stats=total%3Adesc

although I didn't dig back far enough in its log to see the run that failed, and whether the auto-reboot that happens after it failed cleared up the failures.

Ria: as current pixel wrangler could you please reach out to jmadill@ and see whether you can figure out whether this is a transient failure that was fixed by the bot's rebooting? Try clicking "Show More Tasks" repeatedly on build158-a9's Swarming page. Thanks.

Owner: weiliangc@chromium.org
passing this to this week's pixel wrangler Wei
Owner: jmad...@chromium.org
Jamie, could you try to add more logging to ANGLE's tests to help diagnose this? The general pixel wranglers aren't having any success with this problem.

Comment 16 by benhenry@google.com, Jan 16 (6 days ago)

Components: Test>Telemetry

Comment 17 by benhenry@google.com, Jan 16 (6 days ago)

Components: -Tests>Telemetry

Comment 18 by senorblanco@chromium.org, Today (16 hours ago)

I didn't see this last week, but perhaps I wasn't looking hard enough. Jamie, is this still a problem?

Comment 19 by jmadill@google.com, Today (16 hours ago)

There's a different failure mode now but it happens all the time.

https://ci.chromium.org/p/chromium/builders/luci.chromium.try/win-angle-rel/124
https://ci.chromium.org/p/chromium/builders/luci.chromium.try/win-angle-rel/122

A bunch more at https://ci.chromium.org/p/chromium/builders/luci.chromium.try/win-angle-rel?limit=200 . TBH I'm not sure if this is the same root cause or the same as issue 923198 .

Comment 20 by ynovikov@chromium.org, Today (15 hours ago)

Components: Infra>Platform>Swarming
Owner: mar...@chromium.org
Doesn't look like issue 923198 to me.
No capacity problems, bots were assigned to tasks.

Looking at https://chromium-swarm.appspot.com/task?id=426b4cae91846b10&refresh=10&show_raw=1, what I see is no output, but:
Created	1/15/2019, 12:07:09 PM (Eastern Standard Time)
Started	1/15/2019, 12:25:38 PM (Eastern Standard Time)
Expires	1/15/2019, 1:07:09 PM (Eastern Standard Time)
Completed	1/15/2019, 12:46:57 PM (Eastern Standard Time)
Pending Time	18m 29s
Running Time	21m 17s

So, the task was doing something for 21 minutes and finished before expiry. However, we don't see any output.
maruel@, are there more logs somewhere which could bring more light to what's going on?

Comment 21 by maruel@google.com, Today (15 hours ago)

Cc: -ynovikov@chromium.org mar...@chromium.org
Owner: ynovikov@chromium.org
Have you tried running the task locally? See the instructions on the task's page.

Comment 22 by maruel@google.com, Today (15 hours ago)

Components: -Infra>Platform>Swarming Infra>Platform>Swarming>Admin

Comment 23 by ynovikov@chromium.org, Today (14 hours ago)

Owner: jmad...@chromium.org
I don't have a machine with same config readily available, but retrying the same task on Swarming succeeds without problem.
https://chromium-swarm.appspot.com/task?id=428f64aecc344e10&refresh=10&show_raw=1

Are there any logs of communication between the swarmed bot and MILO (or whatever is the thing that launches the swarmed tasks)?

Comment 24 by maruel@google.com, Today (14 hours ago)

I recommend a combination of 'Debug task' and http://go/swarming-ssh. You may want to try to pinpoint if this only happens to a subset of the bots, albeit both look fine:
https://chromium-swarm.appspot.com/bot?id=build118-a9
https://chromium-swarm.appspot.com/bot?id=build159-a9

Sign in to add a comment