Flaky GPU test failures to start on Windows Intel bots |
|||||||||||||||||
Issue descriptionThis seems to be happening pretty regularly. Example flakes will pass 3/4 shards and one shard will time out as the tests are starting. Example: https://ci.chromium.org/p/chromium/builders/luci.chromium.try/win_angle_rel_ng/3670 https://ci.chromium.org/p/chromium/builders/luci.chromium.ci/Win10%20FYI%20Release%20%28Intel%20HD%20630%29/2447 https://ci.chromium.org/p/chromium/builders/luci.chromium.try/win_angle_rel_ng/3631 https://ci.chromium.org/p/chromium/builders/luci.chromium.try/win_angle_rel_ng/3634 The failure mode is the same. Each test shard that fails will output only: ERR: initialize(485): ANGLE Display::initialize error 12289: Intel OpenGL ES drivers are not supported. Skipping tests using configuration ES3_OPENGLES because it is not available. ERR: initialize(485): ANGLE Display::initialize error 12289: Intel OpenGL ES drivers are not supported. Skipping tests using configuration ES3_1_OPENGLES because it is not available. This could be a hang when ANGLE's tests are checking for Vulkan compatibility on startup. I'll try adding some logging. Putting on the Chromium issue tracker for higher visibility in case anyone else has any ideas.
,
Oct 22
Yeah, after this happening like 5 times in a row I was unable to reproduce it in a test CL in about 6-7 tries. Let's wait for a bit and see if it happens again. Might be something that happens where there's more load.
,
Oct 22
,
Oct 22
Looks like this isn't just end2end tests. Look at this WebGL failure, for example: https://ci.chromium.org/p/chromium/builders/luci.chromium.ci/Win10%20FYI%20Release%20%28Intel%20HD%20630%29/2450 No tests were run, browser failed to start 3 times. Maybe the new driver / OS version we've upgraded to has problems?
,
Oct 22
,
Oct 22
CC'ing more people. Unless there's a deeper problem with the browser starting reliably on Windows (which is possible), my first guess would be that it's a machine misconfiguration, like the IP KVM dongle becoming detached. In this case though: https://ci.chromium.org/p/chromium/builders/luci.chromium.ci/Win10%20FYI%20Release%20%28Intel%20HD%20630%29/2450 for the failing shard: https://chromium-swarm.appspot.com/task?id=40a424330838e910&refresh=10&show_raw=1 I don't see a widespread problem on the affected bot: https://chromium-swarm.appspot.com/bot?id=build144-a9&sort_stats=total%3Adesc but maybe problems like this are being auto-detected and fixed by the Labs team? It's unclear whether the failure mode of angle_end2end_tests is the same. More logging inside that test harness would help. Assigning to this week's pixel wrangler to keep an eye on this. jmadill@, as this week's ANGLE wrangler, your help reporting any more flakes on the ANGLE CQ is appreciated.
,
Oct 22
,
Oct 22
Found one more case: https://ci.chromium.org/p/chromium/builders/luci.chromium.ci/Win10%20FYI%20Release%20%28Intel%20HD%20630%29/2345 2 cases in 200 builds is not so bad, actually.
,
Oct 22
Found one more case: https://ci.chromium.org/p/chromium/builders/luci.chromium.ci/Win10%20FYI%20Release%20%28Intel%20HD%20630%29/2345 2 cases in 200 builds is not so bad, actually.
,
Oct 22
OK. That shard's from October 10 (12 days ago) and the bot's healthy at this point: https://chromium-swarm.appspot.com/bot?id=build133-a9&sort_stats=total%3Adesc
,
Oct 27
Popped up again here: https://ci.chromium.org/p/chromium/builders/luci.chromium.try/win_angle_rel_ng/3856
,
Oct 29
Thanks for the report. Here's the failing shard: https://chromium-swarm.appspot.com/task?id=40cc50abc3cf9a10&refresh=10&show_raw=1 It doesn't look like anything's obviously wrong with that bot: https://chromium-swarm.appspot.com/bot?id=build158-a9&sort_stats=total%3Adesc although I didn't dig back far enough in its log to see the run that failed, and whether the auto-reboot that happens after it failed cleared up the failures. Ria: as current pixel wrangler could you please reach out to jmadill@ and see whether you can figure out whether this is a transient failure that was fixed by the bot's rebooting? Try clicking "Show More Tasks" repeatedly on build158-a9's Swarming page. Thanks.
,
Nov 5
passing this to this week's pixel wrangler Wei
,
Nov 26
Happened again here: https://ci.chromium.org/p/chromium/builders/luci.chromium.try/win_angle_rel_ng/4274
,
Nov 27
Jamie, could you try to add more logging to ANGLE's tests to help diagnose this? The general pixel wranglers aren't having any success with this problem.
,
Jan 16
(6 days ago)
,
Jan 16
(6 days ago)
,
Today
(16 hours ago)
I didn't see this last week, but perhaps I wasn't looking hard enough. Jamie, is this still a problem?
,
Today
(16 hours ago)
There's a different failure mode now but it happens all the time. https://ci.chromium.org/p/chromium/builders/luci.chromium.try/win-angle-rel/124 https://ci.chromium.org/p/chromium/builders/luci.chromium.try/win-angle-rel/122 A bunch more at https://ci.chromium.org/p/chromium/builders/luci.chromium.try/win-angle-rel?limit=200 . TBH I'm not sure if this is the same root cause or the same as issue 923198 .
,
Today
(15 hours ago)
Doesn't look like issue 923198 to me. No capacity problems, bots were assigned to tasks. Looking at https://chromium-swarm.appspot.com/task?id=426b4cae91846b10&refresh=10&show_raw=1, what I see is no output, but: Created 1/15/2019, 12:07:09 PM (Eastern Standard Time) Started 1/15/2019, 12:25:38 PM (Eastern Standard Time) Expires 1/15/2019, 1:07:09 PM (Eastern Standard Time) Completed 1/15/2019, 12:46:57 PM (Eastern Standard Time) Pending Time 18m 29s Running Time 21m 17s So, the task was doing something for 21 minutes and finished before expiry. However, we don't see any output. maruel@, are there more logs somewhere which could bring more light to what's going on?
,
Today
(15 hours ago)
Have you tried running the task locally? See the instructions on the task's page.
,
Today
(15 hours ago)
,
Today
(14 hours ago)
I don't have a machine with same config readily available, but retrying the same task on Swarming succeeds without problem. https://chromium-swarm.appspot.com/task?id=428f64aecc344e10&refresh=10&show_raw=1 Are there any logs of communication between the swarmed bot and MILO (or whatever is the thing that launches the swarmed tasks)?
,
Today
(14 hours ago)
I recommend a combination of 'Debug task' and http://go/swarming-ssh. You may want to try to pinpoint if this only happens to a subset of the bots, albeit both look fine: https://chromium-swarm.appspot.com/bot?id=build118-a9 https://chromium-swarm.appspot.com/bot?id=build159-a9 |
|||||||||||||||||
►
Sign in to add a comment |
|||||||||||||||||
Comment 1 by geoffl...@chromium.org
, Oct 22