New issue
Advanced search Search tips
Note: Color blocks (like or ) mean that a user may not be available. Tooltip shows the reason.

Issue 788031 link

Starred by 1 user

Issue metadata

Status: Assigned
Owner:
Last visit 16 days ago
Cc:
Components:
EstimatedDays: ----
NextAction: ----
OS: ----
Pri: 1
Type: Bug

Blocked on:
issue 882097



Sign in to add a comment

Crashes or assertion failures in ANGLE gtest based tests cause test steps to time out

Project Member Reported by kainino@chromium.org, Nov 23 2017

Issue description

In a recent dry run, I introduced a change which caused many (hundreds or thousands) of test crashes, in the form of ASSERT failures, across many test suites. In this single dry run, jobs timed out across most of the try bots, indirectly causing an overload of a Windows AMD bot.
https://chromium-review.googlesource.com/c/angle/angle/+/776278

This seems to be because the test execution time becomes much longer when a crash occurs - apparently because the test executable has to start up many more times (some of these test executables have quite long startup times).

There should be a limit on the number of crashes that can occur in a test suite before it gives up and stops executing.

Excessive numbers of failures were seen on these bots, under these tests:

https://ci.chromium.org/buildbot/tryserver.chromium.angle/linux_angle_deqp_rel_ng/1588
* angle_deqp_gles2_gl_tests on NVIDIA GPU on Linux (with patch) on Ubuntu
* angle_deqp_gles3_gl_tests on NVIDIA GPU on Linux (with patch) on Ubuntu
* angle_deqp_gles31_gl_tests on NVIDIA GPU on Linux (with patch) on Ubuntu

https://ci.chromium.org/buildbot/tryserver.chromium.angle/win_angle_deqp_rel_ng/1564
* angle_deqp_egl_tests on NVIDIA GPU on Windows (with patch) on Windows-2008ServerR2-SP1
* angle_deqp_gles2_d3d11_tests on NVIDIA GPU on Windows (with patch) on Windows-2008ServerR2-SP1
* angle_deqp_gles31_d3d11_tests on NVIDIA GPU on Windows (with patch) on Windows-2008ServerR2-SP1
* probably others but there was a recipe engine bug and some logs were not captured - maybe due to too much output?

https://ci.chromium.org/buildbot/tryserver.chromium.angle/linux_angle_rel_ng/7521
* webgl2_conformance_gl_passthrough_tests on NVIDIA GPU on Linux (with patch) on Ubuntu
* webgl2_conformance_tests on NVIDIA GPU on Linux (with patch) on Ubuntu
and to some extent:
* context_lost_tests on NVIDIA GPU on Linux (with patch) on Ubuntu
* pixel_test on NVIDIA GPU on Linux (with patch) on Ubuntu

https://ci.chromium.org/buildbot/tryserver.chromium.angle/android_angle_rel_ng/5723
* webgl_conformance_tests (with patch) on Android

https://ci.chromium.org/buildbot/tryserver.chromium.angle/mac_angle_rel_ng/7624
* webgl2_conformance_tests on Intel GPU on Mac (with patch) on Mac-10.12.6

(If these links expire, it should be trivially reproducible by adding an ASSERT(false) somewhere in ANGLE's shader translator. Be careful, though!)
 
P.S. jmadill: I am pretty sure that the issue is with the harness, and not to do with how ANGLE implements assertion failures. Digging through the code, it seems like Chrome should probably be bottoming out into asm("int3") in this case, which I think is similar in function to what ANGLE does.
Components: Internals>GPU>ANGLE
Labels: -Pri-3 Pri-2
Kai, sorry, one question - you are aware that ANGLE does not use asm(int3) to trigger ASSERT? We use an invalid instruction (*nullptr) or __builtin_trap. Chrome uses a variety of methods, but the ones that are most interesting are __debugbreak on Windows and asm(3) on posix, which is more like debug break than an invalid instruction.

Also to confirm, you understand the asserts were firing in ANGLE?

I'd like to move to what Chromium does, we've also had complaints from a developer who manages crash reports that ANGLE ASSERTs don't work nicely like Chrome's DCHECK for DCHECK enabled Chrome.

I think rather than changing the test harness we should probably just fix ANGLE (probably a 10 line CL).
For reference: https://github.com/scottt/debugbreak
I understand those things, and that the two assert mechanisms are functionally different, but I'm pretty sure that switching it will not fix the issue that we saw. Maybe I'm wrong, but it looks like the timeouts were just due to the huge number of failures, and huge number of harness restarts.
Kai, a single ASSERT failure will trigger the end2end_tests to time out with ANGLE ASSERTs. Would you be interested in doing an experiment of fixing the ASSERTs (which we also need for better crash reporting) at least on Windows, injecting an ASSERT failure that only triggers on angle_end2end_tests, then seeing if their behaviour is any better? You could be right, but I have am worried about changing the test harness when it could be an easy ANGLE fix.

One way to trigger the ASSERT might be in D3D11's line loop emulation, only with a certain type of index buffer.
Sorry, that wasn't clear.. a single ASSERT failure in angle_end2end_tests will cause the whole test suite to time out (take an hour+) because of some strange test harness behaviour. Seems odd if that would also happen in Chromium test suites.

Comment 7 by kbr@chromium.org, Nov 23 2017

It's not clear that a single ANGLE assertion failure affecting a single test in angle_end2end_tests would cause the entire suite to time out on Linux.

It seemed from Kai's tryjob that it was causing every test to crash.

The main question in my mind is why any of the harnesses -- the gtest-based ones for angle_end2end_tests, etc., as well as the Telemetry based tests -- allowed more than ~20 tests to run before bailing out with "too many failures". If that was working as expected then the tests should have failed, but not timed out.

In Kai's test, the ASSERT was failing everywhere. But any single ASSERT anywhere in ANGLE will trigger the same behaviour - his was especially problematic because it affected every platform. I'll do the tests myself with the Line Loop test, we can replicate it on Linux if you still are skeptical.

Comment 9 by kbr@chromium.org, Nov 23 2017

Appreciate your confirming this, in particular on Linux – I am skeptical.

Cc: lucferron@chromium.org ynovikov@chromium.org
Labels: -Pri-2 Pri-1
Owner: kbr@chromium.org
Status: Assigned (was: Available)
Summary: Crashs or assertion failures in ANGLE gtest based tests cause test steps to time out (was: Make GPU bots robust to huge numbers of test failures)
Examples of this are happening all over the place continually. I think we need some help from the right people in getting this fixed. See for instance:

https://ci.chromium.org/p/chromium/builders/luci.chromium.ci/Win10%20FYI%20dEQP%20Release%20%28NVIDIA%29/2829

https://chromium-swarm.appspot.com/task?id=3cf5a8585aa3b010&refresh=10&show_raw=1

Here the job is timing out because of a few crashes (unclear how many)

Also see:
https://ci.chromium.org/p/chromium/builders/luci.chromium.try/win_angle_rel_ng/373

Here the angle_end2end_tests on ATI GPU step was timing out because of one or two crashes:

https://chromium-swarm.appspot.com/user/task/3ceec458848e4f10

It seems after a single crash failure, the test harness switches into a mode where it starts a new instance for every single subsequent test, which massively slows down the test step time (from 5-10 minutes to 1hr+)

The exact same behaviour happens on Linux:

https://ci.chromium.org/p/chromium/builders/luci.chromium.try/linux_angle_rel_ng/506

We could really use some help from the maintainers of the code to fix this. Ken could you help find the right labels/owners to increase visibility?


Comment 11 by kbr@chromium.org, Apr 19 2018

Cc: jbudorick@chromium.org
Owner: dpranke@chromium.org
Dirk, who owns base/test/launcher at this point?

Summary: Crashes or assertion failures in ANGLE gtest based tests cause test steps to time out (was: Crashs or assertion failures in ANGLE gtest based tests cause test steps to time out)
Realistically, no one, not that the code is hard to hack on. Someone in Ops probably should own it, but we don't have anyone at the moment, only open heads.
Cc: dpranke@chromium.org
Owner: ----
Status: Available (was: Assigned)
I'm not the right person to own this bug, as I'm not likely to fix it in the near future.
Owner: kerz@chromium.org
Status: Assigned (was: Available)
This is a p1, and so needs an owner.  Assigning to kerz to triage and find an owner if dpranke can't find one.
Cc: nednguyen@chromium.org estaab@chromium.org
Owner: yihongg@chromium.org
Things have changed a bit since my comments in #c13 and #c14, we have a team in ops now chartered to take this sort of stuff on.

They're still coming up to speed and stuff in //base/test/launcher is lower priority than other things, but on the other hand, this is a really easy change.

yihongg/nednguyen/estaab - one of you want to be (or find) an owner for this?
Cc: -lucferron@chromium.org fjhenigman@chromium.org
Blockedon: 882097
I've done some digging into a particular run.
This run exceeded "the limit of failed EXPECT/ASSERT entries in the xml and JSON outputs per test"   (Quote from --help.)
When this happens the following appears in the xml results file:
   <x-test-result-part type="failure" file="<unknown>" line="0">
That's not valid xml because of the angle brackets in <unknown>.
That means have_test_results == false here:
https://cs.chromium.org/chromium/src/base/test/launcher/unit_test_launcher.cc?rcl=6e2e384b482bd32c5bc0b037184e9e6311796673&l=289
That leads to the loop at line 405 where it puts ALL the tests (not just failures) on a list to run one at a time.

I made a CL for the bad xml:
https://chromium-review.googlesource.com/c/chromium/src/+/1214696

It's also possible to work around this by passing the flag
--test-launcher-test-part-results-limit=-1
That disables the EXPECT/ASSERT limit so we never get the bad xml.

Frank, good work on fixing  issue 882097 . Is there more work to be done here? Do the failures still trigger retries?
Owner: ----
I ran a test with >10 assert fails and it looks good on Windows and Linux.  Nothing about bad xml or retries.  The output contained all the assert fails - I thought it might be limited to 10 but no.  I think that's ok?

Android still does retries - I don't know if the launcher is different or just the flags - Yuly?
Owner: fjhenigman@chromium.org
Some things work differently in Android, but I'm not sure which ones in this case. Maybe ask jbudorick@ or agrieve@
Cc: agrieve@chromium.org
agrieve: to make a long story short, we want to run a test without no retry after failure.  Working now on Linux and Windows but I see stuff like this on Android.  Any idea where it comes from?  Thanks.

I   81.347s Main  FINISHED TRY #2/3
I   81.347s Main  2 failed tests remain.
I   81.347s Main  STARTING TRY #3/3

That is from this job:
https://ci.chromium.org/p/chromium/builders/luci.chromium.try/android_angle_vk64_rel_ng/1698

I think there's still something bad happening on Windows. It's better since Geoff split angle_end2end_tests into 4 shards so there's less chance of timeouts but any test failure will somehow make the test suite start running one test at a time. Sorry will try to follow up later with more links but didn't want to forget to mention this.
It could be that happens when there's a crash.  I don't think it will happen just because a gtest assert fails.

Sign in to add a comment