New issue
Advanced search Search tips

Issue 875430 link

Starred by 1 user

Issue metadata

Status: Duplicate
Merged: issue 834185
Owner: ----
Closed: Aug 18
Cc:
Components:
EstimatedDays: ----
NextAction: ----
OS: ----
Pri: 3
Type: Bug



Sign in to add a comment

SlowTests / [ Slow ] is not a sufficient band-aid for some tests

Project Member Reported by lukasza@chromium.org, Aug 17

Issue description

REPRO:

1. Add the following expectation to SlowTests:

http/tests/fetch/window/thorough/redirect-nocors-base-https-other-https.html [ Slow ]

2. Remove any other expectations for this test
   (e.g. remove the one in FlagExpectations/site-per-process
    that says [ Pass Timeout ]

3. Schedule some tryjobs


ACTUAL BEHAVIOR:

On the trybots the test fails flakily (sometimes the site_per_process_webkit_layout_tests step on linux_chromium_rel_ng is green and sometimes it is red - for example see https://ci.chromium.org/p/chromium/builders/luci.chromium.try/linux_chromium_rel_ng/167195)


EXPECTED BEHAVIOR:

If a test doesn't truly timeout (i.e. would fail even if given a ginormous timeout), then it should consistently pass, even if it takes a long time.
 
Cc: dpranke@chromium.org
I don't know what the right solution here is:

1. Try not to oversubscribe bots?

2. Increase the timeout boost given by [ Slow ] expectation?

3. Introduce a [ VerySlow ] or [ Slow=7x ] expectation marker?

4. Fix each and every slow test?


Context: I am attempting to drop [ Timeout Pass ] expectation for tests marked as [ Slow ], unless those tests consistently fail with a timeout.  Around 200 expectations in FlagExpectations/site-per-process could be removed this way.  Even more context: https://crrev.com/c/1178465 and issue 874695
Cc: estaab@chromium.org robertma@chromium.org nedngu...@google.com
There's a difference between tests that are truly slow, tests that end up being slow because they're run in conjunction with other tests under load, and tests that are buggy.

You should be able to determine if a test is truly slow by simply running it in isolation (or just looking at it) to see how long it'll take. There's not much excuse for a well-written test that we control to take more than a couple hundred milliseconds to run. So, the fix for these is to normally fix (or rewrite) the tests.

We don't really understand the behavior of the system under load; in order to do this, someone needs to actually run the tests and watch how the system is performing (i.e., is the CPU or disk maxed out, etc.). It's possible things have gotten slower over time (or we use more memory, or we're not getting the performance we need out of the GCE machines, etc.), and the defaults we used to use (either for the # of tests to run in parallel or the timeout values to use) are no longer appropriate. Someone would need to spend a bunch of time looking into this aspect of things.

In the example in your description, the virtual/outofblink-cors-ng tests look like they are truly flaky: often they'll time out, and sometimes they'll complete in ~1s. Between that, and the actual example from your test run (where the shard has a single failure which times out even on retries), that's a buggy test that you need to fix; that's not happening because the bot is oversubscribed or the timeout is too short.



Mergedinto: 834185
Status: Duplicate (was: Untriaged)
Ooops - thanks for pointing out that when the test passes it always passes fast (i.e. the timeouts are not caused by variations in the runtime).  I'll try to take a closer look at the test.

Sign in to add a comment