Issue metadata
Sign in to add a comment
|
SlowTests / [ Slow ] is not a sufficient band-aid for some tests |
||||||||||||||||||||||||
Issue description
REPRO:
1. Add the following expectation to SlowTests:
http/tests/fetch/window/thorough/redirect-nocors-base-https-other-https.html [ Slow ]
2. Remove any other expectations for this test
(e.g. remove the one in FlagExpectations/site-per-process
that says [ Pass Timeout ]
3. Schedule some tryjobs
ACTUAL BEHAVIOR:
On the trybots the test fails flakily (sometimes the site_per_process_webkit_layout_tests step on linux_chromium_rel_ng is green and sometimes it is red - for example see https://ci.chromium.org/p/chromium/builders/luci.chromium.try/linux_chromium_rel_ng/167195)
EXPECTED BEHAVIOR:
If a test doesn't truly timeout (i.e. would fail even if given a ginormous timeout), then it should consistently pass, even if it takes a long time.
,
Aug 17
There's a difference between tests that are truly slow, tests that end up being slow because they're run in conjunction with other tests under load, and tests that are buggy. You should be able to determine if a test is truly slow by simply running it in isolation (or just looking at it) to see how long it'll take. There's not much excuse for a well-written test that we control to take more than a couple hundred milliseconds to run. So, the fix for these is to normally fix (or rewrite) the tests. We don't really understand the behavior of the system under load; in order to do this, someone needs to actually run the tests and watch how the system is performing (i.e., is the CPU or disk maxed out, etc.). It's possible things have gotten slower over time (or we use more memory, or we're not getting the performance we need out of the GCE machines, etc.), and the defaults we used to use (either for the # of tests to run in parallel or the timeout values to use) are no longer appropriate. Someone would need to spend a bunch of time looking into this aspect of things. In the example in your description, the virtual/outofblink-cors-ng tests look like they are truly flaky: often they'll time out, and sometimes they'll complete in ~1s. Between that, and the actual example from your test run (where the shard has a single failure which times out even on retries), that's a buggy test that you need to fix; that's not happening because the bot is oversubscribed or the timeout is too short.
,
Aug 18
Ooops - thanks for pointing out that when the test passes it always passes fast (i.e. the timeouts are not caused by variations in the runtime). I'll try to take a closer look at the test. |
|||||||||||||||||||||||||
►
Sign in to add a comment |
|||||||||||||||||||||||||
Comment 1 by lukasza@chromium.org
, Aug 17