From discussion here: https://github.com/w3c/web-platform-tests/pull/4378
Rather than falling back to brittle text comparison, maybe it would be better if we could mark specific sub-tests as expected to fail? Dunno.
Pros:
- May reduce flakiness in any case where test output text can change while test output doesn't change.
Cons:
- Changing the way we do things would take time and possibly increase complexity.
From domenic: "+1 on not using console output comparison, but instead comparing results from the testharness directly. I've had severe flakiness problems in the past with importing WPTs into Chrome due to that."
How Mozilla does it: https://developer.mozilla.org/en-US/docs/Mozilla/QA/web-platform-tests#Metadata_files
Expectations are kept in a metadata directory, with one expectation file per test, with all different platform-specific expectations in that file.
Given the way that Blink layout test baselines work, it's probably easier for us to do things the way we do things for layout tests - expectations are kept in baseline files next to tests, or in a platform-specific baseline directory.
One possible option would be to change our -expected.txt for testharness tests to -expected.json or similar. Although, I'm not convinced that it's worth it now.
Would it work as a quick hack to just post-process the output (even with a simple grep as a short-term hack) so that it ONLY contains a list of FAIL lines for each sub-test that failed? That way the only changes that would trigger a diff are changes to the sub-test name of failing subtests or an addition/subtraction from the set of subtests that are failing. It looks like the Mozilla infrastructure would fail in exactly those cases too.
We could just add a testRunner.setDumpConsoleMessages(false) to testharnessreport.js if the test path matches the imported WPT test directory. We then need to rebase all the expected.txt files to remove console logging.
I have this change here along with unrelated change that unclamps setTimeout: https://codereview.chromium.org/2587963002/
Just gone on paternity leave so won't be able to commit for a month if anyone wants to take this further.
Disabling console messages for imported wpt tests sounds good to me. I don't think the way of detecting if a test is an imported WPT test as implemented in that CL will work when wptserve is enabled though, as at least currently in that case the servers root is the imported/wpt directory, so the script won't see that in its own path.
Does it sound like a bad idea to disable console messages for *all* testharness.js tests?
If this were done, then that would also solve a couple of other little problems:
- Different/more console messages when testharness.js is updated in auto-import (see related bug 685854 )
- Flakiness when console messages in tests may contain non-deterministic strings in non-wpt testharness.js tests (e.g. bug 685778 )
The main disadvantage would be that people writing tests could have error messages that they're not aware of; the main advantage would be that it would reduce flakiness and make import simpler.
Sounds like a reasonable default to me. testharness.js tests that explicitly do want to test console output can still enable console messages. Not sure how many testharness.js based tests with -expected.txt files we currently have and so how much effort it would be to go through those to try to figure out if they have meaningful console output they're trying to verify?
Made a CL to try it out: https://codereview.chromium.org/2658093003
After making that change and rebaselining, that change involved:
173 changed -expected.txt files
55 deleted -expected.txt files
115 console warning lines
13157 console error lines (many of which were very repetitive)
tkent@ brought up a point in that CL:
> Console messages are helpful to find test errors. Actually, testharness.js shows some console messages for wrong usage of the harness, and it's difficult to find such wrong usage without console messages.
My guess is that error messages are probably very helpful when developing tests, but maybe less helpful if nobody is looking at them.
I think the main concern about matching results based on console messages in testharness output was that it may increase flakiness -- it's more "brittle". Maybe the better policy is to still have a default of printing console messages, and if in some particular case we don't want console messages (because they are non-deterministic, for example) then we can disable these messages by adding `testRunner.setDumpConsoleMessages(false)` at the top of the test, with an explanation.
A quick thought: One thing that would solve our problem is if baseline text matching of testharness.js test results was handled differently than other text baselines -- if there was specific logic to look only at PASS/FAIL and test name, and ignore everything else, then we could keep the console error messages, but still allow test results to match even if some text is different.
The problem here was console messages with unreliable timing, whereas most console messages have reliable timing and are useful for debugging failures and for verifying that deprecation messages in fact are logged, as tkent@ pointed out in https://codereview.chromium.org/2658093003#msg6.
In https://github.com/w3c/web-platform-tests/pull/4378 it seems to me that the problem is that it's not clear whether the promise is resolved or rejected. If that were clear, presumably nobody would object to requiring the expected behavior or at least failing on the unexpected behavior?
Do we suspect we will have cases where the spec is very clear, but the timing is non-deterministic, and we won't want to tinker with the tests to avoid the unhandled promise rejection console message?
In https://codereview.chromium.org/2677763002/ a related issue came up -- how do we triage failing subtests and link them to bugs?
We use TestExpectations to link failures to bugs in other cases, could we perhaps do something like this?
https://crbug.com/1234 external/wpt/a/test.html#Test title [ Failure ]
> Do we suspect we will have cases where the spec is very clear, but the timing is non-deterministic, and we won't want to tinker with the tests to avoid the unhandled promise rejection console message?
In the streams tests, there is a lot of nondeterminism (in a clear spec). The tests are written so that there are no unhandled rejections when they pass, but for a while, we were failing several of them as we lagged the spec. This issue meant that it was impossible to include test expectations for those tests, because the position of the unhandled rejection message kept bouncing around, so no expected.txt that we committed would ever consistently match. Thus we were unable to use the imported WPTs, and had to maintain our own in-tree copy with the failing tests commented out.
In general it just seems very problematic that we're using WPTs but our tooling is verifying things that are not verified in other engines or as part of the official runner. That can only lead to extra, spurious failures.
OK, so clearly https://github.com/w3c/web-platform-tests/pull/4378 was not an isolated case then.
To simply omit console output would probably make it harder to understand why something has failed on the bots, so that's not great. Tracking sub-tests only with something like #12 and having no -expected.txt files at all seems feasible, if in the case of failure one could still see the whole output as we currently see it. qyearsley@, any hunches about how tricky that might be?
My two cents:
Console output is very valuable, and essential in debugging. But it's lousy for detecting pass/fail cases. So we shouldn't touch it while working on pass/fail detection.
If we can extend the approach from TestExpectations, where a single entry tells you the name of the test that's failing and the bug that's filed for getting it fixed, that would seem like the lowest possible overhead route to me.
If it's possible to output console messages (and other text besides subtest names and PASS/FAIL) to stderr, that sounds like it would be a good solution!
The idea of extending TestExpectations to allow for subtest-level-granularity sounds good, although my feeling now is that may not be simple to implement.
I share your suspicion. Because one can only know which tests exists by running the test, removing tests from TestExpectations that no longer exist seems trickier than when we track only files.
That would be pretty straightforward, but would presumably leave console messages around that no longer exist, so that run-webkit-tests --reset-results seems to cause unrelated changes? Maybe that kind of confusing is better than what we have though?
foolip@: You wouldn't need the `-expected.txt` file if we stopped treating console warnings/errors as failing the tests. I guess I don't understand where the "unrelated changes" would come from in that scenario.
For tests that have some failing subtest, a -exepected.txt file is still the only concrete proposal for handling that. We could omit console output entirely, which might be fine, but if it is included but ignored, then presumably we would land changes that change the console output but don't update the -expected.txt files. The next person to touch it would see some unrelated changes.
That bit is probably not too hard, at worst a matter of parsing the existing output, but how to mark a subtest as failing if we don't have the -expected.txt files? I had a suggestion in https://bugs.chromium.org/p/chromium/issues/detail?id=679742#c12 but think it'd be frustrating having to mint the URL by hand when ignoring subtests.
In https://codereview.chromium.org/2678043002, I filtered all of the test output to remove console log messages and push them onto stderr. Currently it only applies to WPT, since there are tests outside WPT that are deliberately looking for console output.
If there are tests outside of WPT that also want to ignore console output we have two choices:
1) Add testRunner.setDumpConsoleMessages(false); at the start of the test.
2) Adapt the console filtering code from my change to detect a string written to console that says don't write console messages to the test output.
The advantage of (2) is that you still get the console messages for debugging purposes.
A final 3rd approach might be to turn on console filtering for all tests and selectively disable it for tests like https://cs.chromium.org/chromium/src/third_party/WebKit/LayoutTests/plugins/window-open.html which need the console output. Might be a lot of effort to determine what tests to label in this way.
I'd like to revive this conversation --
I believe the above change (hiding console log messages for imported WPT tests) reduced flakiness for web-platform-tests by removing some text that shouldn't affect whether the test passes or fails, but I think it didn't go far enough yet.
In other contexts where wpt tests are run (e.g. wptrunner, Mozilla's metadata files, etc.) neither the message that goes with a failure nor the order of the subtests are an essential part of the result.
Ideally, in order to reduce flakiness, we'd want to ignore failure text completely, but this would make it harder to debug failures. Following the example in the above change, we could print failure messages to stderr, and remove them from stdout.
For example: the result might just look like:
PASS a-subtest
FAIL b-subtest
FAIL c-subtest
and the stderr may contain:
FAIL c-subtest: another detailed error message
FAIL b-subtest: detailed error message, potentially containing times or generated strings
CONSOLE LOG: Some logged console message
When debugging, one would check the stderr, but running tests and comparing to baselines, we would check only the essential information.
How does that sound?
That sounds tentatively good, but since it's never affected me personally, do you have specific examples of flaky tests that such a change would help?
I've been pretty happy with things since console output has been removed, BTW, so kudos for that :).
#30 SGTM. I know how to find the stderr when looking at results from the bots, but some instruction about how to see it when running tests locally would also be good. I could never figure out if it's possible except by looking at the HTML report.
If the change affects only external/wpt, then https://chromium.googlesource.com/chromium/src/+/master/docs/testing/web_platform_tests.md would be a good place to explain the behavior and how to debug.
Cc: -qyears...@chromium.org Owner: qyears...@chromium.org Status: Started (was: Available) Summary: Marking WPT sub-tests as expected failing (without failing due to minor message string differences) (was: Consider some mechanism for marking WPT sub-tests as expected failing)
Summary: Mark WPT sub-tests as expected failing, without being sensitive to minor message string diffs (was: Marking WPT sub-tests as expected failing (without failing due to minor message string differences))
So, after considering stripping error messages from testharness output for wpt, I'm not sure if that's the right thing to do since it might make debugging more difficult.
Also, the number of tests that have nondeterministic error message strings is fairly small. In particular, the tests with such problems are currently:
Bug 705490 external/wpt/fetch/api/request/...
Bug 679742 external/wpt/mediacapture-streams/...
Bug 711493 external/wpt/XMLHttpRequest/...
This issue has been Available for over a year. If it's no longer important or seems unlikely to be fixed, please consider closing it out. If it is important, please re-triage the issue.
Sorry for the inconvenience if the bug really should have been left as Available.
For more details visit https://www.chromium.org/issue-tracking/autotriage - Your friendly Sheriffbot
Cc: robertma@chromium.org Status: Available (was: Untriaged)
This would still be useful. As it is, the subtest order and failure messages must be stable, which is an extra constraint that doesn't exist in upstream wpt or in Gecko's infrastructure. This created trouble for me today in issue 888470.
Comment 1 by qyears...@chromium.org
, Jan 11 2017