New issue
Advanced search Search tips

Issue 855255 link

Starred by 3 users

Issue metadata

Status: Assigned
Owner: ----
Cc:
Components:
EstimatedDays: ----
NextAction: ----
OS: ----
Pri: 3
Type: Bug

Blocked on: View detail
issue 621126
issue 735176
issue 654500



Sign in to add a comment

Simplify webkit_layout_test json results

Project Member Reported by dpranke@chromium.org, Jun 21 2018

Issue description

Currently the webkit_layout_test json results include a bunch of "deprecated" fields that are somewhat confusingly defined.

We should remove all of the old fields and make sure that just the new ones remain and are consistently implemented.
 
Owner: nednguyen@chromium.org
Status: Assigned (was: Untriaged)
I will own this
Here are the fields that we want to remove:
AUDIO
IMAGE
IMAGE+TEXT	
LEAK
MISSING
NEEDSMANUALREBASELINE
NEEDSREBASELINE
REBASELINE
SLOW
TEXT
Here are the fields that output & uploaded to the test results server recently:
  "num_failures_by_type": {
    "AUDIO": 0,
    "CRASH": 1,
    "FAIL": 0,
    "IMAGE": 1663,
    "IMAGE+TEXT": 56,
    "LEAK": 0,
    "MISSING": 0,
    "NEEDSMANUALREBASELINE": 0,
    "PASS": 66459,
    "REBASELINE": 0,
    "SKIP": 8492,
    "SLOW": 0,
    "TEXT": 618,
    "TIMEOUT": 142,
    "WONTFIX": 1948
  },

https://logs.chromium.org/v/?s=chromium%2Fbuildbucket%2Fcr-buildbucket.appspot.com%2F8942576898457176992%2F%2B%2Fsteps%2Fwebkit_layout_tests_on__none__GPU_on_Mac_on_Mac-10.11.6%2F0%2Flogs%2Fjson.output%2F0
Components: Test
Dirk: that WONTFIX field is not documented in the spec. What should we do about it?
Just to clarify, are we talking about TestExpectations, test results JSON, or both?

For the test results, I'm a bit surprised to see WONTFIX... IIRC WontFix is an expectation (in NeverFixTests), but in the results it should be SKIP.
Just the test results JSON; the test expectations format should be unchanged by this work.

Tha said, we should update the doc to be accurate.

As robertma@ says, "WONTFIX" shouldn't appear in the results, it should be "SKIP" instead. All of the other layout-test-specific fields like "IMAGE+TEXT" should be "FAIL".
Blockedon: 735176
Blockedon: 654500
Blockedon: 621126
 Issue 654500  has been merged into this issue.
Dirk: can you clarify whether you want this simplification to be applied for the full_results.json only, or do you want it for the html results viewer?
Definitely the JSON file. I'm not sure if/how this would affect the results viewer, but ideally we'd eventually migrate the results viewer over to relying more on the artifacts format (bit.ly/chromium-test-artifacts) for figuring out which results to display.
Sorry for phrasing the question badly. What I meant is do we want this simplication for JSON file only or we also want to apply it toward html results viewer?

Looking at the code, I find that simplifying JSON file & not the html results viewer will cause slightly different models of test expectation in the python code, adding some more complexity in the results system.
Further clarification: Ned is referring to src/third_party/WebKit/LayoutTests/fast/harness/results.html

For that matter, my opinion is that we need to change the viewer whenever we change the JSON.
Cc: atotic@chromium.org
Without looking at the code, I can't say exactly what changes would be needed, but we should try to avoid changing the displayed results (unless of course the changes seem to be improvements).

It may be that we will need to add additional stuff to the .json file in order to do that, but I'd have to see what specifically we need to add.
I am the author of the test viewer, and would love to be kept in the loop on any
test viewer changes.

For full functionality, test viewer needs to know:
- what the expectations were
- what the flag expectations were
- what the results were

I hope we do not remove any of this information from full_results.json
If it is in the TestExpectations, it should be in full_results.json

In LayoutNG, we use all this information to display information not available from python,
such as "unexpected flakes", "layoutng failures", "layoutng passes"

> As robertma@ says, "WONTFIX" shouldn't appear in the results, it should be "SKIP" instead.
> All of the other layout-test-specific fields like "IMAGE+TEXT" should be "FAIL".

WONTFIX should be in the result.expected because that is what test suite expects.

IMAGE+TEXT label is used to display link to image/text errors.

What would current expectation look like:
actual: "IMAGE+TEXT"
bugs: ["crbug.com/591099"]
expected: "FAIL"
flag_expectations: ["FAIL"]
text_mismatch:"general text mismatch"
time:0.4

I am on vacation 7/18->8/4, and will check in once a day. Would love if you did not land anything that broke existing results.html functionality until then.
> WONTFIX should be in the result.expected because that is what test suite expects.

To clarify, I meant we shouldn't have "WONTFIX" in results.actual. i.e. a wontfix test should have expected "WONTFIX" and actual "SKIP".

> IMAGE+TEXT label is used to display link to image/text errors.

Thanks for the heads up. I wasn't aware of that.

And by "we need to change the viewer" I meant we need to keep the functionality of the viewer intact which may require changing its implementation if necessary as we tweak the JSON format.
Thanks, atotic@, this is the feedback I was looking for / expecting. Now for me to disagree with it :)

> If it is in the TestExpectations, it should be in full_results.json

I don't exactly agree with this, but the devil is in the details, so let me explain further.

I do want us to have all of the information we need to be able to do a full-featured results viewer, but I think we probably need to adjust how we do this so that it's done in a more generic way.

For example, rather than using result codes to tell us what types of results to expect (IMAGE+TEXT), we should just be able to see which types of artifacts *were* produced, and have a single result code of FAILURE. This is where the bit.ly/chromium-test-artifacts comes in, and this is something we already more-or-less do today, with things like 'has_wdiff', 'has_repaint_overlay'. Ideally, instead, we'd record the different bits into the 'artifacts' field, and the viewer would just look at what was in the artifacts field in order to figure out what to display.

It's not clear to me if we really need 'WONTFIX' in expected; does the viewer do something different than what it does with expected SKIPs?

We probably also need a better way to indicate how we're handling timeouts / test "sizes". Right now, I think we'll include SLOW in .expected, but it would be nice if we had a different way to indicate a different timeout, so that we didn't have to special-case that value.
I see. You are right in that the devil is in the details.

I think having a list of artifacts would be great. Currently, we 
make a guess what the artifacts were based upon failure type.
If you look for "pathParser.resultLink" inside results.html you can see how
this is done.

Another important part of results.html is showing "unexpected" results.

For this, it needs to know what was expected, and what actually happened.
I'd like to keep the ability to decide "what is unexpected" inside results.html. 
The reason for this is that idea of "what is unexpected" changes a lot.

It would be nice if all the code that generated results.json was in 
some well defined class so it could be tweaked as needed. 
We add enhancements every once in a while, such as crash log 
grouping, fancy diffs, etc.

Re "unexpected", I'm inclined to have only one place to decide if a test result is expected, also precisely because the definition may change and conditions might get out of sync in various places (which I think has already happened to some extent).

There's a related  issue 837047  about this. I think we should reland https://crrev.com/c/1103611 and change the results viewer to rely on the clarified is_unexpected field. (Dirk, I can help relanding that CL if you're busy recently.)
> There's a related  issue 837047  about this. I think we should reland 
> https://crrev.com/c/1103611 and change the results viewer to rely on
> the clarified is_unexpected field. (Dirk, I can help relanding that
> CL if you're busy recently.)

Yup, this. I'll land it soon now that I'm (more) back from vacation.

Reporting the unexpected failures is pretty much the core function of this file, so it's not going away :).
> Re "unexpected", I'm inclined to have only one place to decide if a test result is expected, also precisely because the definition may change and conditions might get out of sync in various places (which I think has already happened to some extent).

I am not sure if "single definition of unexpected" is desirable. In our viewer, we have following ways to filter unexpected:

- unexpected failures
- unexpected passes
- unexpected flakes
- layoutng failures
- layoutng passes

And there might be more unexpected things we'd like to monitor if we find them interesting. Filters can be complex, see:

https://cs.chromium.org/chromium/src/third_party/WebKit/LayoutTests/fast/harness/results.html?q=Results.html&sq=package:chromium&g=0&l=959

I'd prefer to get raw data, with maybe some basic preprocessing: these tests failed, these passed.

Different tools will have different filtering needs.

What I am afraid of is:
- more process to implement new filters, or modify existing ones
- script that decides what is unexpected is centralized, it'd be harder to modify.
- results.html will lose ability to quickly implement filters we need.
- there will be a proliferation of different filters bloating the results.

The whole reason for my results.html rewrite was that the existing tools were not filtering correctly. Modifying python scripts, and my javascript concurrently was painful. With filters in python, I had to run the test suite again every time I wanted to test a filter.

Can't believe you guys are doing this just as I am starting my vacation.

atotic@, I think it's definitely important for us to align with you on the direction of the test results system of layout test. :-)

I can hold off on this until you comeback & can find some other project to work on while waiting
> I can hold off on this until you comeback & can find some other 
> project to work on while waiting

It would make me feel better, results.html is my pet project.

There is information currently missing from raw data that would be interesting:
- how was each test was tested: test-harness, a reftest, baseline. 
- was it a flag-specific baseline?



How do you implement "layoutng failures"?

As to the three other kinds of unexpected failures, yes, that's why we will have fields to discriminate between them without requiring you to parse the actual result types.

More generally, we need to figure out the right balance between supporting the results viewer and supporting other tools (like FindIt, recipes, Milo, the flakiness dashboard, etc.) that need to parse result types generically. We have way too much variance in how we report results between the different test types and this work is part of reducing that variance.

It feels to me like we can meet both needs relatively easily, so I'm not too worried about this, though.
> How do you implement "layoutng failures"?

Filter.flagFailure() and Filter.flagPass() code at:
https://cs.chromium.org/chromium/src/third_party/WebKit/LayoutTests/fast/harness/results.html?q=results.html&sq=package:chromium&g=0&l=1016

Flag-specific failures are failures whose result differs from base expectations.
Flag-specific passes are passes whose result differs from base expectations.

These filters get a little tricky in dealing with SKIP, MISSING, WONTFIX.

Is that comparing the results of the virtual version of a test with the results of the base version of a test? I was thinking it had something to do with FlagExpectations, which is a different mechanism, right?
> Is that comparing the results of the virtual version of a test with the results of the base version of a test? I was thinking it had something to do with FlagExpectations, which is a different mechanism, right?

The easiest way to describe what happens is to go formal:

We have a set of files F that describes expectations:
F = { 
  TestExpectations, 
  NeverFixTests, 
  W3CImportExpectations, 
  FlagExpectations/enable-blink-features=LayoutNG, 
  ..... 
}

run-webkit-tests computes expectations by merging expectation from all files in F. 
Lets call these FullExpectations.

run-webkit-tests also computes non-Flag expectations by merging expectations from all files in F except FlagExpectations/* files. Lets call these FlagFreeExpectations.

Unexpected NG results are tests whose result differs from FlagFreeExpectations.

This is different from unexpected results, which are tests whose result differs from FullExpectations.


Ugh. That makes things more complicated and introduces modes that no other test suite has :(.

It's unfortunate that this functionality was added without wider discussion.

In particular, it's not good that you added fields to the results files that aren't documented in bit.ly/chromium-json-test-results-format and as a result we've been giving no thought as to how other tools like FindIt might be affected :(.
Cc: xiaoche...@chromium.org
Apologies for making your life harder. If I knew about the docs, I'd have updated them.

layout-ng team has also added further functionality. This was done by @xiaocheng.
- enhanced text diffs. Do files only differ by whitespace, or NG text?
- crash summary: analyzes result of crash logs, allowing us to group failures by crash lines.

Addition of these fields has made tracking NG progress much easier:
- NG specific failures tell us how many tests we need to pass before shipping.
- Text diffs allow us to ignore superficial failures
- Crash summary allows us to group crashes, and see which have greatest impact.


To be clear: I understand that you want to add functionality that makes your job easier, and the whole point of the test infrastructure is to do that. I am also grateful that you work on this stuff directly.

However, adding things that make one test suite work differently than the others complicates life not just for those who need to build common tooling, but also for devs, sheriffs, etc. that need to understand how lots of different tests work.

So, we need to be careful when making such changes, and it's good to err on the side of communicating widely.

Can you point me at the CLs for the other changes (or the code)? The webkitpy->blinkpy rename seems to have made them hard to find.
I understand. I was not aware that I was operating within larger framework, I did not know that this code had active owners.

The CL that introduces unexpected NG counts:
https://chromium-review.googlesource.com/c/chromium/src/+/978609

I am unable to locate xiaocheng's commits, he might have better luck.

> I did not know that this code had active owners.

It didn't :) Ned has been looking at taking it on more recently, but wasn't when you were doing the work.  qyearsley@ did at least your code reviews, but hasn't been working on testing-related stuff much lately and would've been out of the loop as well.

Lots of blame to go around, I'm not blaming you specifically (sorry if it sounded like I was).


Owner: nedngu...@google.com
Cc: -nednguyen@chromium.org nedngu...@google.com
Cc: estaab@chromium.org
Owner: ----
It's unlikely that I can take this bug given the little time I have left working on this project.

For documentation, the key challenge of addressing this bug is how to make sure the test results UI still support richful reporting of failure reasons while simplifying the test results format.

Sign in to add a comment