New issue
Advanced search Search tips
Note: Color blocks (like or ) mean that a user may not be available. Tooltip shows the reason.

Issue 893358 link

Starred by 1 user

Issue metadata

Status: Started
Owner:
Cc:
Components:
EstimatedDays: ----
NextAction: ----
OS: ----
Pri: 1
Type: Bug



Sign in to add a comment

Experiment to understand stability of separate Swarming tasks

Project Member Reported by dtu@chromium.org, Oct 8

Issue description

Background
Roberto's (robertocn@) experiment results [1] show that separate Swarming tasks could reproduce a portion of low flakiness cases, but not all. We want to expand the scope of the experiment so we have data on high flakiness cases. We also want to test additional revisions so we know if the results are stable across revisions. To compensate for the increased cost, we can run fewer repeats at each revision.

Questions we want to answer
* Can separate Swarming tasks reproduce all high-flake cases?
* What proportion of existing flake analyses are statistically independent (that is, the variation between revisions can be explained by random chance)?
* Are separate Swarming tasks more statistically independent than existing flake analyses?

Experimental design
Take a simple random sample of 20 FindIt Flake analyses where a culprit was identified and the culprit wasn't adding a new test. Also take a simple random sample of 20 FindIt Flake analyses from Roberto's experiment results [2]. For each analysis:
1. Pull the isolate hash for every revision and run the test in 20 separate Swarming tasks with --gtest_repeat 1.
2. Compare the results of the first and final revisions to see if the regression reproduced, using Fisher's exact test.
3. Plot the results overlaid onto the original analysis results to visually see if there's any difference in stability.
4. Manually divide the revisions into two groups, "before culprit" and "after culprit". For each group, determine if the results are statistically independent using a chi-square test for independence. Do this for both the FindIt results and the experimental results, and compare the resulting p-values.


[1] https://robertocn.users.x20web.corp.google.com/www/independent_tasks/report.html
[2] https://docs.google.com/spreadsheets/d/1hcI29Q8c12e1njMF0HVF_ERHLzB5xwISJQhoHpbFSfU/edit
 
Another perspective to consider:
* For those correct culprits identified by the current analyses, will separate swarming tasks be able to still identify the culprits?
* For those wrong culprits identified by the current analyses, will separate swarming tasks be able to not identify them as culprits by 1) ignoring them or 2) identifying the correct culprits?
* For those flakes without culprits identified by the current analyses, will separate swarming tasks be able to identify the culprits?

The first two are more important, while the last one is less important for now.
For this experiment, for the 20 past analyses that Findit found a culprit for that wasn't adding a new test culprit, do we care that the result was correct or not? I think it's better we have a portion of those that are known correct to ensure the experiment is able still to identify those, and another portion that are confirmed incorrect to see if the experiment may have yielded a different result
Sounds good, from both comment 1 and comment 2 it sounds like it's more useful to select 20 known correct and 20 known incorrect FindIt results. Then we can better answer the questions of how separate Swarming tasks performs for each of those cases.

What's the easiest way to collect known correct and known incorrect FindIt analyses?
Re #3:

This have all the identified culprits with (true_positives tab) and without (false_positives tab) reverts.
https://docs.google.com/spreadsheets/d/1OfFeqrxKEUrf0D5T7rvDEe86CG55OziuZxFphGel69Q/edit#gid=31741883

For false positives, there is a culprit_urlsafe_key column. They are the keys for this model https://cs.chromium.org/chromium/infra/appengine/findit/model/flake/analysis/flake_culprit.py
The FlakeCulprit model has a list of keys to https://cs.chromium.org/chromium/infra/appengine/findit/model/flake/analysis/master_flake_analysis.py
MasterFlakeAnalysis has all the info about a flake analysis: data points, revisions, etc.

For true positives, I don't have the culprit_urlsafe_key, but if you compare the git hash against FlakeCulprit model. You will get everything like false positives above.

--------------------
We may want to do manual sanity-check on the true and false positives for 100% correctness of classification.
You may only check those analysis in the past 1.5 months: isolates expires in 60 days, so 45 days should be safe to play with the experiment.
Is there a way to execute code in FindIt to query the models?

Also, is there newer data? I hesitate to just take the first 20, since some of them may be related analyses (e.g. initiated from the same waterfall build)

Sign in to add a comment