[Findit] Flake Analyzer - 70% confidence set for 100% -> 99.8% passing |
||||||
Issue description
,
Aug 15
,
Aug 17
This should no way have been 70% confidence (Findit just detects 100% stable to anything flaky) and labels it 70% confidence, which worked OK when the flakiness threshold was 98%. However we now use 99.9999% as the flakiness threshold, so confidence score now needs to be redesigned slightly, possibly with some statistical analysis to avoid false positives like these and bugs getting logged
,
Aug 17
Can we make a quick change to get rid of the hard-coded 70% for now?
,
Aug 17
Another case: https://chromium-review.googlesource.com/c/chromium/src/+/1176836#message-38768adb6a9b193e7521e43a38f246da4bf1c7d9
,
Aug 17
Auto actions (bug filing, updating bugs, notifying culprits) temporarily disabled until better confidence scoring mechanism is in place. Setting pri back to 1
,
Aug 22
The following revision refers to this bug: https://chromium.googlesource.com/infra/infra/+/2d86ca27796637f369b47a6fc957e280093681fb commit 2d86ca27796637f369b47a6fc957e280093681fb Author: Jeffrey Li <lijeffrey@chromium.org> Date: Wed Aug 22 18:20:22 2018 [Findit] Flake Analyzer - Implementing statistical analysis for confidence score Low-flakiness cases can cause a lot of false positives, because statistically the "stable" point preceding the low-flaky point can be a fluke. For example, a test with a 0.9975 pass rate still has a 36.7% chance of passing 400 iterations which will yield a lot of false positives. 1. Use the Wilson Score Confidence Interval to identify a range of likely pass rates that a supposedly flaky test can have. 2. Identify the possible ranges of the "stable" point and "flaky" point, using the "flaky" point's pass rate as the input p value, alpha as 0.001 for 99.9% confidence that the true pass rate is indeed within that interval, and the number of iterations the stable point ran to produce its supposed 100% pass rate. 3. If there is any overlap in the 2 ranges, then there is a statistically significant chance that the culprit is a false positive as the stable point is unreliable. 4. Assign a very low "confidence score" of the analysis for such cases, so the calling code can bail out of performing auto actions. Note here, "confidence score" still refers to Findit's scoring mechanism on what to do with the culprit, and is not yet the same as "confidence" in pure statistics, though that is where we would like to head. Bug: 874228 Change-Id: I6d2e8b6ee864a68353c9d449adfd86e7e5dd2ac4 Reviewed-on: https://chromium-review.googlesource.com/1182191 Commit-Queue: Jeffrey Li <lijeffrey@chromium.org> Reviewed-by: David Tu <dtu@chromium.org> [add] https://crrev.com/2d86ca27796637f369b47a6fc957e280093681fb/appengine/findit/dto/float_range.py [modify] https://crrev.com/2d86ca27796637f369b47a6fc957e280093681fb/appengine/findit/services/flake_failure/confidence_score_util.py [modify] https://crrev.com/2d86ca27796637f369b47a6fc957e280093681fb/appengine/findit/services/flake_failure/confidence.py [modify] https://crrev.com/2d86ca27796637f369b47a6fc957e280093681fb/appengine/findit/services/flake_failure/test/pass_rate_util_test.py [add] https://crrev.com/2d86ca27796637f369b47a6fc957e280093681fb/appengine/findit/services/math_util.py [add] https://crrev.com/2d86ca27796637f369b47a6fc957e280093681fb/appengine/findit/services/test/math_util_test.py [add] https://crrev.com/2d86ca27796637f369b47a6fc957e280093681fb/appengine/findit/libs/math/test/statistics_test.py [modify] https://crrev.com/2d86ca27796637f369b47a6fc957e280093681fb/appengine/findit/services/flake_failure/test/confidence_score_util_test.py [add] https://crrev.com/2d86ca27796637f369b47a6fc957e280093681fb/appengine/findit/libs/math/statistics.py [modify] https://crrev.com/2d86ca27796637f369b47a6fc957e280093681fb/appengine/findit/services/flake_failure/flake_constants.py
,
Aug 22
|
||||||
►
Sign in to add a comment |
||||||
Comment 1 by tandrii@chromium.org
, Aug 14