Expose failed tests at recipe side to simplify and speed up the query used to detect flaky tests |
|||||
Issue descriptionLink to the query: https://cs.chromium.org/chromium/infra/appengine/findit/services/flake_detection/flaky_tests.cq_false_rejection.sql?l=1 Currently, the query runs once every hour, and always pulls data of the past 24 hours from the cq_attempts, cr-buildbucket and test-results tables. The time it takes for a single run is about 12 seconds, and there are 384.74 GB data needs to be processed, which is not cheap. One idea to improve the performance is to implement resume point to filter out builds that were already processed. And basically, the change to make is that, once we figure out the list of flaky builds within the past 24 hours, we filter out those who are older than 1 hour (The reason is that builds whose end_time is more than one hour ago were already covered by previous runs), and only proceed with the remaining ones to calculate flaky tests.
,
Aug 9
What if we let the chromium_trybot recipe to surface the list of new test failures in a machine-readable format? In that way, we just need to know that the build is a flaky build, and we could go to Logdog to read the flaky tests. In this way, we might not need to use test-results table at all?
,
Aug 9
In addition, we don't even need to worry about test results were not properly uploaded to test-results app or table. Hopefully that won't make the recipe more fragile. This could make our query much simpler too. The remaining problem is the latency for the data to show up in Logdog. Maybe we could dump that into a GCS instead? Anyway, optimization is a P2 task for now.
,
Aug 9
Sounds like a good idea, but the changes might be non-trivial to make. Agree it's a P2, I'll open this bug to track it.
,
Aug 9
,
Dec 21
Unassigning myself. |
|||||
►
Sign in to add a comment |
|||||
Comment 1 by liaoyuke@chromium.org
, Aug 7