New issue
Advanced search Search tips
Note: Color blocks (like or ) mean that a user may not be available. Tooltip shows the reason.

Issue 872042 link

Starred by 1 user

Issue metadata

Status: Available
Owner: ----
Cc:
Components:
EstimatedDays: ----
NextAction: ----
OS: ----
Pri: 1
Type: Bug

Blocking:
issue 837855



Sign in to add a comment

Expose failed tests at recipe side to simplify and speed up the query used to detect flaky tests

Project Member Reported by liaoyuke@chromium.org, Aug 7

Issue description

Link to the query: https://cs.chromium.org/chromium/infra/appengine/findit/services/flake_detection/flaky_tests.cq_false_rejection.sql?l=1

Currently, the query runs once every hour, and always pulls data of the past 24 hours from the cq_attempts, cr-buildbucket and test-results tables. The time it takes for a single run is about 12 seconds, and there are 384.74 GB data needs to be processed, which is not cheap.

One idea to improve the performance is to implement resume point to filter out builds that were already processed. And basically, the change to make is that, once we figure out the list of flaky builds within the past 24 hours, we filter out those who are older than 1 hour (The reason is that builds whose end_time is more than one hour ago were already covered by previous runs), and only proceed with the remaining ones to calculate flaky tests.
 
Status: WontFix (was: Assigned)
After some investigations, it turns out resume point won't help in this case. Here is why:

Time and data size for executing the query: 12s and 394.67GB.

After filtering out the new flaky builds, there is a statement in the query to calculate a failed_tests (https://cs.chromium.org/chromium/infra/appengine/findit/services/flake_detection/flaky_tests.cq_false_rejection.sql?l=227) table before joining with the new flaky builds. It turns out that calculating failed_tests alone needs to process 392.8GB of data, which dominates the amount of data needs to be processed by the query, and it is actually the bottleneck.

I tested out using resume points locally, and it didn't make a difference, which matches with the theory.
What if we let the chromium_trybot recipe to surface the list of new test failures in a machine-readable format? In that way, we just need to know that the build is a flaky build, and we could go to Logdog to read the flaky tests.

In this way, we might not need to use test-results table at all?
Cc: dpranke@chromium.org jbudorick@chromium.org
In addition, we don't even need to worry about test results were not properly uploaded to test-results app or table. Hopefully that won't make the recipe more fragile.
This could make our query much simpler too.

The remaining problem is the latency for the data to show up in Logdog. Maybe we could dump that into a GCS instead?

Anyway, optimization is a P2 task for now.
Status: Assigned (was: WontFix)
Sounds like a good idea, but the changes might be non-trivial to make. Agree it's a P2, I'll open this bug to track it.
Summary: Expose failed tests at recipe side to simplify and speed up the query used to detect flaky tests (was: Implement resume point in cq_false_rejection flaky_tests sql query to improve performance)
Cc: liaoyuke@chromium.org
Owner: ----
Status: Available (was: Assigned)
Unassigning myself.

Sign in to add a comment