Historical note:
That's a fairly rare circumstance but it's starting to happen more and more as binaries are more deterministic/reproducible across OSes, thanks a lot of our migration to clang/LLVM everywhere.
Scenario:
- Multiple Swarming tasks with the exact same internal TaskSlice hash are triggered within a short period of time.
Actual:
- They all run, because there's no TaskResultSummary with state COMPLETED yet.
Expected:
- The duplicates wait for the first one to complete, and skip accordingly.
Pro:
- Saves duplicate workload in a scenario that can happen relatively often, I recent commit is still being tested, try jobs are happening concurrently.
Drawback:
- In the failure mode case, this increases user visible latency.
Implementation:
- This would require a new state, PENDING_DEDUPE, or a way for the task to have its TaskToRun.queue_number set to None, yet ready to be enqueued when the dedupe_from task is done. This is tricky, as this means that when the "primary" task completes, it now needs to do N additional transactions for each pending tasks that were waiting for its results, either doing a DUPED if the "primary" task succeeded, or to enqueue the TaskToRun.queue_number.
This creates a new situation where TaskToRun.queue_number is None yet TaskResultSummary.state is PENDING. This complicates expiration handling and would challenge some assumptions in the code base.
Comment 1 by thakis@chromium.org
, Dec 14