A failure to create a swarming task may be false positive |
|||||||
Issue descriptionProblem. It is possible that a client (buidlbucket) creates a non-idempotent Swarming task and never receives a response back due to network. In that case, the task may do side effects not expected by client. Example. In https://chromium-review.googlesource.com/c/474058/ buildbucket tries to return a build number generated for a swarming task that it failed to create. If the task was actually created and a build uses the build number, and then another build uses the same build number, it may cause collisions.
,
Apr 19 2017
,
Apr 19 2017
Technically speaking, it *is* possible to register a task and not enqueue it, as the enqueueing process is flipping a int value so this transaction has a much lower odd of failing. I like it, at the cost of more APIs and general code complexity. This still doesn't eradicate the problem but would make it much less frequent.
,
Apr 19 2017
as a user, how do I register a task without enqueuing it?
,
Apr 19 2017
The way you described in comment #1. :)
,
Apr 19 2017
And in comment #3, I meant it is possible to implement comment #1, but it's not available to the user at the moment.
,
Apr 19 2017
The proposed design has some shortcomings:
1. Creating the task and activating it is two transactions over same entity group that happens back to back (I assume here latency of cr-builbucket <-> Swarming RPCs is <<1 s). There is a high chance that the second one will fail with "contention error".
2. It introduces new task state ("not activated"). We'll need to add it to UI, monitoring and make sure various crons handle it correctly.
What we need, essentially, is to be able to specify task ID in /new, so we know what to cancel if /new fails. This is not possible with Swarming, since task IDs have internal structure.
But we can use an approximation in a form of "transaction_id" (or any better name):
1. Buildbucket generates a large random string and passes it to '/new' RPC (along with other task parameters) as "transaction_id".
2. If Swarming saw this transaction_id before, it returns the task_id associated with it. Otherwise it create a new task and associates it with given transaction_id.
3. If '/new' RPC fails, Buildbucket calls '/cancel', passing same transaction_id. Swarming cancels the corresponding task (if any).
The overhead for implementing this is 1 cross-group read and 1 cross-group write in datastore transaction that creates new task. If it seems like a big deal, we can probably approximate this semantics with memcache. (So it will work most of the time, but occasionally we'll have duplicates).
,
Apr 19 2017
i am fine with #7. We use something similar in buildbucket, called client_operation_id and we put it to memcache
,
Apr 19 2017
Yes, it's standard pattern. I named the ID "transaction_id", because we also pass it to '/cancel' (which is, strictly speaking, different operation, so naming this ID "client_operation_id" will be odd). Oh, forgot to mention another advantage: it is 1 RPC on happy path vs 2 RPCs.
,
Apr 19 2017
ok. maruel?
,
Apr 20 2017
"1 cross-group read and 1 cross-group write in datastore transaction that creates new task" is a show stopper An alternative is: - Create the task with a unique tag - When HTTP 500, query for tasks and search for the tag - Find the task back via its unique tag No swarming change. wdyt?
,
Apr 20 2017
Datastore index for the tag may not update in time. After all update latency SLO is 24h
,
Apr 20 2017
This SLO is for composite indexes IIRC, not primary indexes. But I don't know what is the SLO for primary indexes. Anyhow that's a fair point that it is eventually consistent.
,
Apr 20 2017
Yes, we need strong consistency here. "transaction_id => task id" resolution will be happening right after the task creation (like, in <50 ms). There's very high likelihood datastore indices will be stale. That's why I proposed memcache as an alternative of XG transactions. I think with memcache the overall chance of errors will be lower than with non-strongly-consistent datastore.
,
Apr 20 2017
memcache lgtm, albeit when memcache fails, everything falls over because then the DB quickly fail under the load. It's more work though but I don't mind.
,
May 1 2017
https://codereview.chromium.org/2856733002/
,
Dec 1 2017
not a problem in practice
,
Dec 1 2017
,
Dec 3
This issue has been Available for over a year. If it's no longer important or seems unlikely to be fixed, please consider closing it out. If it is important, please re-triage the issue. Sorry for the inconvenience if the bug really should have been left as Available. For more details visit https://www.chromium.org/issue-tracking/autotriage - Your friendly Sheriffbot
,
Dec 3
|
|||||||
►
Sign in to add a comment |
|||||||
Comment 1 by no...@chromium.org
, Apr 19 2017