New issue
Advanced search Search tips

Issue 713307 link

Starred by 2 users

Issue metadata

Status: Archived
Owner: ----
Closed: Dec 3
Cc:
Components:
EstimatedDays: ----
NextAction: ----
OS: ----
Pri: 3
Type: Feature



Sign in to add a comment

A failure to create a swarming task may be false positive

Project Member Reported by no...@chromium.org, Apr 19 2017

Issue description

Problem. It is possible that a client (buidlbucket) creates a non-idempotent Swarming task and never receives a response back due to network. In that case, the task may do side effects not expected by client. 

Example. In https://chromium-review.googlesource.com/c/474058/ buildbucket tries to return a build number generated for a swarming task that it failed to create. If the task was actually created and a build uses the build number, and then another build uses the same build number, it may cause collisions.
 

Comment 1 by no...@chromium.org, Apr 19 2017

Possible solution:
- add "active" boolean field to swarming_rpcs.NewTaskRequest, true by default. If false, the task is not scheduled.
- add tasks.activate API to activate an inactive task. If it was inactive, it is added to the pool of pending tasks

If we had this, buildbucket would
- create a  new task as inactive
- activate a new task right after creation
- cancel the task if activation request failed (even if the task was actually activated)

Comment 2 by no...@chromium.org, Apr 19 2017

Cc: d...@chromium.org
Labels: -Type-Bug Type-Feature

Comment 3 by mar...@chromium.org, Apr 19 2017

Technically speaking, it *is* possible to register a task and not enqueue it, as the enqueueing process is flipping a int value so this transaction has a much lower odd of failing. I like it, at the cost of more APIs and general code complexity. This still doesn't eradicate the problem but would make it much less frequent.

Comment 4 by no...@chromium.org, Apr 19 2017

as a user, how do I register a task without enqueuing it?

Comment 5 by mar...@chromium.org, Apr 19 2017

The way you described in comment #1. :)

Comment 6 by mar...@chromium.org, Apr 19 2017

And in comment #3, I meant it is possible to implement comment #1, but it's not available to the user at the moment.
The proposed design has some shortcomings:
1. Creating the task and activating it is two transactions over same entity group that happens back to back (I assume here latency of cr-builbucket <-> Swarming RPCs is <<1 s). There is a high chance that the second one will fail with "contention error". 
2. It introduces new task state ("not activated"). We'll need to add it to UI, monitoring and make sure various crons handle it correctly.

What we need, essentially, is to be able to specify task ID in /new, so we know what to cancel if /new fails. This is not possible with Swarming, since task IDs have internal structure.

But we can use an approximation in a form of "transaction_id" (or any better name):
1. Buildbucket generates a large random string and passes it to '/new' RPC (along with other task parameters) as "transaction_id".
2. If Swarming saw this transaction_id before, it returns the task_id associated with it. Otherwise it create a new task and associates it with given transaction_id.
3. If '/new' RPC fails, Buildbucket calls '/cancel', passing same transaction_id. Swarming cancels the corresponding task (if any).

The overhead for implementing this is 1 cross-group read and 1 cross-group write in datastore transaction that creates new task. If it seems like a big deal, we can probably approximate this semantics with memcache. (So it will work most of the time, but occasionally we'll have duplicates).

Comment 8 by no...@chromium.org, Apr 19 2017

i am fine with #7. We use something similar in buildbucket, called client_operation_id and we put it to memcache
Yes, it's standard pattern. I named the ID "transaction_id", because we also pass it to '/cancel' (which is, strictly speaking, different operation, so naming this ID "client_operation_id" will be odd).

Oh, forgot to mention another advantage: it is 1 RPC on happy path vs 2 RPCs.

Comment 10 by no...@chromium.org, Apr 19 2017

ok. maruel?
"1 cross-group read and 1 cross-group write in datastore transaction that creates new task" is a show stopper

An alternative is:
- Create the task with a unique tag
- When HTTP 500, query for tasks and search for the tag
- Find the task back via its unique tag

No swarming change. wdyt?

Comment 12 by no...@chromium.org, Apr 20 2017

Datastore index for the tag may not update in time. After all update latency SLO is 24h
This SLO is for composite indexes IIRC, not primary indexes. But I don't know what is the SLO for primary indexes. Anyhow that's a fair point that it is eventually consistent.
Yes, we need strong consistency here. "transaction_id => task id" resolution will be happening right after the task creation (like, in <50 ms). There's very high likelihood datastore indices will be stale. That's why I proposed memcache as an alternative of XG transactions. I think with memcache the overall chance of errors will be lower than with non-strongly-consistent datastore.
memcache lgtm, albeit when memcache fails, everything falls over because then the DB quickly fail under the load.

It's more work though but I don't mind.
Owner: no...@chromium.org
Status: Started (was: Untriaged)
https://codereview.chromium.org/2856733002/
Labels: -Pri-1 Pri-3
Owner: ----
Status: Available (was: Started)
not a problem in practice
Cc: no...@chromium.org
Project Member

Comment 19 by sheriffbot@chromium.org, Dec 3

Labels: Hotlist-Recharge-Cold
Status: Untriaged (was: Available)
This issue has been Available for over a year. If it's no longer important or seems unlikely to be fixed, please consider closing it out. If it is important, please re-triage the issue.

Sorry for the inconvenience if the bug really should have been left as Available.

For more details visit https://www.chromium.org/issue-tracking/autotriage - Your friendly Sheriffbot
Status: Archived (was: Untriaged)

Sign in to add a comment