New issue
Advanced search Search tips

Issue 880550 link

Starred by 1 user

Issue metadata

Status: Fixed
Owner:
Closed: Oct 9
Cc:
Components:
EstimatedDays: ----
NextAction: ----
OS: ----
Pri: 2
Type: Feature

Blocked on:
issue 847602

Blocking:
issue 851152



Sign in to add a comment

Buildbucket: implement generic Swarming task fallback

Project Member Reported by mar...@chromium.org, Sep 4

Issue description

This is a follow up of  issue 847602 .

Some kind of operations works better when a task is running on the bot that ran the previous task's generation.

For example, after running build, build+1 will be much faster if it runs exactly on the same bot. This is normally coupled with build incrementally with named caches, but other kind of task like performance testing benefits from rerunning on the exact same hardware. Still, we want it to fallback to a "cold bot" if the bot is dead or very busy, to keep availability high.

# Implementation

Have buildbucket use a taskslice to lock into the bot first.
This is what is used by perf testing, and requires manual work from the client to talk directly to swarming: query for the bot that ran the previous task, and lock it via a task slice.

On the buildbucket side this could be done by either:
- querying its own database, albeit model.Build doesn't store the bot yet
- a RPC to the Swarming server would be needed at build creation when the field is used

I'd really recommend against the RPC, since it's going to be slow and less efficient than a DB lookup.


# Configuration

buildbucket/proto/config/project_config.proto
Add Builder.bot_affinity_secs, so that this would be specified on per-builder basis.


# Rejected alternative

Expose task id as bot dimensions, and have the task request the task id as a dimension.
The main drawback for this is that swarming task queue is optimizing for fairly steady state dimensions, and I'm not sure how well this would behave. I'd suspect "mostly ok but not awesome". Buildbucket could potentially query its own database to figure out the task id, without any RPC to Swarming. This would require changes in Swarming, changes that would go against issue 786735.

 
Blocking: 851152
when you say "build+1", do you mean within the same builder? what defines the seq?
note that cros won't benefit from this until "dynamic builder" feature is implemented (go/chops-build-api-prd)

> querying its own database, albeit model.Build doesn't store the bot yet
it is done in a non-structured way
https://chromium.googlesource.com/infra/infra/+/75f1613e9690a34f84062f32cb5f543b4398ad8f/appengine/cr-buildbucket/v2/builds.py#82


Thanks. This is really tricky. :/
Labels: LUCI-ChromeInternal LUCI-Blocker-ChromeInternal
Cc: efoo@chromium.org
Summary: Buildbucket: implement generic Swarming task fallback (was: Buildbucket: implement Swarming task bot affinity)
So the current idea is to use:
 "[expiration:]key:value"

when specifying the dimension. This could use the same flow as the way named cache fallback was implemented, so it's mostly the same logic on buildbucket server, which simplifies things.
Cc: dgarr...@chromium.org
Labels: Swarming
Project Member

Comment 7 by bugdroid1@chromium.org, Sep 25

The following revision refers to this bug:
  https://chromium.googlesource.com/infra/infra/+/5668a8a2d0553c96fa5080edb18b31081b48e2a3

commit 5668a8a2d0553c96fa5080edb18b31081b48e2a3
Author: Marc-Antoine Ruel <maruel@chromium.org>
Date: Tue Sep 25 18:16:42 2018

[cr-buildbucket] clarify public and private symbols in swarming/

I need to do significant refactoring of the way builder's dimensions are
processed and it's currently hard to know what are internal functions and
what are public symbols. Use "_" prefix to make this clear.

Regroup code in flatten_swarmingcfg but not the other ones as this would have
been more disruptive.

I used:
  grep -R -h -o '\bflatten_swarmingcfg\.[^(]\+(' \
      $(git ls-files "*.py" | grep -v _test) | sort | uniq
to determine which symbols are used in other files.

Bug:  880550 
Change-Id: Ib2347e8edfd2fa7c4aa30102bdc8651113689aa1
Reviewed-on: https://chromium-review.googlesource.com/1243585
Commit-Queue: Marc-Antoine Ruel <maruel@chromium.org>
Reviewed-by: Nodir Turakulov <nodir@chromium.org>
Cr-Commit-Position: refs/heads/master@{#17863}
[modify] https://crrev.com/5668a8a2d0553c96fa5080edb18b31081b48e2a3/appengine/cr-buildbucket/swarming/swarming.py
[modify] https://crrev.com/5668a8a2d0553c96fa5080edb18b31081b48e2a3/appengine/cr-buildbucket/swarming/test/swarmbucket_api_test.py
[modify] https://crrev.com/5668a8a2d0553c96fa5080edb18b31081b48e2a3/appengine/cr-buildbucket/swarming/test/swarming_test.py
[modify] https://crrev.com/5668a8a2d0553c96fa5080edb18b31081b48e2a3/infra/libs/buildbucket/swarming/flatten_swarmingcfg.py
[modify] https://crrev.com/5668a8a2d0553c96fa5080edb18b31081b48e2a3/appengine/cr-buildbucket/swarming/swarmingcfg.py

Project Member

Comment 8 by bugdroid1@chromium.org, Sep 26

Project Member

Comment 9 by bugdroid1@chromium.org, Sep 26

The following revision refers to this bug:
  https://chromium.googlesource.com/infra/infra/+/b451db89ef1957155351fff0921e5ab84bff02a3

commit b451db89ef1957155351fff0921e5ab84bff02a3
Author: Marc-Antoine Ruel <maruel@chromium.org>
Date: Wed Sep 26 18:27:24 2018

[cr-buildbucket] extra recipe code out of _create_task_def_async()

_create_task_def_async() will be getting longer with optional dimensions, so
try to reduce its length first to keep it manageable/readable.

No functional change.

Bug:  880550 
Change-Id: Ibd1a6540756dfc8a172ac11d6759f86d09b04830
Reviewed-on: https://chromium-review.googlesource.com/1246622
Reviewed-by: Nodir Turakulov <nodir@chromium.org>
Commit-Queue: Marc-Antoine Ruel <maruel@chromium.org>
Cr-Commit-Position: refs/heads/master@{#17893}
[modify] https://crrev.com/b451db89ef1957155351fff0921e5ab84bff02a3/appengine/cr-buildbucket/swarming/swarming.py

Project Member

Comment 10 by bugdroid1@chromium.org, Sep 26

The following revision refers to this bug:
  https://chromium.googlesource.com/infra/infra/+/5723040ec44e499f92de4b8bb861ed32b7c7d08e

commit 5723040ec44e499f92de4b8bb861ed32b7c7d08e
Author: Marc-Antoine Ruel <maruel@chromium.org>
Date: Wed Sep 26 19:36:15 2018

[cr-buildbucket] move swarming logic out of _create_task_def_async()

_create_task_def_async() will be getting longer with optional
dimensions, so try to reduce its length first to keep it
manageable/readable.

No functional change.

Bug:  880550 
Change-Id: Ie2d8d2f7bbdf79c0fb972cc0529918ab7d3592cc
Reviewed-on: https://chromium-review.googlesource.com/1246623
Reviewed-by: Nodir Turakulov <nodir@chromium.org>
Commit-Queue: Marc-Antoine Ruel <maruel@chromium.org>
Cr-Commit-Position: refs/heads/master@{#17894}
[modify] https://crrev.com/5723040ec44e499f92de4b8bb861ed32b7c7d08e/appengine/cr-buildbucket/swarming/swarming.py

Project Member

Comment 12 by bugdroid1@chromium.org, Sep 28

The following revision refers to this bug:
  https://chromium.googlesource.com/infra/infra/+/b6a623870fd6cfa12c113f487a72792c014998ca

commit b6a623870fd6cfa12c113f487a72792c014998ca
Author: Marc-Antoine Ruel <maruel@chromium.org>
Date: Fri Sep 28 20:53:01 2018

[cr-buildbucket] implement dimension expiration

It automatically merges user and builder provided optional dimensions, and
optional dimensions generated via optional named caches.

Use a bit more of unicode.

Sort the dimensions in the request, so that the order is deterministic.

Limit at 6 the number of fallbacks.

Bug:  880550 
Change-Id: I4139c7fb62e75563ccbcd8c3aafa80ad9c6357a6
Reviewed-on: https://chromium-review.googlesource.com/1246626
Commit-Queue: Marc-Antoine Ruel <maruel@chromium.org>
Reviewed-by: Nodir Turakulov <nodir@chromium.org>
Cr-Commit-Position: refs/heads/master@{#17956}
[modify] https://crrev.com/b6a623870fd6cfa12c113f487a72792c014998ca/appengine/cr-buildbucket/swarming/swarming.py
[modify] https://crrev.com/b6a623870fd6cfa12c113f487a72792c014998ca/appengine/cr-buildbucket/swarming/test/swarming_test.py
[modify] https://crrev.com/b6a623870fd6cfa12c113f487a72792c014998ca/appengine/cr-buildbucket/swarming/test/swarmingcfg_test.py
[modify] https://crrev.com/b6a623870fd6cfa12c113f487a72792c014998ca/infra/libs/buildbucket/swarming/test/flatten_swarmingcfg_test.py
[modify] https://crrev.com/b6a623870fd6cfa12c113f487a72792c014998ca/appengine/cr-buildbucket/swarming/swarmingcfg.py

How much is left at this point?
Cc: la...@chromium.org
Left: fix a bug in the current impl (probably c12) that is currently preventing buildbucket deployment
Could we issue build requests with the new tags today, and expect them to be quietly ignored?
No, they would be rejected as malformed
I wanted to fix it this afternoon but got "delayed" at the hospital. Tomorrow I fly but I'll try to find the big and fix, probably a stupid logic error. Then it's done.
Thanks! I hope the hospital trip wasn't serious.
Project Member

Comment 20 by bugdroid1@chromium.org, Oct 4

The following revision refers to this bug:
  https://chromium.googlesource.com/infra/infra/+/0e202c612869cba3a62cbcdcbd2ddda18b3977c8

commit 0e202c612869cba3a62cbcdcbd2ddda18b3977c8
Author: Nodir Turakulov <nodir@google.com>
Date: Thu Oct 04 17:24:46 2018

[buildbucket] Fix swarmingcfg.read_dimensions

https://chromium-review.googlesource.com/c/infra/infra/+/1244319/6/appengine/cr-buildbucket/swarming/swarmingcfg.py#b58
lost a check that a dimension value is set. Fix the bug and add a test.

TBR=maruel@chromium.org

Bug:  880550 
Change-Id: I2c54a83e8d1ff05a503a29922b37a01a667f075d
Reviewed-on: https://chromium-review.googlesource.com/c/1262102
Commit-Queue: Nodir Turakulov <nodir@chromium.org>
Reviewed-by: Nodir Turakulov <nodir@chromium.org>
Cr-Commit-Position: refs/heads/master@{#18051}
[modify] https://crrev.com/0e202c612869cba3a62cbcdcbd2ddda18b3977c8/appengine/cr-buildbucket/swarming/test/swarming_test.py
[modify] https://crrev.com/0e202c612869cba3a62cbcdcbd2ddda18b3977c8/appengine/cr-buildbucket/swarming/swarmingcfg.py

dgarrett, this feature was deployed. Please try it.
Thanks!
Cc: athilenius@chromium.org
Reading back through the CLs linked in this bug, I don't see the config changes.

Scanning through CLs, they sorta look like they are only pulling from the LUCI Builder configs, not the Buildbucket requests.

If they are pulling from buildbucket requests, then what is the format inside parameters_json?

At a random guess, something like this?

parameters_json = "{
  "builder_name": "foo",
  "dimensions": [
    "240:bot:swarm-cros-123",
    "role:try"
  ] 
}"

Meaning wait 4 minutes for "swarm-cros-123", and then fall back to any builder with the role "try"?
it is documented in go/swarmbucket#Build-level
i believe you already use override_builder_cfg

parameters_json = "{
  "builder_name": "foo",
  "swarming": {
    "override_builder_cfg": {
      "dimensions": [
        "240:bot:swarm-cros-123",
	"role:try"
      ] 
    }
  }
}

We aren't using override_builder_cfg yet, but that's a very reasonable format.

Thanks!

Also, to confirm.

If the buildbucket request specifies:

parameters_json = "{
  "builder_name": "foo",
  "swarming": {
    "override_builder_cfg": {
      "dimensions": [
        "240:bot:swarm-cros-123",
      ] 
    }
  }
}

And "swarm-cros-123" isn't available, will it fall back to the dimension "role:try" specified on the builder config? Or is that dimension totally overridden because of the override_builder_cfg?

I've issued two requests:

"240:bot:swarm-cros-0"
https://ci.chromium.org/p/chromeos/builders/luci.chromeos.general/Try/b8933561266400768176

"240:bot:swarm-cros-12345"
https://ci.chromium.org/p/chromeos/builders/luci.chromeos.general/Try/b8933561249977822400


Both of them were immediately assigned to bots in the "try" role. Zero exists (in a different role), and is idle. 12345 does not exist.



see "Source: Task ...." link on build page, e.g. 
https://chrome-swarming.appspot.com/task?id=4059d5c500e0af10&refresh=10&show_raw=1&wide_logs=true

it displays 3 task slices, each having different set of dimensions and different slice expirations. It did fallback to the 3rd slice.

the dimensions specified in override_builder_cfg are merged into the dimensions specified in the config file using the same merging logic used in builder mixins. It is documented here:
https://chromium.googlesource.com/infra/luci/luci-go/+/b4019a5efec0c7b2890bb5a7c5c8b2036ccdd9d5/buildbucket/proto/config/project_config.proto#210
I see that the first two slices have dimension "bot:swarm-cros-0", but at least in the swarming UI the dimension should be "id:swarm-cros-0".
Re #27:

The way I read the docs above, all dimensions would get overridden. But it seems more reasonable, that they only override the ones with the same name. This means that the buildbucket request must either specify a bot in the expected role, override the role as well, or (maybe) specify "role:"?

Re #28

They also specify "role:tryjob", when "swarm-cros-0" is not a member of "role:tryjob", so you can't satisfy both requirements.

I submitted two more jobs against "swarm-cros-1", which is part of the tryjob role. Both were assigned to random bots.

https://ci.chromium.org/p/chromeos/builders/luci.chromeos.general/Try/b8933559204186162272

https://ci.chromium.org/p/chromeos/builders/luci.chromeos.general/Try/b8933559210621567152
Try "240:id:swarm-cros-1" instead?
the resulting task must have compatible bot id and role, so yes, if overriding a bot id, please override role to be compatible, too. If role is removed ("role:"), then if bot is not found, the task may be executed on a bot of any role. 

Do you need to be able to specify incompatible bot id and role? buildbucket is less flexible than swarming in this regard and unlike swarming, it does not support that.


Mostly, I was just playing to make certain I understand the system.

However, I don't understand why #30 didn't work. Setting "id" caused the build request to be rejected with an error.
Related, but different, I tried adding this to our cr-buildbucket.cfg:

builder_mixins {
  name: "prod"
  dimensions: "240:role:prod"
  dimensions: "240:role:precq"
  dimensions: "role:tryjob"
}

But I get presubmit errors:

builder_mixins {
  name: "prod"
  dimensions: "240:role:prod"
  dimensions: "240:role:precq"
  dimensions: "role:tryjob"
}

Was that the wrong syntax?
Lane is right in comment 28, the dimension key is 'id', not 'bot'.
You found a bug with the 'id' slice being denied. That's an overly zealous check I added but is incorrect. The fix involves removing code. Sorry about that.

Remove this:
https://cs.chromium.org/chromium/infra/luci/appengine/swarming/server/task_request.py?l=216
Sorry, presubmit error is:

Config validation: builder_mixin prod: dimension #2: duplicate key role

Config validation: builder_mixin prod: dimension #3: duplicate key role

Dimensions are AND, not OR. So you may want to use a "role:shared" value or something similar.
Is there any way to define a fallback?

The idea is that prod builds prefer prod role, unless all bots are busy.
What you want is '240:role:'. This means to remove the dimension from the slice after 240 seconds.
So... in my case, we currently have 4 roles.

  prod
  precq
  tryjob
  staging


I never want anyone to use staging unless they explicitly request it.
I never want tryjob builds to use anything else.
I want precq builds to prefer precq, but fallback to precq|tryjob.
I want prod builds to prefer prod, but fallback to prod|precq|tryjob.

On reflection, that's not what my config attempt above would have expressed.

Is there any way to express the behavior I'd like? 

Note: this is a nice to have efficiency, NOT critical behavior. The 'id' behavior above is what's critical.

Add the values "precq_tryjob" and "precq_prod_tryjob" to "role" key on each member. This is a bit clumsy and cannot scale indefinitely but it'll work great in practice.
So... like this?

builder_mixins {
  name: "prod"
  dimensions: "240:role:prod"
  dimensions: "role:prod_precq_tryjob"
}

Or does this already include the "Prefer prod" concept?

builder_mixins {
  name: "prod"
  dimensions: "role:prod_precq_tryjob"
}


current buildbucket implementation does not support duplicate dimensions keys.
swarming does, but even swarming wouldn't be able to serve `I want precq builds to prefer precq, but fallback to precq|tryjob`, i believe.
Would you be able to add a separate dimensions for this? e.g. can_be_used_by_precq_builds:(true|false). It would be straightforward and supported by both swarming and buildbucket
What I was expecting to happen is that a taskslice is defined with:
  dimensions: 240:role:prod

Then a second taskslice with
  dimensions: role:precq

(or maybe role:prod_precq, I'm not clear on what the semantics of _ are).

So... I'm not expecting duplicate dimensions keys to ever apply at the same time.

But, maybe what I want isn't possible. It was a use case I discussed with MA a while back, but it's also very much a secondary priority.

When the id: fix lands, our hard requirements should be met.

the reason dup keys are not supported in buildbucket is that it was unclear how "merging" is supposed to work in this case. A set of dimensions S1 is "merged" into another set S2 when a builder mixin is applied to a builder, or when dimensions specified in a build request are merged into dimensions specified in a config.
Today it is: a dimension in S1 replaces a dimensions in S2 with the same key (or just added, if no such dimension exists in S2). 

However, the way you describing your use case, looks like we can support dup keys:
all dimensions in S1 with key K replace all dimensions in S2 with key K.
Then we can have multiple dim values for a dim key, with or without same expiration.
I will make the change.
Thanks!

Also (and more importantly) when can we expect the "id" fix?
M-A to answer the "id" fix
Project Member

Comment 49 by bugdroid1@chromium.org, Oct 8

The following revision refers to this bug:
  https://chromium.googlesource.com/infra/luci/luci-go.git/+/840d4327fcbf2654af02d63328ca255f944cc37c

commit 840d4327fcbf2654af02d63328ca255f944cc37c
Author: Nodir Turakulov <nodir@google.com>
Date: Mon Oct 08 19:43:16 2018

[buildbucket] Update dimensions docs

Document "<expiration_secs>:<key>:<value>" format for dimensions in the proto
file.

Document new dimension merging logic.

Bug:  880550 
Change-Id: Ic8316d532f150f99ec145ea25735f92cbd522cdf
Reviewed-on: https://chromium-review.googlesource.com/c/1268685
Reviewed-by: Don Garrett <dgarrett@chromium.org>
Commit-Queue: Nodir Turakulov <nodir@chromium.org>

[modify] https://crrev.com/840d4327fcbf2654af02d63328ca255f944cc37c/buildbucket/proto/config/project_config.pb.go
[modify] https://crrev.com/840d4327fcbf2654af02d63328ca255f944cc37c/buildbucket/proto/config/project_config.proto

Project Member

Comment 50 by bugdroid1@chromium.org, Oct 9

The following revision refers to this bug:
  https://chromium.googlesource.com/infra/luci/luci-py.git/+/a31bbe1906a91dfac65dc639bfc6e7184c9a7595

commit a31bbe1906a91dfac65dc639bfc6e7184c9a7595
Author: Marc-Antoine Ruel <maruel@chromium.org>
Date: Tue Oct 09 13:31:09 2018

[swarming] Remove enforcement of unique 'id' in TaskSlice's

There's no fundamental reason to deny this, and this is actually needed.

Small doc update in task_queues.py.

Bug:  880550 
Change-Id: I94f9f8391421362864bb5b38090ed5e6af91e7ec
Reviewed-on: https://chromium-review.googlesource.com/c/1270115
Auto-Submit: Marc-Antoine Ruel <maruel@chromium.org>
Reviewed-by: Nodir Turakulov <nodir@chromium.org>
Commit-Queue: Marc-Antoine Ruel <maruel@chromium.org>

[modify] https://crrev.com/a31bbe1906a91dfac65dc639bfc6e7184c9a7595/appengine/swarming/server/task_queues.py
[modify] https://crrev.com/a31bbe1906a91dfac65dc639bfc6e7184c9a7595/appengine/swarming/server/task_request.py

Fix is live. Don, please try again.
Thanks!
I tried to assign both of these to "swarm-cros-2" which was idle. One of them claims to be running on it, and one of them is pending.

https://ci.chromium.org/p/chromeos/builders/luci.chromeos.general/Try/b8933127455791706320

https://ci.chromium.org/p/chromeos/builders/luci.chromeos.general/Try/b8933127449921177072

Watching.
Looks like it's working! Both jobs are running, and the second moved to a new machine!

Excellent!
Here is an example of what the new build request should look like, in hacked up format:

https://chromium-review.googlesource.com/c/chromiumos/chromite/+/1263435
That's rad!
Status: Fixed (was: Assigned)
Closing as Fixed as everything needed is deployed as version 3809-a31bbe1.
Project Member

Comment 58 by bugdroid1@chromium.org, Dec 26

The following revision refers to this bug:
  https://chromium.googlesource.com/infra/infra/+/ef413eda3dd81f0e52ac44f40a3053b4057dc45a

commit ef413eda3dd81f0e52ac44f40a3053b4057dc45a
Author: Nodir Turakulov <nodir@google.com>
Date: Wed Dec 26 23:45:23 2018

Fix override_builder_cfg parameter application

override_builder_cfg allows overriding builder configuration.
It is implementation was not correct, but worked most of the time.
The bug is uncovered by failing test in
https://chromium-review.googlesource.com/c/infra/infra/+/1268921

Bug:  880550 
Change-Id: I5a19674b42bc54533e940c460799c57967e0f322
Reviewed-on: https://chromium-review.googlesource.com/c/1391371
Reviewed-by: Vadim Shtayura <vadimsh@chromium.org>
Commit-Queue: Nodir Turakulov <nodir@chromium.org>
Cr-Commit-Position: refs/heads/master@{#19724}
[modify] https://crrev.com/ef413eda3dd81f0e52ac44f40a3053b4057dc45a/appengine/cr-buildbucket/swarming/swarming.py

Sign in to add a comment