New issue
Advanced search Search tips

Issue 638245 link

Starred by 1 user

Issue metadata

Status: Duplicate
Merged: issue 705002
Owner: ----
Closed: Mar 2017
Cc:
Components:
EstimatedDays: ----
NextAction: ----
OS: ----
Pri: 3
Type: Bug



Sign in to add a comment

ts_mon may break running swarming task when upgrading luci-config schema on swarming

Project Member Reported by mar...@chromium.org, Aug 16 2016

Issue description

Repro:
- Add new field in https://github.com/luci/luci-py/blob/master/appengine/swarming/proto/config.proto
- Roll new Swarming version to chromium-swarm.appspot.com
- Push new config.cfg in the git repo fetched by luci-config

Expected:
Seamless upgrade.

Actual:
If there were tasks still running on the old Swarming version, ts_mon will cause an exception in task_update when flushing. The last error was:
ConfigFormatError: 11:1 : Message type "SettingsCfg" has no field named "ui_client_id".
when https://github.com/luci/luci-py/commit/e02bb087183bbf2323a0229eda4266b51d5d74d6#diff-803a4899c7fea28eed7aee90b7f44fa6
was rolled in. This is because ts_mon flushes on arbitrary versions of the code, not only on the default version.
This killed the running tasks that hit this point repeatedly.

Workaround:
Wait for *all* task to complete when updating to a new schema before starting to use it. Which means a bit over 1 hour.
 
More accurately, ts_mon flushes on any request that happens >=60sec after the last flush on the same instance, regardless of what code processes that request. Do bots ever hit specific (non-default) version of the app?

It would also cross versions at the time of traffic migration, for the same reason.

If you don't hit specific versions in requests, would disabling ts_mon for the time of traffic migration for schema changes do the trick? It would only stop metrics for a couple of minutes, rather than waiting for tasks for 1h - which I assume might be much more disruptive.

Comment 2 by mar...@chromium.org, Aug 16 2016

By design, a bot that is running a task locks itself to the version that was used to get the task, e.g. 123-abc-dot-chromium-swarm.appspot.com. So yes, bots hit non-default version while running a task if the default version is changed while the task is running.
I see. Well, another way to do this is to roll only backwards-compatible schema changes. E.g. don't just delete or rename a field, but add a new one first, switch all usage to the new field, and after some time, delete the now unused field. It's more work, but eliminates the potential downtime.
Status: Available (was: Untriaged)
Components: -Internals>Metrics
Removing Internals>Metrics since this is not related to UMA or experiments.

Comment 6 by no...@chromium.org, Sep 13 2016

it was a backward-compatible change from the proto perspective: add new field was added

what needs to be done is to store the config in the datstore in binary format, not text format. Buildbucket had the same problem and this solution worked well.
https://chromium-review.googlesource.com/c/367886/
https://chromium-review.googlesource.com/c/368355/

Swarming uses components.config's feature to store the config in the datastore, so fixing this bug requires changes to components.config
https://cs.chromium.org/chromium/infra/luci/appengine/components/components/config/remote.py?q=store_last_good+file:remote&sq=package:chromium&l=121

Comment 7 by mar...@chromium.org, Mar 29 2017

Mergedinto: 705002
Status: Duplicate (was: Available)

Sign in to add a comment