We have a somewhat frequent source of downtime where:
1) Someone adds a new field in a luci-py GAE server proto
2) The person deploys everywhere
3) The person start using the new field in luci-config managed repo
The problem is that old versions could still be running, which is particularly true on Swarming as the bot version-lock the server for the duration of a task.
The current options are:
- Wait several hours between 2) and 3).
- Accept some HTTP 500.
Since there's no signal if old server version instances are still running or not, people occasionally start using the new fields too quickly.
A long term fix is:
- Upon ingestion, the textproto is convert to binary encoded proto.
- The binary encoded data is used as the canonical local cache.
Comment 1 by d...@chromium.org
, Mar 24 2017