This is a step towards a master-less setup of drone++
The end goal of this bug is:
- Each shard is told
- it's own board labels
- [shard_heartbeat_v2] shard reports all its shard labels to master, which master accepts if an existing shard does not claim that board. Master responds with the list of labels that were accepted. [This can be a subset, see below]
- [shard_heartbeat_v2] shard also reports its own DUTs, which the master unconditionally accepts. skylab and DUTLeaser are responsible for ensuring that two shards can never lease the same DUT.
- Both RPC forwarding based on hosts (based on host.shard) and assigning of new jobs and HQEs to shards (based on shard label) in the master are unchanged.
Needs a few new RPCs:
- [shard:add_board_to_shard_v2,remove_board_from_shard_v2] This is how a shard is told what labels to claim. Today, we inject this mapping in the master via the add_board_to_shard, remove_board_from_shard, which will be deprecated.
- [master:purge_shard shard_id] This is necessary so that when a shard is removed entirely, another shard can then be added that will claim its labels and DUTs.
We could do away with master:purge_shard if master simply accepts the latest shard-board mapping from shard_heartbeat_v2. But this is risky because two shards claiming the same board mapping would lead to difficult to diagnose indeterminism. Instead, teardown of shards follow:
- If removing just a board from a shard, call remove_board_from_shard_v2 on the shard.
- The shard will report that it no longer claims that board in the next heartbeat.
- If removing the shard entirely, call master:parge_shard, and all the shard's board mapping will be synchronously removed.
------
The shards need to be resilient to some of their board mappings being rejected by the master. This can happen especially when moving a board from one shard to another:
- we call remove_board_from_shard_v2 (shardA, boardZ)
- we call add_board_to_shard_v2 (shardB, boardZ)
- shardB:shard_heartbeat_v2 claims boardZ, master rejects.
- shardB:shard_heartbeat_v2 will continue to claim boardZ, no failures
- But, when requesting DUTs from dut-leaser, shardB will NOT requests DUTs for boardZ
- shardA:shard_heartbeat_v2 finally runs and relinquishes boardZ, after it has canceled all its DUT leases with dut-leaser
- shardB:shard_heartbeat_v2 next time succeeds in claiming boardZ
Comment 1 by pprabhu@chromium.org
, Nov 11 2017