New issue
Advanced search Search tips
Note: Color blocks (like or ) mean that a user may not be available. Tooltip shows the reason.

Issue 784024 link

Starred by 1 user

Issue metadata

Status: WontFix
Owner:
Closed: Nov 2017
Cc:
Components:
EstimatedDays: ----
NextAction: ----
OS: ----
Pri: 1
Type: Bug

Blocked on:
issue 784032



Sign in to add a comment

Make shards the source of truth about shard-DUT assignment

Project Member Reported by pprabhu@chromium.org, Nov 11 2017

Issue description

This is a step towards a master-less setup of drone++

The end goal of this bug is: 
- Each shard is told
  - it's own board labels
- [shard_heartbeat_v2]  shard reports all its shard labels to master, which master accepts if an existing shard does not claim that board. Master responds with the list of labels that were accepted. [This can be a subset, see below]
- [shard_heartbeat_v2]  shard also reports its own DUTs, which the master unconditionally accepts. skylab and DUTLeaser are responsible for ensuring that two shards can never lease the same DUT.
- Both RPC forwarding based on hosts (based on host.shard) and assigning of new jobs and HQEs  to shards (based on shard label) in the master are unchanged.

Needs a few new RPCs:
- [shard:add_board_to_shard_v2,remove_board_from_shard_v2] This is how a shard is told what labels to claim. Today, we inject this mapping in the master via the add_board_to_shard, remove_board_from_shard, which will be deprecated.
- [master:purge_shard shard_id] This is necessary so that when a shard is removed entirely, another shard can then be added that will claim its labels and DUTs.

We could do away with master:purge_shard if master simply accepts the latest shard-board mapping from shard_heartbeat_v2. But this is risky because two shards claiming the same board mapping would lead to difficult to diagnose indeterminism. Instead, teardown of shards follow:
- If removing just a board from a shard, call remove_board_from_shard_v2 on the shard.
  - The shard will report that it no longer claims that board in the next heartbeat.
- If removing the shard entirely, call master:parge_shard, and all the shard's board mapping will be synchronously removed.

------
The shards need to be resilient to some of their board mappings being rejected by the master. This can happen especially when moving a board from one shard to another:

- we call remove_board_from_shard_v2 (shardA, boardZ)
- we call add_board_to_shard_v2 (shardB, boardZ)
- shardB:shard_heartbeat_v2 claims boardZ, master rejects.
  - shardB:shard_heartbeat_v2 will continue to claim boardZ, no failures
  - But, when requesting DUTs from dut-leaser, shardB will NOT requests DUTs for boardZ
- shardA:shard_heartbeat_v2 finally runs and relinquishes boardZ, after it has canceled all its DUT leases with dut-leaser
- shardB:shard_heartbeat_v2 next time succeeds in claiming boardZ
 
Status: Assigned (was: Untriaged)
My plan is to add all the extra RPCs next week or so as shadow RPCs (they'll be parallel to existing RPCs, but will not actually be effective, hidden behind a flag).

I will add shadow tables in the database for this purpose as well, because I can not have the shadow RPCs work without some actual DB updates (e.g, the shard's need to insert their own labels in some table, so that they can report those to the master via the new heartbeat).

The actual switch can not happen until the shards are ready to lease DUTs themselves.
In the meantime, I will have the heartbeat creating metrics about discrepancies it finds wherever possible.
Blockedon: 784032
Status: WontFix (was: Assigned)

Sign in to add a comment