support removing boards from a shard |
||||||||
Issue descriptionThe need to delete an entire shard to move boards away from it is a crippling limitation when trying to failover or shuffle boards. I think it shouldn't be too hard to implement this ability, as a small extension of existing behavior. 1) shard_client's heartbeat already sends it's full set of "known hosts" in its heartbeat rpc to master. 2) Master can easily check whether the "known hosts" according to the shard are ones that should actually belong to the shard. 3) For any of the shard's "known hosts" that are actually not supposed to belong to it, the shard can include these in a list of "invalid hosts" which it includes in the heartbeat response. 4) Shard client, upon heartbeat response, deletes any hosts from its local database that it was told are invalid. This allows us to easily propagate deletions back to the shards. Then we just need to add a shard_remove_boards rpc which actually accomplishes the deletion on the master, and we're set. Are there any weird races that make this a bad idea?
,
Mar 23 2017
*3) when I said "the shard can" I mean obviously "the master can"
,
Mar 28 2017
,
Mar 28 2017
Sounds like a (small) OKR candidate, marking as such
,
Mar 28 2017
,
Mar 29 2017
The following revision refers to this bug: https://chromium.googlesource.com/chromiumos/third_party/autotest/+/b9077b9825143d43a63dc6c89b4f4d0c8facdf4a commit b9077b9825143d43a63dc6c89b4f4d0c8facdf4a Author: Aviv Keshet <akeshet@chromium.org> Date: Wed Mar 29 18:42:15 2017 [autotest] shard_client remove incorrect hosts from shard This is the first piece of the puzzle in making it possible to remove hosts from a running shard. BUG= chromium:704445 TEST=shard_client_unittest, shard_client_integration_test Change-Id: Ifd9d3fb754f2dfd593c8df29baa8bfcdadd41cf7 Reviewed-on: https://chromium-review.googlesource.com/457789 Commit-Ready: Aviv Keshet <akeshet@chromium.org> Tested-by: Aviv Keshet <akeshet@chromium.org> Reviewed-by: Dan Shi <dshi@google.com> [modify] https://crrev.com/b9077b9825143d43a63dc6c89b4f4d0c8facdf4a/scheduler/shard/shard_client.py [modify] https://crrev.com/b9077b9825143d43a63dc6c89b4f4d0c8facdf4a/scheduler/shard/shard_client_unittest.py
,
Mar 29 2017
The following revision refers to this bug: https://chrome-internal.googlesource.com/chromeos/overlays/overlay-strago-private/+/4067cd8fee4046f2a855e54d6408f4f548238878 commit 4067cd8fee4046f2a855e54d6408f4f548238878 Author: Hung-Te Lin <hungte@chromium.org> Date: Wed Mar 29 18:42:18 2017
,
Mar 29 2017
The following revision refers to this bug: https://chrome-internal.googlesource.com/chromeos/overlays/overlay-strago-private/+/4067cd8fee4046f2a855e54d6408f4f548238878 commit 4067cd8fee4046f2a855e54d6408f4f548238878 Author: Hung-Te Lin <hungte@chromium.org> Date: Wed Mar 29 18:42:18 2017
,
Apr 4 2017
Complication + request for comment.
The current shard_heartbeat has a race condition that would emerge if we allow labels to be removed from a shard. In particular, the heartbeat calls into this assign_to_shard code. (Relevant code from frontend/afe/models.py)
@classmethod
def assign_to_shard(cls, shard, known_ids):
"""Assigns hosts to a shard.
For all labels that have been assigned to a shard, all hosts that
have at least one of the shard's labels are assigned to the shard.
Hosts that are assigned to the shard but aren't already present on the
shard are returned.
Any boards that are in |known_ids| but that do not belong to the shard
are incorrect ids, which are also returned so that the shard can remove
them locally.
Board to shard mapping is many-to-one. Many different boards can be
hosted in a shard. However, DUTs of a single board cannot be distributed
into more than one shard.
@param shard: The shard object to assign labels/hosts for.
@param known_ids: List of all host-ids the shard already knows.
This is used to figure out which hosts should be sent
to the shard. If shard_ids were used instead, hosts
would only be transferred once, even if the client
failed persisting them.
The number of hosts usually lies in O(100), so the
overhead is acceptable.
@returns a tuple of (hosts objects that should be sent to the shard,
incorrect host ids that should not belong to]
shard)
"""
# <SNIP>
hosts = []
host_ids = set(Host.objects.filter(
labels__in=shard.labels.all(),
leased=False
).exclude(
id__in=known_ids,
).values_list('pk', flat=True))
if host_ids:
Host.objects.filter(pk__in=host_ids).update(shard=shard)
hosts = list(Host.objects.filter(pk__in=host_ids).all())
invalid_host_ids = list(Host.objects.filter(
id__in=known_ids
).exclude(
shard=shard
).values_list('pk', flat=True))
return hosts, invalid_host_ids
The race condition is that we first compute the host_ids that should get updated. Then in a second query, we update those hosts with the shard they now belong to. If the shard labels change between these two queries (in particular, if a label was removed), this will be incorrect.
Pondered if their was a way to use a last_updated_timestamp or similar on the shard table, and add a check on it within this heartbeat/assign function. However, I think given that we compute host_ids and do the update in separate queries, it is generally impossible to avoid this race, at least with these queries as written.
Proposal 1: Combine the host_ids and host_update query into a single select-and-update query.
Proposal 2: Include a timestamp check in the update query.
My django-fu is weak so I might need some hints for either approach.
,
Apr 4 2017
Proposal 3: include the same labels__in and leased=False filters in the update query.
,
Apr 4 2017
3 is much cleaner. essentially that's similar to 1 but with less complication. I'm not sure if shard.labels.all() is a dynamic query. Maybe we should cache that list before these two queries. Otherwise, the two queries could use two different sets of shard labels.
,
Apr 4 2017
The shard.labels.all() won't be so complicated. Maybe similar to get_labels.startswith('board:'). Agree with dshi@, vote for proposal 3
,
Apr 4 2017
#3 sounds right. Pretty sure that shard.label.all() returns a queryset, meaning that it doesn't actually execute the query until necessary, so yes I think it is a dynamic query. shard.labels.all() being different between the two queries is in fact what we *want*.
,
Apr 4 2017
if host_ids:
Host.objects.filter(pk__in=host_ids).update(shard=shard)
hosts = list(Host.objects.filter(pk__in=host_ids).all())
If we change that to
if host_ids:
Host.objects.filter(pk__in=host_ids).update(shard=shard)
hosts = list(Host.objects.filter(pk__in=host_ids, labels__in=shard.labels.all(), leased=False).all())
Note that the first filter "pk__in=host_ids" will limit the hosts to the collection from earlier set. If we use shard.labels.all() there and the label was removed already, that might cause some issue, as the query will return fewer hosts?
,
Apr 5 2017
,
Apr 6 2017
The following revision refers to this bug: https://chromium.googlesource.com/chromiumos/third_party/autotest/+/224b2205aa7d271c67f7c8041d9f86cba80a1928 commit 224b2205aa7d271c67f7c8041d9f86cba80a1928 Author: Aviv Keshet <akeshet@chromium.org> Date: Thu Apr 06 08:42:17 2017 autotest: add a remove_boards_from_shard RPC BUG= chromium:704445 TEST=unittests added, and they pass Change-Id: I1dbbc68cbb29a341e9d35d1bdb322f1bb398da74 Reviewed-on: https://chromium-review.googlesource.com/467990 Commit-Ready: Aviv Keshet <akeshet@chromium.org> Tested-by: Aviv Keshet <akeshet@chromium.org> Reviewed-by: Dan Shi <dshi@google.com> [modify] https://crrev.com/224b2205aa7d271c67f7c8041d9f86cba80a1928/frontend/afe/rpc_interface_unittest.py [modify] https://crrev.com/224b2205aa7d271c67f7c8041d9f86cba80a1928/frontend/afe/rpc_interface.py
,
Apr 6 2017
The following revision refers to this bug: https://chromium.googlesource.com/chromiumos/third_party/autotest/+/224b2205aa7d271c67f7c8041d9f86cba80a1928 commit 224b2205aa7d271c67f7c8041d9f86cba80a1928 Author: Aviv Keshet <akeshet@chromium.org> Date: Thu Apr 06 08:42:17 2017 autotest: add a remove_boards_from_shard RPC BUG= chromium:704445 TEST=unittests added, and they pass Change-Id: I1dbbc68cbb29a341e9d35d1bdb322f1bb398da74 Reviewed-on: https://chromium-review.googlesource.com/467990 Commit-Ready: Aviv Keshet <akeshet@chromium.org> Tested-by: Aviv Keshet <akeshet@chromium.org> Reviewed-by: Dan Shi <dshi@google.com> [modify] https://crrev.com/224b2205aa7d271c67f7c8041d9f86cba80a1928/frontend/afe/rpc_interface_unittest.py [modify] https://crrev.com/224b2205aa7d271c67f7c8041d9f86cba80a1928/frontend/afe/rpc_interface.py
,
Apr 6 2017
The following revision refers to this bug: https://chromium.googlesource.com/chromiumos/third_party/autotest/+/bf963c188531f899edded62641118140a97f21dd commit bf963c188531f899edded62641118140a97f21dd Author: Aviv Keshet <akeshet@chromium.org> Date: Thu Apr 06 08:42:17 2017 autotest: fix race condition between heartbeat and label removal BUG= chromium:704445 TEST=New test added. Test verified to fail prior to fix. Change-Id: Ifdc49e1937d1a3580a8689d05ea135afc7a4d445 Reviewed-on: https://chromium-review.googlesource.com/468086 Commit-Ready: Aviv Keshet <akeshet@chromium.org> Tested-by: Aviv Keshet <akeshet@chromium.org> Reviewed-by: Aviv Keshet <akeshet@chromium.org> [modify] https://crrev.com/bf963c188531f899edded62641118140a97f21dd/frontend/afe/models.py [modify] https://crrev.com/bf963c188531f899edded62641118140a97f21dd/frontend/afe/rpc_interface_unittest.py
,
Apr 10 2017
~/chromiumos/src/third_party/autotest/files/cli$ ./atest shard remove_board -w chromeos-autotest.hot -l board:quawks chromeos-shard2-staging.hot.corp.google.com None A moment later, on the shard heartbeat log, I see 04/10 15:54:25.287 INFO | shard_client:0328| Performing heartbeat. 04/10 15:54:25.291 INFO | shard_client:0296| Known jobs: [] 04/10 15:54:25.294 INFO | shard_client:0302| Uploading jobs [] 04/10 15:54:25.485 INFO | shard_client:0171| Heartbeat response contains hosts [] 04/10 15:54:25.485 INFO | shard_client:0173| Heartbeat response contains jobs [] 04/10 15:54:25.485 INFO | shard_client:0177| Heartbeat response contains suite_keyvals_for jobs [] 04/10 15:54:25.486 INFO | shard_client:0180| Heartbeat response contains incorrect_host_ids [2, 12, 31] which will be deleted. 04/10 15:54:26.537 INFO | shard_client:0352| Heartbeat completed. 04/10 15:55:26.890 INFO | shard_client:0328| Performing heartbeat. And verified on shards web ui that the hosts are no longer listed. Then, a few minutes later, I try ~/chromiumos/src/third_party/autotest/files/cli$ ./atest shard add_boards -w chromeos-autotest.hot -l board:quawks chromeos-shard2-staging.hot.corp.google.com Added boards for shards: 'board:quawks', 'chromeos-shard2-staging.hot.corp.google.com' and verify that after next heartbeat, boards reappear. A few remaining things to fix: - `atest shard remove_board` should output a bit more information upon success (and should encourage user to wait a minute and verify that boards are removed from shard) - the boards were in the READY state on the master after removal from shard, and were also in READY state when re-added to shard. I'm not sure if this is problematic. Is it necessary to force any kind of verify jobs on them during these moves?
,
Apr 10 2017
I think we force a repair after dut is added/removed from a shard.
,
Apr 10 2017
You think we do, or we should? I didn't see any action when I removed the DUT from shard, and no surprise since I didn't follow the delete_shard rpc flow. Forcing a job against the DUT at "remove_board" time sounds wrong, because the board isn't actually removed until the next heartbeat from that shard, so there could be a race between the master and any still-running job on the shard. Forcing repair after adding board to a shard makes sense, though I don't see any evidence that is the case today.
,
Apr 10 2017
shuqianz@? what's the work flow in removing shards? I think repair is enforced after duts are removed from shard. The special task will be queued on the dut. So whenever the shard returns the dut to master in the next tick, repair will be executed first before any test can run.
,
Apr 11 2017
As discussed in Issue 499865, reboot is required on all DUTs removed from shard to make sure the DUTs are still of ready status and own testing logs in master DB. Repair was enforced due to it contained a reboot then. In order to remove the reboot legacy from repair, now a 'reboot_after_shard_deletion' job is kicked off instead of repair. A timeout has been added to job reboot_after_shard_deletion to make sure it will succeed or aborted in time. (https://chromium-review.googlesource.com/c/462335/).
,
Apr 13 2017
The following revision refers to this bug: https://chromium.googlesource.com/chromiumos/third_party/autotest/+/307731525ec0e337c31da5d25686d54faa17276e commit 307731525ec0e337c31da5d25686d54faa17276e Author: Aviv Keshet <akeshet@chromium.org> Date: Thu Apr 13 03:22:39 2017 autotest: add 'atest shard remove_board' subcommand BUG= chromium:704445 TEST=None Change-Id: I51222c621d1aa20ae6ea23e0bf04ec1c86958029 Reviewed-on: https://chromium-review.googlesource.com/470046 Commit-Ready: Aviv Keshet <akeshet@chromium.org> Tested-by: Aviv Keshet <akeshet@chromium.org> Reviewed-by: Aviv Keshet <akeshet@chromium.org> [modify] https://crrev.com/307731525ec0e337c31da5d25686d54faa17276e/cli/shard.py
,
Apr 13 2017
The following revision refers to this bug: https://chromium.googlesource.com/chromiumos/third_party/autotest/+/307731525ec0e337c31da5d25686d54faa17276e commit 307731525ec0e337c31da5d25686d54faa17276e Author: Aviv Keshet <akeshet@chromium.org> Date: Thu Apr 13 03:22:39 2017 autotest: add 'atest shard remove_board' subcommand BUG= chromium:704445 TEST=None Change-Id: I51222c621d1aa20ae6ea23e0bf04ec1c86958029 Reviewed-on: https://chromium-review.googlesource.com/470046 Commit-Ready: Aviv Keshet <akeshet@chromium.org> Tested-by: Aviv Keshet <akeshet@chromium.org> Reviewed-by: Aviv Keshet <akeshet@chromium.org> [modify] https://crrev.com/307731525ec0e337c31da5d25686d54faa17276e/cli/shard.py
,
Apr 14 2017
We tried this in prod yesterday, and it worked.
,
Apr 14 2017
Docs updated https://sites.google.com/a/google.com/chromeos/for-team-members/infrastructure/chromeos-admin/shard-management
,
May 30 2017
,
Aug 1 2017
,
Jan 22 2018
|
||||||||
►
Sign in to add a comment |
||||||||
Comment 1 by akes...@chromium.org
, Mar 23 2017