New issue
Advanced search Search tips
Note: Color blocks (like or ) mean that a user may not be available. Tooltip shows the reason.

Issue 704445 link

Starred by 1 user

Issue metadata

Status: Archived
Owner:
Closed: Apr 2017
Cc:
Components:
EstimatedDays: ----
NextAction: ----
OS: Chrome
Pri: 2
Type: Feature
OKR

Blocking:
issue 706262



Sign in to add a comment

support removing boards from a shard

Project Member Reported by akes...@chromium.org, Mar 23 2017

Issue description

The need to delete an entire shard to move boards away from it is a crippling limitation when trying to failover or shuffle boards.

I think it shouldn't be too hard to implement this ability, as a small extension of existing behavior.

1) shard_client's heartbeat already sends it's full set of "known hosts" in its heartbeat rpc to master.
2) Master can easily check whether the "known hosts" according to the shard are ones that should actually belong to the shard.
3) For any of the shard's "known hosts" that are actually not supposed to belong to it, the shard can include these in a list of "invalid hosts" which it includes in the heartbeat response.
4) Shard client, upon heartbeat response, deletes any hosts from its local database that it was told are invalid.

This allows us to easily propagate deletions back to the shards. Then we just need to add a shard_remove_boards rpc which actually accomplishes the deletion on the master, and we're set.

Are there any weird races that make this a bad idea?
 
Labels: -current-issue
*3) when I said "the shard can" I mean obviously "the master can"

Comment 3 by aut...@google.com, Mar 28 2017

Labels: -Type-Bug OKR Type-Feature

Comment 4 by aut...@google.com, Mar 28 2017

Sounds like a (small) OKR candidate, marking as such
Owner: akes...@chromium.org
Status: Started (was: Untriaged)
Project Member

Comment 6 by bugdroid1@chromium.org, Mar 29 2017

The following revision refers to this bug:
  https://chromium.googlesource.com/chromiumos/third_party/autotest/+/b9077b9825143d43a63dc6c89b4f4d0c8facdf4a

commit b9077b9825143d43a63dc6c89b4f4d0c8facdf4a
Author: Aviv Keshet <akeshet@chromium.org>
Date: Wed Mar 29 18:42:15 2017

[autotest] shard_client remove incorrect hosts from shard

This is the first piece of the puzzle in making it possible to remove
hosts from a running shard.

BUG= chromium:704445 
TEST=shard_client_unittest, shard_client_integration_test

Change-Id: Ifd9d3fb754f2dfd593c8df29baa8bfcdadd41cf7
Reviewed-on: https://chromium-review.googlesource.com/457789
Commit-Ready: Aviv Keshet <akeshet@chromium.org>
Tested-by: Aviv Keshet <akeshet@chromium.org>
Reviewed-by: Dan Shi <dshi@google.com>

[modify] https://crrev.com/b9077b9825143d43a63dc6c89b4f4d0c8facdf4a/scheduler/shard/shard_client.py
[modify] https://crrev.com/b9077b9825143d43a63dc6c89b4f4d0c8facdf4a/scheduler/shard/shard_client_unittest.py

Project Member

Comment 7 by bugdroid1@chromium.org, Mar 29 2017

Project Member

Comment 8 by bugdroid1@chromium.org, Mar 29 2017

Complication + request for comment.

The current shard_heartbeat has a race condition that would emerge if we allow labels to be removed from a shard. In particular, the heartbeat calls into this assign_to_shard code. (Relevant code from frontend/afe/models.py)

    @classmethod
    def assign_to_shard(cls, shard, known_ids):
        """Assigns hosts to a shard.

        For all labels that have been assigned to a shard, all hosts that
        have at least one of the shard's labels are assigned to the shard.
        Hosts that are assigned to the shard but aren't already present on the
        shard are returned.

        Any boards that are in |known_ids| but that do not belong to the shard
        are incorrect ids, which are also returned so that the shard can remove
        them locally.

        Board to shard mapping is many-to-one. Many different boards can be
        hosted in a shard. However, DUTs of a single board cannot be distributed
        into more than one shard.

        @param shard: The shard object to assign labels/hosts for.
        @param known_ids: List of all host-ids the shard already knows.
                          This is used to figure out which hosts should be sent
                          to the shard. If shard_ids were used instead, hosts
                          would only be transferred once, even if the client
                          failed persisting them.
                          The number of hosts usually lies in O(100), so the
                          overhead is acceptable.

        @returns a tuple of (hosts objects that should be sent to the shard,
                             incorrect host ids that should not belong to]
                             shard)
        """

        # <SNIP>
        hosts = []

        host_ids = set(Host.objects.filter(
            labels__in=shard.labels.all(),
            leased=False
            ).exclude(
            id__in=known_ids,
            ).values_list('pk', flat=True))

        if host_ids:
            Host.objects.filter(pk__in=host_ids).update(shard=shard)
            hosts = list(Host.objects.filter(pk__in=host_ids).all())

        invalid_host_ids = list(Host.objects.filter(
            id__in=known_ids
            ).exclude(
            shard=shard
            ).values_list('pk', flat=True))

        return hosts, invalid_host_ids

The race condition is that we first compute the host_ids that should get updated. Then in a second query, we update those hosts with the shard they now belong to. If the shard labels change between these two queries (in particular, if a label was removed), this will be incorrect.


Pondered if their was a way to use a last_updated_timestamp or similar on the shard table, and add a check on it within this heartbeat/assign function. However, I think given that we compute host_ids and do the update in separate queries, it is generally impossible to avoid this race, at least with these queries as written.

Proposal 1: Combine the host_ids and host_update query into a single select-and-update query.
Proposal 2: Include a timestamp check in the update query.

My django-fu is weak so I might need some hints for either approach.
Proposal 3: include the same labels__in and leased=False filters in the update query.

Comment 11 by dshi@chromium.org, Apr 4 2017

3 is much cleaner. essentially that's similar to 1 but with less complication.

I'm not sure if shard.labels.all() is a dynamic query. Maybe we should cache that list before these two queries. Otherwise, the two queries could use two different sets of shard labels.
The shard.labels.all() won't be so complicated. Maybe similar to get_labels.startswith('board:'). Agree with dshi@, vote for proposal 3
#3 sounds right.

Pretty sure that shard.label.all() returns a queryset, meaning that it doesn't actually execute the query until necessary, so yes I think it is a dynamic query.

shard.labels.all() being different between the two queries is in fact what we *want*.

Comment 14 by dshi@chromium.org, Apr 4 2017

        if host_ids:
            Host.objects.filter(pk__in=host_ids).update(shard=shard)
            hosts = list(Host.objects.filter(pk__in=host_ids).all())

If we change that to 

        if host_ids:
            Host.objects.filter(pk__in=host_ids).update(shard=shard)
            hosts = list(Host.objects.filter(pk__in=host_ids, labels__in=shard.labels.all(),  leased=False).all())

Note that the first filter "pk__in=host_ids" will limit the hosts to the collection from earlier set. If we use shard.labels.all() there and the label was removed already, that might cause some issue, as the query will return fewer hosts?
Blocking: 706262
Project Member

Comment 16 by bugdroid1@chromium.org, Apr 6 2017

The following revision refers to this bug:
  https://chromium.googlesource.com/chromiumos/third_party/autotest/+/224b2205aa7d271c67f7c8041d9f86cba80a1928

commit 224b2205aa7d271c67f7c8041d9f86cba80a1928
Author: Aviv Keshet <akeshet@chromium.org>
Date: Thu Apr 06 08:42:17 2017

autotest: add a remove_boards_from_shard RPC

BUG= chromium:704445 
TEST=unittests added, and they pass

Change-Id: I1dbbc68cbb29a341e9d35d1bdb322f1bb398da74
Reviewed-on: https://chromium-review.googlesource.com/467990
Commit-Ready: Aviv Keshet <akeshet@chromium.org>
Tested-by: Aviv Keshet <akeshet@chromium.org>
Reviewed-by: Dan Shi <dshi@google.com>

[modify] https://crrev.com/224b2205aa7d271c67f7c8041d9f86cba80a1928/frontend/afe/rpc_interface_unittest.py
[modify] https://crrev.com/224b2205aa7d271c67f7c8041d9f86cba80a1928/frontend/afe/rpc_interface.py

Project Member

Comment 17 by bugdroid1@chromium.org, Apr 6 2017

The following revision refers to this bug:
  https://chromium.googlesource.com/chromiumos/third_party/autotest/+/224b2205aa7d271c67f7c8041d9f86cba80a1928

commit 224b2205aa7d271c67f7c8041d9f86cba80a1928
Author: Aviv Keshet <akeshet@chromium.org>
Date: Thu Apr 06 08:42:17 2017

autotest: add a remove_boards_from_shard RPC

BUG= chromium:704445 
TEST=unittests added, and they pass

Change-Id: I1dbbc68cbb29a341e9d35d1bdb322f1bb398da74
Reviewed-on: https://chromium-review.googlesource.com/467990
Commit-Ready: Aviv Keshet <akeshet@chromium.org>
Tested-by: Aviv Keshet <akeshet@chromium.org>
Reviewed-by: Dan Shi <dshi@google.com>

[modify] https://crrev.com/224b2205aa7d271c67f7c8041d9f86cba80a1928/frontend/afe/rpc_interface_unittest.py
[modify] https://crrev.com/224b2205aa7d271c67f7c8041d9f86cba80a1928/frontend/afe/rpc_interface.py

Project Member

Comment 18 by bugdroid1@chromium.org, Apr 6 2017

The following revision refers to this bug:
  https://chromium.googlesource.com/chromiumos/third_party/autotest/+/bf963c188531f899edded62641118140a97f21dd

commit bf963c188531f899edded62641118140a97f21dd
Author: Aviv Keshet <akeshet@chromium.org>
Date: Thu Apr 06 08:42:17 2017

autotest: fix race condition between heartbeat and label removal

BUG= chromium:704445 
TEST=New test added. Test verified to fail prior to fix.

Change-Id: Ifdc49e1937d1a3580a8689d05ea135afc7a4d445
Reviewed-on: https://chromium-review.googlesource.com/468086
Commit-Ready: Aviv Keshet <akeshet@chromium.org>
Tested-by: Aviv Keshet <akeshet@chromium.org>
Reviewed-by: Aviv Keshet <akeshet@chromium.org>

[modify] https://crrev.com/bf963c188531f899edded62641118140a97f21dd/frontend/afe/models.py
[modify] https://crrev.com/bf963c188531f899edded62641118140a97f21dd/frontend/afe/rpc_interface_unittest.py

~/chromiumos/src/third_party/autotest/files/cli$ ./atest shard remove_board -w chromeos-autotest.hot -l board:quawks chromeos-shard2-staging.hot.corp.google.com                                                                                                   
None


A moment later, on the shard heartbeat log, I see
04/10 15:54:25.287 INFO |      shard_client:0328| Performing heartbeat.
04/10 15:54:25.291 INFO |      shard_client:0296| Known jobs: []
04/10 15:54:25.294 INFO |      shard_client:0302| Uploading jobs []
04/10 15:54:25.485 INFO |      shard_client:0171| Heartbeat response contains hosts []
04/10 15:54:25.485 INFO |      shard_client:0173| Heartbeat response contains jobs []
04/10 15:54:25.485 INFO |      shard_client:0177| Heartbeat response contains suite_keyvals_for jobs []
04/10 15:54:25.486 INFO |      shard_client:0180| Heartbeat response contains incorrect_host_ids [2, 12, 31] which will be deleted.
04/10 15:54:26.537 INFO |      shard_client:0352| Heartbeat completed.
04/10 15:55:26.890 INFO |      shard_client:0328| Performing heartbeat.


And verified on shards web ui that the hosts are no longer listed.


Then, a few minutes later, I try
~/chromiumos/src/third_party/autotest/files/cli$ ./atest shard add_boards -w chromeos-autotest.hot -l board:quawks chromeos-shard2-staging.hot.corp.google.com
Added boards for shards: 
        'board:quawks', 'chromeos-shard2-staging.hot.corp.google.com'

and verify that after next heartbeat, boards reappear.


A few remaining things to fix:
 - `atest shard remove_board` should output a bit more information upon success (and should encourage user to wait a minute and verify that boards are removed from shard)

 - the boards were in the READY state on the master after removal from shard, and were also in READY state when re-added to shard. I'm not sure if this is problematic. Is it necessary to force any kind of verify jobs on them during these moves?

Comment 20 by dshi@chromium.org, Apr 10 2017

I think we force a repair after dut is added/removed from a shard.
You think we do, or we should?

I didn't see any action when I removed the DUT from shard, and no surprise since I didn't follow the delete_shard rpc flow. Forcing a job against the DUT at "remove_board" time sounds wrong, because the board isn't actually removed until the next heartbeat from that shard, so there could be a race between the master and any still-running job on the shard.

Forcing repair after adding board to a shard makes sense, though I don't see any evidence that is the case today.

Comment 22 by dshi@chromium.org, Apr 10 2017

shuqianz@? what's the work flow in removing shards? I think repair is enforced after duts are removed from shard. The special task will be queued on the dut. So whenever the shard returns the dut to master in the next tick, repair will be executed first before any test can run.
As discussed in Issue 499865, reboot is required on all DUTs removed from shard to make sure the DUTs are still of
ready status and own testing logs in master DB.

Repair was enforced due to it contained a reboot then. In order to remove the reboot legacy from repair, now a 'reboot_after_shard_deletion' job is kicked off instead of repair.

A timeout has been added to job reboot_after_shard_deletion to make sure it will succeed or aborted in time. (https://chromium-review.googlesource.com/c/462335/).


Project Member

Comment 24 by bugdroid1@chromium.org, Apr 13 2017

The following revision refers to this bug:
  https://chromium.googlesource.com/chromiumos/third_party/autotest/+/307731525ec0e337c31da5d25686d54faa17276e

commit 307731525ec0e337c31da5d25686d54faa17276e
Author: Aviv Keshet <akeshet@chromium.org>
Date: Thu Apr 13 03:22:39 2017

autotest: add 'atest shard remove_board' subcommand

BUG= chromium:704445 
TEST=None

Change-Id: I51222c621d1aa20ae6ea23e0bf04ec1c86958029
Reviewed-on: https://chromium-review.googlesource.com/470046
Commit-Ready: Aviv Keshet <akeshet@chromium.org>
Tested-by: Aviv Keshet <akeshet@chromium.org>
Reviewed-by: Aviv Keshet <akeshet@chromium.org>

[modify] https://crrev.com/307731525ec0e337c31da5d25686d54faa17276e/cli/shard.py

Project Member

Comment 25 by bugdroid1@chromium.org, Apr 13 2017

The following revision refers to this bug:
  https://chromium.googlesource.com/chromiumos/third_party/autotest/+/307731525ec0e337c31da5d25686d54faa17276e

commit 307731525ec0e337c31da5d25686d54faa17276e
Author: Aviv Keshet <akeshet@chromium.org>
Date: Thu Apr 13 03:22:39 2017

autotest: add 'atest shard remove_board' subcommand

BUG= chromium:704445 
TEST=None

Change-Id: I51222c621d1aa20ae6ea23e0bf04ec1c86958029
Reviewed-on: https://chromium-review.googlesource.com/470046
Commit-Ready: Aviv Keshet <akeshet@chromium.org>
Tested-by: Aviv Keshet <akeshet@chromium.org>
Reviewed-by: Aviv Keshet <akeshet@chromium.org>

[modify] https://crrev.com/307731525ec0e337c31da5d25686d54faa17276e/cli/shard.py

We tried this in prod yesterday, and it worked.

Comment 28 by dchan@google.com, May 30 2017

Labels: VerifyIn-60
Labels: VerifyIn-61

Comment 30 by dchan@chromium.org, Jan 22 2018

Status: Archived (was: Fixed)

Sign in to add a comment