New issue
Advanced search Search tips
Note: Color blocks (like or ) mean that a user may not be available. Tooltip shows the reason.

Issue 717811 link

Starred by 1 user

Issue metadata

Status: Archived
Owner:
Closed: Aug 2017
Cc:
Components:
EstimatedDays: ----
NextAction: ----
OS: Chrome
Pri: 3
Type: Bug

Blocked on:
issue 719628



Sign in to add a comment

cleanup: drop tables from (make board transfer between shards less bumpy, by using a temporary label lockout)

Project Member Reported by akes...@chromium.org, May 3 2017

Issue description

Problem:

During the time between when you run "atest remove_board -l board:foo shard1" and "atest add_board -l board:foo shard2",  the jobs for that board (at least any new ones, probably all outstanding ones too?) fall back to the master for ownership. During that time, master's host-scheduler will start turning afe_jobs into afe_hqes, and master's scheduler will pick up those hqes and start running them. This leads to:
 - short term increased load on the master
 - longer effective time for board to move to new shard, because many DUTs may get taken by master hqes and not become Ready again for a while
 - potential for inconsistencies or other weirdness between shard1, shard2, and master view of jobs and hqes.


Idea:

1) Create an afe_lockedout_labels table with (label, lockout_end_time) columns and index on lockout_end_time.
2) During "atest shard remove_board" rpc, add a entry 10 minutes in the future to that table on the master.
3) In host-scheduler, in each tick, make a query to afe_lockedout_labels table for all entries with locked_end_time in the future. If any such entries are encountered, ignore all jobs with those labels during this tick.

This will effectively lead to a 10 minute "freeze" on new hqes for that board, so that no new work is created for it for a few minutes while it lives on the master. If the board is added to a new shard, the freeze will expire. Or if 10 minutes elapse, then we can assume that the board is intended to live on the master, and the freeze will also expire.
 
Idea 2: Instead of creating a separate afe_lockedout_labels table, just add a lockout_end_time column to the labels table itself. My question here is: does label table get sync'ed to shards somehow? If so, will this column also get synced, or only certain columns? We want this behavior only on the master, not on the shards, so don't want lockedout_end_time propagating to shards.
Let's try to do this without a new table / column if at all possible.

Another idea in the bag: update 'atest remove_board' to take an argument that says "lock DUT on addition to master".
This way, when the DUTs fallback, they're all locked, and can not be used for any jobs. host_scheduler will fail to match and jobs to the hosts and just sit there. 

This avoids adding a new concept to the mix.

The biggest risk here is that DUT locking doesn't have a timeout like the one proposed above (or does it?)
I'd say add a timeout the DUT locking which should be set by default to 1 day. This is good practice anyway since we have DUTs locked indefinitely that we tend to forget about.
I don't like DUT locking because, mainly because some DUTs might have already been locked. So now you hvae to keep track of which ones were locked by this request, vs. previously locked, so you know which to unlock.

Also, I think in the naive implementation DUT locking would propagate to the new shard.

Comment 4 by aut...@google.com, May 5 2017

Labels: -current-issue
Owner: akes...@chromium.org
Blockedon: 719628
Issue 716854 has been merged into this issue.
Project Member

Comment 7 by bugdroid1@chromium.org, May 31 2017

The following revision refers to this bug:
  https://chromium.googlesource.com/chromiumos/third_party/autotest/+/6a156ff6e5abf85434454bc852b7b739479c82ba

commit 6a156ff6e5abf85434454bc852b7b739479c82ba
Author: Aviv Keshet <akeshet@chromium.org>
Date: Wed May 31 09:15:27 2017

autotest: add label lockout table

BUG= chromium:717811 
TEST=None

Change-Id: I2ad7c415228daa854bd04467be097efb06861fd0
Reviewed-on: https://chromium-review.googlesource.com/517392
Commit-Ready: Aviv Keshet <akeshet@chromium.org>
Tested-by: Aviv Keshet <akeshet@chromium.org>
Reviewed-by: Dan Shi <dshi@google.com>

[add] https://crrev.com/6a156ff6e5abf85434454bc852b7b739479c82ba/frontend/migrations/116_add_label_lockout_table.py

Issue 716586 has been merged into this issue.
Labels: -Pri-2 Hotlist-Fixit Pri-3
Summary: cleanup: drop tables from (make board transfer between shards less bumpy, by using a temporary label lockout) (was: make board transfer between shards less bumpy, by using a temporary label lockout)
On further thought, I'm not as confident in the design of this.

Separately, there was a fix to scheduler crashloops caused by job inconsistencies that these migrations can introduce. I think that fix precludes the need for this work.
Project Member

Comment 10 by bugdroid1@chromium.org, Jul 25 2017

The following revision refers to this bug:
  https://chromium.googlesource.com/chromiumos/third_party/autotest/+/2abe495c0b9c59a4044688e461a0eeecb7c23a7a

commit 2abe495c0b9c59a4044688e461a0eeecb7c23a7a
Author: Aviv Keshet <akeshet@chromium.org>
Date: Tue Jul 25 21:04:09 2017

autotest: drop unused label_lockout_table

BUG= chromium:717811 
TEST=None

Change-Id: I1d4848ca87e1eb37fc1df4c3ce55808631b1bc62
Reviewed-on: https://chromium-review.googlesource.com/583513
Commit-Ready: Aviv Keshet <akeshet@chromium.org>
Tested-by: Aviv Keshet <akeshet@chromium.org>
Reviewed-by: Dan Shi <dshi@google.com>

[add] https://crrev.com/2abe495c0b9c59a4044688e461a0eeecb7c23a7a/frontend/migrations/117_drop_label_lockout_table.py

Status: Fixed (was: Untriaged)

Comment 12 by dchan@chromium.org, Jan 22 2018

Status: Archived (was: Fixed)

Sign in to add a comment