cleanup: drop tables from (make board transfer between shards less bumpy, by using a temporary label lockout) |
||||||
Issue descriptionProblem: During the time between when you run "atest remove_board -l board:foo shard1" and "atest add_board -l board:foo shard2", the jobs for that board (at least any new ones, probably all outstanding ones too?) fall back to the master for ownership. During that time, master's host-scheduler will start turning afe_jobs into afe_hqes, and master's scheduler will pick up those hqes and start running them. This leads to: - short term increased load on the master - longer effective time for board to move to new shard, because many DUTs may get taken by master hqes and not become Ready again for a while - potential for inconsistencies or other weirdness between shard1, shard2, and master view of jobs and hqes. Idea: 1) Create an afe_lockedout_labels table with (label, lockout_end_time) columns and index on lockout_end_time. 2) During "atest shard remove_board" rpc, add a entry 10 minutes in the future to that table on the master. 3) In host-scheduler, in each tick, make a query to afe_lockedout_labels table for all entries with locked_end_time in the future. If any such entries are encountered, ignore all jobs with those labels during this tick. This will effectively lead to a 10 minute "freeze" on new hqes for that board, so that no new work is created for it for a few minutes while it lives on the master. If the board is added to a new shard, the freeze will expire. Or if 10 minutes elapse, then we can assume that the board is intended to live on the master, and the freeze will also expire.
,
May 3 2017
Let's try to do this without a new table / column if at all possible. Another idea in the bag: update 'atest remove_board' to take an argument that says "lock DUT on addition to master". This way, when the DUTs fallback, they're all locked, and can not be used for any jobs. host_scheduler will fail to match and jobs to the hosts and just sit there. This avoids adding a new concept to the mix. The biggest risk here is that DUT locking doesn't have a timeout like the one proposed above (or does it?) I'd say add a timeout the DUT locking which should be set by default to 1 day. This is good practice anyway since we have DUTs locked indefinitely that we tend to forget about.
,
May 3 2017
I don't like DUT locking because, mainly because some DUTs might have already been locked. So now you hvae to keep track of which ones were locked by this request, vs. previously locked, so you know which to unlock. Also, I think in the naive implementation DUT locking would propagate to the new shard.
,
May 5 2017
,
May 8 2017
,
May 26 2017
Issue 716854 has been merged into this issue.
,
May 31 2017
The following revision refers to this bug: https://chromium.googlesource.com/chromiumos/third_party/autotest/+/6a156ff6e5abf85434454bc852b7b739479c82ba commit 6a156ff6e5abf85434454bc852b7b739479c82ba Author: Aviv Keshet <akeshet@chromium.org> Date: Wed May 31 09:15:27 2017 autotest: add label lockout table BUG= chromium:717811 TEST=None Change-Id: I2ad7c415228daa854bd04467be097efb06861fd0 Reviewed-on: https://chromium-review.googlesource.com/517392 Commit-Ready: Aviv Keshet <akeshet@chromium.org> Tested-by: Aviv Keshet <akeshet@chromium.org> Reviewed-by: Dan Shi <dshi@google.com> [add] https://crrev.com/6a156ff6e5abf85434454bc852b7b739479c82ba/frontend/migrations/116_add_label_lockout_table.py
,
Jun 12 2017
Issue 716586 has been merged into this issue.
,
Jul 24 2017
On further thought, I'm not as confident in the design of this. Separately, there was a fix to scheduler crashloops caused by job inconsistencies that these migrations can introduce. I think that fix precludes the need for this work.
,
Jul 25 2017
The following revision refers to this bug: https://chromium.googlesource.com/chromiumos/third_party/autotest/+/2abe495c0b9c59a4044688e461a0eeecb7c23a7a commit 2abe495c0b9c59a4044688e461a0eeecb7c23a7a Author: Aviv Keshet <akeshet@chromium.org> Date: Tue Jul 25 21:04:09 2017 autotest: drop unused label_lockout_table BUG= chromium:717811 TEST=None Change-Id: I1d4848ca87e1eb37fc1df4c3ce55808631b1bc62 Reviewed-on: https://chromium-review.googlesource.com/583513 Commit-Ready: Aviv Keshet <akeshet@chromium.org> Tested-by: Aviv Keshet <akeshet@chromium.org> Reviewed-by: Dan Shi <dshi@google.com> [add] https://crrev.com/2abe495c0b9c59a4044688e461a0eeecb7c23a7a/frontend/migrations/117_drop_label_lockout_table.py
,
Aug 14 2017
,
Jan 22 2018
|
||||||
►
Sign in to add a comment |
||||||
Comment 1 by akes...@chromium.org
, May 3 2017