atest shard delete is slow and unreliable |
||
Issue description
The relevant code from rpc_interface.py:
def delete_shard(hostname):
"""Delete a shard and reclaim all resources from it.
This claims back all assigned hosts from the shard.
"""
shard = rpc_utils.retrieve_shard(shard_hostname=hostname)
# Remove shard information.
models.Host.objects.filter(shard=shard).update(shard=None)
models.Job.objects.filter(shard=shard).update(shard=None)
shard.labels.clear()
shard.delete()
The part here that appears to be slow is the jobs query. When I run that manually via a mysql connection ("select * from afe_jobs where shard_id=145;") it takes a long time, despite the fact that there is a shard_id index in this table.
,
Feb 8 2018
The reason that query is so damn slow, but shard heartbeat queries are able to be a little bit performant, is that the heartbeat queries use some kind of join that allows for filtering down to just incomplete jobs.
,
Feb 8 2018
And for the record MySQL [chromeos_autotest_db]> select count(*) from afe_jobs where shard_id=145; +----------+ | count(*) | +----------+ | 1600299 | +----------+ 1 row in set (3 min 28.53 sec)
,
Feb 9 2018
A hacky workaround is to do the update via database surgery that uses smaller updates, e.g. running the following query several times: update afe_jobs set shard_id = NULL where shard_id=149 limit 100000;
,
Feb 9 2018
I think delete_shard rpc can do the same. Shard setting for completed job has no significance if it's mismatched.
,
Feb 9 2018
there's a foreign key constraint from afe_jobs -> shard_table, so I think the update is necessary for all entries prior to deleting from shard_table.
,
Feb 9 2018
The delete_shard rpc is implemented via a django query. Django has no support for UPDATE ... LIMIT queries because it is apparently nonstandard SQL. We could change the delete_shard rpc to loop over a raw UPDATE ... LIMIT query.
,
Feb 9 2018
or we can leave the shard there, just flag it as disabled, so the foreign key constrain won't be broken. Then we can only move the incomplete job to master.
,
Feb 9 2018
I don't like leaving the jobs lying around either, because they might end up being reincarnated in a strange way if we revive a once-dead shard.
,
Feb 9 2018
For finished jobs, having shard set to a disabled shard shouldn't cause any problem. Worst thing I can think of is that job view showing shard of the old shard. There is no side effect of that.
,
Feb 10 2018
The following revision refers to this bug: https://chromium.googlesource.com/chromiumos/third_party/autotest/+/a9ab98d31018389691bbe9540cb7b147daf2c8ff commit a9ab98d31018389691bbe9540cb7b147daf2c8ff Author: Aviv Keshet <akeshet@chromium.org> Date: Sat Feb 10 05:37:14 2018 autotest: make shard deletion rpc more reliable with iterative deletion BUG= chromium:810570 TEST=Tested interactively Change-Id: I22ebd89895e1a59aeab36171338b33513b476f58 Reviewed-on: https://chromium-review.googlesource.com/910326 Commit-Ready: Aviv Keshet <akeshet@chromium.org> Tested-by: Aviv Keshet <akeshet@chromium.org> Reviewed-by: Dan Shi <dshi@google.com> [modify] https://crrev.com/a9ab98d31018389691bbe9540cb7b147daf2c8ff/frontend/afe/rpc_interface.py
,
Feb 12 2018
|
||
►
Sign in to add a comment |
||
Comment 1 by akes...@chromium.org
, Feb 8 2018