New issue
Advanced search Search tips
Note: Color blocks (like or ) mean that a user may not be available. Tooltip shows the reason.

Issue 810570 link

Starred by 2 users

Issue metadata

Status: Fixed
Owner:
Closed: Feb 2018
Cc:
Components:
EstimatedDays: ----
NextAction: ----
OS: ----
Pri: 2
Type: Bug



Sign in to add a comment

atest shard delete is slow and unreliable

Project Member Reported by akes...@chromium.org, Feb 8 2018

Issue description

The relevant code from rpc_interface.py:

def delete_shard(hostname):
    """Delete a shard and reclaim all resources from it.

    This claims back all assigned hosts from the shard.
    """
    shard = rpc_utils.retrieve_shard(shard_hostname=hostname)

    # Remove shard information.
    models.Host.objects.filter(shard=shard).update(shard=None)
    models.Job.objects.filter(shard=shard).update(shard=None)
    shard.labels.clear()
    shard.delete()


The part here that appears to be slow is the jobs query. When I run that manually via a mysql connection ("select * from afe_jobs where shard_id=145;") it takes a long time, despite the fact that there is a shard_id index in this table.
 
The query above *finally* returned. It took 5 minutes. :/

That said, I'm not convinced that it is safe to drop that part of the shard deletion cleanup.
The reason that query is so damn slow, but shard heartbeat queries are able to be a little bit performant, is that the heartbeat queries use some kind of join that allows for filtering down to just incomplete jobs.
And for the record MySQL [chromeos_autotest_db]> select count(*) from afe_jobs where shard_id=145;
+----------+
| count(*) |
+----------+
|  1600299 |
+----------+
1 row in set (3 min 28.53 sec)

A hacky workaround is to do the update via database surgery that uses smaller updates, e.g. running the following query several times:

update afe_jobs set shard_id = NULL where shard_id=149 limit 100000;


Comment 5 by dshi@chromium.org, Feb 9 2018

I think delete_shard rpc can do the same. Shard setting for completed job has no significance if it's mismatched.
there's a foreign key constraint from afe_jobs -> shard_table, so I think the update is necessary for all entries prior to deleting from shard_table.
The delete_shard rpc is implemented via a django query. Django has no support for UPDATE ... LIMIT queries because it is apparently nonstandard SQL.

We could change the delete_shard rpc to loop over a raw UPDATE ... LIMIT query.

Comment 8 by dshi@chromium.org, Feb 9 2018

or we can leave the shard there, just flag it as disabled, so the foreign key constrain won't be broken. Then we can only move the incomplete job to master.
I don't like leaving the jobs lying around either, because they might end up being reincarnated in a strange way if we revive a once-dead shard.

Comment 10 by dshi@chromium.org, Feb 9 2018

For finished jobs, having shard set to a disabled shard shouldn't cause any problem. Worst thing I can think of is that job view showing shard of the old shard. There is no side effect of that.
Project Member

Comment 11 by bugdroid1@chromium.org, Feb 10 2018

The following revision refers to this bug:
  https://chromium.googlesource.com/chromiumos/third_party/autotest/+/a9ab98d31018389691bbe9540cb7b147daf2c8ff

commit a9ab98d31018389691bbe9540cb7b147daf2c8ff
Author: Aviv Keshet <akeshet@chromium.org>
Date: Sat Feb 10 05:37:14 2018

autotest: make shard deletion rpc more reliable with iterative deletion

BUG= chromium:810570 
TEST=Tested interactively

Change-Id: I22ebd89895e1a59aeab36171338b33513b476f58
Reviewed-on: https://chromium-review.googlesource.com/910326
Commit-Ready: Aviv Keshet <akeshet@chromium.org>
Tested-by: Aviv Keshet <akeshet@chromium.org>
Reviewed-by: Dan Shi <dshi@google.com>

[modify] https://crrev.com/a9ab98d31018389691bbe9540cb7b147daf2c8ff/frontend/afe/rpc_interface.py

Status: Fixed (was: Untriaged)

Sign in to add a comment