New issue
Advanced search Search tips

Issue 881908 link

Starred by 1 user

Issue metadata

Status: Assigned
Owner:
Cc:
Components:
EstimatedDays: ----
NextAction: ----
OS: Chrome
Pri: 2
Type: Bug



Sign in to add a comment

limit batch size of host and job transfer between master-shard

Project Member Reported by pprabhu@chromium.org, Sep 7

Issue description

This is follow up of the outage in https://bugs.chromium.org/p/chromium/issues/detail?id=880450#c15

Migrating a board with a large number of DUTs from master to shard led to an avalanche of jobs being migrated from master to shard. This locked up the shard_client on the shard and everything else on the shard came to a grinding halt.

We were lucky that this did not also lock the master processes / master DB.
 
Labels: -Pri-2 Pri-1
While removing a board from a shard, the host is deleted. That process take forever because

[1] All hosts are deleted within a single heartbeat
[2] This particular line takes forever: http://shortn/_wJLajAYDgB because it serially deletes all HQEs (there are 1000s?), even the ones that are already complete (issue 881491)
And while adding a board, this step takes forever in the heartbeat:

09/07 12:40:39.705 INFO |      shard_client:0321| Uploading jobs [235119729L, 235207358L]
09/07 12:45:20.133 INFO |            models:1981| None/234938000 (235671770) -> Queued
09/07 12:45:20.207 INFO |            models:1981| None/234938024 (235671793) -> Queued
09/07 12:45:20.262 INFO |            models:1981| None/234944272 (235678079) -> Queued
09/07 12:45:20.308 INFO |            models:1981| None/234944320 (235678128) -> Queued
09/07 12:45:20.354 INFO |            models:1981| None/234944364 (235678172) -> Queued
09/07 12:45:20.399 INFO |            models:1981| None/234944450 (235678260) -> Queued
09/07 12:45:20.461 INFO |            models:1981| None/234944487 (235678296) -> Queued
...



The step after "Uploading jobs" was stuck for 5 minutes in this case. In case of moving nami, it was stuck for ~15 minutes.
Then, all the newly obtained jobs for the newly obtain hosts were inserted quickly in this case. In case of nami, it took well over half an hour to do that.
Cc: akes...@chromium.org
Labels: -Chase-Pending Chase
Owner: zamorzaev@chromium.org
In meeting I thought I had other bugs with similar context, but I can't find them so guess not.
Cc: pprabhu@chromium.org
Regarding #c2 bullet point [2], what should the right behavior be? 

Do we want to only restrict to the active hqes when performing delete?

Are the completed hqes not deleted automatically?
> Do we want to only restrict to the active hqes when performing delete?

Yes, that would be one option, sounds worth investigation.

> Are the completed hqes not deleted automatically?

No, they are not.
Still looking, no progress expected this week (deputy).
Labels: -Pri-1 -Chase Pri-2
workaround:
 - shard's remove_board marks invalid rather than deletes host
 - sentinel later comes by and deletes invalid hosts on shards
Status: Assigned (was: Untriaged)

Sign in to add a comment