limit batch size of host and job transfer between master-shard |
|||||
Issue descriptionThis is follow up of the outage in https://bugs.chromium.org/p/chromium/issues/detail?id=880450#c15 Migrating a board with a large number of DUTs from master to shard led to an avalanche of jobs being migrated from master to shard. This locked up the shard_client on the shard and everything else on the shard came to a grinding halt. We were lucky that this did not also lock the master processes / master DB.
,
Sep 7
While removing a board from a shard, the host is deleted. That process take forever because [1] All hosts are deleted within a single heartbeat [2] This particular line takes forever: http://shortn/_wJLajAYDgB because it serially deletes all HQEs (there are 1000s?), even the ones that are already complete (issue 881491)
,
Sep 7
And while adding a board, this step takes forever in the heartbeat: 09/07 12:40:39.705 INFO | shard_client:0321| Uploading jobs [235119729L, 235207358L] 09/07 12:45:20.133 INFO | models:1981| None/234938000 (235671770) -> Queued 09/07 12:45:20.207 INFO | models:1981| None/234938024 (235671793) -> Queued 09/07 12:45:20.262 INFO | models:1981| None/234944272 (235678079) -> Queued 09/07 12:45:20.308 INFO | models:1981| None/234944320 (235678128) -> Queued 09/07 12:45:20.354 INFO | models:1981| None/234944364 (235678172) -> Queued 09/07 12:45:20.399 INFO | models:1981| None/234944450 (235678260) -> Queued 09/07 12:45:20.461 INFO | models:1981| None/234944487 (235678296) -> Queued ... The step after "Uploading jobs" was stuck for 5 minutes in this case. In case of moving nami, it was stuck for ~15 minutes. Then, all the newly obtained jobs for the newly obtain hosts were inserted quickly in this case. In case of nami, it took well over half an hour to do that.
,
Sep 17
,
Sep 17
In meeting I thought I had other bugs with similar context, but I can't find them so guess not.
,
Sep 21
Regarding #c2 bullet point [2], what should the right behavior be? Do we want to only restrict to the active hqes when performing delete? Are the completed hqes not deleted automatically?
,
Sep 21
> Do we want to only restrict to the active hqes when performing delete? Yes, that would be one option, sounds worth investigation. > Are the completed hqes not deleted automatically? No, they are not.
,
Sep 24
Still looking, no progress expected this week (deputy).
,
Oct 1
workaround: - shard's remove_board marks invalid rather than deletes host - sentinel later comes by and deletes invalid hosts on shards
,
Oct 25
|
|||||
►
Sign in to add a comment |
|||||
Comment 1 by pprabhu@chromium.org
, Sep 7