Issue metadata
Sign in to add a comment
|
device_os and device_type missing from quarantined/dead android bots |
||||||||||||||||||||||
Issue descriptiondevice_os and device_type dimensions are missing from 3 of our android go swarming bots in chrome-swarming swarming pool: build30-a7--device6: https://chrome-swarming.appspot.com/bot?id=build30-a7--device6&sort_stats=total%3Adesc build31-a7--device1 https://chrome-swarming.appspot.com/bot?id=build31-a7--device1&sort_stats=total%3Adesc build31-a7--device4 https://chrome-swarming.appspot.com/bot?id=build31-a7--device4&sort_stats=total%3Adesc We are running into an issue with android Go bots that are not showing device_type and device_os dimensions and so our jobs are failing to find them when querying for devices targeting those dimensions. This is especially problematic with our new device affinity implementation that needs the complete set of bots to do its assignment. Note this is only happening for the bots that are quarantined or dead. Is this a known issue with android? When I look at the last tasks that completed on the bots I see that the those dimensions were present when the task executed: https://chrome-swarming.appspot.com/task?id=3d0ca82650024410&refresh=10&show_raw=1
,
May 1 2018
And for dead bots: docker on the host machine only spawns containers for devices it currently sees. If a phone drops offline and completely disappears, its container will eventually get reaped and won't start back up, so the swarming bot will likewise go MIA.
,
May 1 2018
ok that is kind of what I thought (and feared). It breaks our assumptions in perf that we can query on a set of dimensions and get all eligible bots (dead, quaratined, alive, etc) and then we prune from there. Yes healing the bots is necessary, but doesn't fix this issue. I will have to see what we can do on our end on our device affinity algorithm to account for this with android.
,
May 1 2018
this is actually potentially more problematic for the perf use case since we trigger jobs on dead bots when there aren't enough healthy ones. I guess if we don't know about the bot than we are just going to have to trigger on a dummy id so that we can see the shard and tests failing.
,
May 1 2018
"dummy id"? That sounds a lot more wonkers than simply persisting device dimensions on the bot after the phone dies, which is pretty trivial. I'm fine with adding that, just wanted to see if there was any other options. I'll work on that change, then we can see if it fixes perf's use-case here. If not, then yeah we might need to think about it some more.
,
May 1 2018
thanks sgtm
,
May 2 2018
The following revision refers to this bug: https://chrome-internal.googlesource.com/infradata/config/+/4196b285b4d88680ed2b5d91ebd816c691ad18f4 commit 4196b285b4d88680ed2b5d91ebd816c691ad18f4 Author: Ben Pastene <bpastene@chromium.org> Date: Wed May 02 18:46:32 2018
,
May 14 2018
This should be fixed with https://chrome-internal-review.googlesource.com/621131 As of writing this, https://chrome-swarming.appspot.com/bot?id=build191-b7--device1 is quarantined and shows its previous device type and OS dimensions. |
|||||||||||||||||||||||
►
Sign in to add a comment |
|||||||||||||||||||||||
Comment 1 by bpastene@chromium.org
, May 1 2018