New issue
Advanced search Search tips
Note: Color blocks (like or ) mean that a user may not be available. Tooltip shows the reason.

Issue 838557 link

Starred by 2 users

Issue metadata

Status: Fixed
Owner:
Closed: May 2018
Cc:
Components:
EstimatedDays: ----
NextAction: ----
OS: Android
Pri: 1
Type: Bug-Regression



Sign in to add a comment

device_os and device_type missing from quarantined/dead android bots

Project Member Reported by eyaich@chromium.org, May 1 2018

Issue description

device_os and device_type dimensions are missing from 3 of our android go swarming bots in chrome-swarming swarming pool: 

build30-a7--device6:
https://chrome-swarming.appspot.com/bot?id=build30-a7--device6&sort_stats=total%3Adesc

build31-a7--device1
https://chrome-swarming.appspot.com/bot?id=build31-a7--device1&sort_stats=total%3Adesc

build31-a7--device4
https://chrome-swarming.appspot.com/bot?id=build31-a7--device4&sort_stats=total%3Adesc

We are running into an issue with android Go bots that are not showing device_type and device_os dimensions and so our jobs are failing to find them when querying for devices targeting those dimensions. 

This is especially problematic with our new device affinity implementation that needs the complete set of bots to do its assignment.

Note this is only happening for the bots that are quarantined or dead.  Is this a known issue with android?

When I look at the last tasks that completed on the bots I see that the those dimensions were present when the task executed:
https://chrome-swarming.appspot.com/task?id=3d0ca82650024410&refresh=10&show_raw=1



 
This is WAI. The bot is quarantined because it can't talk to the device (it or the usb hub is offline). Since it can't talk to the device, it can't get device type or OS.

We could theoretically add a sticky-state file that records last known device type & OS, but I'd rather not add more persistent state to the bot's filesystem. Is a sufficient fix here to heal the bots? (The way you'd do it for any dead win/mac bot.)
And for dead bots: docker on the host machine only spawns containers for devices it currently sees. If a phone drops offline and completely disappears, its container will eventually get reaped and won't start back up, so the swarming bot will likewise go MIA.
ok that is kind of what I thought (and feared).  It breaks our assumptions in perf that we can query on a set of dimensions and get all eligible bots (dead, quaratined, alive, etc) and then we prune from there.

Yes healing the bots is necessary, but doesn't fix this issue.  I will have to see what we can do on our end on our device affinity algorithm to account for this with android.


this is actually potentially more problematic for the perf use case since we trigger jobs on dead bots when there aren't enough healthy ones.  I guess if we don't know about the bot than we are just going to have to trigger on a dummy id so that we can see the shard and tests failing.
"dummy id"? That sounds a lot more wonkers than simply persisting device dimensions on the bot after the phone dies, which is pretty trivial. I'm fine with adding that, just wanted to see if there was any other options.

I'll work on that change, then we can see if it fixes perf's use-case here. If not, then yeah we might need to think about it some more.
Cc: -bpastene@chromium.org eyaich@chromium.org
Owner: bpastene@chromium.org
thanks sgtm
Project Member

Comment 7 by bugdroid1@chromium.org, May 2 2018

The following revision refers to this bug:
  https://chrome-internal.googlesource.com/infradata/config/+/4196b285b4d88680ed2b5d91ebd816c691ad18f4

commit 4196b285b4d88680ed2b5d91ebd816c691ad18f4
Author: Ben Pastene <bpastene@chromium.org>
Date: Wed May 02 18:46:32 2018

Status: Fixed (was: Untriaged)
This should be fixed with https://chrome-internal-review.googlesource.com/621131

As of writing this, https://chrome-swarming.appspot.com/bot?id=build191-b7--device1 is quarantined and shows its previous device type and OS dimensions.

Sign in to add a comment