New issue
Advanced search Search tips
Note: Color blocks (like or ) mean that a user may not be available. Tooltip shows the reason.

Issue 862113 link

Starred by 1 user

Issue metadata

Status: Available
Owner: ----
Cc:
Components:
EstimatedDays: ----
NextAction: ----
OS: Android
Pri: 1
Type: Bug



Sign in to add a comment

Querying swarming devices is flaky (sometimes devices are missing)

Project Member Reported by nednguyen@chromium.org, Jul 10

Issue description

Some times our build fails at trigger step because we cannot query enough devices for all the shards.

Example:
https://ci.chromium.org/buildbot/chromium.perf/Android%20Nexus5X%20Perf/2007

https://ci.chromium.org/buildbot/chromium.perf/Android%20Nexus6%20WebView%20Perf/2048



In the first example, the trigger step was able to only retrieve the following bots:

Healthy bots: ['build211-b7--device1', 'build212-b7--device4', 'build212-b7--device3', 'build212-b7--device2', 'build211-b7--device5', 'build213-b7--device2', 'build213-b7--device3', 'build213-b7--device1', 'build213-b7--device6', 'build213-b7--device7', 'build213-b7--device4', 'build212-b7--device6', 'build211-b7--device7', 'build213-b7--device5']
Dead Bots: []

(https://logs.chromium.org/v/?s=chrome%2Fbb%2Fchromium.perf%2FAndroid_Nexus5X_Perf%2F2007%2F%2B%2Frecipes%2Fsteps%2Ftest_pre_run%2F0%2Fsteps%2Fs__trigger__performance_test_suite_on_Android_device_Nexus_5X%2F0%2Fstdout)

Normally, it would have found all these bots instead: 
Healthy bots: ['build212-b7--device3', 'build213-b7--device7', 'build212-b7--device4', 'build212-b7--device7', 'build211-b7--device1', 'build211-b7--device2', 'build211-b7--device3', 'build211-b7--device4', 'build211-b7--device5', 'build211-b7--device6', 'build211-b7--device7', 'build213-b7--device2', 'build213-b7--device3', 'build213-b7--device1', 'build213-b7--device6', 'build212-b7--device1', 'build213-b7--device4', 'build212-b7--device6', 'build212-b7--device5', 'build213-b7--device5', 'build212-b7--device2']
Dead Bots: []
(https://logs.chromium.org/v/?s=chrome%2Fbb%2Fchromium.perf%2FAndroid_Nexus5X_Perf%2F2008%2F%2B%2Frecipes%2Fsteps%2Ftest_pre_run%2F0%2Fsteps%2Fs__trigger__performance_test_suite_on_Android_device_Nexus_5X%2F0%2Fstdout)


Comparing between the failed build and the working build, swarming query fails to retrieve the data for the following bots: ['build212-b7--device7', 'build212-b7--device5', 'build211-b7--device3', 'build211-b7--device4', 'build211-b7--device6', 'build212-b7--device1', 'build211-b7--device2']


I am not sure whether this is a swarming infra problem, or a device problem. The code to query swarming devices is in https://cs.chromium.org/chromium/src/testing/trigger_scripts/perf_device_trigger.py?rcl=c13a893ebf9a176c754209b23955a471d0313afb&l=222

 
Cc: kbr@chromium.org
+kbr@ in case he also observed this problem 
Components: Infra>Client>Android
Status: Available (was: Untriaged)
Components: -Infra>Client>Android Infra>Client>Chrome
This probably isn't an infra bug. The base_test_triggerer.py script currently ignores busy bots:

https://cs.chromium.org/chromium/src/testing/trigger_scripts/base_test_triggerer.py?g=0&l=160

Does the Speed team expect busy bots to be returned from the query?

We only currently ignore dead and quarantined ones:
https://cs.chromium.org/chromium/src/testing/trigger_scripts/perf_device_trigger.py?l=227
Labels: OS-Android
Ah, I see. I was looking at query_swarming_for_bot_configs in base_test_triggerer.py.

Thinking about this more, my best guess is that the phones were in the process of rebooting. I've run some Android tests locally recently and the phones reboot quite a lot. Does that sound like a plausible reason?

Taking build212-b7--device7 as an example: the trigger step ran at 2:37 am PST. From 2:35 - 2:46, the bot was restarting. We restart the bots (and rerun device provisioning) every few hours. The build caught the bot in the middle of that process and missed it I guess. (Didn't look at the other bots, but it's prob the same story.)

If the triggering depends on a set of bots being available, it really shouldn't fail fast. Let it sit and wait for a few minutes and only fail if they're not all up by then.
I suspect this could be a dupe of issue 863072.

Sign in to add a comment