Querying swarming devices is flaky (sometimes devices are missing) |
||||
Issue descriptionSome times our build fails at trigger step because we cannot query enough devices for all the shards. Example: https://ci.chromium.org/buildbot/chromium.perf/Android%20Nexus5X%20Perf/2007 https://ci.chromium.org/buildbot/chromium.perf/Android%20Nexus6%20WebView%20Perf/2048 In the first example, the trigger step was able to only retrieve the following bots: Healthy bots: ['build211-b7--device1', 'build212-b7--device4', 'build212-b7--device3', 'build212-b7--device2', 'build211-b7--device5', 'build213-b7--device2', 'build213-b7--device3', 'build213-b7--device1', 'build213-b7--device6', 'build213-b7--device7', 'build213-b7--device4', 'build212-b7--device6', 'build211-b7--device7', 'build213-b7--device5'] Dead Bots: [] (https://logs.chromium.org/v/?s=chrome%2Fbb%2Fchromium.perf%2FAndroid_Nexus5X_Perf%2F2007%2F%2B%2Frecipes%2Fsteps%2Ftest_pre_run%2F0%2Fsteps%2Fs__trigger__performance_test_suite_on_Android_device_Nexus_5X%2F0%2Fstdout) Normally, it would have found all these bots instead: Healthy bots: ['build212-b7--device3', 'build213-b7--device7', 'build212-b7--device4', 'build212-b7--device7', 'build211-b7--device1', 'build211-b7--device2', 'build211-b7--device3', 'build211-b7--device4', 'build211-b7--device5', 'build211-b7--device6', 'build211-b7--device7', 'build213-b7--device2', 'build213-b7--device3', 'build213-b7--device1', 'build213-b7--device6', 'build212-b7--device1', 'build213-b7--device4', 'build212-b7--device6', 'build212-b7--device5', 'build213-b7--device5', 'build212-b7--device2'] Dead Bots: [] (https://logs.chromium.org/v/?s=chrome%2Fbb%2Fchromium.perf%2FAndroid_Nexus5X_Perf%2F2008%2F%2B%2Frecipes%2Fsteps%2Ftest_pre_run%2F0%2Fsteps%2Fs__trigger__performance_test_suite_on_Android_device_Nexus_5X%2F0%2Fstdout) Comparing between the failed build and the working build, swarming query fails to retrieve the data for the following bots: ['build212-b7--device7', 'build212-b7--device5', 'build211-b7--device3', 'build211-b7--device4', 'build211-b7--device6', 'build212-b7--device1', 'build211-b7--device2'] I am not sure whether this is a swarming infra problem, or a device problem. The code to query swarming devices is in https://cs.chromium.org/chromium/src/testing/trigger_scripts/perf_device_trigger.py?rcl=c13a893ebf9a176c754209b23955a471d0313afb&l=222
,
Jul 10
,
Jul 11
,
Jul 11
This probably isn't an infra bug. The base_test_triggerer.py script currently ignores busy bots: https://cs.chromium.org/chromium/src/testing/trigger_scripts/base_test_triggerer.py?g=0&l=160 Does the Speed team expect busy bots to be returned from the query?
,
Jul 11
We only currently ignore dead and quarantined ones: https://cs.chromium.org/chromium/src/testing/trigger_scripts/perf_device_trigger.py?l=227
,
Jul 11
Ah, I see. I was looking at query_swarming_for_bot_configs in base_test_triggerer.py. Thinking about this more, my best guess is that the phones were in the process of rebooting. I've run some Android tests locally recently and the phones reboot quite a lot. Does that sound like a plausible reason?
,
Jul 12
Taking build212-b7--device7 as an example: the trigger step ran at 2:37 am PST. From 2:35 - 2:46, the bot was restarting. We restart the bots (and rerun device provisioning) every few hours. The build caught the bot in the middle of that process and missed it I guess. (Didn't look at the other bots, but it's prob the same story.) If the triggering depends on a set of bots being available, it really shouldn't fail fast. Let it sit and wait for a few minutes and only fail if they're not all up by then.
,
Jul 12
I suspect this could be a dupe of issue 863072. |
||||
►
Sign in to add a comment |
||||
Comment 1 by nednguyen@chromium.org
, Jul 10