Ok, I added a get_states RPC to swarming which will allow this to work. Tested it on the dev swarming server and it seems to work.
It's a bit unclear how i'm going to use this in our collection code though. Our code right now is structured such that every swarming task is independent. It'll probably take a bit of surgery to group everything together to try to collect them all at once... Any suggestions?
I uploaded https://crrev.com/c/1170198, which is a proof of concept CL. The idea it uses is that when you execute tests, you first group the tests into groups which have differing ways of collecting results. Each group has its own function. Swarming would have its own function which would call out to a special script which would use the 'get_states' API call.
Any thoughts on this approach?
#6,
Yes, that's the plan. The only problem is the actual recipe code doesn't make it easy to do this, since each test is collected separately.
Some combination of https://crrev.com/c/1170198 and https://crrev.com/c/1170240 should make this happen. I'm OOO for a week but once I'm back I'll try to land this.
Ok, this is live on the staging bots!
https://ci.chromium.org/p/chromium/builders/luci.chromium.ci/linux-chromium-tests-staging-tests/5326 is a sample build. You can see it working correctly here.
You can see in the build that there's a ~30 minute time period where all the tasks are pending. I'm not sure what's going on there.
I think this is pretty much ready to promote to prod. The only thought I had was that maybe we should add some sort of step text or annotations to the collect tasks step which lists the number of tests we're waiting for, or maybe the shards we're waiting for? Something like
"Waiting for 10 tests to complete:"
and then maybe the step log lists like
webkit_layout_tests: 3/17 shards completed
base_unittests: 0/5 shards completed
I think it could be useful, but maybe not? Thoughts?
Also of note; it looks like the collect tasks take longer than they do, when we first start waiting. This is because we start sleeping for 1 second, then do exponential backoff. I think the sleep we have, which is implemented in python, ends up showing up in milo as the step taking longer. Which might be a bug, but is probably just something strange going on with how annotations work. It could be confusing for users, but maybe not? Just something to think about.
This is awesome!
A while ago, I found that 25% of CQ builder build times are on waiting for Swarming to complete and process results (Dirk saw those numbers). I could find some time to get the new percentage after your change is online for a longer time. Please feel free to ping me in a month!
I fixed. I had to cancel all the tasks triggered by the builder; it had triggered a ton of tasks, and since there was a bug, it never waited for them to finish, and the bots processing the tasks became overloaded.
Comment 1 by martiniss@chromium.org
, Jul 31Status: Assigned (was: Started)