New issue
Advanced search Search tips

Issue 787109 link

Starred by 1 user

Issue metadata

Status: WontFix
Owner:
Closed: Dec 2017
Cc:
EstimatedDays: ----
NextAction: ----
OS: Windows
Pri: 2
Type: Bug



Sign in to add a comment

Switch priority on FYI swarming bots

Project Member Reported by jam@chromium.org, Nov 20 2017

Issue description

https://build.chromium.org/p/chromium.fyi/builders/Mojo%20Linux?numbuilds=200

is our FYI bot for the network service. During daytime, we often have long build cycles because our FYI swarming priority is low, so for example https://build.chromium.org/p/chromium.fyi/builders/Mojo%20Linux/builds/7439 has 35+min pending time for jobs.

We need this to cycle fast, as it helps us track down when regressions occur. We're not ready to move this out of FYI yet, but in the meantime is it possible to set the priority on swarming tasks for this bot to be similar to main waterfall? I couldn't find a way.

Dirk: please triage, thanks
 
As we've discussed before, jobs being scheduled late because of a lower priority is a last-ditch mechanism to prevent overload. I.e., most of the time we should have excess capacity and things shouldn't get delayed.

If they are getting delayed, it means that the CQ and the main waterfalls are using up all of the capacity. In that situation, I don't want things that aren't in those two classes to have equal priority.

There are three possible ways forward here that I see:

1) Bring additional capacity online, so that these aren't being delayed. This is my preferred option, and we should be working on it today and tomorrow.

2) Move the bots to the main waterfall.

3) If we're not ready to move the bots to the main waterfall, move them to a different master that could be treated more on an equal footing with the main waterfall. I do not want bots on FYI to fall into this category, because it makes things operationally confusing. I also don't really want to set up a new waterfall, since that has its own costs. We could move the bots to LUCI, though, which is easier but still a little buggy.

Comment 2 by jam@chromium.org, Nov 20 2017

What if the CQ and waterfall are not using all the capacity, but combined with FYI bots they are? we might not care about most FYI bots getting slowed down a bit. But in our bot's case, we do. So specifying a swarming capacity just for that bot seems lighter weight than creating a new master. I'm not really enthusiastic using LUCI for our bot at this point.
We shouldn't be using all of our capacity even for "CQ + Waterfall + FYI", which is why getting more capacity is the right answer and my top priority.

I understand your reluctance about setting a new master and/or moving to LUCI. And, changing the priority is certainly lighter weight, but it has downsides in that it means that different FYI bots get different QoS (which I understand is what you want) which is a change from how we actually do things today and something I don't particularly want to encourage.

Hopefully my desire to get more capacity in the next day or two is enough of a short term answer. In a few weeks, we will earnestly be moving bots to LUCI, and at that point we can look into better longer-term answers.

Comment 4 by jam@chromium.org, Nov 21 2017

Ok, thanks for the explanations.

I wonder if removing most of the old navigation tests gives back enough capacity? I can't find a way to see capacity usage of swarming bots (the status link is broken)
The graph I use for monitoring the capacity of the linux swarming pool is http://shortn/_dXaTT4clmM . Yesterday we were clearly maxing things out. I'll keep an eye on it today.
Status: WontFix (was: Assigned)
Marking this as WontFix since we've added capacity and AFAIK don't have any current issues, and since I don't want to actually change the priority as per the discussion above.

Sign in to add a comment