New issue
Advanced search Search tips
Note: Color blocks (like or ) mean that a user may not be available. Tooltip shows the reason.

Issue 691749 link

Starred by 3 users

Issue metadata

Status: Duplicate
Owner: ----
Closed: Feb 2017
Cc:
Components:
EstimatedDays: ----
NextAction: ----
OS: Linux
Pri: 1
Type: Bug

Blocked on:
issue 671372



Sign in to add a comment

swarming overloaded: base_unittests took 37 min 'cause it was pending for 2319s

Project Member Reported by thakis@chromium.org, Feb 13 2017

Issue description

base_unittests took 37 min on https://build.chromium.org/p/tryserver.chromium.linux/builders/linux_chromium_rel_ng/builds/389203, 2319s of that in "pending". Looks like swarming slaves on linux are currently swamped.
 

Comment 1 by iannu...@google.com, Feb 14 2017

Yeah, it looks like we have a TON of pending jobs right now. I'm looking into it.

Comment 2 by iannu...@google.com, Feb 14 2017

Mergedinto: 691759
Status: Duplicate (was: Untriaged)
merging this for now

Comment 3 by iannu...@google.com, Feb 14 2017

Status: Available (was: Duplicate)
Different issue

Comment 4 by iannu...@google.com, Feb 14 2017

Labels: -Pri-3 Pri-0
bumping this to p0 because it's actually pretty important

Comment 5 by s...@google.com, Feb 14 2017

Cc: vadimsh@chromium.org smut@chromium.org mar...@chromium.org
We spiked to 25,000 pending jobs recently, and right now there are still like 15,000.

Comment 6 by iannu...@google.com, Feb 14 2017

yes, and they're falling, but I'm not making any headway on debugging the root cause.

Comment 7 by s...@google.com, Feb 14 2017

7500 of the pending tasks are from tryserver.chromium.linux, 4800 are from chromium.perf.

Comment 8 by iannu...@google.com, Feb 14 2017

I was just about to post that they're all in the 'chromium' project.

smut@ where did you find the breakdown by master?

Comment 9 by s...@google.com, Feb 14 2017

I used trial and error on the task list. Apply various filters and see how many tasks there are pending.

For example, here's a filter for pending tasks on tryserver.chromium.linux. It says for the selected filters there are currently 7500 tasks:
https://chromium-swarm.appspot.com/tasklist?c=name&c=state&c=created_ts&c=user&c=bot&c=master&c=buildername&c=os&f=state%3APENDING&f=master%3Atryserver.chromium.linux&l=1000&s=created_ts%3Adesc

Comment 10 by s...@google.com, Feb 14 2017

Cc: -vadimsh@chromium.org -mar...@chromium.org martiniss@chromium.org
Looks like chromium.perf was restarted and it interrupted a bunch of in progress builds. Maybe a bunch of duplicate tasks got triggered when the bots started rebuilding?

https://uberchromegw.corp.google.com/i/chromium.perf/waterfall

+cc martiniss, looks like you restarted chromium.perf for https://chromium.googlesource.com/chromium/tools/build/+/885076f57154cdf2bc03d6e056c1831795387573. Did you use master manager? It should have waited for the in progress builds to complete instead of interrupting them.
I remember the master restart from earlier today: https://chrome-internal.googlesource.com/infradata/master-manager/+/163363e98880615d4f040e007ced5ebe991050f1

I'm not sure how that could lead to duplicates though
I don't think chromium.perf would add *that* many pending builds. Some quick math:

16 builders * 80 tests per bot * 5 bots = 6400 jobs per build run on buildbot. So, we would add a max of 12800 jobs, if we have a build interrupted, and then re-scheduled.

Well, maybe I was wrong.

I do know that when we restart the master, we get a bunch of duplicate jobs, which overloads our bots. 
It looks like the upward trend started at ~10:40 AM (PST) 
It does lead to duplicates because we trigger all jobs at the beginning of the build. The builds take about 6 hours on average (sadly), and each job sequentially runs on bots, so there are jobs pending until the very end. So, if we interrupt a bot halfway through its build, then we have about half of the jobs pending.

Then, we run another buildbot build once the master restarts, and trigger a whole new set of jobs.

Comment 15 by s...@google.com, Feb 14 2017

Well, hypothetically if every builder had triggered tests but not yet collected the results, if the master was forcibly restarted and interrupted all their builds (which happened if you look at the amount of purple on chromium.perf), then it could be possible that all those builds re-ran and triggered the same tests again.

But that doesn't seem like enough duplication to explain 4500 pending tasks for chromium.perf, and certainly doesn't explain 7500 pending tasks on tryserver.chromium.mac.
I think you meant on tryserver.chromium.linux?
Re #15: According to the quick math I did in #12, I think we average about 4500 pending tasks for chromium.perf :(

Comment 18 by s...@google.com, Feb 14 2017

Oh, with the explanation in #12 it definitely explains 4500 pending tasks on chromium.perf.

Maybe the 7500 pending on tryserver.chromium.linux are normal and the bulk of the 25,000 pending from earlier was chromium.perf? Not sure how to check which master had tasks pending historically.

Is there a way not to interrupt chromium.perf bots in the future? Would restarting chromium.perf at EOD help or would it still be running 6 hour builds at night?

Comment 19 by s...@google.com, Feb 14 2017

Yeah regarding #16 I meant tryserver.chromium.linux not tryserver.chromium.mac.
No, there isn't really a way to not interrupt chromium.perf. They're always running tests.

We're trying to drive cycle time down, which would help a lot. We're also trying to reduce the number of jobs we trigger, since we do trigger jobs unnecessarily, which would help with this.

We could require that you do a long drain before you actually restart the master. It'd be annoying, but it would help.

I had also floated the idea around that we should have something in the buildbot makefile to cancel any swarming tasks left when we start up a master.
If you look at a 7d plot, the numbers are wildly over the norm, even with 4500 from perf. The initial spike also happens a lot earlier than the perf waterfall restart.
As a recap; it looks like swarming was actually working as intended, however because of the 12.04 -> 14.04 ubuntu migration, the 12.04 pool is severely under capacity :(. Completing the migration should fix this.
But the size of swarming's precise pool hasn't significantly changed recently? http://shortn/_bdU63v8Em8
Blockedon: 671372
Hm... so today things look pretty sane. I'm really wondering if we got hit by some internal prod load testing stuff...

Comment 26 by s...@google.com, Feb 14 2017

Regarding #22, I didn't make any changes to Machine Provider yesterday. The whole 12.04 -> 14.04 migration will be today.
Did migration happen? Should this be marked as fix?
Cc: efoo@chromium.org
unowned Pri0 :( +efoo
Cc: iannucci@chromium.org phajdan.jr@chromium.org
Owner: sergeybe...@chromium.org
Assigning to current trooper ... why didn't the other troopers take this on as owner?
I think this might be fixed, but can't tell at the moment because tryserver.chromium.linux is down :(.
Status: Assigned (was: Available)
http://shortn/_z2alXMrdHH indeed shows a bunch of tryserver.chromium.linux swarming jobs pending today.
Just to rule out the obvious: the generated build load on the linux tryserver seems to be the usual: http://shortn/_PwjIAIjb3D (requested builds per hour). So it's not that.
And the runtime for browser_tests seems to be as usual as well: http://shortn/_dxaboh1asN
Which begs the question - how did it work before??
OK, we did migrate our swarming fleet to Trusty recently:  issue 664296 .
What we found: all 1200 bots were converted to Trusty, some 100+ bots died today. But that's just 10% of our fleet, and shouldn't really take us down...

Comment 40 by s...@google.com, Feb 16 2017

Sorry I didn't mention this when we were talking, but the 100 that died was completely normal. Leases are refreshed daily and different leases expire at different times. The 100 or so that died actually just expired and it took a few minutes to get replacements. This is normal behavior and shouldn't have any significant impact on us.

As of writing, 1208/1210 Trusty VMs are available.
Owner: ----
Status: Available (was: Assigned)
I'm not actively working on this right now - making available for the next trooper to see. It appears that issue 693668 may be contributing to the overload.
Looking at this now.
According to http://shortn/_v0jPabDEJF, the pending queues have started growing again ~10 hours ago and reached 22k jobs. Right now they dropped back to 5-8k, but still higher than usual 2-4k.

There are 4181 pending builds total:
 - 1126 on tryserver.chromium.linux.
 - 2167 on chromium.perf
Interestingly that number of executors has steadily dropped according to http://shortn/_L0fLYTi4wQ. The timing doesn't seem to match though (pending builds start to grow a few hours earlier), so perhaps this is red herring. I am surprised that bots disappear on that graph instead of just being reported as dead?
Cc: sergeybe...@chromium.org
Sergey, can you please explain how did you identify most loaded pool involved in #34 and most loaded step in #35?

According to http://shortn/_QFTty0lmNo we've migrated back from 14.04 to 12.04. It also explains why previous graph showed dropping number of bots. I wonder if this could be related to the spike. Sana, why did we migrate back?
Labels: -Pri-0 Pri-1
The pending builds are back to normal. I'm going to decrease priority on this.
Mergedinto: -691759 692708
Status: Duplicate (was: Available)
I think we can dup this into bug 692708.

Sign in to add a comment