swarming overloaded: base_unittests took 37 min 'cause it was pending for 2319s |
||||||||||||
Issue descriptionbase_unittests took 37 min on https://build.chromium.org/p/tryserver.chromium.linux/builders/linux_chromium_rel_ng/builds/389203, 2319s of that in "pending". Looks like swarming slaves on linux are currently swamped.
,
Feb 14 2017
,
Feb 14 2017
Different issue
,
Feb 14 2017
bumping this to p0 because it's actually pretty important
,
Feb 14 2017
We spiked to 25,000 pending jobs recently, and right now there are still like 15,000.
,
Feb 14 2017
yes, and they're falling, but I'm not making any headway on debugging the root cause.
,
Feb 14 2017
7500 of the pending tasks are from tryserver.chromium.linux, 4800 are from chromium.perf.
,
Feb 14 2017
I was just about to post that they're all in the 'chromium' project. smut@ where did you find the breakdown by master?
,
Feb 14 2017
I used trial and error on the task list. Apply various filters and see how many tasks there are pending. For example, here's a filter for pending tasks on tryserver.chromium.linux. It says for the selected filters there are currently 7500 tasks: https://chromium-swarm.appspot.com/tasklist?c=name&c=state&c=created_ts&c=user&c=bot&c=master&c=buildername&c=os&f=state%3APENDING&f=master%3Atryserver.chromium.linux&l=1000&s=created_ts%3Adesc
,
Feb 14 2017
Looks like chromium.perf was restarted and it interrupted a bunch of in progress builds. Maybe a bunch of duplicate tasks got triggered when the bots started rebuilding? https://uberchromegw.corp.google.com/i/chromium.perf/waterfall +cc martiniss, looks like you restarted chromium.perf for https://chromium.googlesource.com/chromium/tools/build/+/885076f57154cdf2bc03d6e056c1831795387573. Did you use master manager? It should have waited for the in progress builds to complete instead of interrupting them.
,
Feb 14 2017
I remember the master restart from earlier today: https://chrome-internal.googlesource.com/infradata/master-manager/+/163363e98880615d4f040e007ced5ebe991050f1 I'm not sure how that could lead to duplicates though
,
Feb 14 2017
I don't think chromium.perf would add *that* many pending builds. Some quick math: 16 builders * 80 tests per bot * 5 bots = 6400 jobs per build run on buildbot. So, we would add a max of 12800 jobs, if we have a build interrupted, and then re-scheduled. Well, maybe I was wrong. I do know that when we restart the master, we get a bunch of duplicate jobs, which overloads our bots.
,
Feb 14 2017
It looks like the upward trend started at ~10:40 AM (PST)
,
Feb 14 2017
It does lead to duplicates because we trigger all jobs at the beginning of the build. The builds take about 6 hours on average (sadly), and each job sequentially runs on bots, so there are jobs pending until the very end. So, if we interrupt a bot halfway through its build, then we have about half of the jobs pending. Then, we run another buildbot build once the master restarts, and trigger a whole new set of jobs.
,
Feb 14 2017
Well, hypothetically if every builder had triggered tests but not yet collected the results, if the master was forcibly restarted and interrupted all their builds (which happened if you look at the amount of purple on chromium.perf), then it could be possible that all those builds re-ran and triggered the same tests again. But that doesn't seem like enough duplication to explain 4500 pending tasks for chromium.perf, and certainly doesn't explain 7500 pending tasks on tryserver.chromium.mac.
,
Feb 14 2017
I think you meant on tryserver.chromium.linux?
,
Feb 14 2017
Re #15: According to the quick math I did in #12, I think we average about 4500 pending tasks for chromium.perf :(
,
Feb 14 2017
Oh, with the explanation in #12 it definitely explains 4500 pending tasks on chromium.perf. Maybe the 7500 pending on tryserver.chromium.linux are normal and the bulk of the 25,000 pending from earlier was chromium.perf? Not sure how to check which master had tasks pending historically. Is there a way not to interrupt chromium.perf bots in the future? Would restarting chromium.perf at EOD help or would it still be running 6 hour builds at night?
,
Feb 14 2017
Yeah regarding #16 I meant tryserver.chromium.linux not tryserver.chromium.mac.
,
Feb 14 2017
No, there isn't really a way to not interrupt chromium.perf. They're always running tests. We're trying to drive cycle time down, which would help a lot. We're also trying to reduce the number of jobs we trigger, since we do trigger jobs unnecessarily, which would help with this. We could require that you do a long drain before you actually restart the master. It'd be annoying, but it would help. I had also floated the idea around that we should have something in the buildbot makefile to cancel any swarming tasks left when we start up a master.
,
Feb 14 2017
If you look at a 7d plot, the numbers are wildly over the norm, even with 4500 from perf. The initial spike also happens a lot earlier than the perf waterfall restart.
,
Feb 14 2017
As a recap; it looks like swarming was actually working as intended, however because of the 12.04 -> 14.04 ubuntu migration, the 12.04 pool is severely under capacity :(. Completing the migration should fix this.
,
Feb 14 2017
But the size of swarming's precise pool hasn't significantly changed recently? http://shortn/_bdU63v8Em8
,
Feb 14 2017
,
Feb 14 2017
Hm... so today things look pretty sane. I'm really wondering if we got hit by some internal prod load testing stuff...
,
Feb 14 2017
Regarding #22, I didn't make any changes to Machine Provider yesterday. The whole 12.04 -> 14.04 migration will be today.
,
Feb 15 2017
Did migration happen? Should this be marked as fix?
,
Feb 15 2017
unowned Pri0 :( +efoo
,
Feb 15 2017
Assigning to current trooper ... why didn't the other troopers take this on as owner?
,
Feb 15 2017
I think this might be fixed, but can't tell at the moment because tryserver.chromium.linux is down :(.
,
Feb 15 2017
,
Feb 15 2017
http://shortn/_z2alXMrdHH indeed shows a bunch of tryserver.chromium.linux swarming jobs pending today.
,
Feb 15 2017
However, only ~5K out of total ~37K pending jobs are from tryserver.chromium.linux: https://viceroy.corp.google.com/chrome_infra/Jobs/per_job?service_name=chromium-swarm&job_regexp=tryserver.chromium.linux%3A.*&duration=1d&refresh=-1 Hm...
,
Feb 15 2017
https://viceroy.corp.google.com/chrome_infra/Jobs/pools?duration=1d&job_regexp=tryserver.chromium.linux%3A.%2A&pool=cores%3A8%7Ccpu%3Ax86%7Ccpu%3Ax86-64%7Cgpu%3Anone%7Cmachine_type%3An1-standard-8%7Cos%3ALinux%7Cos%3AUbuntu%7Cos%3AUbuntu-14.04%7Cpool%3AChrome%7Czone%3Aus-central1-f&refresh=-1&service_name=chromium-swarm shows the largest pool involved (1024 bots) which is currently overloaded (!! ^o^ !!)
,
Feb 16 2017
A good chuck of that pool is taken by browser_tests: https://viceroy.corp.google.com/chrome_infra/Jobs/per_job?service_name=chromium-swarm&job_regexp=tryserver.chromium.linux%3A.*%3Abrowser_tests&duration=1d&refresh=-1 About 50% of it.
,
Feb 16 2017
Just to rule out the obvious: the generated build load on the linux tryserver seems to be the usual: http://shortn/_PwjIAIjb3D (requested builds per hour). So it's not that.
,
Feb 16 2017
And the runtime for browser_tests seems to be as usual as well: http://shortn/_dxaboh1asN
,
Feb 16 2017
Which begs the question - how did it work before??
,
Feb 16 2017
OK, we did migrate our swarming fleet to Trusty recently: issue 664296 . What we found: all 1200 bots were converted to Trusty, some 100+ bots died today. But that's just 10% of our fleet, and shouldn't really take us down...
,
Feb 16 2017
Sorry I didn't mention this when we were talking, but the 100 that died was completely normal. Leases are refreshed daily and different leases expire at different times. The 100 or so that died actually just expired and it took a few minutes to get replacements. This is normal behavior and shouldn't have any significant impact on us. As of writing, 1208/1210 Trusty VMs are available.
,
Feb 22 2017
I'm not actively working on this right now - making available for the next trooper to see. It appears that issue 693668 may be contributing to the overload.
,
Feb 22 2017
Looking at this now.
,
Feb 22 2017
According to http://shortn/_v0jPabDEJF, the pending queues have started growing again ~10 hours ago and reached 22k jobs. Right now they dropped back to 5-8k, but still higher than usual 2-4k. There are 4181 pending builds total: - 1126 on tryserver.chromium.linux. - 2167 on chromium.perf
,
Feb 22 2017
Interestingly that number of executors has steadily dropped according to http://shortn/_L0fLYTi4wQ. The timing doesn't seem to match though (pending builds start to grow a few hours earlier), so perhaps this is red herring. I am surprised that bots disappear on that graph instead of just being reported as dead?
,
Feb 22 2017
Sergey, can you please explain how did you identify most loaded pool involved in #34 and most loaded step in #35? According to http://shortn/_QFTty0lmNo we've migrated back from 14.04 to 12.04. It also explains why previous graph showed dropping number of bots. I wonder if this could be related to the spike. Sana, why did we migrate back?
,
Feb 22 2017
The pending builds are back to normal. I'm going to decrease priority on this. |
||||||||||||
►
Sign in to add a comment |
||||||||||||
Comment 1 by iannu...@google.com
, Feb 14 2017