New issue
Advanced search Search tips

Issue 675986 link

Starred by 1 user

Issue metadata

Status: WontFix
Owner: ----
Closed: Nov 7
Cc:
EstimatedDays: ----
NextAction: ----
OS: ----
Pri: 2
Type: Bug-Regression

Blocking:
issue 691582



Sign in to add a comment

Restarting perf waterfall can lead to swarmed desktop bots oscillating between 100+ expiring jobs to none every other run

Project Member Reported by eyaich@chromium.org, Dec 20 2016

Issue description

There is something odd happening on a good chunk of the swarming bots, and it doesn't seem to be specific to a platform.  We will have a couple of normal builds (ie 1-3 test failures) followed by a large number of failing tests due to expiring jobs, not test failures.  This will happen sometimes every other time and sometimes every 3-5 times.    

Revision range first seen:
Seems to be happening sometime after revision 439341

Recent failure on Mac Retina Perf, 116 expired jobs:

https://uberchromegw.corp.google.com/i/chromium.perf/builders/Mac%20Retina%20Perf/builds/89

Mac Pro 10.11 Perf: https://uberchromegw.corp.google.com/i/chromium.perf/builders/Mac%20Pro%2010.11%20Perf

Mac Air 10.11 Perf: 
https://uberchromegw.corp.google.com/i/chromium.perf/builders/Mac%20Air%2010.11%20Perf

Linux Perf: https://uberchromegw.corp.google.com/i/chromium.perf/builders/Linux%20Perf

Win 7 Nvidia GPU perf: 
https://uberchromegw.corp.google.com/i/chromium.perf/builders/Win%207%20Nvidia%20GPU%20Perf

Win 7 x64 Perf: https://uberchromegw.corp.google.com/i/chromium.perf/builders/Win%207%20x64%20Perf

Win 8 Perf: 
https://uberchromegw.corp.google.com/i/chromium.perf/builders/Win%208%20Perf

Win Zenbook Perf: 
https://uberchromegw.corp.google.com/i/chromium.perf/builders/Win%20Zenbook%20Perf
 
Status: Started (was: Untriaged)
I want to get some graphs from swarming.
Per chat with maruel@, the problem is that the builds are taking long enough that the triggered jobs are expiring. The short term solution should be to extend the expiration time on swarming jobs.

I'll double check this is the problem. The chromium.perf waterfall is restarting now, and it takes a while to come back up.

Comment 4 by eyaich@chromium.org, Dec 20 2016

Wait, I don't think that makes sense.  They were doing fine for a few weeks.  Why all the sudden are we having this large amount of trouble, and why not consistently?  

I don't think just extending the timeout is a good solution.  I think more investigation into which tests and when is necessary before we jump to that.
https://luci-milo.appspot.com/buildbot/chromium.perf/Win%20Zenbook%20Perf/83 is an example build which took 7 hours. The swarming task timeout is 6 hours, so it makes sense that tasks would expire.

I'm not sure how to get timing information for tests... Let me look around.
Ok, looking at some stuff...

https://chromium-swarm.appspot.com/task?id=33361ee331744110&refresh=10&show_raw=1 has a longer overhead than we had before. This is on a Win 10 Perf. So that doesn't look good. Let me confirm we used to have ~0 second overhead before.
Yup, I've seen ~20 second overhead on a lot of these bots.

Did we recently change the isolate? I know we removed the.. tracing json data? And the re-added it?
Going to lunch, but that sounds suspect. I'll see if buildbot runtimes have increased, if I can. Data doesn't seem to be in Dremel....
I don't know what is happening...

To be clear, there's a big set of expired jobs that happened on about "Dec 17 13:27". This is a known issue with how master restarts work. Master restarts kill builds before they finish. This is a problem for perf, because the way perf works is it triggers 10 (not the actual number, but for this example) swarming tasks per bot at the beginning of the run, and then collects their results over the course of the run. If the master gets restarted during a build, it's only waited for 5 of the 10 tasks to finish, so there are 5 more pending jobs. 10 more will get queue up once the next build starts running, and so we'll end up with a lot of expired jobs, due to not enough capacity to handle the (mistakenly high) load on the bots.

Not sure what the best way to handle that case is :/

This hasn't happened in a while, so I think my original hypothesis was correct. Lemme go through the builders and verify that things were semi-stable over the holidays.
Blocking: 691582
Owner: ----
Status: Available (was: Started)
This is no longer happening. 

It's known that restarting the waterfall will cause this. We should probably handle that better. But there's no unknown bug here, thankfully.
Labels: -Pri-1 Pri-2
Summary: Restarting perf waterfall can lead to swarmed desktop bots oscillating between 100+ expiring jobs to none every other run (was: Perf waterfall swarmed desktop bots oscillating between 100+ expiring jobs to none every other run)
Updated description and priority since we're now tracking the problem in #12
Project Member

Comment 14 by sheriffbot@chromium.org, Feb 16 2018

Labels: Hotlist-Recharge-Cold
Status: Untriaged (was: Available)
This issue has been Available for over a year. If it's no longer important or seems unlikely to be fixed, please consider closing it out. If it is important, please re-triage the issue.

Sorry for the inconvenience if the bug really should have been left as Available. If you change it back, also remove the "Hotlist-Recharge-Cold" label.

For more details visit https://www.chromium.org/issue-tracking/autotriage - Your friendly Sheriffbot
Status: WontFix (was: Untriaged)
This regression has been open for half a year. It's not very actionable and the regression has been in all Chrome user's hands for months.

Sign in to add a comment