New issue
Advanced search Search tips

Issue 682649 link

Starred by 1 user

Issue metadata

Status: Fixed
Owner:
Closed: Mar 2017
Cc:
Components:
EstimatedDays: ----
NextAction: ----
OS: ----
Pri: 1
Type: Bug

Blocking:
issue 691582



Sign in to add a comment

webrtc.peerconnection.reference fails because of not enough capacity

Project Member Reported by skyos...@chromium.org, Jan 19 2017

Issue description

Example failure:

https://build.chromium.org/p/chromium.perf/builders/Linux%20Perf/builds/264

webrtc.peerconnection.reference on (102b) GPU on Linux on Ubuntu-14.04
Bot id: 'build150-m1'
Run on OS: 'Ubuntu-14.04' ( 0 secs )
stdio [stdout]
outdir_json [logdog]
no_results_exc [logdog]
invalid_results_exc [logdog]
chartjson_info [logdog]
shard #0 expired, not enough capacity


 
Not sure if I should disable the test or not. Would that just move the failure elsewhere?
Owner: martiniss@chromium.org
Stephen, can you take a look? Is there documentation about what sheriffs should do here?
Status: Assigned (was: Untriaged)
These bugs are hard to diagnose, sadly. Usually the cause of this is that an earlier test started failing or taking longer.

In this case, what happened is that the triggered task (https://chromium-swarm.appspot.com/task?id=33cd772be045dc10&refresh=10&show_raw=1) just barely hit the expiration time set for it. It's unclear why this task didn't end up getting run; you have look at the previous tasks to figure out why. I want to make a little script that grabs run times from swarming, to allow us to see what tests get worse over time. I'll bring this up in the speed/infra sync in about an hour.

I'm probably the correct owner for this. 
Labels: -Pri-3 Pri-1
Status: Started (was: Assigned)
I did some spelunking today. Actually a lot of spelunking.

The root problem here is that the tests on the bot which this test needs to run on are taking too long. This is problematic because we trigger all tests at the beginning of the run, and we give it a timeout of about 6 hours. This is generally sufficient, but sometimes is not enough. This is a case where it isn't enough.

You can see our test runtime for this bot has gone up over time. Compare the pending time for the task run directly before it from two different builds. https://chromium-swarm.appspot.com/task?id=3378cd2375e7aa10&refresh=10&show_raw=1 is from about 2 weeks ago. It had a pending time of 5h 14m 4s. https://chromium-swarm.appspot.com/task?id=33d02ad39455a010&refresh=10&show_raw=1 is from today. It had a pending time of 5h 52m 17s. So, the total pending time has gone up by about 45 minutes. If it goes up another 45 minutes, that task will start expiring as well.

I'm trying to gather data about times for all tests, so that we can start to track which tests are getting slower over time. There is some viceroy data available, but it is fairly hard to use (http://shortn/_T8OEtp1pFQ is an example for Linux Perf, I can't see any obvious cause of the time increase.) I also made a spreadsheet (https://docs.google.com/spreadsheets/d/1S-bt-2XhmbLlCEtYS9wKJbwhKKzzpQq6gm8WWLGqpA8/edit#gid=1804831563) which contains times for all the tests. Right now it includes disabled benchmarks in its mean calculations, so it's not super useful, but I'm getting more accurate data right now, so hopefully by tomorrow this will be useful.

Generally, this problem needs to be solved for swarming. Increasing the test timeout would solve it, but then we would just increase the cycle time for builders, which will cause more problems for us. The real solution is to have monitoring on these test times, alert when they get longer, and work on making the test times quicker and quicker.
Blocking: 691582
Status: Fixed (was: Started)
This is passing after increased timeouts.

Sign in to add a comment