New issue
Advanced search Search tips
Note: Color blocks (like or ) mean that a user may not be available. Tooltip shows the reason.

Issue 662307 link

Starred by 2 users

Issue metadata

Status: Assigned
Owner:
Cc:
Components:
EstimatedDays: ----
NextAction: ----
OS: ----
Pri: 2
Type: Bug

Blocked on:
issue 705104



Sign in to add a comment

"telemetry_gpu_unittests (with patch)" is flaky

Project Member Reported by chromium...@appspot.gserviceaccount.com, Nov 4 2016

Issue description

"telemetry_gpu_unittests (with patch)" is flaky.

This issue was created automatically by the chromium-try-flakes app. Please find the right owner to fix the respective test/step and assign this issue to them. If the step/test is infrastructure-related, please add Infra-Troopers label and change issue status to Untriaged. When done, please remove the issue from Sheriff Bug Queue by removing the Sheriff-Chromium label.

We have detected 8 recent flakes. List of all flakes can be found at https://chromium-try-flakes.appspot.com/all_flake_occurrences?key=ahVzfmNocm9taXVtLXRyeS1mbGFrZXNyLwsSBUZsYWtlIiR0ZWxlbWV0cnlfZ3B1X3VuaXR0ZXN0cyAod2l0aCBwYXRjaCkM.



This flaky test/step was previously tracked in  issue 637200 .
 

Comment 1 by xlai@chromium.org, Nov 7 2016

Labels: -Sheriff-Chromium Infra-Troopers
I look at all those flaky builds and they all end up in purple where the telemetry_gpu_unittests are "TEST RESULTS WERE INVALID". I think this is an infra problem. Pls take a look to see if it is still happening and if not this bug can be closed.
Owner: sergeybe...@chromium.org
Status: Assigned (was: Untriaged)
Will look a bit later today (swamped now)
The tests fail it seems due to lack of capacity on swarming - looking at a few failing shards shows them as "Expired".

The largest pool used for this builder on swarming indeed seems to be overloaded:

http://vi/chrome_infra/Jobs/pools?duration=1d&job_regexp=tryserver.chromium.win.%2A&pool=cores%3A8%7Ccpu%3Ax86%7Ccpu%3Ax86-64%7Cgpu%3Anone%7Cmachine_type%3An1-highcpu-8%7Cos%3AWindows%7Cos%3AWindows-7-SP1%7Cpool%3AChrome&refresh=-1&service_name=chromium-swarm
There was a definite influx of expired tasks late Nov 3 - early Nov 4, which is when flakes were reported: 

http://vi/chrome_infra/Buildbot/per_builder?builder=win_chromium_rel_ng&duration=7d&job_regexp=tryserver.chromium.win.%2A&master=master.tryserver.chromium.win&refresh=-1&service_name=chromium-swarm&utc_end=1478564683#_VG_JypEDXSe

Not sure what was the trigger, but we are low on capacity, so for the time being this is probably unavoidable, until we get more machines.
Cc: vhang@chromium.org phajdan.jr@chromium.org
+vhang@ and phajdan.jr@ - FYI for capacity of the windows swarming pool - see #c3. The pool is running pretty much at full capacity at peaks. We should be aiming for ~75% peak load on average.

Comment 6 by vhang@chromium.org, Nov 11 2016

Sergey,

Can you tell me how bots are in each of the oversubscribed pools?  Let's fan the fire by adding 10-20% more to each pool to see if that helps.  I think we're taking too long to analyze the amount we need when we can quickly pad the pools and then analyze the numbers in details later.  Your thoughts?
Labels: -Pri-1 Pri-2
Apparently, this dropped off my radar for too long - sorry. I'll try to get back to this once other immediate fires are dealt with.
Project Member

Comment 8 by chromium...@appspot.gserviceaccount.com, Dec 13 2016

Detected 3 new flakes for test/step "telemetry_gpu_unittests (with patch)". To see the actual flakes, please visit https://chromium-try-flakes.appspot.com/all_flake_occurrences?key=ahVzfmNocm9taXVtLXRyeS1mbGFrZXNyLwsSBUZsYWtlIiR0ZWxlbWV0cnlfZ3B1X3VuaXR0ZXN0cyAod2l0aCBwYXRjaCkM. This message was posted automatically by the chromium-try-flakes app.
Looking at this again - the swarming pool for this step hasn't changed, and it's only one single pool of 319 bots: https://goto.google.com/rwtytb

I expect this pool to overload again once we are fully back from holidays...
Checked it again today - and sure enough, the pool is running at capacity again.

vhang: any chance we can increase this pool? Here's the current list of bots: http://shortn/_V3CTyDxZ0A

Some samples: vm1-m4, vm10-m4, vm103-m4, etc.

Owner: vhang@chromium.org
Updated pool link: https://goto.google.com/rwtytb

It is still close to full capacity, and is still expiring tasks: http://shortn/_llrRfJuioc

Assigning to vhang@ - please check if it is possible to add more capacity to the pool. Thanks!

Comment 12 by vhang@chromium.org, Mar 20 2017

How many more Win7 VMs would you like?
Ideally, I'd ask for another 100 bots. Is that feasible? Or as many as you can if that's too much.

I can't estimate the actual expected load easily, so my reasoning is: it's maxing out now at 324 bots in the pool, we want peaks to be at ~75% capacity, so let's add 30% on top (which comes out to ~100), so that if the current peak is a true peak, it'll end up at 75%.

In reality, the current true peak is likely higher, but we'll only know for sure once we add enough capacity.

Thanks!

Comment 14 Deleted

Comment 15 by kbr@chromium.org, Mar 24 2017

Blockedon: 705104

Comment 16 by vhang@chromium.org, Mar 24 2017

Assigning to johnw to handle this.  We had some server shuffling in the golo and will have to wait for b/35753978 to be complete before we can free up the servers.

Comment 17 by kbr@chromium.org, Mar 24 2017

Sorry about this -- it looks like there has been a longstanding TODO in the recipe code that's preventing Swarming from de-duplicating these runs. I just filed P1 Issue 705104 about fixing this. This should ultimately reduce the load on the Swarming pool, but I'd be surprised if it was just this one target that's causing all of the problems.

Project Member

Comment 18 by chromium...@appspot.gserviceaccount.com, May 7 2017

Detected 3 new flakes for test/step "telemetry_gpu_unittests (with patch)". To see the actual flakes, please visit https://chromium-try-flakes.appspot.com/all_flake_occurrences?key=ahVzfmNocm9taXVtLXRyeS1mbGFrZXNyLwsSBUZsYWtlIiR0ZWxlbWV0cnlfZ3B1X3VuaXR0ZXN0cyAod2l0aCBwYXRjaCkM. This message was posted automatically by the chromium-try-flakes app.
Project Member

Comment 19 by chromium...@appspot.gserviceaccount.com, May 8 2017

Detected 4 new flakes for test/step "telemetry_gpu_unittests (with patch)". To see the actual flakes, please visit https://chromium-try-flakes.appspot.com/all_flake_occurrences?key=ahVzfmNocm9taXVtLXRyeS1mbGFrZXNyLwsSBUZsYWtlIiR0ZWxlbWV0cnlfZ3B1X3VuaXR0ZXN0cyAod2l0aCBwYXRjaCkM. This message was posted automatically by the chromium-try-flakes app.
I believe we are still blocked on b/35753978 for capacity. Based on comment #17, has the number of required slaves changed? 

Comment 21 by jo...@chromium.org, May 11 2017

Cc: dpranke@chromium.org
Project Member

Comment 22 by chromium...@appspot.gserviceaccount.com, Jan 3 2018

Detected 3 new flakes for test/step "telemetry_gpu_unittests (with patch)". To see the actual flakes, please visit https://chromium-try-flakes.appspot.com/all_flake_occurrences?key=ahVzfmNocm9taXVtLXRyeS1mbGFrZXNyLwsSBUZsYWtlIiR0ZWxlbWV0cnlfZ3B1X3VuaXR0ZXN0cyAod2l0aCBwYXRjaCkM. This message was posted automatically by the chromium-try-flakes app.

Sign in to add a comment