New issue
Advanced search Search tips
Note: Color blocks (like or ) mean that a user may not be available. Tooltip shows the reason.

Issue 635723 link

Starred by 2 users

Issue metadata

Status: Untriaged
Owner: ----
Cc:
Components:
EstimatedDays: ----
NextAction: ----
OS: Windows
Pri: 2
Type: Bug

Blocked on:
issue 637965
issue 634877
issue 636426



Sign in to add a comment

Detect misconfigured GPUs on Windows Perfbots

Project Member Reported by sullivan@chromium.org, Aug 9 2016

Issue description

In  bug 634877  and also on build85-b1 on chromium.perf.fyi waterfall, we're seeing issues where something is misconfigured and the GPU is not being used. We'll need to fix these individually, but we should also add some kind of test to catch if this happens again. It seems like a build step (similar to host_info for Android) would be the clearest way to catch this kind of problem.

Ken, does GPU team already have something we could use? Or do you have any advice on implementing something like this?
 

Comment 1 by kbr@chromium.org, Aug 9 2016

Blockedon: 634877
The Telemetry-based tests that run on bots like these:
https://build.chromium.org/p/chromium.gpu.fyi/builders/Win7%20Release%20(NVIDIA)

in particular, the GpuProcess tests, might be reasonable tests to run in order to have a firm expectation that (a) a GPU process launches successfully on these machines and (b) that certain features like 2D canvas cause the GPU process to start running.

However, per comments on  Issue 634877 , it sounds like the machines are not actually misconfigured, but that the test is running on a machine where it shouldn't be. In this case, it seems reasonable for the test to detect that it should be able to run, can't run because the machine doesn't have a GPU, and fail.

Thanks, Ken! We will look into using the GpuProcess tests.

It does look like  issue 634877  was a false alarm, but aschulman@ also saw a problem with new laptops on the FYI waterfall where they were going down a software rendering path when they shouldn't have been. Aaron, do you have more details on that?
Yes while investigating what CL caused a power efficiency improvement, we noticed that our Dell Windows High DPI bot was using the software rendering path. We then looked at the bot's stdio and saw that the reason why this bot is not using GPU rendering is because it is using the default Microsoft GPU driver instead of the proper Intel GPU driver. I've attached a snippet of the log from this bot that shows it is using the default Microsoft driver, not the proper Intel driver.


snippet.txt
2.9 KB View Download
Components: -Infra>Labs Infra>Puppet
Owner: friedman@chromium.org
Status: Assigned (was: Untriaged)
Elliott, can you take a look at this?
Blockedon: 636426
I think there are actually 3 things we want here:

1) Fix the machine missing drivers. Elliott did that in bug 636426.
2) Infra labs has a setup process so we know that when we get 9 more of these laptops, and then later when we upgrade them to Windows 11 or whatever, they'll get the right drivers.
3) There is a test on the perf waterfall that goes red or purple with clear explanation if the setup in 2) isn't done correctly. Ken has some helpful comments in #1 for that.

Elliott, are you looking into 2) or both 2) and 3)? If it's just 2), can you file a separate bug for that?

Comment 6 by fried...@google.com, Aug 11 2016

Cc: pschmidt@chromium.org
Hey Peter, what did you do to get this working, and what do you recommend we do with puppet to ensure it stays working?
Blockedon: 637965

Comment 8 by benhenry@google.com, Aug 15 2016

Owner: pschmidt@chromium.org
Peter to respond to comment 6 and assign back to Elliott.

Comment 9 by pschm...@google.com, Aug 15 2016

Owner: friedman@chromium.org
Peter responds by saying:

It was odd the Windows 10 did not have driver support for this.

I pulled over the intel graphics drivers (HD5500)  from HP's web site and installed it.
Not sure if they support a silent install mode.   I expect downloading drivers directly from Intel would have worked too (you have to know ahead of time what GPU model is installed to make sure you get the correct one).

I guess you can possibly install the driver using chocolately?  I would think that you have to maintain a specific node list to install it on as some of our slaves have multiple video chipsets installed, making active gpu detection difficult?

Side note on Swarming bots, we DO report the driver version being used but as a state, not as a dimension;
https://chromium-swarm.appspot.com/restricted/bots?dimensions=os%3AWindows
build108-m1 has gpu: GeForce GT 610 | Version: 2.4.1.0 | Version: 9.18.13.4788 
build108-m4 has gpu: AMD Radeon Graphics Processor (0x6779) | Version: 14.501.1003.0 | Version: 2.4.1.0

both have 2 versions because they have two graph card. There's a bug that the two names are not correctly reported.

My point is that this could be monitored at fleet level via bot's state. You could then auto-quarantine in bot_config.py.
Cc: sergeybe...@chromium.org
Components: Infra>Monitoring
Adding Sergey to see if there's a way to bubble this sort of information up currently. Also, this might be something we add-on to the Crimson project at some point.
I really don't think puppet is the way to solve this.  Puppet will enforce the change at any time, most likely during a test.  This would be bad.

The tests should check for  all these to be correctly setup and error if not (like swarming).  We can easily put the driver in the image that gets applied to all machines when they are provisioned, but it's harder to make sure someone didn't change the driver or disable it by logging in with RDP or something like that.
Versions can be scraped on the server by ts_mon and reported as string metrics. It would be infinitely better to do that on the bots directly, but we as far as I know, we still don't run ts_mon on swarming bots... Or do we? bpastene@ did some work on that but ran into lots of challenges. I don't remember how it ended.
Labels: need-labs-startup
I agree with Elliot that puppet isn't relevant here, it should be done as per Swarming quarantine native functionality in:
https://chrome-internal.googlesource.com/infra/infra_internal.git/+/master/services/swarming/bot_config_public.py#864

ts_mon is not strictly needed, you can get the full list of quarantined bots at:
https://chromium-swarm.appspot.com/botlist?c=id&c=os&c=task&c=status&c=gpu&c=hidpi&c=first_seen&f=status%3Aquarantined&f=pool%3AChrome-perf&l=100&s=id%3Aasc

https://trooper-o-matic.appspot.com or https://sheriff-o-matic.appspot.com/ should fetch the list via the API and alert based on that; or use ts_mon, I have no opinion.
Please don't use trooper-o-matic, it's deprecated.

If we need alerts, please use the standard monitoring pipelines and tools - they work really well these days, no need to create new custom pipelines.

Native swarming queries can be very useful for deep-dive analysis, and should be added to the trooper playbook under the appropriate alert, when we have one.
Cc: friedman@chromium.org
Components: -Infra>Puppet
Owner: ----
Status: Available (was: Assigned)
Labels: Pri-2
If this is really a Pri-1, find an owner and update the priority.

This is the result of a bulk edit that moved high priority available bugs to a lower priority in an attempt to be more honest with bug filers.

Comment 19 by efoo@chromium.org, Apr 12 2017

Components: -Infra>Monitoring
Labels: Ops-AddMonitoring
Removed Infra>Monitoring since this is a Perf specific alert update. Added "Ops-AddMonitoring" label to track adding alerts tasks to services. 
Should this issue be duped on issue 639096 ?
Project Member

Comment 21 by sheriffbot@chromium.org, Nov 22

Labels: Hotlist-Recharge-Cold
Status: Untriaged (was: Available)
This issue has been Available for over a year. If it's no longer important or seems unlikely to be fixed, please consider closing it out. If it is important, please re-triage the issue.

Sorry for the inconvenience if the bug really should have been left as Available.

For more details visit https://www.chromium.org/issue-tracking/autotriage - Your friendly Sheriffbot

Sign in to add a comment