Detect misconfigured GPUs on Windows Perfbots |
|||||||||||||
Issue descriptionIn bug 634877 and also on build85-b1 on chromium.perf.fyi waterfall, we're seeing issues where something is misconfigured and the GPU is not being used. We'll need to fix these individually, but we should also add some kind of test to catch if this happens again. It seems like a build step (similar to host_info for Android) would be the clearest way to catch this kind of problem. Ken, does GPU team already have something we could use? Or do you have any advice on implementing something like this?
,
Aug 9 2016
Thanks, Ken! We will look into using the GpuProcess tests. It does look like issue 634877 was a false alarm, but aschulman@ also saw a problem with new laptops on the FYI waterfall where they were going down a software rendering path when they shouldn't have been. Aaron, do you have more details on that?
,
Aug 9 2016
Yes while investigating what CL caused a power efficiency improvement, we noticed that our Dell Windows High DPI bot was using the software rendering path. We then looked at the bot's stdio and saw that the reason why this bot is not using GPU rendering is because it is using the default Microsoft GPU driver instead of the proper Intel GPU driver. I've attached a snippet of the log from this bot that shows it is using the default Microsoft driver, not the proper Intel driver.
,
Aug 9 2016
Elliott, can you take a look at this?
,
Aug 11 2016
I think there are actually 3 things we want here: 1) Fix the machine missing drivers. Elliott did that in bug 636426. 2) Infra labs has a setup process so we know that when we get 9 more of these laptops, and then later when we upgrade them to Windows 11 or whatever, they'll get the right drivers. 3) There is a test on the perf waterfall that goes red or purple with clear explanation if the setup in 2) isn't done correctly. Ken has some helpful comments in #1 for that. Elliott, are you looking into 2) or both 2) and 3)? If it's just 2), can you file a separate bug for that?
,
Aug 11 2016
Hey Peter, what did you do to get this working, and what do you recommend we do with puppet to ensure it stays working?
,
Aug 15 2016
,
Aug 15 2016
Peter to respond to comment 6 and assign back to Elliott.
,
Aug 15 2016
Peter responds by saying: It was odd the Windows 10 did not have driver support for this. I pulled over the intel graphics drivers (HD5500) from HP's web site and installed it. Not sure if they support a silent install mode. I expect downloading drivers directly from Intel would have worked too (you have to know ahead of time what GPU model is installed to make sure you get the correct one). I guess you can possibly install the driver using chocolately? I would think that you have to maintain a specific node list to install it on as some of our slaves have multiple video chipsets installed, making active gpu detection difficult?
,
Aug 18 2016
Side note on Swarming bots, we DO report the driver version being used but as a state, not as a dimension; https://chromium-swarm.appspot.com/restricted/bots?dimensions=os%3AWindows build108-m1 has gpu: GeForce GT 610 | Version: 2.4.1.0 | Version: 9.18.13.4788 build108-m4 has gpu: AMD Radeon Graphics Processor (0x6779) | Version: 14.501.1003.0 | Version: 2.4.1.0 both have 2 versions because they have two graph card. There's a bug that the two names are not correctly reported. My point is that this could be monitored at fleet level via bot's state. You could then auto-quarantine in bot_config.py.
,
Aug 18 2016
Adding Sergey to see if there's a way to bubble this sort of information up currently. Also, this might be something we add-on to the Crimson project at some point.
,
Nov 1 2016
I really don't think puppet is the way to solve this. Puppet will enforce the change at any time, most likely during a test. This would be bad. The tests should check for all these to be correctly setup and error if not (like swarming). We can easily put the driver in the image that gets applied to all machines when they are provisioned, but it's harder to make sure someone didn't change the driver or disable it by logging in with RDP or something like that.
,
Nov 1 2016
Versions can be scraped on the server by ts_mon and reported as string metrics. It would be infinitely better to do that on the bots directly, but we as far as I know, we still don't run ts_mon on swarming bots... Or do we? bpastene@ did some work on that but ran into lots of challenges. I don't remember how it ended.
,
Nov 9 2016
,
Nov 16 2016
I agree with Elliot that puppet isn't relevant here, it should be done as per Swarming quarantine native functionality in: https://chrome-internal.googlesource.com/infra/infra_internal.git/+/master/services/swarming/bot_config_public.py#864 ts_mon is not strictly needed, you can get the full list of quarantined bots at: https://chromium-swarm.appspot.com/botlist?c=id&c=os&c=task&c=status&c=gpu&c=hidpi&c=first_seen&f=status%3Aquarantined&f=pool%3AChrome-perf&l=100&s=id%3Aasc https://trooper-o-matic.appspot.com or https://sheriff-o-matic.appspot.com/ should fetch the list via the API and alert based on that; or use ts_mon, I have no opinion.
,
Nov 16 2016
Please don't use trooper-o-matic, it's deprecated. If we need alerts, please use the standard monitoring pipelines and tools - they work really well these days, no need to create new custom pipelines. Native swarming queries can be very useful for deep-dive analysis, and should be added to the trooper playbook under the appropriate alert, when we have one.
,
Nov 17 2016
,
Jan 18 2017
If this is really a Pri-1, find an owner and update the priority. This is the result of a bulk edit that moved high priority available bugs to a lower priority in an attempt to be more honest with bug filers.
,
Apr 12 2017
Removed Infra>Monitoring since this is a Perf specific alert update. Added "Ops-AddMonitoring" label to track adding alerts tasks to services.
,
Nov 21 2017
Should this issue be duped on issue 639096 ?
,
Nov 22
This issue has been Available for over a year. If it's no longer important or seems unlikely to be fixed, please consider closing it out. If it is important, please re-triage the issue. Sorry for the inconvenience if the bug really should have been left as Available. For more details visit https://www.chromium.org/issue-tracking/autotriage - Your friendly Sheriffbot |
|||||||||||||
►
Sign in to add a comment |
|||||||||||||
Comment 1 by kbr@chromium.org
, Aug 9 2016