Extremely high CPU temps on Pixel 2013 with NO fan activity.
Reported by
scottt...@gmail.com,
Feb 3 2018
|
||||||||||||
Issue description
Chrome Version : 63.0.3239.140
OS Version: 10032.86.0
URLs (if applicable) :
Other browsers tested:
Add OK or FAIL after other browsers where you have tested this issue:
Safari:
Firefox:
IE/Edge:
What steps will reproduce the problem?
1. Start the Pixel 2013 on battery power.
2. Run a webGL demo or other CPU stressing activity
3. Watch the temps using a tool like COG
4. The temps will rise in the to 90-100C range with NO fan activity!
What is the expected result?
Fans should spool up to high speed attempting to regulate CPU temp
What happens instead of that?
The fans do not spool up and system temps rise to critical levels.
Please provide any additional information below. Attach a screenshot if
possible.
Running stable build
Version 63.0.3239.140 (Official Build) (64-bit)
This Pixel 2013 is like new in every way and has always run the stable build.. Everything works perfectly and it thermally regulated well until sometime in the last month or two of OS updates. I haven't been using it as much recently and only this last weekend noticed the heat issue.
The fans work fine on this unit. The fans will spool up quickly and spin down smoothly often on boot if the system is already warm. Sometimes connecting the charger seems to wake the fans up and then they thermally regulate as normal with CPU temps. Other times, there is no fan activity. When running, they smoothly operate and vary rpm according to temperature.
I'm seeing things like this in the system log snip below. It seems really odd that temp_metrics would be setting the fan to 0 at the same time the CPU is going critical! Also, I can have a log full of temp metrics trying to set the rpm to 3000 all the while there is no fan activity.
2018-02-02T22:35:41.061835-05:00 NOTICE temp_metrics[3432]: Setting fan RPM (temps: 1:28:7:27:9:66:): 10 -> 0
2018-02-02T22:35:41.070891-05:00 NOTICE temp_metrics[3443]: Throttling (temps: 1:28:7:27:9:66:): 1801000 800000 1150 0 0x180aa00dd8088 # no throttling
2018-02-02T22:35:41.446680-05:00 CRIT kernel: [ 16.177727] CPU0: Package power limit notification (total events = 1)
2018-02-02T22:35:41.446703-05:00 CRIT kernel: [ 16.177730] CPU3: Package power limit notification (total events = 1)
2018-02-02T22:35:41.446706-05:00 CRIT kernel: [ 16.177732] CPU2: Package power limit notification (total events = 1)
2018-02-02T22:35:41.446721-05:00 CRIT kernel: [ 16.177737] CPU1: Package power limit notification (total events = 1)
2018-02-02T22:35:41.457666-05:00 INFO kernel: [ 16.188649] CPU1: Package power limit normal
2018-02-02T22:35:41.457681-05:00 INFO kernel: [ 16.188651] CPU0: Package power limit normal
2018-02-02T22:35:41.457683-05:00 INFO kernel: [ 16.188691] CPU3: Package power limit normal
2018-02-02T22:35:41.457691-05:00 INFO kernel: [ 16.188692] CPU2: Package power limit normal
2018-02-02T22:35:47.302744-05:00 INFO kernel: [ 21.601447] ca0132 DOWNLOAD OK :-) DSP IS RUNNING.
2018-02-02T22:35:53.897784-05:00 INFO kernel: [ 28.194199] tpm_tis tpm_tis: command 0x65 (size 22) returned code 0x0
2018-02-02T22:37:11.949797-05:00 NOTICE temp_metrics[4989]: Setting fan RPM (temps: 1:33:7:34:9:67:): 0 -> 3000
UserAgentString: Mozilla/5.0 (X11; CrOS x86_64 10032.86.0) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/63.0.3239.140 Safari/537.36
,
Feb 3 2018
To rule out anything else I did a full USB recovery of the Pixel this morning. Again the same issue. - After the recovery and the system had time to download and update any extensions I listened and noted no fan activity. - I ran a WebGL demo again and according to COG the heat was rising into the 90's. - The system notified me that it needed to restart for a Flash update. I clicked the restart and again listened for fans. There was no fan activity. - At that point I shut down the Pixel, gave about 5 seconds and powered it back up. The fans spun into life and continued regulating from there. They are running normally in the background as I type this. I am running the Pixel Shredder demo (link in OP) and the fans are audibly varying speed. The system climbed into the mid 80's (C) for a while (which is still rather hot) at which point they spun up and slowly cooled the CPU back down into the mid 60's. The fans spun back down at that point but continuing to run quietly. My Pixel (in Florida) is usually running the fans a low RPM in normal conditions (ambient 75 F). The Pixel often starts up from a full power down with the quick spool up/down of the fans. When the fans decrease to inaudible after that it appears they often do not come back up even when they should. If however the system is warmer and the fans remain running then thermal regulation appears to work and the fans vary speed according to CPU temps. My fans are clearly not faulty. It appears something may be killing the thermal monitoring process or that process has a bug in it and simply thinks it is setting the fan speed. If that is happening it's serious business not just because it will damage Pixel Chromebooks but because a CPU at 100 C should not be anywhere near a Lithium Ion battery. Again, up until just recently this Pixel always ran the fans normally. Please look into this. I will provide any logs or run any tests you wish. Thanks, Scott
,
Feb 3 2018
I would assume that, as with most modern architectures, there is a feedback system allowing ChromeOS to monitor the actual fan RPM vs the requested speed. Without this most modern computer architectures would be at risk of thermal damage. If so, this can't be working properly or the system should throw errors indicating a cooling system failure.
,
Feb 5 2018
,
Feb 5 2018
,
Feb 5 2018
,
Feb 5 2018
Ruben, can you please take a look?
,
Feb 5 2018
,
Feb 6 2018
From taking a first look at this: - temp_metrics doesn't seem to crash, or run into errors - after "fixing" the fan control (as outlined below) webql aquarium triggers the expected up and down regulation of the fan speed. At first when I just flashed the image I also saw a suspiciously low fan activity even though the cpu was being throttled by temp_metrics. This went away in the course of testing. However, I can reproduce the bad-fan behavior by cold resetting the EC. Namely, after a cold-reset, any request to set the fan rpm doesn't go through, and the rpm stays at 0. If I call ectool fanduty [0-100] once, then things start to work again. Maybe there are other ways to get the fan-control running again (as the report outlines, maybe some charger action). Unfortunately, I haven't been able to retrieve a servo-v1 connector here to get ec console access, maybe you guys can retrieve one? Mengqi, if you still have the link that you got, or that I left, could you take a look? Flash R63-10032.86.0, and notice if inside the EC there are issues with fan control, or fanspeed setting & retrieval. Namely, notice if there are errors if you after an ec cold_reset type "ectool pwmsetfanrpm 4000" in the ap console. What does the ec side say? I don't think the issue is on the temp_metrics or ec image side of things since both of those haven't been touched in a while. I'll look more on my end if I can pinpoint some changes in ectool that might have introduced this issue.
,
Feb 6 2018
,
Feb 6 2018
,
Feb 6 2018
other data-point: there have been some changes around number of fans in ectool over the last ~year, but even when the fan is unresponsive, I still get $ectool pwmgetnumfans Number of fans = 1 as output
,
Feb 6 2018
Thank you all for looking into this and nice job coconutruben on reproducing it. I knew there was something abnormal going on. This is the kind of problem only us engineers and testers ever notice so I didn't expect normal users to have picked up on it. Accordingly I found no references to it on Google+ or other forums. If I can be of any help with tests of my system just let me know. I also have an earlier HP 14" Chromebook with fans that I can test on. Scott
,
Feb 6 2018
On testing w/in google you could consider an idle lab resource as well if local unit unavailable. Obviously there's nothing like being there though to feel the fan ;) h=$(atest host list -b link | grep False | grep pool:suites | cut -d' ' -f1 | head -1) atest host mod -l -r 'debug crbug.com/808764 ' $h #chroot dut-control ${h}-servo.cros cold_reset:on sleep:1 cold_reset:off Tried briefly and saw this, # on dut date ; ectool console | tail -2 ; date; ectool pwmsetfanrpm 4000 ; date ; ectool console | tail -40 Tue Feb 6 08:27:08 PST 2018 ioctl -1, errno 74 (Bad message), EC result 1 (INVALID_COMMAND) [967.597563 HC 0x97] Tue Feb 6 08:27:08 PST 2018 ioctl -1, errno 74 (Bad message), EC result 1 (INVALID_COMMAND) ioctl -1, errno 74 (Bad message), EC result 6 (INVALID_VERSION) Fan target RPM set for all fans. Tue Feb 6 08:27:08 PST 2018 ioctl -1, errno 74 (Bad message), EC result 1 (INVALID_COMMAND) [967.610476 HC 0x98] [967.612118 HC 0x98] [967.613715 HC 0x98] [967.615156 HC 0x98] [967.616697 HC 0x98] [967.618230 HC 0x98] [967.619753 HC 0x98] [967.621303 HC 0x98] [967.622810 HC 0x98] [967.624332 HC 0x98] [967.625837 HC 0x98] [967.627450 HC 0x98] [967.628949 HC 0x98] [967.630485 HC 0x98] [967.631916 HC 0x98] [967.633442 HC 0x98] [967.634974 HC 0x98] [967.636509 HC 0x98] [967.638053 HC 0x98] [967.639581 HC 0x98] [967.641189 HC 0x98] [967.642624 HC 0x98] [967.644188 HC 0x98] [967.645781 HC 0x98] [967.647296 HC 0x98] [967.648527 HC 0x98] [967.653822 HC 0x02] [967.654532 HC 0x01] [967.654972 HC 0x0b] [967.655125 HC err 1] [967.655571 HC 0x08] [967.655724 HC err 6] [967.656125 HC 0x08] *[967.656477 HC 0x21] [967.661807 HC 0x02] [967.662457 HC 0x01] [967.662940 HC 0x0b] [967.663094 HC err 1] [967.663526 HC 0x97] It does seem like 'set' completes successfully ('*' above) #define EC_CMD_PWM_SET_FAN_TARGET_RPM 0x21 but subsequent 'get always says '0' ectool pwmgetfanrpm Current fan RPM: 0 Above was for CHROMEOS_RELEASE_DESCRIPTION=10323.12.0 (Official Build) dev-channel link test ectool version ioctl -1, errno 74 (Bad message), EC result 1 (INVALID_COMMAND) RO version: link_v1.2.145-352afa8 RW version: link_v1.2.145-352afa8 Firmware copy: RW Build info: link_v1.2.145-352afa8 2015-11-18 13:00:18 @build169-m2 crossystem fwid Google_Link.2695.1.169
,
Feb 7 2018
I think the fan is not enabled, and triggering fanduty explicitly enables the fan. Alternatively suspending the device, and resuming it, since chipset_resume enables the fan, the controls work again, so that makes me believe this theory more. What makes me believe it less is that this should've shown up much earlier, unless something else was masking the behavior. I made an EC image that inserted a pwm_enable_fan(1) call inside set_target_rpm, and I stopped being able to reproduce the issue. More on my thoughts on why this might/might not be the issue below at the end. https://chromium.googlesource.com/chromiumos/platform/ec/+/firmware-link-2695.B/chip/lm4/pwm.c#113 Given that an ec image update isn't really feasible, I'd propose a pre-script to temp_metrics that does one call to "ectool fanduty 0" to make sure the fan is enabled. If that's acceptable/good I'll push a CL for this after testing it out. - - - - - - More thoughts. So from what I can tell, the code-path to set the fan rpm never enables the fan. https://chromium.googlesource.com/chromiumos/platform/ec/+/firmware-link-2695.B/common/pwm_commands.c#33 https://chromium.googlesource.com/chromiumos/platform/ec/+/firmware-link-2695.B/chip/lm4/pwm.c#104 namely, set_rpm_mode() doesn't have a code-path that leaves the fan enabled https://chromium.googlesource.com/chromiumos/platform/ec/+/firmware-link-2695.B/chip/lm4/pwm.c#75 Both command_fanset (the equivalent, but for EC console) and fanduty explicitly enable the fan. https://chromium.googlesource.com/chromiumos/platform/ec/+/firmware-link-2695.B/chip/lm4/pwm.c#150 https://chromium.googlesource.com/chromiumos/platform/ec/+/firmware-link-2695.B/chip/lm4/pwm.c#236 resuming the chipset does enable the fan https://chromium.googlesource.com/chromiumos/platform/ec/+/firmware-link-2695.B/chip/lm4/pwm.c#402
,
Feb 7 2018
> Given that an ec image update isn't really feasible, Well, once you have the EC patch, it's definitely feasible unless there is a strong dependency in the RO firmware. That said I'm not sure I understand your explanation of the cause of the bug. I might have forgotten some details of this platform, but on such an x86, HOOK_CHIPSET_RESUME is normally *also* called at power-up, ie the CPU state does G3->S5->S3->S0 when you boot the machine pwm_resume() should run pwm_enable_fan(1) at that time. So what is not happening exactly ? I'm seeing too a mechanism to preserve it across the sysjump of the software-sync, but tricky stuff might happen there. is your issue happening with soft-sync enabled ? disabled ? both ? What are your RO and RW EC version ?
,
Feb 8 2018
I think you're right, at least the sysjump might have something to do with this. At least, I'm seeing the following behavior: <boot> $ ectool fanduty 0 $ ectool pwmsetfanrpm 8000 <hear fans> $ ectool reboot_ec RO <fans stop> $ ectool pwmsetfanrpm 8000 <nothing happens> $ ectool fanduty 0 $ ectool pwmsetfanrpm 8000 <hear fans again> I should note that if I do a reboot_ec RO if I'm already in RO, then the fans don't stop. Do we ignore a jump if it's to the same code? My comment about the ec patch is that I thought it has to go through some qual and validation before we decide to push out a new EC image. I'm seeing this with soft-sync enabled, and RO version: link_v1.2.145-352afa8 RW version: link_v1.2.145-352afa8 same as Todd mentioned above. But when I got the device, and it had the normal RO that's shipped, I'm also seeing this behavior. //didn't note down the version
,
Feb 8 2018
Thanks for the useful tests.
> $ ectool reboot_ec RO
> <fans stop>
It's not terribly good, but on a real system, we never do such a thing (the only sequence I know triggering a sysjump to RO is the full EC re-flashing for dogfood machines).
And actually it's impossible to do it a write-protected machine, you would get EC_ERROR_ACCESS_DENIED;
> I should note that if I do a reboot_ec RO if I'm already in RO,
> then the fans don't stop.
> Do we ignore a jump if it's to the same code?
Yes, we do, this is this code in system_run_image_copy() :
/* If system is already running the requested image, done */
if (system_get_image_copy() == copy)
return EC_SUCCESS;
> I'm seeing this with soft-sync enabled, and
> RO version: link_v1.2.145-352afa8
> RW version: link_v1.2.145-352afa8
Interesting,
At this point, you probably want to do the low-tech debugging :
just put a big fat trace in the 3 spots of chip/lm4/pwm.c accessing the LM4_FAN_FANCTL
and boot with soft-sync enabled (which requires a RW firmware with the proper RW EC image inside) and the servo connected.
,
Feb 12 2018
I was doing some more work on the Pixel this weekend. It certainly seems there is a strong correlation between having it plugged-in when booting and the fan working properly.
,
Feb 27 2018
Still experiencing this regularly. Just tonight the Pixel is very hard to use safely. I had to shut down any web pages with active content just to get the thing to cool down to the high 60's. I connected the charger and went through several power cycles hoping the fan would stay on. Even with core temps in the 90's the fan starts up at boot running a rather high rpm and only runs until shortly after booting. It then shuts down to 0 rpm with the temps climbing into the 80's and above. No fan activity after that initial shutdown. Finally sometime after the third power cycle, the fan is now on an regulating temperature.
,
Mar 2 2018
Sorry I haven't looked into this for a bit - I was OOO. Anyhow, I got the right cables and from what I can tell by using fanset and faninfo commands the following happens: - when the device is in RO, the fan is enabled, and the EC jumps to RW, the fan gets disabled - when the device is in RW, the fan is enabled, and the EC jumps to RO, the fan gets disabled. I'll try to trace this a little further, but seems to be what you were suggesting Vincent that something undesired happens during the RO->RW jump. @Scott: could you try out something on your device. Namely, when you notice that the fan isn't kicking in, suspend and resume the device again (closing lid for a few seconds should do it), and then notice if the fan kicks in.
,
Mar 2 2018
I figured it was something like vacation so I tried to avoid being obnoxious. Hope it was a good one. I will run that test a few times this evening and post the results.
,
Mar 3 2018
Ran several tests. No power cycles, just suspended by folding the screen closed. All test performed on battery power. - Started Pixel - Loaded up heavy WebGL demo allowing temps to climb into the high 80's - No fan activity - Shut down WebGL demo - Suspend, wait several seconds - On wake, fan spun up to high rpm, immediately settling down but not shutting off - Shortly after login, the fan seemed to spin back down to zero - Started WebGL demo again - temps climbing to low 90's - no fan activity - Left WebGL demo running - Suspend, wait several seconds - On wake, light fan activity, quickly spun down to zero - Stopped WebGL demo - Started typing this. - Suspend, wait several seconds - On wake, high rpm fan for a few seconds then back to zero even though temps remain in the mid 70's. - continued typing - after a couple minutes it appeared the fan restarted and began running at a low rpm. - Started WebGL demo again and watched temps climb into the 90's. The fan remained at the low rpm as though it got no further speed updates. - Shut down WebGL demo to let system cool - A couple minutes after shutting down demo, the fan spun up a bit again. It's hard to find any consistent patterns in this. It's as though the fan only gets sporadic updates remaining at whatever the last requested rpm was. If it was at zero, then it stays there for quite some time even if system temps climb rapidly into dangerous territory. If it's at a low rpm, it does the same. Then at some random time later, it may change rpm again. Right now it's running at a bit higher rpm and finally cooled the cpu down to the 60's. If you want me to run more specific tests with or without the supply, just let me know.
,
Mar 3 2018
I almost convinced myself that some of what I was observing was just the hysteresis built into the fan control logic but there is no way the fan should ever remain off or at a low rpm when temps are in the 90's. At that point the fans should be cranked up to high speed.
,
Mar 3 2018
A few more use experiments today. I started it up and worked on a few things for about 10 minutes. Nothing with heavy CPU loading. No fan activity, but that wasn't really expected. I cranked up a WebGL demo and let it go. The temps continued to rise till it hit 101! Ouch... that's out of spec for the part. I think I actually saw the chip throttle itself at that point as the animation glitched. Needless to say I didn't let it continue to run at that temp. I shut down the demo and suspended the machine. After waiting about 10 seconds I woke it back up. The fan immediately started, but at a rather low rpm. I logged in and started up the WebGL demo. Again, the temps rose in the upper 90's only this time the fan continued running at that low speed having little effect on the overall core temps. It seems like shortly after boot or wake, the fan gets to a point where no further rpm updates are received. Yet on other occasions, something seems to start the fan monitoring back up again out of the blue.
,
Mar 4 2018
I should note those last tests were battery only as well
,
Mar 5 2018
scottt492@gmail.com, thanks for doing these experiments. Ruben, would it be useful to get /var/log/messages from scottt492@gmail.com? That may help shed light on whether the issue is in reacting to changes in temperature (if the script has crashed, for example) vs trying to set the fan speed.
,
Mar 5 2018
No problem. Just give me a scenario you want tested and I'll capture the logs.
,
Mar 5 2018
Indeed, thanks scottt492@gmail.com :) Providing the logs would be great. In general any of the tests that you mentioned with logs would be useful, especially if you can reproduce the scenario where the fans don't kick in at first, and after suspending they do kick in, but at low rpm. I think that there is an issue with fan activation, which shows itself by the fans never kicking in in some cases. I think there's two ways we can solve that issue, so I'll see if I can provide CLs for that today/tomorrow. However, that would not explain the fans not speeding up to higher RPM, especially if we wait 2-3 minutes at a higher load. So I'm still trying to see what's going on there
,
Mar 5 2018
I will see what I can provide. It's a hard problem to characterze much less reproduce. 2-3 minutes at high loads, like those WebGL demos, is not something I want entertain. I'm getting worried about these thermal stresses I am subjecting the system to. Most gamers would freak if their systems hit 90 C. I think we agree, regardless of the hysteresis employed to avoid constantly changing fan speeds, there is a critical threshold at which point the fan should be aggressively trying to reduce system temperatures. I think 80 plus Centigrade should easily meet that requirement. As it turns out the absolute maximum temperature rating for the mobile i5 is 105. I haven't got that yet and I don't really want to. These Pixel Chromebooks are sort of expensive.:-)
,
Mar 7 2018
I emailed a somewhat long log to coconutruben. It seemed a little large to post here but I can do that if you like. I captured the problem this morning. I was working on the Pixel again just doing low-load stuff that doesn't usually bring up the fan. I cranked up a web demo and let the temps rise to the low 90's - no fan activity. https://experiments.withgoogle.com/chrome/the-polygon-shredder I suspended the Pixel with the demo still running. I opened the lid and logged in with the demo still running and the fan never started up... Yikes! I suspended the Pixel again with the demo running. I opened the lid and the fan started. I logged in and by that time the fan had gone back to a very low, almost inaudible speed. With the demo still running the temps were climbing back into the 90's again with no speed changes for the fan. It just kept running at the low rpm having little effect on the high temperatures. I emailed the log covering that time period. Now... as I sit here typing this, the fan has come back of it's own accord of course. I will email that log as well.
,
Mar 7 2018
I emailed the logs for the low-load period where the fan started again. It remained at a log rpm so I decided to hit it with the demo again. The temps climbed into the 90's and fan changed it speed eventually from a very low speed to a slightly higher speed nowhere near necessary to deal with the high CPU temps. I finally shut the demo down to let things cool. Here is a short log of that activity. 2018-03-07T07:05:51.071582-05:00 NOTICE temp_metrics[15803]: Throttling (temps: 1:41:7:39:9:69:): 1800000 800000 900 0 0x180aa00dd8068 # disable turbo 2018-03-07T07:09:01.894171-05:00 NOTICE temp_metrics[16586]: Setting fan RPM (temps: 1:39:7:38:9:62:): 5500 -> 4000 2018-03-07T07:09:01.901649-05:00 NOTICE temp_metrics[16593]: Throttling (temps: 1:39:7:38:9:62:): 1801000 800000 1150 0 0x180aa00dd8068 # cap pkg to 13W 2018-03-07T07:09:01.922338-05:00 CRIT kernel: [ 1966.154975] CPU2: Package power limit notification (total events = 5566) 2018-03-07T07:09:01.922352-05:00 CRIT kernel: [ 1966.154976] CPU3: Package power limit notification (total events = 5566) 2018-03-07T07:09:01.922353-05:00 CRIT kernel: [ 1966.155003] CPU1: Package power limit notification (total events = 5566) 2018-03-07T07:09:01.922355-05:00 CRIT kernel: [ 1966.155005] CPU0: Package power limit notification (total events = 5566) 2018-03-07T07:09:01.933338-05:00 INFO kernel: [ 1966.166012] CPU3: Package power limit normal 2018-03-07T07:09:01.933353-05:00 INFO kernel: [ 1966.166014] CPU2: Package power limit normal 2018-03-07T07:09:01.933356-05:00 INFO kernel: [ 1966.166025] CPU0: Package power limit normal 2018-03-07T07:09:01.933365-05:00 INFO kernel: [ 1966.166026] CPU1: Package power limit normal 2018-03-07T07:09:11.994988-05:00 NOTICE temp_metrics[16676]: Throttling (temps: 1:39:7:38:9:66:): 1801000 800000 1150 0 0x180aa00dd8070 # cap pkg to 14W 2018-03-07T07:10:22.296433-05:00 NOTICE temp_metrics[16977]: Throttling (temps: 1:39:7:39:9:79:): 1801000 800000 1150 0 0x180aa00dd8078 # cap pkg to 15W 2018-03-07T07:10:32.328875-05:00 NOTICE temp_metrics[17042]: Throttling (temps: 1:39:7:38:9:80:): 1801000 800000 1150 0 0x180aa00dd8080 # cap pkg to 16W 2018-03-07T07:11:02.487233-05:00 NOTICE temp_metrics[17201]: Setting fan RPM (temps: 1:41:7:39:9:86:): 4000 -> 5500 2018-03-07T07:11:02.493551-05:00 NOTICE temp_metrics[17208]: Throttling (temps: 1:41:7:39:9:86:): 1801000 800000 1150 0 0x180aa00dd8078 # cap pkg to 15W
,
Mar 7 2018
In that last little segment I can tell you the fan was nowhere near the 4000 to 5500 rpm. Again, all this mornings tests were battery only.
,
Mar 14 2018
Did you get those email logs? Let me know if you need more or more specific tests.
,
Mar 15 2018
Thank you scottt492@gmail.com for the detailed logs, and for the pointer :) I have uploaded two CLs crrev.com/c/964069 and crrev.com/c/964037 that I think can address part of this issue. When testing this I found that temp_metrics seems stable, and doesn't crash. Your logs also have no crashes in temp_metrics. I do believe that there's an issue with the fan being disabled in some cases. I also found that 3000-4000 range to be almost impossible to hear, but that might just be the office environment here. Now, the speed at which the device cools down, or the fan speed itself might be another issue, mainly that a) that the fans react to skin-temperature, and not core-temperature, and only every 10s. So there is an expected delay there. b) calibration might not be with that intensive a workload in mind
,
Mar 15 2018
Hopefully the simple change to temp_metrics.conf will be sufficient. Still though, that would not seem to cover the scenario I have observed where the fan seems to remain at a set rpm with no updates for long periods of time even with cpu loads that would warrant speed increases. In that case, the fan is enabled but fails to get updates. I assume by "skin" you mean the fan control system is reacting to the CPU package temp instead of core temps. There shouldn't be much thermal lag between the two so I wouldn't think that would be part of the problem. I would like to suggest something though. I understand the, probably arbitrary, 10 second period is really there to avoid having the fan constantly change speed which would annoy the user. However using that same time constant for both speed increases and decreases isn't really optimal if that is currently the case. Fan speed increases are the critical action whereas speed decreases are non-critical and done only to reduce noise and power usage when the thermal load doesn't require the extra cooling. Having the fan react more quickly to temperature increases and applying the fixed 10 second delay to temperature decreases would result in similar fan behavior from a user perspective. Rapid up-down fluctuations in fan speed would still be avoided, but the fan would cool the CPU more rapidly, avoiding the heat build up that would then take longer to dissipate. Even non-technical users recognize when their lap gets warm. :-) I'm sure there is some optimal algorithm that would scale the reaction time for fan speed increases with cpu package temperature but I doubt that's necessary. Probably just applying something simple like a 3 second delay on increases and the standard 10 second delay on decreases would accomplish much the same. This could improve overall fan power usage to some small degree.
,
Mar 15 2018
> I assume by "skin" you mean the fan control system is reacting to the CPU package temp instead of core temps. No, the skin temp is an estimation of the case temperature using the measurements from temperature IR sensors in various points. > but the fan would cool the CPU more rapidly The final goal of the thermal loop is not really to 'cool' the CPU (which is far from Tjmax in those cases) but to maintain the casing temperature within limits.
,
Mar 15 2018
The cooling system is designed with no direct response to CPU temps, but simply to avoid the user getting a warm lap? With CPU core and package temps available to the OS, should the cooling system base it's actions on the temperatures of structures in the system that are bound to have significant thermal lag and unpredictable thermal gradients relative to the core temps? If the cooling system is not reacting to the cpu temps how is it to properly regulate the highest dissipation element in the system? With HTML 5 and WebGL capable of delivering high-load content to any browser this seems a recipe for trouble going forward. I have seen multiple cores get right to the edge of i5's 105C Tjmax on multiple occasions. That is certainly due to the bug behind this issue but also illustrates that a single point of failure allows the system to hit unsafe core temperatures with a lithium ion battery in close proximity. As an electrical engineer I certainly wouldn't be happy having that possibility in the wild. I'm still a little stunned that the temp management system isn't really monitoring the heat source! Thermal gradients between the CPU and adjacent structures will vary significantly with ambient conditions. There is no way to guarantee safe core temps with such a strategy. Automotive cooling systems don't measure the temperature of the hood as the internal engine temperatures are the important factor and the hood is not a reliable indicator of those temperatures. Intel has a strict "no overclocking" policy for a reason. Running CPU's near their thermal limits reduces the lifespan of the part on top of increasing the risk of errors and crashes. It really sounds like there is room for improvement here beyond the bug fix.
,
Mar 16 2018
The following revision refers to this bug: https://chromium.googlesource.com/chromiumos/platform/ec/+/43d0769918a0c674423227bb9e81226a0dba6274 commit 43d0769918a0c674423227bb9e81226a0dba6274 Author: Ruben Rodriguez Buchillon <coconutruben@chromium.org> Date: Fri Mar 16 22:56:23 2018 temp_metrics: use fanduty 0 to enable fan If the fan is never enabled, temp_metrics itself has no code-path to enable the fan. This fixes this by calling fanduty 0 in the beginning of temp_metrics, since fanduty does explicitly enable the fan. Note: This is a hack to avoid having to flash a new EC image. See crrev.com/c/964037 for a more fundamental fix to the same issue. BRANCH=link BUG= chromium:808764 TEST=couldn't reproduce issue with this version of temp_metrics. Change-Id: I8a9b258ba7b50cf5180497d318f8d94454dab434 Signed-off-by: Ruben Rodriguez Buchillon <coconutruben@chromium.org> Reviewed-on: https://chromium-review.googlesource.com/964069 Commit-Ready: ChromeOS CL Exonerator Bot <chromiumos-cl-exonerator@appspot.gserviceaccount.com> Reviewed-by: Sameer Nanda <snanda@chromium.org> [modify] https://crrev.com/43d0769918a0c674423227bb9e81226a0dba6274/util/temp_metrics.conf
,
Mar 18 2018
Nice job Ruben! Looking forward to getting your fix rolled out to my Pixel. That will hopefully keep the i5 out of the danger zone.
,
Mar 19 2018
The following revision refers to this bug: https://chromium.googlesource.com/chromiumos/platform/ec/+/2cf6a6ae8c15590f7cdf0cda153d45e5b49a632f commit 2cf6a6ae8c15590f7cdf0cda153d45e5b49a632f Author: Ruben Rodriguez Buchillon <coconutruben@chromium.org> Date: Mon Mar 19 05:11:23 2018 temp_metrics: use fanduty 0 to enable fan If the fan is never enabled, temp_metrics itself has no code-path to enable the fan. This fixes this by calling fanduty 0 in the beginning of temp_metrics, since fanduty does explicitly enable the fan. Note: This is a hack to avoid having to flash a new EC image. See crrev.com/c/964037 for a more fundamental fix to the same issue. BRANCH=link BUG= chromium:808764 TEST=couldn't reproduce issue with this version of temp_metrics. Change-Id: I8a9b258ba7b50cf5180497d318f8d94454dab434 Signed-off-by: Ruben Rodriguez Buchillon <coconutruben@chromium.org> Reviewed-on: https://chromium-review.googlesource.com/964069 Commit-Ready: ChromeOS CL Exonerator Bot <chromiumos-cl-exonerator@appspot.gserviceaccount.com> Reviewed-by: Sameer Nanda <snanda@chromium.org> (cherry picked from commit 43d0769918a0c674423227bb9e81226a0dba6274) Reviewed-on: https://chromium-review.googlesource.com/967327 Commit-Queue: Furquan Shaikh <furquan@chromium.org> Tested-by: Furquan Shaikh <furquan@chromium.org> Trybot-Ready: Furquan Shaikh <furquan@chromium.org> Reviewed-by: Furquan Shaikh <furquan@chromium.org> [modify] https://crrev.com/2cf6a6ae8c15590f7cdf0cda153d45e5b49a632f/util/temp_metrics.conf
,
Mar 20 2018
Ruben, How will I know when this fix had been rolled to the device? thanks, Scott
,
Mar 20 2018
Barring any merges to earlier releases, I think this will go out with Chrome 67 (probably early June for stable channel, late April for beta channel, sooner for dev channel).
,
Mar 20 2018
Thanks
,
May 18 2018
Certainly looking forward to the fix on this. I did a simple transfer of photos from an SD card to a USB stick the other day and as usual, the fan sat there mute while the temps hit 92C! So, it doesn't require some beastly WebGL demo to create these high temp conditions.
,
Jun 21 2018
Is there any way to determine if this has rolled into the release builds yet? I was just writing some email this evening and the sat there with no fan running into the low 80's (C) for about an hour. When I rebooted it was the same behavior observed in examples above. When the Pixel restarted the fan immediately spun up to cool the system and stayed on until the temps got back down into the 60's. It's still running normally as I type this.
,
Jun 21 2018
can you tell me what the exact version you're running is?
,
Jun 21 2018
I was running this last night. Version 66.0.3359.203 (Official Build) (64-bit) Just updated to... Version 67.0.3396.87 (Official Build) (64-bit) I will test it again.
,
Jun 21 2018
Just ran this demo with Version 67.0.3396.87 (Official Build) (64-bit) https://experiments.withgoogle.com/biomes Temps hit 85 C before I shut it down. Upon reboot fan activated at high rpm for a moment and then shut down again. Seems the same as before.
,
Jun 21 2018
Ran it again up till all the cores were hovering at 100! I hate doing that. :-) The base of the unit at this point was about 110 F. The fan finally came on and started regulating temps down into the 80's. Maybe it's working now. Not sure. In any case it still seems like poor thermal regulation but at least there was fan activity. I'll run some more tests this evening to see if the fan activity is consistent.
,
Jun 21 2018
> The base of the unit at this point was about 110 F. It's fine / expected for this machine if you are running an artificial intensive workload. The loud fan threshold starts at 42 C (107 F)
,
Jun 21 2018
By "loud" do you mean the high rpm threshold? There appears to be more fan activity which is certainly good. I'll run more tests. However, even at those temps the fan never maintained even close to it's high rpm settings. The only time I actually heard the fan hit it's max rpm was as the system booted. At other times it was audible but probably half max rpm or less. As noted earlier in the thread, I was able previously to get the system temps very high doing a simple copy operation of multiple files to or from a flash drive. That certainly isn't an "artificial" work load and neither is a WebGL demo. A raw floating point benchmark could be considered an artificial workload. A snazzy WebGL visual demo however, is representative of what the web is becoming be it cool data visualization or web gaming. These machines are designed with the primary function of consuming web content. They need to be able to do that and remain effectively thermally regulated right down to the core level.
,
Jun 28 2018
It does appear that the fan isn't completely dormant now. So that's progress! However, just now I started it up from a full shutdown, went straight to the "biomes" demo linked above (you have to select the arrows to get past the initial demo as that doesn't work). The Pixel sat there for almost ten minutes with the cores hovering in the high 90's and frequently hitting 101 and 102! The fan finally went from barely audible (below 4000 rpm) to audible but still nowhere close to it's max rpm. It would spin up for a short period to 5500 rpm and then go back down to a low rpm allowing the cores to again reach 90's all across the board. Again, these same temperatures can be attained by simply copying a large number of files to or from a USB stick. The demo is just the easiest way to reproduce this scenario. As I type this, the fan is running a consistent 4000 to 5500 rpm according to the log and keeping temps in the low 70's but of course there is no longer any significant load. Though it's max rpm appears to be 9000 rpm (the highest value I've seen in the log) that speed is rarely ever employed even when temps are almost max across all the cores! No matter how you approach it, letting a modern CPU like this spend long periods at or near Tjmax is poor thermal management when the fan isn't even near max rpm. Somewhere along the way, the fan/chipset driver changed as the Pixel once had a considerably more active cooling strategy from a user point of view. This thing would routinely spin the fan up and down to cool itself and that never bothered me. The absence of that activity raised the red flag that started this issue. I believe there's a strong argument for further code changes in order to get the Pixel back to behaving well thermally.
,
Jun 28 2018
scottt492@gmail.com thanks for being super helpful in debugging this. Given, that the original issue being tracked in this bug (fans not spinning up) has been fixed, I am closing this bug as fixed.
,
Jun 28 2018
Agreed, the dead fan issue appears to be fixed. Thanks Ruben for your work on this! |
||||||||||||
►
Sign in to add a comment |
||||||||||||
Comment 1 by scottt...@gmail.com
, Feb 3 2018