From: "Zhang, Rui" <rui.zhang@intel.com>
To: Doug Smythies <dsmythies@telus.net>
Cc: "daniel.lezcano@linaro.org" <daniel.lezcano@linaro.org>,
"srinivas.pandruvada@linux.intel.com"
<srinivas.pandruvada@linux.intel.com>,
"linux-pm@vger.kernel.org" <linux-pm@vger.kernel.org>
Subject: RE: [PATCH] thermal/intel: introduce tcc cooling driver
Date: Mon, 18 Jan 2021 09:46:30 +0000 [thread overview]
Message-ID: <e04c36aae6eb4cbb9b99799290016d58@intel.com> (raw)
In-Reply-To: <002601d6ec2a$36663da0$a332b8e0$@net>
Hi, Doug,
Thanks for testing this patch.
> -----Original Message-----
> From: Doug Smythies <dsmythies@telus.net>
> Sent: Sunday, January 17, 2021 1:08 AM
> To: Zhang, Rui <rui.zhang@intel.com>
> Cc: daniel.lezcano@linaro.org; srinivas.pandruvada@linux.intel.com; linux-
> pm@vger.kernel.org
> Subject: RE: [PATCH] thermal/intel: introduce tcc cooling driver
> Importance: High
>
> On 2021.01.15 Zhang Rui wrote:
> >
> > On Intel processors, the core frequency can be reduced below OS
> > request, when the current temperature reaches the TCC (Thermal Control
> > Circuit) activation temperature.
> >
> > The default TCC activation temperature is specified by
> > MSR_IA32_TEMPERATURE_TARGET. However, it can be adjusted by
> specifying
> > an offset in degrees C, using the TCC Offset bits in the same MSR register.
> >
> > This patch introduces a cooling devices driver that utilizes the TCC
> > Offset feature. The bigger the current cooling state is, the lower the
> > effective TCC activation temperature is, so that the processors can be
> > throttled earlier before system critical overheats.
>
> Thank you for this useful patch.
> My systems don't need thermald or any other thermal control, but it is nice
> to have this extra margin to add to the critical stuff, as a backup.
> I also like to use the offset to test stuff.
>
> I use the internal power limit servo for power limiting, and that servo works
> very well indeed. Using this temperature offset as a way to servo the
> thermal operating limit does work, but tends to overshoot, oscillate, hold low
> excessively long (minutes).
Do you have a script to test and show the drawbacks of this feature?
It seems that it behaves differently on different platforms.
Maybe we can evaluate this on more platforms.
> It also seems to limit CPU clock frequency
> reduction to the non-turbo limit, regardless of the desired maximum
> temperature.
>
> I am not familiar with the thermal stuff at all, and didn't know where to find
> the trip point knob. Anyway, found "cooling_devices11".
>
> I do not understand this:
>
> ~$ cat /sys/devices/virtual/thermal/cooling_device11/stats/trans_table
> cat: /sys/devices/virtual/thermal/cooling_device11/stats/trans_table: File
> too large
This is a known issue that stats table can not handle devices with too many cooling states, say, 127 cooling states for TCC Offset cooling device.
We can ignore this for now.
>
> Rather than enter the actual TCC offset, I would rather enter the desired trip
> point, and have the driver do the math to convert it to the offset.
Hmmm, a writable trip point? I need to think about this.
>
> Example step function overshoot, trip point set to 55 degrees C.
>
> doug@s18:~$ sudo ~/turbostat --Summary --quiet --show
> Busy%,Bzy_MHz,PkgTmp,PkgWatt,GFXWatt,IRQ --interval 1
> Busy% Bzy_MHz IRQ PkgTmp PkgWatt GFXWatt
> 0.07 800 45 24 1.89 0.00
> 0.04 800 29 23 1.89 0.00
> 61.76 4546 4151 66 103.77 0.00 < step function load applied on 4 of 6
> cores
> 67.76 4570 4476 66 120.42 0.00
> 68.03 4567 4488 66 120.73 0.00
> 67.98 4572 4492 67 121.00 0.00 < 19 degrees over trip point
> 68.10 4489 4493 58 109.19 0.00 < this throttling is either the power
> servo or the temp servo.
> 68.08 4262 4476 51 82.82 0.00 < this throttling is the temp servo.
> 68.13 4143 4513 48 75.16 0.00
> 68.03 4086 4488 46 71.87 0.00 < It actually undershoots often, I don't
> know why.
> 68.12 4000 4505 46 67.02 0.00 < often it doesn't undershoot.
> 68.44 4000 4502 45 67.16 0.00
> 68.06 4000 4483 45 66.95 0.00
> 68.02 3973 4490 44 65.20 0.00
> 67.94 3900 4489 43 60.51 0.00
> 67.88 3900 4501 44 60.55 0.00
> 67.85 3900 4472 43 60.52 0.00
> 67.96 3900 4481 43 60.59 0.00
> 68.26 3900 4501 44 60.70 0.00
> 67.93 3900 4498 43 60.58 0.00
> 68.03 3900 4476 43 60.68 0.00
> 67.83 3900 4481 44 60.54 0.00
> 35.06 3895 2412 25 32.13 0.00 < load removed.
> 0.04 800 25 24 1.89 0.00
> 0.04 800 22 23 1.89 0.00
> 0.06 800 35 23 1.90 0.00
> 0.03 800 18 23 1.89 0.00
> 0.04 800 26 22 1.90 0.00
> 0.30 1927 44 23 1.97 0.00
> ^C0.10 800 25 23 1.91 0.00
>
> Example long time to recover:
> (actually, this example never recovers, unusual):
> Note: 3.7 GHz is the limit.
>
> doug@s18:~$ sudo ~/turbostat --Summary --quiet --show
> Busy%,Bzy_MHz,PkgTmp,PkgWatt,GFXWatt,IRQ --interval 30
> Busy% Bzy_MHz IRQ PkgTmp PkgWatt GFXWatt
> 67.58 3700 134812 42 52.15 0.00 <<< the trip point was changed from 37
> to 57 degrees
> 67.90 3700 134964 42 52.08 0.00
> 68.07 3700 134424 42 52.06 0.00
> 68.01 3700 134415 41 50.76 0.00
> 68.14 3700 134521 41 50.78 0.00
> 68.11 3700 134424 42 50.75 0.00
> 68.03 3700 134329 42 50.70 0.00
> 68.11 3700 134321 42 50.76 0.00
> 68.05 3700 134456 42 51.09 0.00
> 68.12 3700 134549 42 52.21 0.00
> 68.12 3700 134482 42 52.19 0.00
> 68.10 3700 134301 42 52.20 0.00
> 68.11 3700 134444 42 52.14 0.00
> 68.08 3700 134422 42 52.17 0.00
> 68.07 3700 134430 42 52.23 0.00
> 68.00 3700 134723 42 52.12 0.00
> 67.96 3711 135207 44 52.53 0.00 <<< It takes 8 minutes until the
> frequency goes above 3.7 GHz
> 68.05 3765 134519 42 54.34 0.00
> 68.11 3771 134461 43 54.60 0.00
> 67.83 3763 134867 43 54.26 0.00
> 67.93 3773 134577 43 54.78 0.00 <<< But it never recovers, Why not?
> ...
>
> For unknown reason the processor seems to now think it is not heavily
> loaded. From my MSR decoder:
>
> 0x64F: MSR_CORE_PERF_LIMIT_REASONS: 200020 AUTO AUTOL
>
> From the book:
>
> > Autonomous Utilization-Based Frequency Control Status (R0) When set,
> > frequency is reduced below the operating system request because the
> > processor has detected that utilization is low.
>
> Which is not true.
>
> Anyway,
>
> Acked-by: Doug Smythies <dsmythies@telus.net>
>
thanks,
rui
next prev parent reply other threads:[~2021-01-18 10:47 UTC|newest]
Thread overview: 10+ messages / expand[flat|nested] mbox.gz Atom feed top
2021-01-15 9:47 [PATCH] thermal/intel: introduce tcc cooling driver Zhang Rui
2021-01-16 17:08 ` Doug Smythies
2021-01-16 21:21 ` Doug Smythies
2021-01-18 9:31 ` Zhang, Rui
2021-01-19 7:10 ` Doug Smythies
2021-01-18 9:46 ` Zhang, Rui [this message]
2021-01-28 17:32 ` Zhang Rui
2021-01-26 19:18 ` Doug Smythies
2021-01-28 17:29 ` Zhang Rui
2021-01-30 16:58 ` Doug Smythies
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=e04c36aae6eb4cbb9b99799290016d58@intel.com \
--to=rui.zhang@intel.com \
--cc=daniel.lezcano@linaro.org \
--cc=dsmythies@telus.net \
--cc=linux-pm@vger.kernel.org \
--cc=srinivas.pandruvada@linux.intel.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).