* [lm-sensors] [PATCH 1/2] k8temp warn about errata
@ 2008-10-01 22:09 Rudolf Marek
2008-10-25 12:43 ` Jean Delvare
` (18 more replies)
0 siblings, 19 replies; 20+ messages in thread
From: Rudolf Marek @ 2008-10-01 22:09 UTC (permalink / raw)
To: lm-sensors
[-- Attachment #1: Type: text/plain, Size: 586 bytes --]
-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1
Hello,
Following patch adds warning about wrong CPU temperature readouts on all fam f
rev f revision of CPUs.
Used switch statement, more code changes follows.
Signed-off-by: Rudolf Marek <r.marek@assembler.cz>
It should "fix" http://bugzilla.kernel.org/show_bug.cgi?id=8866
Rudolf
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.6 (GNU/Linux)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org
iD8DBQFI4/UF3J9wPJqZRNURAmtlAJ9+mLtSfm0pm5mPYM64KAJlkqNJSACfeUd/
E/rMMMRv+fY550oVBxk3lF4=
=zqkq
-----END PGP SIGNATURE-----
[-- Attachment #2: k8temp_add_warning.patch --]
[-- Type: text/x-diff, Size: 1356 bytes --]
Index: linux-2.6.27-rc7/drivers/hwmon/k8temp.c
===================================================================
--- linux-2.6.27-rc7.orig/drivers/hwmon/k8temp.c 2008-09-28 11:01:45.855284456 +0200
+++ linux-2.6.27-rc7/drivers/hwmon/k8temp.c 2008-09-28 11:13:42.396117790 +0200
@@ -34,6 +34,7 @@
#define TEMP_FROM_REG(val) (((((val) >> 16) & 0xff) - 49) * 1000)
#define REG_TEMP 0xe4
+#define REG_CPUID 0xfc
#define SEL_PLACE 0x40
#define SEL_CORE 0x04
@@ -47,6 +48,7 @@
/* registers values */
u8 sensorsp; /* sensor presence bits - SEL_CORE & SEL_PLACE */
u32 temp[2][2]; /* core, place */
+ u8 fam;
};
static struct k8temp_data *k8temp_update_device(struct device *dev)
@@ -141,6 +143,7 @@
int err;
u8 scfg;
u32 temp;
+
struct k8temp_data *data;
u32 cpuid = cpuid_eax(1);
@@ -155,6 +158,18 @@
goto exit;
}
+ /* get real PCI based cpuid, prior revF of fam 0Fh, this reg is 0 */
+ pci_read_config_dword(pdev, REG_CPUID, &cpuid);
+
+ data->fam = (cpuid & 0x00000f00) >> 8;
+ data->fam += (cpuid & 0x00f00000) >> 20;
+
+ switch (data->fam) {
+ case 0xf:
+ dev_warn(&pdev->dev, "Temperature readouts might be wrong"
+ " - check errata #141\n");
+ }
+
pci_read_config_byte(pdev, REG_TEMP, &scfg);
scfg &= ~(SEL_PLACE | SEL_CORE); /* Select sensor 0, core0 */
pci_write_config_byte(pdev, REG_TEMP, scfg);
[-- Attachment #3: k8temp_add_warning.patch.sig --]
[-- Type: application/octet-stream, Size: 65 bytes --]
[-- Attachment #4: Type: text/plain, Size: 153 bytes --]
_______________________________________________
lm-sensors mailing list
lm-sensors@lm-sensors.org
http://lists.lm-sensors.org/mailman/listinfo/lm-sensors
^ permalink raw reply [flat|nested] 20+ messages in thread
* Re: [lm-sensors] [PATCH 1/2] k8temp warn about errata
2008-10-01 22:09 [lm-sensors] [PATCH 1/2] k8temp warn about errata Rudolf Marek
@ 2008-10-25 12:43 ` Jean Delvare
2008-10-27 1:18 ` Jean-Marc Spaggiari
` (17 subsequent siblings)
18 siblings, 0 replies; 20+ messages in thread
From: Jean Delvare @ 2008-10-25 12:43 UTC (permalink / raw)
To: lm-sensors
Hi Rudolf,
On Thu, 02 Oct 2008 00:09:09 +0200, Rudolf Marek wrote:
> Following patch adds warning about wrong CPU temperature readouts on all fam f
> rev f revision of CPUs.
>
> Used switch statement, more code changes follows.
>
> Signed-off-by: Rudolf Marek <r.marek@assembler.cz>
>
> It should "fix" http://bugzilla.kernel.org/show_bug.cgi?idˆ66
Thanks for working on this. Review:
> Index: linux-2.6.27-rc7/drivers/hwmon/k8temp.c
> =================================> --- linux-2.6.27-rc7.orig/drivers/hwmon/k8temp.c 2008-09-28 11:01:45.855284456 +0200
> +++ linux-2.6.27-rc7/drivers/hwmon/k8temp.c 2008-09-28 11:13:42.396117790 +0200
> @@ -34,6 +34,7 @@
>
> #define TEMP_FROM_REG(val) (((((val) >> 16) & 0xff) - 49) * 1000)
> #define REG_TEMP 0xe4
> +#define REG_CPUID 0xfc
> #define SEL_PLACE 0x40
> #define SEL_CORE 0x04
>
> @@ -47,6 +48,7 @@
> /* registers values */
> u8 sensorsp; /* sensor presence bits - SEL_CORE & SEL_PLACE */
> u32 temp[2][2]; /* core, place */
> + u8 fam;
> };
>
> static struct k8temp_data *k8temp_update_device(struct device *dev)
> @@ -141,6 +143,7 @@
> int err;
> u8 scfg;
> u32 temp;
> +
> struct k8temp_data *data;
> u32 cpuid = cpuid_eax(1);
>
Unrelated whitespace change, please revert.
> @@ -155,6 +158,18 @@
> goto exit;
> }
>
> + /* get real PCI based cpuid, prior revF of fam 0Fh, this reg is 0 */
That's tricky. It took me some time to figure out why the warning
wouldn't be displayed on all family 0Fh CPUs ;)
> + pci_read_config_dword(pdev, REG_CPUID, &cpuid);
> +
> + data->fam = (cpuid & 0x00000f00) >> 8;
> + data->fam += (cpuid & 0x00f00000) >> 20;
The extended family code is an 8-bit value, so this should be:
+ data->fam += (cpuid & 0x0ff00000) >> 20;
I don't know if the result is guaranteed to fit in a u8.
> +
> + switch (data->fam) {
> + case 0xf:
Coding style: case is normally aligned on switch (see Chapter 1:
Indentation of Documentation/CodingStyle.)
> + dev_warn(&pdev->dev, "Temperature readouts might be wrong"
> + " - check errata #141\n");
> + }
> +
> pci_read_config_byte(pdev, REG_TEMP, &scfg);
> scfg &= ~(SEL_PLACE | SEL_CORE); /* Select sensor 0, core0 */
> pci_write_config_byte(pdev, REG_TEMP, scfg);
This is a good start, but I don't think this is enough. I don't expect
desktop users to search the kernel logs for that kind of information.
The situation is made worse by the fact that the k8temp driver is one
of the few hwmon drivers which auto-loads, so many users see the
reported temperature, not just those who ran sensors-detect. So I'd
rather implement a more "active" mechanism. I have several ideas in
mind and would welcome comments on them.
If all revision F and later CPUs are affected by the errata and the
temperature reported is never correct, we should simply blacklist these
CPUs. I guess this isn't the case, otherwise you'd have proposed that
we do that long ago.
If most but not all of these CPUs are affected, then it would make
sense to disable them by default but give the user a chance to still
enable them (using a module parameter.)
If there are more revision F and later CPUs with working thermal
diodes, and working ones can be told from non-working ones based on the
exact revision, we could implement blacklisting and/or whitelisting
base on the revision.
Does anyone have an idea about the ratio of working/broken revision F
and later CPU models?
Does the status (broken or not) of the thermal diode depend on the
exact CPU revision, or is it purely random?
Alternatively, or additionally, we could use a heuristic to detect
broken diodes. I don't much like this in general, as monitoring should
normally not assume anything on what it is monitoring, but in this
specific case I think it makes some sense because the thermal sensor in
question is not a generic chip and brokenness is so widely spread.
When looking at reports for broken K8 diodes, for example at:
http://bugzilla.kernel.org/show_bug.cgi?idˆ66
it is immediately visible to a human that:
k8temp-pci-00c3
Adapter: PCI adapter
Core0 Temp: +17 C
Core0 Temp: +3 C
Core1 Temp: +21 C
Core1 Temp: +5 C
are NOT reasonable temperatures for a CPU.
Other obviously wrong outputs (gathered on the lm-sensors mailing list):
k8temp-pci-00c3
Adapter: PCI adapter
Core0 Temp: +4°C
Core0 Temp: -1°C
Core1 Temp: -2°C
Core1 Temp: -20°C
k8temp-pci-00c3
Adapter: PCI adapter
temp1: -9°C
temp2: -20°C
temp3: -18°C
temp4: -14°C
k8temp-pci-00c3
Adapter: PCI adapter
Core0 Temp: -6°C
Core0 Temp: -11°C
Core1 Temp: +5°C
Core1 Temp: -13°C
k8temp-pci-00c3
Adapter: PCI adapter
Core0 Temp: -1°C
Core0 Temp: +1°C
Core1 Temp: +6°C
Core1 Temp: +6°C
k8temp-pci-00c3
Adapter: PCI adapter
Core0 Temp: +9°C
Core0 Temp: +0°C
Core1 Temp: +3°C
Core1 Temp: +11°C
The following outputs, OTOH, look pretty good (also gathered on the
lm-sensors mailing list):
k8temp-pci-00c3
Adapter: PCI adapter
Core0 Temp: +42.0°C
Core0 Temp: +39.0°C
Core1 Temp: +36.0°C
Core1 Temp: +39.0°C
k8temp-pci-00c3
Adapter: PCI adapter
Core0 Temp: +43.0°C
Core1 Temp: +37.0°C
k8temp-pci-00c3
Adapter: PCI adapter
Core0 Temp: +41.0°C
Core1 Temp: +46.0°C
k8temp-pci-00c3
Adapter: PCI adapter
Core0 Temp: +38.0°C
Core1 Temp: +25.0°C
k8temp-pci-00c3
Adapter: PCI adapter
Core0 Temp: +31.0°C
Core1 Temp: +32.0°C
k8temp-pci-00c3
Adapter: PCI adapter
Core0 Temp: +32°C
Core1 Temp: +35°C
k8temp-pci-00c3
Adapter: PCI adapter
Core1 Temp: +43°C
Core2 Temp: +46°C
There were also a few reports which I'm not sure if there are good or
bad:
k8temp-pci-00c3
Adapter: PCI adapter
Core0 Temp: +26 C
Core1 Temp: +24 C
k8temp-pci-00c3
Adapter: PCI adapter
Core0 Temp: +17°C
Core1 Temp: +25°C
So I am under the impression that it should not be too difficult to
come up with a heuristic to spot obviously broken sensors:
* Apparently all affected CPUs have 4 thermal sensors (I don't know if
this is because all revision F and later CPUs have 4 thermal sensors?)
* Average of the 4 sensors is below 15 degrees C.
* Minimum of the 4 sensors is below 5 degrees C
So we could build a heuristic based on the number of sensors and either
the average or the minimum temperature and reject by default all
revision F and later CPUs which do not pass the test. We would of
course provide a way (module parameter) for the user to change the
temperature limit or maybe disable the heuristic entirely.
Of course this isn't a perfect solution, as there are still CPUs in the
gray zone for which no heuristic can say whether they are good or bad.
Also, there is a small risk that a CPU which is at the limit will be
rejected sometimes but not always, so hwmon device numbering may change
from one boot to the next. However, I think that a good heuristic would
solve way more problems than it would cause.
Comments anyone?
Thanks,
--
Jean Delvare
_______________________________________________
lm-sensors mailing list
lm-sensors@lm-sensors.org
http://lists.lm-sensors.org/mailman/listinfo/lm-sensors
^ permalink raw reply [flat|nested] 20+ messages in thread
* Re: [lm-sensors] [PATCH 1/2] k8temp warn about errata
2008-10-01 22:09 [lm-sensors] [PATCH 1/2] k8temp warn about errata Rudolf Marek
2008-10-25 12:43 ` Jean Delvare
@ 2008-10-27 1:18 ` Jean-Marc Spaggiari
2008-11-09 19:56 ` Rudolf Marek
` (16 subsequent siblings)
18 siblings, 0 replies; 20+ messages in thread
From: Jean-Marc Spaggiari @ 2008-10-27 1:18 UTC (permalink / raw)
To: lm-sensors
2008/10/25 Jean Delvare <khali@linux-fr.org>:
> Hi Rudolf,
>
> On Thu, 02 Oct 2008 00:09:09 +0200, Rudolf Marek wrote:
>> Following patch adds warning about wrong CPU temperature readouts on all fam f
>> rev f revision of CPUs.
>>
>> Used switch statement, more code changes follows.
>>
>> Signed-off-by: Rudolf Marek <r.marek@assembler.cz>
>>
>> It should "fix" http://bugzilla.kernel.org/show_bug.cgi?idˆ66
>
> Thanks for working on this. Review:
>
>> Index: linux-2.6.27-rc7/drivers/hwmon/k8temp.c
>> =================================>> --- linux-2.6.27-rc7.orig/drivers/hwmon/k8temp.c 2008-09-28 11:01:45.855284456 +0200
>> +++ linux-2.6.27-rc7/drivers/hwmon/k8temp.c 2008-09-28 11:13:42.396117790 +0200
>> @@ -34,6 +34,7 @@
>>
>> #define TEMP_FROM_REG(val) (((((val) >> 16) & 0xff) - 49) * 1000)
>> #define REG_TEMP 0xe4
>> +#define REG_CPUID 0xfc
>> #define SEL_PLACE 0x40
>> #define SEL_CORE 0x04
>>
>> @@ -47,6 +48,7 @@
>> /* registers values */
>> u8 sensorsp; /* sensor presence bits - SEL_CORE & SEL_PLACE */
>> u32 temp[2][2]; /* core, place */
>> + u8 fam;
>> };
>>
>> static struct k8temp_data *k8temp_update_device(struct device *dev)
>> @@ -141,6 +143,7 @@
>> int err;
>> u8 scfg;
>> u32 temp;
>> +
>> struct k8temp_data *data;
>> u32 cpuid = cpuid_eax(1);
>>
>
> Unrelated whitespace change, please revert.
>
>> @@ -155,6 +158,18 @@
>> goto exit;
>> }
>>
>> + /* get real PCI based cpuid, prior revF of fam 0Fh, this reg is 0 */
>
> That's tricky. It took me some time to figure out why the warning
> wouldn't be displayed on all family 0Fh CPUs ;)
>
>> + pci_read_config_dword(pdev, REG_CPUID, &cpuid);
>> +
>> + data->fam = (cpuid & 0x00000f00) >> 8;
>> + data->fam += (cpuid & 0x00f00000) >> 20;
>
> The extended family code is an 8-bit value, so this should be:
>
> + data->fam += (cpuid & 0x0ff00000) >> 20;
>
> I don't know if the result is guaranteed to fit in a u8.
>
>> +
>> + switch (data->fam) {
>> + case 0xf:
>
> Coding style: case is normally aligned on switch (see Chapter 1:
> Indentation of Documentation/CodingStyle.)
>
>> + dev_warn(&pdev->dev, "Temperature readouts might be wrong"
>> + " - check errata #141\n");
>> + }
>> +
>> pci_read_config_byte(pdev, REG_TEMP, &scfg);
>> scfg &= ~(SEL_PLACE | SEL_CORE); /* Select sensor 0, core0 */
>> pci_write_config_byte(pdev, REG_TEMP, scfg);
>
> This is a good start, but I don't think this is enough. I don't expect
> desktop users to search the kernel logs for that kind of information.
> The situation is made worse by the fact that the k8temp driver is one
> of the few hwmon drivers which auto-loads, so many users see the
> reported temperature, not just those who ran sensors-detect. So I'd
> rather implement a more "active" mechanism. I have several ideas in
> mind and would welcome comments on them.
>
> If all revision F and later CPUs are affected by the errata and the
> temperature reported is never correct, we should simply blacklist these
> CPUs. I guess this isn't the case, otherwise you'd have proposed that
> we do that long ago.
>
> If most but not all of these CPUs are affected, then it would make
> sense to disable them by default but give the user a chance to still
> enable them (using a module parameter.)
>
> If there are more revision F and later CPUs with working thermal
> diodes, and working ones can be told from non-working ones based on the
> exact revision, we could implement blacklisting and/or whitelisting
> base on the revision.
>
> Does anyone have an idea about the ratio of working/broken revision F
> and later CPU models?
>
> Does the status (broken or not) of the thermal diode depend on the
> exact CPU revision, or is it purely random?
>
> Alternatively, or additionally, we could use a heuristic to detect
> broken diodes. I don't much like this in general, as monitoring should
> normally not assume anything on what it is monitoring, but in this
> specific case I think it makes some sense because the thermal sensor in
> question is not a generic chip and brokenness is so widely spread.
>
> When looking at reports for broken K8 diodes, for example at:
> http://bugzilla.kernel.org/show_bug.cgi?idˆ66
> it is immediately visible to a human that:
>
> k8temp-pci-00c3
> Adapter: PCI adapter
> Core0 Temp: +17 C
> Core0 Temp: +3 C
> Core1 Temp: +21 C
> Core1 Temp: +5 C
>
> are NOT reasonable temperatures for a CPU.
>
> Other obviously wrong outputs (gathered on the lm-sensors mailing list):
>
> k8temp-pci-00c3
> Adapter: PCI adapter
> Core0 Temp: +4°C
> Core0 Temp: -1°C
> Core1 Temp: -2°C
> Core1 Temp: -20°C
>
> k8temp-pci-00c3
> Adapter: PCI adapter
> temp1: -9°C
> temp2: -20°C
> temp3: -18°C
> temp4: -14°C
>
> k8temp-pci-00c3
> Adapter: PCI adapter
> Core0 Temp: -6°C
> Core0 Temp: -11°C
> Core1 Temp: +5°C
> Core1 Temp: -13°C
>
> k8temp-pci-00c3
> Adapter: PCI adapter
> Core0 Temp: -1°C
> Core0 Temp: +1°C
> Core1 Temp: +6°C
> Core1 Temp: +6°C
>
> k8temp-pci-00c3
> Adapter: PCI adapter
> Core0 Temp: +9°C
> Core0 Temp: +0°C
> Core1 Temp: +3°C
> Core1 Temp: +11°C
>
> The following outputs, OTOH, look pretty good (also gathered on the
> lm-sensors mailing list):
>
> k8temp-pci-00c3
> Adapter: PCI adapter
> Core0 Temp: +42.0°C
> Core0 Temp: +39.0°C
> Core1 Temp: +36.0°C
> Core1 Temp: +39.0°C
>
> k8temp-pci-00c3
> Adapter: PCI adapter
> Core0 Temp: +43.0°C
> Core1 Temp: +37.0°C
>
> k8temp-pci-00c3
> Adapter: PCI adapter
> Core0 Temp: +41.0°C
> Core1 Temp: +46.0°C
>
> k8temp-pci-00c3
> Adapter: PCI adapter
> Core0 Temp: +38.0°C
> Core1 Temp: +25.0°C
>
> k8temp-pci-00c3
> Adapter: PCI adapter
> Core0 Temp: +31.0°C
> Core1 Temp: +32.0°C
>
> k8temp-pci-00c3
> Adapter: PCI adapter
> Core0 Temp: +32°C
> Core1 Temp: +35°C
>
> k8temp-pci-00c3
> Adapter: PCI adapter
> Core1 Temp: +43°C
> Core2 Temp: +46°C
>
> There were also a few reports which I'm not sure if there are good or
> bad:
>
> k8temp-pci-00c3
> Adapter: PCI adapter
> Core0 Temp: +26 C
> Core1 Temp: +24 C
>
> k8temp-pci-00c3
> Adapter: PCI adapter
> Core0 Temp: +17°C
> Core1 Temp: +25°C
>
> So I am under the impression that it should not be too difficult to
> come up with a heuristic to spot obviously broken sensors:
> * Apparently all affected CPUs have 4 thermal sensors (I don't know if
> this is because all revision F and later CPUs have 4 thermal sensors?)
> * Average of the 4 sensors is below 15 degrees C.
> * Minimum of the 4 sensors is below 5 degrees C
>
> So we could build a heuristic based on the number of sensors and either
> the average or the minimum temperature and reject by default all
> revision F and later CPUs which do not pass the test. We would of
> course provide a way (module parameter) for the user to change the
> temperature limit or maybe disable the heuristic entirely.
>
> Of course this isn't a perfect solution, as there are still CPUs in the
> gray zone for which no heuristic can say whether they are good or bad.
> Also, there is a small risk that a CPU which is at the limit will be
> rejected sometimes but not always, so hwmon device numbering may change
> from one boot to the next. However, I think that a good heuristic would
> solve way more problems than it would cause.
>
> Comments anyone?
>
> Thanks,
> --
> Jean Delvare
Sound like a good idea to me
For those values:
k8temp-pci-00c3
Adapter: PCI adapter
Core0 Temp: +26 C
Core1 Temp: +24 C
k8temp-pci-00c3
Adapter: PCI adapter
Core0 Temp: +17°C
Core1 Temp: +25°C
I think they are correct. In winter, I place my computer outside
(-10C/-15C). The idle core temp can be pretty low. So it's important
to give the user a way to force the sensor activation or adjust the
temperatures level for the heuristic.
So in this statement:
* Average of the 4 sensors is below 15 degrees C.
* Minimum of the 4 sensors is below 5 degrees C
The user need to be able to update 15 and 5.
Also, since in all examples the average temperature is lower than 15
degrees C (-4,75; -15,25; -6,25; 3; 5,75) I will recommand to put it
closer to 10 degrees C.
Just my 2¢.
--
JM
_______________________________________________
lm-sensors mailing list
lm-sensors@lm-sensors.org
http://lists.lm-sensors.org/mailman/listinfo/lm-sensors
^ permalink raw reply [flat|nested] 20+ messages in thread
* Re: [lm-sensors] [PATCH 1/2] k8temp warn about errata
2008-10-01 22:09 [lm-sensors] [PATCH 1/2] k8temp warn about errata Rudolf Marek
2008-10-25 12:43 ` Jean Delvare
2008-10-27 1:18 ` Jean-Marc Spaggiari
@ 2008-11-09 19:56 ` Rudolf Marek
2008-11-09 20:47 ` Jean Delvare
` (15 subsequent siblings)
18 siblings, 0 replies; 20+ messages in thread
From: Rudolf Marek @ 2008-11-09 19:56 UTC (permalink / raw)
To: lm-sensors
[-- Attachment #1: Type: text/plain, Size: 2681 bytes --]
-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1
Hi Jean
Thanks for the review. I'm attaching fixed version.
Following patch adds warning about wrong CPU temperature readouts on all fam f
rev f revision of CPUs.
Used switch statement, more code changes follows.
Signed-off-by: Rudolf Marek <r.marek@assembler.cz>
> If all revision F and later CPUs are affected by the errata and the
> temperature reported is never correct, we should simply blacklist these
> CPUs. I guess this isn't the case, otherwise you'd have proposed that
> we do that long ago.
Yes.
> If most but not all of these CPUs are affected, then it would make
> sense to disable them by default but give the user a chance to still
> enable them (using a module parameter.)
I tried hard to get this info from AMD. All versions are affected, but some may
be fixed/not tested.
> If there are more revision F and later CPUs with working thermal
> diodes, and working ones can be told from non-working ones based on the
> exact revision, we could implement blacklisting and/or whitelisting
> base on the revision.
The errata is for all revs.
> Does anyone have an idea about the ratio of working/broken revision F
> and later CPU models?
Dont sorry.
> Does the status (broken or not) of the thermal diode depend on the
> exact CPU revision, or is it purely random?
Perhaps random.
> Alternatively, or additionally, we could use a heuristic to detect
> broken diodes. I don't much like this in general, as monitoring should
> normally not assume anything on what it is monitoring, but in this
> specific case I think it makes some sense because the thermal sensor in
> question is not a generic chip and brokenness is so widely spread.
I agree. This could work. However I would like to first get the fam 10h and fam
11h support in, because otherwise I would need to rewrite the patches again and
changing lot of at the same time is no good.
So please can you accept the patch as it is now and have a look on second one
too? I fixed the case indent and we need to write some docs too, but first check
code.
Andreas can you test please the second patch too? I'm especially curious if you
see the high limit when running sensors command.
Once second patch is in, I would like to rewrite the driver with the help of
heuristic and put there more generic logic for "place/core" selectors and temp1_
etc sysfs files creation logic.
Thanks,
Rudolf
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.9 (GNU/Linux)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org
iEYEARECAAYFAkkXQIYACgkQ3J9wPJqZRNVYYQCfdonOHvRKV8guroQtPq6VA6ZF
/W0AniSg/LYun9ZfsF2lnVlWjrVIY/6P
=qoH1
-----END PGP SIGNATURE-----
[-- Attachment #2: k8temp_add_warning.patch --]
[-- Type: text/x-diff, Size: 1239 bytes --]
Index: linux-2.6.27-rc7/drivers/hwmon/k8temp.c
===================================================================
--- linux-2.6.27-rc7.orig/drivers/hwmon/k8temp.c 2008-09-28 11:01:45.000000000 +0200
+++ linux-2.6.27-rc7/drivers/hwmon/k8temp.c 2008-11-02 12:05:58.125785703 +0100
@@ -34,6 +34,7 @@
#define TEMP_FROM_REG(val) (((((val) >> 16) & 0xff) - 49) * 1000)
#define REG_TEMP 0xe4
+#define REG_CPUID 0xfc
#define SEL_PLACE 0x40
#define SEL_CORE 0x04
@@ -47,6 +48,7 @@
/* registers values */
u8 sensorsp; /* sensor presence bits - SEL_CORE & SEL_PLACE */
u32 temp[2][2]; /* core, place */
+ u8 fam;
};
static struct k8temp_data *k8temp_update_device(struct device *dev)
@@ -155,6 +157,18 @@
goto exit;
}
+ /* get real PCI based cpuid, prior revF of fam 0Fh, this reg is 0 */
+ pci_read_config_dword(pdev, REG_CPUID, &cpuid);
+
+ data->fam = (cpuid & 0x00000f00) >> 8;
+ data->fam += (cpuid & 0x0ff00000) >> 20;
+
+ switch (data->fam) {
+ case 0xf:
+ dev_warn(&pdev->dev, "Temperature readouts might be wrong"
+ " - check errata #141\n");
+ }
+
pci_read_config_byte(pdev, REG_TEMP, &scfg);
scfg &= ~(SEL_PLACE | SEL_CORE); /* Select sensor 0, core0 */
pci_write_config_byte(pdev, REG_TEMP, scfg);
[-- Attachment #3: Type: text/plain, Size: 153 bytes --]
_______________________________________________
lm-sensors mailing list
lm-sensors@lm-sensors.org
http://lists.lm-sensors.org/mailman/listinfo/lm-sensors
^ permalink raw reply [flat|nested] 20+ messages in thread
* Re: [lm-sensors] [PATCH 1/2] k8temp warn about errata
2008-10-01 22:09 [lm-sensors] [PATCH 1/2] k8temp warn about errata Rudolf Marek
` (2 preceding siblings ...)
2008-11-09 19:56 ` Rudolf Marek
@ 2008-11-09 20:47 ` Jean Delvare
2008-11-09 20:54 ` Jean Delvare
` (14 subsequent siblings)
18 siblings, 0 replies; 20+ messages in thread
From: Jean Delvare @ 2008-11-09 20:47 UTC (permalink / raw)
To: lm-sensors
Hi Jean-Marc,
Thanks for your comments.
On Sun, 26 Oct 2008 21:18:42 -0400, Jean-Marc Spaggiari wrote:
> Sound like a good idea to me
>
> For those values:
>
> k8temp-pci-00c3
> Adapter: PCI adapter
> Core0 Temp: +26 C
> Core1 Temp: +24 C
>
> k8temp-pci-00c3
> Adapter: PCI adapter
> Core0 Temp: +17°C
> Core1 Temp: +25°C
>
> I think they are correct. In winter, I place my computer outside
> (-10C/-15C). The idle core temp can be pretty low.
They can be correct, but the fact is that we can't be certain (that's
more or less the definition of a heuristic.) Maybe these machines had
working diodes and were in a very cool environment. Or maybe these
machines had broken diodes and were in a very hot environment.
> So it's important
> to give the user a way to force the sensor activation or adjust the
> temperatures level for the heuristic.
Definitely, yes.
> So in this statement:
> * Average of the 4 sensors is below 15 degrees C.
> * Minimum of the 4 sensors is below 5 degrees C
>
> The user need to be able to update 15 and 5.
This was part of my original proposal :) Either that of a boolean
option to simply force the driver to bind.
> Also, since in all examples the average temperature is lower than 15
> degrees C (-4,75; -15,25; -6,25; 3; 5,75) I will recommand to put it
> closer to 10 degrees C.
You forgot the first entry in the list:
k8temp-pci-00c3
Adapter: PCI adapter
Core0 Temp: +17 C
Core0 Temp: +3 C
Core1 Temp: +21 C
Core1 Temp: +5 C
That one has an average temperature of 11.5 degrees C, so a limit of 10
degrees C would consider it good - which it clearly isn't. That being
said, there's no perfect limit, so maybe it is OK if the above is
considered OK by the driver even if I'm certain it isn't.
OTOH, the above is an interesting case. What strikes me here is not the
temperature values themselves but the difference between them which is
pretty unrealistic. So maybe this can be another check in the heuristic.
--
Jean Delvare
_______________________________________________
lm-sensors mailing list
lm-sensors@lm-sensors.org
http://lists.lm-sensors.org/mailman/listinfo/lm-sensors
^ permalink raw reply [flat|nested] 20+ messages in thread
* Re: [lm-sensors] [PATCH 1/2] k8temp warn about errata
2008-10-01 22:09 [lm-sensors] [PATCH 1/2] k8temp warn about errata Rudolf Marek
` (3 preceding siblings ...)
2008-11-09 20:47 ` Jean Delvare
@ 2008-11-09 20:54 ` Jean Delvare
2008-11-10 1:27 ` Jordan Crouse
` (13 subsequent siblings)
18 siblings, 0 replies; 20+ messages in thread
From: Jean Delvare @ 2008-11-09 20:54 UTC (permalink / raw)
To: lm-sensors
Hi Rudolf,
On Sun, 09 Nov 2008 20:56:54 +0100, Rudolf Marek wrote:
> Thanks for the review. I'm attaching fixed version.
>
> Following patch adds warning about wrong CPU temperature readouts on all fam f
> rev f revision of CPUs.
>
> Used switch statement, more code changes follows.
>
> Signed-off-by: Rudolf Marek <r.marek@assembler.cz>
I've applied it, thanks. Note that I added a "break" at the end of the
case 0xf, otherwise it's an invitation to get things wrong in the next
patch...
> > If all revision F and later CPUs are affected by the errata and the
> > temperature reported is never correct, we should simply blacklist these
> > CPUs. I guess this isn't the case, otherwise you'd have proposed that
> > we do that long ago.
>
> Yes.
>
> > If most but not all of these CPUs are affected, then it would make
> > sense to disable them by default but give the user a chance to still
> > enable them (using a module parameter.)
>
> I tried hard to get this info from AMD. All versions are affected, but some may
> be fixed/not tested.
>
> > If there are more revision F and later CPUs with working thermal
> > diodes, and working ones can be told from non-working ones based on the
> > exact revision, we could implement blacklisting and/or whitelisting
> > base on the revision.
>
> The errata is for all revs.
For what it's worth, Jordan Crouse seems to think that blacklisting on
a per-revision basis may still work.
> > Does anyone have an idea about the ratio of working/broken revision F
> > and later CPU models?
>
> Dont sorry.
>
> > Does the status (broken or not) of the thermal diode depend on the
> > exact CPU revision, or is it purely random?
>
> Perhaps random.
>
> > Alternatively, or additionally, we could use a heuristic to detect
> > broken diodes. I don't much like this in general, as monitoring should
> > normally not assume anything on what it is monitoring, but in this
> > specific case I think it makes some sense because the thermal sensor in
> > question is not a generic chip and brokenness is so widely spread.
>
> I agree. This could work. However I would like to first get the fam 10h and fam
> 11h support in, because otherwise I would need to rewrite the patches again and
> changing lot of at the same time is no good.
I understand.
> So please can you accept the patch as it is now and have a look on second one
> too? I fixed the case indent and we need to write some docs too, but first check
> code.
This second patch is larger, I will try to find some time to review it
but I can't promise anything. Also note that I am not familiar with the
Fam 10h CPUs at all and I don't have any so I can't test the code.
> Andreas can you test please the second patch too? I'm especially curious if you
> see the high limit when running sensors command.
>
> Once second patch is in, I would like to rewrite the driver with the help of
> heuristic and put there more generic logic for "place/core" selectors and temp1_
> etc sysfs files creation logic.
--
Jean Delvare
_______________________________________________
lm-sensors mailing list
lm-sensors@lm-sensors.org
http://lists.lm-sensors.org/mailman/listinfo/lm-sensors
^ permalink raw reply [flat|nested] 20+ messages in thread
* Re: [lm-sensors] [PATCH 1/2] k8temp warn about errata
2008-10-01 22:09 [lm-sensors] [PATCH 1/2] k8temp warn about errata Rudolf Marek
` (4 preceding siblings ...)
2008-11-09 20:54 ` Jean Delvare
@ 2008-11-10 1:27 ` Jordan Crouse
2008-11-10 9:38 ` Jean Delvare
` (12 subsequent siblings)
18 siblings, 0 replies; 20+ messages in thread
From: Jordan Crouse @ 2008-11-10 1:27 UTC (permalink / raw)
To: lm-sensors
Jean Delvare wrote:
> Hi Rudolf,
>
> On Sun, 09 Nov 2008 20:56:54 +0100, Rudolf Marek wrote:
>> Thanks for the review. I'm attaching fixed version.
>>
>> Following patch adds warning about wrong CPU temperature readouts on all fam f
>> rev f revision of CPUs.
>>
>> Used switch statement, more code changes follows.
>>
>> Signed-off-by: Rudolf Marek <r.marek@assembler.cz>
>
> I've applied it, thanks. Note that I added a "break" at the end of the
> case 0xf, otherwise it's an invitation to get things wrong in the next
> patch...
>
>>> If all revision F and later CPUs are affected by the errata and the
>>> temperature reported is never correct, we should simply blacklist these
>>> CPUs. I guess this isn't the case, otherwise you'd have proposed that
>>> we do that long ago.
>> Yes.
>>
>>> If most but not all of these CPUs are affected, then it would make
>>> sense to disable them by default but give the user a chance to still
>>> enable them (using a module parameter.)
>> I tried hard to get this info from AMD. All versions are affected, but some may
>> be fixed/not tested.
>>
>>> If there are more revision F and later CPUs with working thermal
>>> diodes, and working ones can be told from non-working ones based on the
>>> exact revision, we could implement blacklisting and/or whitelisting
>>> base on the revision.
>> The errata is for all revs.
>
> For what it's worth, Jordan Crouse seems to think that blacklisting on
> a per-revision basis may still work.
I think it can. A much larger sample would probably need to be taken to
be completely sure - but I hope that we'll find that the problem is
deterministic enough for a blacklist. I think we would agree that a
blacklist would be the more user friendly solution.
Jordan
_______________________________________________
lm-sensors mailing list
lm-sensors@lm-sensors.org
http://lists.lm-sensors.org/mailman/listinfo/lm-sensors
^ permalink raw reply [flat|nested] 20+ messages in thread
* Re: [lm-sensors] [PATCH 1/2] k8temp warn about errata
2008-10-01 22:09 [lm-sensors] [PATCH 1/2] k8temp warn about errata Rudolf Marek
` (5 preceding siblings ...)
2008-11-10 1:27 ` Jordan Crouse
@ 2008-11-10 9:38 ` Jean Delvare
2008-11-18 8:41 ` Andreas Herrmann
` (11 subsequent siblings)
18 siblings, 0 replies; 20+ messages in thread
From: Jean Delvare @ 2008-11-10 9:38 UTC (permalink / raw)
To: lm-sensors
Hi Jordan,
On Sun, 09 Nov 2008 18:27:53 -0700, Jordan Crouse wrote:
> Jean Delvare wrote:
> > For what it's worth, Jordan Crouse seems to think that blacklisting on
> > a per-revision basis may still work.
>
> I think it can. A much larger sample would probably need to be taken to
> be completely sure - but I hope that we'll find that the problem is
> deterministic enough for a blacklist. I think we would agree that a
> blacklist would be the more user friendly solution.
OK, but then we should probably extend Rudolf's patch to ask users
potentially affected by the errata to report to us. The report should
include the CPUID information (e.g. contents of /proc/cpuinfo), the
output of sensors (or the raw temperature values from sysfs), and
whether or not the user thinks the temperature values are correct.
If we don't do that, I fear it will take forever before we can complete
the blacklist.
--
Jean Delvare
_______________________________________________
lm-sensors mailing list
lm-sensors@lm-sensors.org
http://lists.lm-sensors.org/mailman/listinfo/lm-sensors
^ permalink raw reply [flat|nested] 20+ messages in thread
* Re: [lm-sensors] [PATCH 1/2] k8temp warn about errata
2008-10-01 22:09 [lm-sensors] [PATCH 1/2] k8temp warn about errata Rudolf Marek
` (6 preceding siblings ...)
2008-11-10 9:38 ` Jean Delvare
@ 2008-11-18 8:41 ` Andreas Herrmann
2008-11-18 8:55 ` Andreas Herrmann
` (10 subsequent siblings)
18 siblings, 0 replies; 20+ messages in thread
From: Andreas Herrmann @ 2008-11-18 8:41 UTC (permalink / raw)
To: lm-sensors
On Thu, Oct 02, 2008 at 12:09:09AM +0200, Rudolf Marek wrote:
Hi Rudolf,
Sorry for the late reply. But up to now, I didn't find time for review
and test. Now I've done that. I've two patches that I'd like to add
on top of your patches.
BTW, Tctl_max is inadvertently missing in the power and thermal data
sheet for family 0x11. It should be published with one of the next versions
of that document. For currently available family 0x11 CPUs it's 100degC.
I've tested your code with my patches on top of it on
various family 0x10 models and family 0x11 revB1.
Thanks.
Andreas
_______________________________________________
lm-sensors mailing list
lm-sensors@lm-sensors.org
http://lists.lm-sensors.org/mailman/listinfo/lm-sensors
^ permalink raw reply [flat|nested] 20+ messages in thread
* Re: [lm-sensors] [PATCH 1/2] k8temp warn about errata
2008-10-01 22:09 [lm-sensors] [PATCH 1/2] k8temp warn about errata Rudolf Marek
` (7 preceding siblings ...)
2008-11-18 8:41 ` Andreas Herrmann
@ 2008-11-18 8:55 ` Andreas Herrmann
2008-11-18 9:06 ` Jean Delvare
` (9 subsequent siblings)
18 siblings, 0 replies; 20+ messages in thread
From: Andreas Herrmann @ 2008-11-18 8:55 UTC (permalink / raw)
To: lm-sensors
On Sun, Nov 09, 2008 at 09:54:39PM +0100, Jean Delvare wrote:
> Hi Rudolf,
>
> On Sun, 09 Nov 2008 20:56:54 +0100, Rudolf Marek wrote:
> > Thanks for the review. I'm attaching fixed version.
> >
> > Following patch adds warning about wrong CPU temperature readouts on all fam f
> > rev f revision of CPUs.
> >
> > Used switch statement, more code changes follows.
> >
> > Signed-off-by: Rudolf Marek <r.marek@assembler.cz>
>
> I've applied it, thanks. Note that I added a "break" at the end of the
> case 0xf, otherwise it's an invitation to get things wrong in the next
> patch...
Hi Jean,
your git tree is at git://jdelvare.pck.nerim.net/jdelvare-2.6, correct?
At least that's what you have used in your pull requests.
Did you apply Rudolf's patch to this git tree?
I just want to know against which tree I need to create patches when adapting
k8temp.
Thanks,
Andreas
_______________________________________________
lm-sensors mailing list
lm-sensors@lm-sensors.org
http://lists.lm-sensors.org/mailman/listinfo/lm-sensors
^ permalink raw reply [flat|nested] 20+ messages in thread
* Re: [lm-sensors] [PATCH 1/2] k8temp warn about errata
2008-10-01 22:09 [lm-sensors] [PATCH 1/2] k8temp warn about errata Rudolf Marek
` (8 preceding siblings ...)
2008-11-18 8:55 ` Andreas Herrmann
@ 2008-11-18 9:06 ` Jean Delvare
2008-11-18 9:25 ` Andreas Herrmann
` (8 subsequent siblings)
18 siblings, 0 replies; 20+ messages in thread
From: Jean Delvare @ 2008-11-18 9:06 UTC (permalink / raw)
To: lm-sensors
Hi Andreas,
On Tue, 18 Nov 2008 09:55:29 +0100, Andreas Herrmann wrote:
> your git tree is at git://jdelvare.pck.nerim.net/jdelvare-2.6, correct?
> At least that's what you have used in your pull requests.
No this is incorrect. This git tree is a temporary thing I use to let
Linus pull my patches, but the actual work is done in a quilt tree at:
http://khali.linux-fr.org/devel/linux-2.6/jdelvare-hwmon/
That being said, please keep in mind that this is as unofficial as you
can get, given that I am not the maintainer of the hwmon subsystem.
> Did you apply Rudolf's patch to this git tree?
>
> I just want to know against which tree I need to create patches when adapting
> k8temp.
For now I applied:
http://khali.linux-fr.org/devel/linux-2.6/jdelvare-hwmon/hwmon-k8temp-01-warn-about-errata.patch
But everyone involved seems to agree that it isn't sufficient and we
should prevent the driver from binding to broken devices if possible,
using either a revision-based blacklist, or a temperature value-based
heuristic. I didn't yet have the time to work on this though, nor did
anyone else apparently.
--
Jean Delvare
_______________________________________________
lm-sensors mailing list
lm-sensors@lm-sensors.org
http://lists.lm-sensors.org/mailman/listinfo/lm-sensors
^ permalink raw reply [flat|nested] 20+ messages in thread
* Re: [lm-sensors] [PATCH 1/2] k8temp warn about errata
2008-10-01 22:09 [lm-sensors] [PATCH 1/2] k8temp warn about errata Rudolf Marek
` (9 preceding siblings ...)
2008-11-18 9:06 ` Jean Delvare
@ 2008-11-18 9:25 ` Andreas Herrmann
2008-11-18 10:10 ` Andreas Herrmann
` (7 subsequent siblings)
18 siblings, 0 replies; 20+ messages in thread
From: Andreas Herrmann @ 2008-11-18 9:25 UTC (permalink / raw)
To: lm-sensors
On Mon, Nov 10, 2008 at 10:38:06AM +0100, Jean Delvare wrote:
> On Sun, 09 Nov 2008 18:27:53 -0700, Jordan Crouse wrote:
> > Jean Delvare wrote:
> > > For what it's worth, Jordan Crouse seems to think that blacklisting on
> > > a per-revision basis may still work.
> >
> > I think it can. A much larger sample would probably need to be taken to
> > be completely sure - but I hope that we'll find that the problem is
> > deterministic enough for a blacklist. I think we would agree that a
> > blacklist would be the more user friendly solution.
>
> OK, but then we should probably extend Rudolf's patch to ask users
> potentially affected by the errata to report to us. The report should
> include the CPUID information (e.g. contents of /proc/cpuinfo), the
> output of sensors (or the raw temperature values from sysfs), and
> whether or not the user thinks the temperature values are correct.
How do you find out whether your system is affected by this erratum?
The erratum is: due to inaccuracy of reported temperature (doesn't
meet a certain accuracy threshold) triggering of HTC/STC feature is
inaccurate.
I am just curious how you'd like to determine the accuracy of the
thermal sensor ... As an example, if the sensor reports 70 degC when
the true temperature is 65 degC -- is it worth to blacklist it?
Jordan, do you know more details about the deviation of the reported
temparature sensor values from the real ones?
I'd prefer not to blacklist but to keep the warning about potential
inaccurate temperature values as introduced by Rudolf. I use k8temp on
my private machines -- Athlon X2 and a Turion X2 (both are revF CPUs
and thus affected by erratum 141). I admit, this is more or less a
gimmick but I would miss it (if blacklisted).
Regards,
Andreas
_______________________________________________
lm-sensors mailing list
lm-sensors@lm-sensors.org
http://lists.lm-sensors.org/mailman/listinfo/lm-sensors
^ permalink raw reply [flat|nested] 20+ messages in thread
* Re: [lm-sensors] [PATCH 1/2] k8temp warn about errata
2008-10-01 22:09 [lm-sensors] [PATCH 1/2] k8temp warn about errata Rudolf Marek
` (10 preceding siblings ...)
2008-11-18 9:25 ` Andreas Herrmann
@ 2008-11-18 10:10 ` Andreas Herrmann
2008-11-18 10:40 ` Jean Delvare
` (6 subsequent siblings)
18 siblings, 0 replies; 20+ messages in thread
From: Andreas Herrmann @ 2008-11-18 10:10 UTC (permalink / raw)
To: lm-sensors
On Sun, Nov 09, 2008 at 08:56:54PM +0100, Rudolf Marek wrote:
> > If most but not all of these CPUs are affected, then it would make
> > sense to disable them by default but give the user a chance to still
> > enable them (using a module parameter.)
>
> I tried hard to get this info from AMD. All versions are affected, but some may
> be fixed/not tested.
> > If there are more revision F and later CPUs with working thermal
> > diodes, and working ones can be told from non-working ones based on the
> > exact revision, we could implement blacklisting and/or whitelisting
> > base on the revision.
>
> The errata is for all revs.
IMHO it is not possible to do blacklisting/whitelisting. CPU revision
guide for family 0xf says all models/steppgings of revG and revF CPUs
are affected by erratum 141.
> > Does anyone have an idea about the ratio of working/broken revision F
> > and later CPU models?
>
> Dont sorry.
No idea (yet). Will see, whether it's possible to get additional information
about this.
<snip>
> Andreas can you test please the second patch too? I'm especially curious if you
> see the high limit when running sensors command.
What do you mean by "if you see the high limit when running sensors
command". (What lm_sensors version should I use? Any additional patches?)
I've patched lm_sensors on some of my test machines. With
that version I get for a dual socket machine following values.
fam10temp-pci-00c3
Adapter: PCI adapter
Core Temp: +18 C
fam10temp-pci-00cb
Adapter: PCI adapter
Core Temp: +14 C
Of course the temperature limit is shown in sysfs:
(family 0x11:)
neon ~ # find /sys -name temp1*
/sys/devices/pci0000:00/0000:00:18.3/temp1_input
/sys/devices/pci0000:00/0000:00:18.3/temp1_max
neon ~ # find /sys -name temp1* | xargs cat
56500
100000
(family 0x10:)
brandon sys # find /sys -name temp1*
/sys/devices/pci0000:00/0000:00:18.3/temp1_input
/sys/devices/pci0000:00/0000:00:18.3/temp1_max
/sys/devices/pci0000:00/0000:00:19.3/temp1_input
/sys/devices/pci0000:00/0000:00:19.3/temp1_max
brandon sys # find /sys -name temp1* | xargs cat
19000
70000
16500
70000
(and another family 0x10:)
attila # find /sys -name temp1*
/sys/devices/pci0000:00/0000:00:18.3/temp1_input
/sys/devices/pci0000:00/0000:00:18.3/temp1_max
attila # find /sys -name temp1* | xargs cat
26125
70000
> Once second patch is in, I would like to rewrite the driver with the help of
> heuristic and put there more generic logic for "place/core" selectors and temp1_
> etc sysfs files creation logic.
BTW, what is the submission chain for hwmon -- especially k8temp?
Is it "patch -> you -> Jean -> Linus"?
Regards,
Andreas
_______________________________________________
lm-sensors mailing list
lm-sensors@lm-sensors.org
http://lists.lm-sensors.org/mailman/listinfo/lm-sensors
^ permalink raw reply [flat|nested] 20+ messages in thread
* Re: [lm-sensors] [PATCH 1/2] k8temp warn about errata
2008-10-01 22:09 [lm-sensors] [PATCH 1/2] k8temp warn about errata Rudolf Marek
` (11 preceding siblings ...)
2008-11-18 10:10 ` Andreas Herrmann
@ 2008-11-18 10:40 ` Jean Delvare
2008-11-18 11:18 ` Andreas Herrmann
` (5 subsequent siblings)
18 siblings, 0 replies; 20+ messages in thread
From: Jean Delvare @ 2008-11-18 10:40 UTC (permalink / raw)
To: lm-sensors
Hi Andreas,
On Tue, 18 Nov 2008 10:25:01 +0100, Andreas Herrmann wrote:
> On Mon, Nov 10, 2008 at 10:38:06AM +0100, Jean Delvare wrote:
> > On Sun, 09 Nov 2008 18:27:53 -0700, Jordan Crouse wrote:
> > > Jean Delvare wrote:
> > > > For what it's worth, Jordan Crouse seems to think that blacklisting on
> > > > a per-revision basis may still work.
> > >
> > > I think it can. A much larger sample would probably need to be taken to
> > > be completely sure - but I hope that we'll find that the problem is
> > > deterministic enough for a blacklist. I think we would agree that a
> > > blacklist would be the more user friendly solution.
> >
> > OK, but then we should probably extend Rudolf's patch to ask users
> > potentially affected by the errata to report to us. The report should
> > include the CPUID information (e.g. contents of /proc/cpuinfo), the
> > output of sensors (or the raw temperature values from sysfs), and
> > whether or not the user thinks the temperature values are correct.
>
> How do you find out whether your system is affected by this erratum?
> The erratum is: due to inaccuracy of reported temperature (doesn't
> meet a certain accuracy threshold) triggering of HTC/STC feature is
> inaccurate.
>
> I am just curious how you'd like to determine the accuracy of the
> thermal sensor ... As an example, if the sensor reports 70 degC when
> the true temperature is 65 degC -- is it worth to blacklist it?
If we are 100% certain that this is the case then yes, I would
blacklist it.
The whole point of hardware monitoring is to report accurate values.
Additional software can be used to take actions based on temperature
values, for example fan speed regulation or CPU frequency changes. If
you can't trust the temperature readings then these operations become
dangerous.
That being said, I guess that the example above is essentially
theoretical? Most cases we've seen so far were not off by 5 degrees.
They were plain wrong, with reported temperatures being in the -20 to
+15 degrees C.
> Jordan, do you know more details about the deviation of the reported
> temparature sensor values from the real ones?
>
> I'd prefer not to blacklist but to keep the warning about potential
> inaccurate temperature values as introduced by Rudolf. I use k8temp on
> my private machines -- Athlon X2 and a Turion X2 (both are revF CPUs
> and thus affected by erratum 141). I admit, this is more or less a
> gimmick but I would miss it (if blacklisted).
Whatever we end up with, we will add a module parameter to let the user
force the driver binding, exactly for advanced users like you. My point
with the blacklist is that we should not report knowingly incorrect
values to the user _by default_, especially given that the k8temp
driver loads automatically. I think it is better to not report anything
by default rather than potentially wrong values. But I also agree that
we must provide a way to bypass the tests, if nothing else, because our
blacklist or heuristic may be incorrect.
Please keep in mind that most end users do not read the kernel logs
(thankfully - they really shouldn't have to.)
--
Jean Delvare
_______________________________________________
lm-sensors mailing list
lm-sensors@lm-sensors.org
http://lists.lm-sensors.org/mailman/listinfo/lm-sensors
^ permalink raw reply [flat|nested] 20+ messages in thread
* Re: [lm-sensors] [PATCH 1/2] k8temp warn about errata
2008-10-01 22:09 [lm-sensors] [PATCH 1/2] k8temp warn about errata Rudolf Marek
` (12 preceding siblings ...)
2008-11-18 10:40 ` Jean Delvare
@ 2008-11-18 11:18 ` Andreas Herrmann
2008-11-18 11:26 ` Andreas Herrmann
` (4 subsequent siblings)
18 siblings, 0 replies; 20+ messages in thread
From: Andreas Herrmann @ 2008-11-18 11:18 UTC (permalink / raw)
To: lm-sensors
On Tue, Nov 18, 2008 at 11:40:52AM +0100, Jean Delvare wrote:
> On Tue, 18 Nov 2008 10:25:01 +0100, Andreas Herrmann wrote:
> > I am just curious how you'd like to determine the accuracy of the
> > thermal sensor ... As an example, if the sensor reports 70 degC when
> > the true temperature is 65 degC -- is it worth to blacklist it?
>
> If we are 100% certain that this is the case then yes, I would
> blacklist it.
>
> The whole point of hardware monitoring is to report accurate values.
>
> Additional software can be used to take actions based on temperature
> values, for example fan speed regulation or CPU frequency changes. If
> you can't trust the temperature readings then these operations become
> dangerous.
Then we need to blacklist it. The suggested workaround for this erratum is
to use "temperature measurements from an analog thermal diode" for thermal
control ...
If the system should be designed not to rely on the thermal sensor
(due to accuracy issue) the OS shouldn't use it either.
> That being said, I guess that the example above is essentially
> theoretical? Most cases we've seen so far were not off by 5 degrees.
> They were plain wrong, with reported temperatures being in the -20 to
> +15 degrees C.
Was this observed with k8temp or other sensor chips?
> > Jordan, do you know more details about the deviation of the reported
> > temparature sensor values from the real ones?
> >
> > I'd prefer not to blacklist but to keep the warning about potential
> > inaccurate temperature values as introduced by Rudolf. I use k8temp on
> > my private machines -- Athlon X2 and a Turion X2 (both are revF CPUs
> > and thus affected by erratum 141). I admit, this is more or less a
> > gimmick but I would miss it (if blacklisted).
>
> Whatever we end up with, we will add a module parameter to let the user
> force the driver binding, exactly for advanced users like you.
Ok, a module parameter should do it.
> My point with the blacklist is that we should not report knowingly
> incorrect values to the user _by default_, especially given that the
> k8temp driver loads automatically. I think it is better to not
> report anything by default rather than potentially wrong values. But
> I also agree that we must provide a way to bypass the tests, if
> nothing else, because our blacklist or heuristic may be incorrect.
Regards,
Andreas
_______________________________________________
lm-sensors mailing list
lm-sensors@lm-sensors.org
http://lists.lm-sensors.org/mailman/listinfo/lm-sensors
^ permalink raw reply [flat|nested] 20+ messages in thread
* Re: [lm-sensors] [PATCH 1/2] k8temp warn about errata
2008-10-01 22:09 [lm-sensors] [PATCH 1/2] k8temp warn about errata Rudolf Marek
` (13 preceding siblings ...)
2008-11-18 11:18 ` Andreas Herrmann
@ 2008-11-18 11:26 ` Andreas Herrmann
2008-11-18 12:33 ` Jean Delvare
` (3 subsequent siblings)
18 siblings, 0 replies; 20+ messages in thread
From: Andreas Herrmann @ 2008-11-18 11:26 UTC (permalink / raw)
To: lm-sensors
Hi Rudolf,
On Thu, Oct 02, 2008 at 12:09:09AM +0200, Rudolf Marek wrote:
> Index: linux-2.6.27-rc7/drivers/hwmon/k8temp.c
> =================================> --- linux-2.6.27-rc7.orig/drivers/hwmon/k8temp.c 2008-09-28 11:01:45.855284456 +0200
> +++ linux-2.6.27-rc7/drivers/hwmon/k8temp.c 2008-09-28 11:13:42.396117790 +0200
> @@ -155,6 +158,18 @@
> goto exit;
> }
>
> + /* get real PCI based cpuid, prior revF of fam 0Fh, this reg is 0 */
> + pci_read_config_dword(pdev, REG_CPUID, &cpuid);
I am just curious whether you have tested this on a CPU revision prior revF.
Because "BIOS and Kernel Developer's Guide for AMD Athlon 64 and AMD
Opteron Processors" suggests that this register exists for those older CPUs:
"CPUID Fn[8000_0001,0000_0001]_EAX Family, Model, Feature Identifiers
This register provides identical information to Function 3, Offset FCh."
(I don't have access to such a CPU model at the moment and thus can't
double-check this now.)
> +
> + data->fam = (cpuid & 0x00000f00) >> 8;
> + data->fam += (cpuid & 0x00f00000) >> 20;
> +
> + switch (data->fam) {
> + case 0xf:
> + dev_warn(&pdev->dev, "Temperature readouts might be wrong"
> + " - check errata #141\n");
> + }
> +
> pci_read_config_byte(pdev, REG_TEMP, &scfg);
> scfg &= ~(SEL_PLACE | SEL_CORE); /* Select sensor 0, core0 */
> pci_write_config_byte(pdev, REG_TEMP, scfg);
Regards,
Andreas
_______________________________________________
lm-sensors mailing list
lm-sensors@lm-sensors.org
http://lists.lm-sensors.org/mailman/listinfo/lm-sensors
^ permalink raw reply [flat|nested] 20+ messages in thread
* Re: [lm-sensors] [PATCH 1/2] k8temp warn about errata
2008-10-01 22:09 [lm-sensors] [PATCH 1/2] k8temp warn about errata Rudolf Marek
` (14 preceding siblings ...)
2008-11-18 11:26 ` Andreas Herrmann
@ 2008-11-18 12:33 ` Jean Delvare
2008-11-18 12:57 ` Jean Delvare
` (2 subsequent siblings)
18 siblings, 0 replies; 20+ messages in thread
From: Jean Delvare @ 2008-11-18 12:33 UTC (permalink / raw)
To: lm-sensors
On Tue, 18 Nov 2008 12:18:45 +0100, Andreas Herrmann wrote:
> On Tue, Nov 18, 2008 at 11:40:52AM +0100, Jean Delvare wrote:
> > The whole point of hardware monitoring is to report accurate values.
> >
> > Additional software can be used to take actions based on temperature
> > values, for example fan speed regulation or CPU frequency changes. If
> > you can't trust the temperature readings then these operations become
> > dangerous.
>
> Then we need to blacklist it. The suggested workaround for this erratum is
> to use "temperature measurements from an analog thermal diode" for thermal
> control ...
> If the system should be designed not to rely on the thermal sensor
> (due to accuracy issue) the OS shouldn't use it either.
My point exactly.
> > That being said, I guess that the example above is essentially
> > theoretical? Most cases we've seen so far were not off by 5 degrees.
> > They were plain wrong, with reported temperatures being in the -20 to
> > +15 degrees C.
>
> Was this observed with k8temp or other sensor chips?
k8temp. See my list of obviously incorrect values at:
http://lists.lm-sensors.org/pipermail/lm-sensors/2008-October/024584.html
After seeing these values, I admit I had a smile when seeing how AMD
had formulated the errata (sensor doesn't meet a certain accuracy
threshold...)
--
Jean Delvare
_______________________________________________
lm-sensors mailing list
lm-sensors@lm-sensors.org
http://lists.lm-sensors.org/mailman/listinfo/lm-sensors
^ permalink raw reply [flat|nested] 20+ messages in thread
* Re: [lm-sensors] [PATCH 1/2] k8temp warn about errata
2008-10-01 22:09 [lm-sensors] [PATCH 1/2] k8temp warn about errata Rudolf Marek
` (15 preceding siblings ...)
2008-11-18 12:33 ` Jean Delvare
@ 2008-11-18 12:57 ` Jean Delvare
2008-11-18 17:46 ` Jordan Crouse
2008-11-18 18:15 ` Jean Delvare
18 siblings, 0 replies; 20+ messages in thread
From: Jean Delvare @ 2008-11-18 12:57 UTC (permalink / raw)
To: lm-sensors
On Tue, 18 Nov 2008 12:26:12 +0100, Andreas Herrmann wrote:
> Hi Rudolf,
>
> On Thu, Oct 02, 2008 at 12:09:09AM +0200, Rudolf Marek wrote:
> > Index: linux-2.6.27-rc7/drivers/hwmon/k8temp.c
> > =================================> > --- linux-2.6.27-rc7.orig/drivers/hwmon/k8temp.c 2008-09-28 11:01:45.855284456 +0200
> > +++ linux-2.6.27-rc7/drivers/hwmon/k8temp.c 2008-09-28 11:13:42.396117790 +0200
> > @@ -155,6 +158,18 @@
> > goto exit;
> > }
> >
> > + /* get real PCI based cpuid, prior revF of fam 0Fh, this reg is 0 */
> > + pci_read_config_dword(pdev, REG_CPUID, &cpuid);
>
> I am just curious whether you have tested this on a CPU revision prior revF.
> Because "BIOS and Kernel Developer's Guide for AMD Athlon 64 and AMD
> Opteron Processors" suggests that this register exists for those older CPUs:
>
> "CPUID Fn[8000_0001,0000_0001]_EAX Family, Model, Feature Identifiers
>
> This register provides identical information to Function 3, Offset FCh."
>
> (I don't have access to such a CPU model at the moment and thus can't
> double-check this now.)
For what it's worth, I have the following K8 CPU models here: 15/31/0
and 15/28/0, and both of them have value 0x00 in Function 3, Offset FCh.
That doesn't mean that all pre-rev.F CPUs do though... and it would
probably be trivial to handle both cases in the code if we have any
doubt.
--
Jean Delvare
_______________________________________________
lm-sensors mailing list
lm-sensors@lm-sensors.org
http://lists.lm-sensors.org/mailman/listinfo/lm-sensors
^ permalink raw reply [flat|nested] 20+ messages in thread
* Re: [lm-sensors] [PATCH 1/2] k8temp warn about errata
2008-10-01 22:09 [lm-sensors] [PATCH 1/2] k8temp warn about errata Rudolf Marek
` (16 preceding siblings ...)
2008-11-18 12:57 ` Jean Delvare
@ 2008-11-18 17:46 ` Jordan Crouse
2008-11-18 18:15 ` Jean Delvare
18 siblings, 0 replies; 20+ messages in thread
From: Jordan Crouse @ 2008-11-18 17:46 UTC (permalink / raw)
To: lm-sensors
Andreas Herrmann wrote:
> On Mon, Nov 10, 2008 at 10:38:06AM +0100, Jean Delvare wrote:
>> On Sun, 09 Nov 2008 18:27:53 -0700, Jordan Crouse wrote:
>>> Jean Delvare wrote:
>>>> For what it's worth, Jordan Crouse seems to think that blacklisting on
>>>> a per-revision basis may still work.
>>> I think it can. A much larger sample would probably need to be taken to
>>> be completely sure - but I hope that we'll find that the problem is
>>> deterministic enough for a blacklist. I think we would agree that a
>>> blacklist would be the more user friendly solution.
>> OK, but then we should probably extend Rudolf's patch to ask users
>> potentially affected by the errata to report to us. The report should
>> include the CPUID information (e.g. contents of /proc/cpuinfo), the
>> output of sensors (or the raw temperature values from sysfs), and
>> whether or not the user thinks the temperature values are correct.
>
> How do you find out whether your system is affected by this erratum?
> The erratum is: due to inaccuracy of reported temperature (doesn't
> meet a certain accuracy threshold) triggering of HTC/STC feature is
> inaccurate.
>
> I am just curious how you'd like to determine the accuracy of the
> thermal sensor ... As an example, if the sensor reports 70 degC when
> the true temperature is 65 degC -- is it worth to blacklist it?
>
> Jordan, do you know more details about the deviation of the reported
> temparature sensor values from the real ones?
Nothing scientific. The AMD errata team might know more about it. I
can identify the obviously broken sensors - my athlon X2 system for
example tells me the cores are 7C and 3C respectfully, but I don't know
if you could tell the difference between a well working sensor and a
marginally working sensor, especially with differing work conditions.
The best you could do is to figure out how cold the core could possibly
run, and then omit anything under that.
You might do a better job if you could compare the core temperature
against the system monitor - they should only differ by a few degrees (I
think there is some math about how much the external and internal diodes
should differ). That said, thats not the sort of math you could do in
the kernel driver, you would need the user land to find the other sensor
and do the calculations.
> I'd prefer not to blacklist but to keep the warning about potential
> inaccurate temperature values as introduced by Rudolf. I use k8temp on
> my private machines -- Athlon X2 and a Turion X2 (both are revF CPUs
> and thus affected by erratum 141). I admit, this is more or less a
> gimmick but I would miss it (if blacklisted).
Well, it is a gimmick, and thats important to keep in mind. No offense
at all to Rudolf, this is a very nice driver, but in the end the value
is not critical to system performance, especially since it cannot
trigger the HTC/STC. Nearly every system I have ever seen relies on an
external sensor. Thats why I was voting for the blacklist, since it
would omit the obviously flawed processors, and it would keep the users
from getting too worked up. I would rather have them concentrate their
attention on the external sensors rather then exert a lot of effort to
read the K8 temps, which in the end are "just for fun".
Jordan
_______________________________________________
lm-sensors mailing list
lm-sensors@lm-sensors.org
http://lists.lm-sensors.org/mailman/listinfo/lm-sensors
^ permalink raw reply [flat|nested] 20+ messages in thread
* Re: [lm-sensors] [PATCH 1/2] k8temp warn about errata
2008-10-01 22:09 [lm-sensors] [PATCH 1/2] k8temp warn about errata Rudolf Marek
` (17 preceding siblings ...)
2008-11-18 17:46 ` Jordan Crouse
@ 2008-11-18 18:15 ` Jean Delvare
18 siblings, 0 replies; 20+ messages in thread
From: Jean Delvare @ 2008-11-18 18:15 UTC (permalink / raw)
To: lm-sensors
On Tue, 18 Nov 2008 10:46:03 -0700, Jordan Crouse wrote:
> Nothing scientific. The AMD errata team might know more about it. I
> can identify the obviously broken sensors - my athlon X2 system for
> example tells me the cores are 7C and 3C respectfully, but I don't know
> if you could tell the difference between a well working sensor and a
> marginally working sensor, especially with differing work conditions.
> The best you could do is to figure out how cold the core could possibly
> run, and then omit anything under that.
>
> You might do a better job if you could compare the core temperature
> against the system monitor - they should only differ by a few degrees (I
> think there is some math about how much the external and internal diodes
> should differ). That said, thats not the sort of math you could do in
> the kernel driver, you would need the user land to find the other sensor
> and do the calculations.
This is the problem: the k8temp driver doesn't have access to other
sensors for comparison purposes. Even if it had, it wouldn't know which
sensor corresponds to the CPU (this is motherboard-specific.) So while
this can be used by the user to determine whether his CPU sensors are
working or not, this cannot be used as a way to automatically discard
CPUs with broken sensors.
--
Jean Delvare
_______________________________________________
lm-sensors mailing list
lm-sensors@lm-sensors.org
http://lists.lm-sensors.org/mailman/listinfo/lm-sensors
^ permalink raw reply [flat|nested] 20+ messages in thread
end of thread, other threads:[~2008-11-18 18:15 UTC | newest]
Thread overview: 20+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2008-10-01 22:09 [lm-sensors] [PATCH 1/2] k8temp warn about errata Rudolf Marek
2008-10-25 12:43 ` Jean Delvare
2008-10-27 1:18 ` Jean-Marc Spaggiari
2008-11-09 19:56 ` Rudolf Marek
2008-11-09 20:47 ` Jean Delvare
2008-11-09 20:54 ` Jean Delvare
2008-11-10 1:27 ` Jordan Crouse
2008-11-10 9:38 ` Jean Delvare
2008-11-18 8:41 ` Andreas Herrmann
2008-11-18 8:55 ` Andreas Herrmann
2008-11-18 9:06 ` Jean Delvare
2008-11-18 9:25 ` Andreas Herrmann
2008-11-18 10:10 ` Andreas Herrmann
2008-11-18 10:40 ` Jean Delvare
2008-11-18 11:18 ` Andreas Herrmann
2008-11-18 11:26 ` Andreas Herrmann
2008-11-18 12:33 ` Jean Delvare
2008-11-18 12:57 ` Jean Delvare
2008-11-18 17:46 ` Jordan Crouse
2008-11-18 18:15 ` Jean Delvare
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.