public inbox for linux-kernel@vger.kernel.org
 help / color / mirror / Atom feed
* MCE/Package power limit notification
@ 2011-11-23 15:09 Udo Steinberg
  2011-11-28 22:46 ` Luck, Tony
  0 siblings, 1 reply; 6+ messages in thread
From: Udo Steinberg @ 2011-11-23 15:09 UTC (permalink / raw)
  To: Linux Kernel Mailing List; +Cc: Tony Luck

[-- Attachment #1: Type: text/plain, Size: 7942 bytes --]

Hi,

After half an hour of compiling code on an Intel SNB machine, at which time
the machine was more or less permanently running with turbo boost active,
I've gotten the following in my dmesg with Linux-3.1.0:

CPU2: Package power limit notification (total events = 1)
CPU3: Package power limit notification (total events = 1)
CPU1: Package power limit notification (total events = 1)
CPU0: Package power limit notification (total events = 1)
CPU2: Package power limit normal
CPU3: Package power limit normal
CPU1: Package power limit normal
CPU0: Package power limit normal
[Hardware Error]: Machine check events logged
CPU1: Package power limit notification (total events = 655)
CPU2: Package power limit notification (total events = 655)
CPU3: Package power limit notification (total events = 655)
CPU0: Package power limit notification (total events = 655)
CPU3: Package power limit normal
CPU2: Package power limit normal
CPU1: Package power limit normal
CPU0: Package power limit normal
[Hardware Error]: Machine check events logged

Below is the output from mcelog. Is there anything that can be done about it?

Cheers,

	- Udo

mcelog: Unsupported new Family 6 Model 2a CPU: only decoding architectural errors
HARDWARE ERROR. This is *NOT* a software problem!
Please contact your hardware vendor
MCE 0
CPU 3 THERMAL EVENT TSC 179fecf42d7 
TIME 1322060688 Wed Nov 23 16:04:48 2011
Processor 3 below trip temperature. Throttling disabled
STATUS c0000000880c0c00 MCGSTATUS 0
MCGCAP c07 APICID 3 SOCKETID 0 
CPUID Vendor Intel Family 6 Model 42
mcelog: Unsupported new Family 6 Model 2a CPU: only decoding architectural errors
HARDWARE ERROR. This is *NOT* a software problem!
Please contact your hardware vendor
MCE 1
CPU 1 THERMAL EVENT TSC 179fecf5d95 
TIME 1322060688 Wed Nov 23 16:04:48 2011
Processor 1 below trip temperature. Throttling disabled
STATUS c0000000880c0c00 MCGSTATUS 0
MCGCAP c07 APICID 1 SOCKETID 0 
CPUID Vendor Intel Family 6 Model 42
mcelog: Unsupported new Family 6 Model 2a CPU: only decoding architectural errors
HARDWARE ERROR. This is *NOT* a software problem!
Please contact your hardware vendor
MCE 2
CPU 0 THERMAL EVENT TSC 179fecf72fb 
TIME 1322060688 Wed Nov 23 16:04:48 2011
Processor 0 below trip temperature. Throttling disabled
STATUS c0000000880c0c00 MCGSTATUS 0
MCGCAP c07 APICID 0 SOCKETID 0 
CPUID Vendor Intel Family 6 Model 42
mcelog: Unsupported new Family 6 Model 2a CPU: only decoding architectural errors
HARDWARE ERROR. This is *NOT* a software problem!
Please contact your hardware vendor
MCE 3
CPU 2 THERMAL EVENT TSC 179fecf88a9 
TIME 1322060688 Wed Nov 23 16:04:48 2011
Processor 2 below trip temperature. Throttling disabled
STATUS c0000000880c0c00 MCGSTATUS 0
MCGCAP c07 APICID 2 SOCKETID 0 
CPUID Vendor Intel Family 6 Model 42
mcelog: Unsupported new Family 6 Model 2a CPU: only decoding architectural errors
HARDWARE ERROR. This is *NOT* a software problem!
Please contact your hardware vendor
MCE 4
CPU 3 THERMAL EVENT TSC 179fef2e964 
TIME 1322060688 Wed Nov 23 16:04:48 2011
Processor 3 below trip temperature. Throttling disabled
STATUS c0000000880c0800 MCGSTATUS 0
MCGCAP c07 APICID 3 SOCKETID 0 
CPUID Vendor Intel Family 6 Model 42
mcelog: Unsupported new Family 6 Model 2a CPU: only decoding architectural errors
HARDWARE ERROR. This is *NOT* a software problem!
Please contact your hardware vendor
MCE 5
CPU 1 THERMAL EVENT TSC 179fef2fd56 
TIME 1322060688 Wed Nov 23 16:04:48 2011
Processor 1 below trip temperature. Throttling disabled
STATUS c0000000880c0800 MCGSTATUS 0
MCGCAP c07 APICID 1 SOCKETID 0 
CPUID Vendor Intel Family 6 Model 42
mcelog: Unsupported new Family 6 Model 2a CPU: only decoding architectural errors
HARDWARE ERROR. This is *NOT* a software problem!
Please contact your hardware vendor
MCE 6
CPU 0 THERMAL EVENT TSC 179fef30d22 
TIME 1322060688 Wed Nov 23 16:04:48 2011
Processor 0 below trip temperature. Throttling disabled
STATUS c0000000880c0800 MCGSTATUS 0
MCGCAP c07 APICID 0 SOCKETID 0 
CPUID Vendor Intel Family 6 Model 42
mcelog: Unsupported new Family 6 Model 2a CPU: only decoding architectural errors
HARDWARE ERROR. This is *NOT* a software problem!
Please contact your hardware vendor
MCE 7
CPU 2 THERMAL EVENT TSC 179fef311bd 
TIME 1322060688 Wed Nov 23 16:04:48 2011
Processor 2 below trip temperature. Throttling disabled
STATUS c0000000880c0800 MCGSTATUS 0
MCGCAP c07 APICID 2 SOCKETID 0 
CPUID Vendor Intel Family 6 Model 42
mcelog: Unsupported new Family 6 Model 2a CPU: only decoding architectural errors
HARDWARE ERROR. This is *NOT* a software problem!
Please contact your hardware vendor
MCE 8
CPU 2 THERMAL EVENT TSC 31a70403488 
TIME 1322060688 Wed Nov 23 16:04:48 2011
Processor 2 below trip temperature. Throttling disabled
STATUS c000000088080c02 MCGSTATUS 0
MCGCAP c07 APICID 2 SOCKETID 0 
CPUID Vendor Intel Family 6 Model 42
mcelog: Unsupported new Family 6 Model 2a CPU: only decoding architectural errors
HARDWARE ERROR. This is *NOT* a software problem!
Please contact your hardware vendor
MCE 9
CPU 3 THERMAL EVENT TSC 31a7040468b 
TIME 1322060688 Wed Nov 23 16:04:48 2011
Processor 3 below trip temperature. Throttling disabled
STATUS c000000088080c02 MCGSTATUS 0
MCGCAP c07 APICID 3 SOCKETID 0 
CPUID Vendor Intel Family 6 Model 42
mcelog: Unsupported new Family 6 Model 2a CPU: only decoding architectural errors
HARDWARE ERROR. This is *NOT* a software problem!
Please contact your hardware vendor
MCE 10
CPU 0 THERMAL EVENT TSC 31a704059b1 
TIME 1322060688 Wed Nov 23 16:04:48 2011
Processor 0 below trip temperature. Throttling disabled
STATUS c000000088080c02 MCGSTATUS 0
MCGCAP c07 APICID 0 SOCKETID 0 
CPUID Vendor Intel Family 6 Model 42
mcelog: Unsupported new Family 6 Model 2a CPU: only decoding architectural errors
HARDWARE ERROR. This is *NOT* a software problem!
Please contact your hardware vendor
MCE 11
CPU 1 THERMAL EVENT TSC 31a70407082 
TIME 1322060688 Wed Nov 23 16:04:48 2011
Processor 1 below trip temperature. Throttling disabled
STATUS c000000088080c02 MCGSTATUS 0
MCGCAP c07 APICID 1 SOCKETID 0 
CPUID Vendor Intel Family 6 Model 42
mcelog: Unsupported new Family 6 Model 2a CPU: only decoding architectural errors
HARDWARE ERROR. This is *NOT* a software problem!
Please contact your hardware vendor
MCE 12
CPU 2 THERMAL EVENT TSC 31a705416f8 
TIME 1322060688 Wed Nov 23 16:04:48 2011
Processor 2 below trip temperature. Throttling disabled
STATUS c000000088080802 MCGSTATUS 0
MCGCAP c07 APICID 2 SOCKETID 0 
CPUID Vendor Intel Family 6 Model 42
mcelog: Unsupported new Family 6 Model 2a CPU: only decoding architectural errors
HARDWARE ERROR. This is *NOT* a software problem!
Please contact your hardware vendor
MCE 13
CPU 1 THERMAL EVENT TSC 31a70542907 
TIME 1322060688 Wed Nov 23 16:04:48 2011
Processor 1 below trip temperature. Throttling disabled
STATUS c000000088080802 MCGSTATUS 0
MCGCAP c07 APICID 1 SOCKETID 0 
CPUID Vendor Intel Family 6 Model 42
mcelog: Unsupported new Family 6 Model 2a CPU: only decoding architectural errors
HARDWARE ERROR. This is *NOT* a software problem!
Please contact your hardware vendor
MCE 14
CPU 0 THERMAL EVENT TSC 31a70543945 
TIME 1322060688 Wed Nov 23 16:04:48 2011
Processor 0 below trip temperature. Throttling disabled
STATUS c000000088080802 MCGSTATUS 0
MCGCAP c07 APICID 0 SOCKETID 0 
CPUID Vendor Intel Family 6 Model 42
mcelog: Unsupported new Family 6 Model 2a CPU: only decoding architectural errors
HARDWARE ERROR. This is *NOT* a software problem!
Please contact your hardware vendor
MCE 15
CPU 3 THERMAL EVENT TSC 31a70543cab 
TIME 1322060688 Wed Nov 23 16:04:48 2011
Processor 3 below trip temperature. Throttling disabled
STATUS c000000088080802 MCGSTATUS 0
MCGCAP c07 APICID 3 SOCKETID 0 
CPUID Vendor Intel Family 6 Model 42

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 198 bytes --]

^ permalink raw reply	[flat|nested] 6+ messages in thread

* RE: MCE/Package power limit notification
  2011-11-23 15:09 MCE/Package power limit notification Udo Steinberg
@ 2011-11-28 22:46 ` Luck, Tony
  2011-11-28 22:50   ` Yu, Fenghua
  0 siblings, 1 reply; 6+ messages in thread
From: Luck, Tony @ 2011-11-28 22:46 UTC (permalink / raw)
  To: Udo Steinberg, Linux Kernel Mailing List; +Cc: Yu, Fenghua

> After half an hour of compiling code on an Intel SNB machine, at which time
> the machine was more or less permanently running with turbo boost active,
> I've gotten the following in my dmesg with Linux-3.1.0:
>
> CPU2: Package power limit notification (total events = 1)
> CPU3: Package power limit notification (total events = 1)
> CPU1: Package power limit notification (total events = 1)
> CPU0: Package power limit notification (total events = 1)
> ...

Udo,

Fenghua is looking at options to tone down these messages - reporting
as machine checks is unduly scary.

-Tony

^ permalink raw reply	[flat|nested] 6+ messages in thread

* RE: MCE/Package power limit notification
  2011-11-28 22:46 ` Luck, Tony
@ 2011-11-28 22:50   ` Yu, Fenghua
  2011-11-29 21:24     ` Udo Steinberg
  0 siblings, 1 reply; 6+ messages in thread
From: Yu, Fenghua @ 2011-11-28 22:50 UTC (permalink / raw)
  To: Luck, Tony, Udo Steinberg, Linux Kernel Mailing List

> -----Original Message-----
> From: Luck, Tony
> Sent: Monday, November 28, 2011 2:46 PM
> To: Udo Steinberg; Linux Kernel Mailing List
> Cc: Yu, Fenghua
> Subject: RE: MCE/Package power limit notification
> 
> > After half an hour of compiling code on an Intel SNB machine, at
> which time
> > the machine was more or less permanently running with turbo boost
> active,
> > I've gotten the following in my dmesg with Linux-3.1.0:
> >
> > CPU2: Package power limit notification (total events = 1)
> > CPU3: Package power limit notification (total events = 1)
> > CPU1: Package power limit notification (total events = 1)
> > CPU0: Package power limit notification (total events = 1)
> > ...
> 
> Udo,
> 
> Fenghua is looking at options to tone down these messages - reporting
> as machine checks is unduly scary.

Hi, Udo,

I sent out a patch to remove the mcelog info. Could you try it and see if it works for you?
https://lkml.org/lkml/2011/11/14/239

Thanks.

-Fenghua

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: MCE/Package power limit notification
  2011-11-28 22:50   ` Yu, Fenghua
@ 2011-11-29 21:24     ` Udo Steinberg
  2011-11-29 21:43       ` Fenghua Yu
  0 siblings, 1 reply; 6+ messages in thread
From: Udo Steinberg @ 2011-11-29 21:24 UTC (permalink / raw)
  To: Yu, Fenghua; +Cc: Luck, Tony, Linux Kernel Mailing List

[-- Attachment #1: Type: text/plain, Size: 459 bytes --]

On Mon, 28 Nov 2011 14:50:47 -0800 Yu, Fenghua (YF) wrote:

YF> I sent out a patch to remove the mcelog info. Could you try it and see if it works for you?
YF> https://lkml.org/lkml/2011/11/14/239
YF> 
YF> Thanks.
YF> 
YF> -Fenghua

Hi Fenghua,

Thanks for the patch. It works and eliminates the MCE warnings. What exactly
are the BIOS issues mentioned in the patch description? Is BIOS programming
some MSRs the wrong way?

Cheers,

	- Udo

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 198 bytes --]

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: MCE/Package power limit notification
  2011-11-29 21:24     ` Udo Steinberg
@ 2011-11-29 21:43       ` Fenghua Yu
  2011-11-29 22:19         ` Udo Steinberg
  0 siblings, 1 reply; 6+ messages in thread
From: Fenghua Yu @ 2011-11-29 21:43 UTC (permalink / raw)
  To: Udo Steinberg; +Cc: Yu, Fenghua, Luck, Tony, Linux Kernel Mailing List

On Tue, Nov 29, 2011 at 01:24:24PM -0800, Udo Steinberg wrote:
> On Mon, 28 Nov 2011 14:50:47 -0800 Yu, Fenghua (YF) wrote:
> 
> YF> I sent out a patch to remove the mcelog info. Could you try it and see if it works for you?
> YF> https://lkml.org/lkml/2011/11/14/239
> YF> 
> YF> Thanks.
> YF> 
> YF> -Fenghua
> 
> Hi Fenghua,
> 
> Thanks for the patch. It works and eliminates the MCE warnings. What exactly
> are the BIOS issues mentioned in the patch description? Is BIOS programming
> some MSRs the wrong way?

Hi, Udo,

Could you please check counters in /sys/devices/system/cpu/cpu#/thermal_throttle
and see which counters report the thermal events?

The thought of the patch is to remove the errors in mcelog and report the errors
in respective counters. Therefore, the events are not reported as scary hardware
issues but are still captured in counters.

I think BIOS/firmware sets up power limit or thermal throttle incorrectly and
triggers events incorrectly. You may try updated BIOS to see if the events go
away.

Thanks.

-Fenghua

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: MCE/Package power limit notification
  2011-11-29 21:43       ` Fenghua Yu
@ 2011-11-29 22:19         ` Udo Steinberg
  0 siblings, 0 replies; 6+ messages in thread
From: Udo Steinberg @ 2011-11-29 22:19 UTC (permalink / raw)
  To: Fenghua Yu; +Cc: Luck, Tony, Linux Kernel Mailing List

[-- Attachment #1: Type: text/plain, Size: 2537 bytes --]

Hi Fenghua,

On Tue, 29 Nov 2011 13:43:43 -0800 Fenghua Yu (FY) wrote:

FY> Could you please check counters in /sys/devices/system/cpu/cpu#/thermal_throttle
FY> and see which counters report the thermal events?

cpu0/thermal_throttle/core_power_limit_count : 0
cpu0/thermal_throttle/core_throttle_count : 102536
cpu0/thermal_throttle/package_power_limit_count : 384
cpu0/thermal_throttle/package_throttle_count : 183429
cpu1/thermal_throttle/core_power_limit_count : 0
cpu1/thermal_throttle/core_throttle_count : 102536
cpu1/thermal_throttle/package_power_limit_count : 384
cpu1/thermal_throttle/package_throttle_count : 183429
cpu2/thermal_throttle/core_power_limit_count : 0
cpu2/thermal_throttle/core_throttle_count : 104859
cpu2/thermal_throttle/package_power_limit_count : 384
cpu2/thermal_throttle/package_throttle_count : 183429
cpu3/thermal_throttle/core_power_limit_count : 0
cpu3/thermal_throttle/core_throttle_count : 104859
cpu3/thermal_throttle/package_power_limit_count : 384
cpu3/thermal_throttle/package_throttle_count : 183429

FY> The thought of the patch is to remove the errors in mcelog and report the errors
FY> in respective counters. Therefore, the events are not reported as scary hardware
FY> issues but are still captured in counters.

I'm still seeing the following messages:

CPU2: Package temperature above threshold, cpu clock throttled (total events = 146147)
CPU3: Package temperature above threshold, cpu clock throttled (total events = 146147)
CPU1: Package temperature above threshold, cpu clock throttled (total events = 146147)
CPU0: Package temperature above threshold, cpu clock throttled (total events = 146147)
CPU0: Package temperature/speed normal
CPU2: Package temperature/speed normal
CPU1: Package temperature/speed normal
CPU3: Package temperature/speed normal
CPU3: Core temperature above threshold, cpu clock throttled (total events = 81740)
CPU2: Core temperature above threshold, cpu clock throttled (total events = 81740)
CPU2: Core temperature/speed normal
CPU3: Core temperature/speed normal
[Hardware Error]: Machine check events logged

FY> I think BIOS/firmware sets up power limit or thermal throttle incorrectly and
FY> triggers events incorrectly. You may try updated BIOS to see if the events go
FY> away.

I'm running the latest BIOS on my Lenovo Thinkpad X220. Someone should talk
to Lenovo about getting this fixed. My machine reports:

DMI: LENOVO 4290W4H/4290W4H, BIOS 8DET54WW (1.24 ) 10/18/2011

Cheers,

	- Udo

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 198 bytes --]

^ permalink raw reply	[flat|nested] 6+ messages in thread

end of thread, other threads:[~2011-11-29 22:19 UTC | newest]

Thread overview: 6+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2011-11-23 15:09 MCE/Package power limit notification Udo Steinberg
2011-11-28 22:46 ` Luck, Tony
2011-11-28 22:50   ` Yu, Fenghua
2011-11-29 21:24     ` Udo Steinberg
2011-11-29 21:43       ` Fenghua Yu
2011-11-29 22:19         ` Udo Steinberg

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox