* [PATCH] x86/mcheck/therm_throt.c: Don't log power limit and package level thermal throttle event in mce log
@ 2011-11-14 21:11 Fenghua Yu
2011-12-05 13:18 ` Borislav Petkov
0 siblings, 1 reply; 9+ messages in thread
From: Fenghua Yu @ 2011-11-14 21:11 UTC (permalink / raw)
To: H Peter Anvin, Thomas Gleixner, Ingo Molnar, Andrew Morton,
Tony Luck, Len Brown
Cc: linux-kernel, x86, Fenghua Yu
From: Fenghua Yu <fenghua.yu@intel.com>
Because of BIOS issues, some platforms report mce errors after power limit and
thermal throttle events. Customers are concerned about the mce errors. Although
BIOS need to fix the issues eventually, the events should not be viewed as mce
errors in the first place.
This patch doesn't log power limit and package level thermal throttle events
in mce log. When the events happen, only count them in respective counters in
sysfs.
For legacy reason, core level thermal throttle is still logged in mce log and
counted in counter in sysfs.
Signed-off-by: Fenghua Yu <fenghua.yu@intel.com>
---
arch/x86/kernel/cpu/mcheck/therm_throt.c | 29 +++++++----------------------
1 files changed, 7 insertions(+), 22 deletions(-)
diff --git a/arch/x86/kernel/cpu/mcheck/therm_throt.c b/arch/x86/kernel/cpu/mcheck/therm_throt.c
index 787e06c..ce04b58 100644
--- a/arch/x86/kernel/cpu/mcheck/therm_throt.c
+++ b/arch/x86/kernel/cpu/mcheck/therm_throt.c
@@ -323,17 +323,6 @@ device_initcall(thermal_throttle_init_device);
#endif /* CONFIG_SYSFS */
-/*
- * Set up the most two significant bit to notify mce log that this thermal
- * event type.
- * This is a temp solution. May be changed in the future with mce log
- * infrasture.
- */
-#define CORE_THROTTLED (0)
-#define CORE_POWER_LIMIT ((__u64)1 << 62)
-#define PACKAGE_THROTTLED ((__u64)2 << 62)
-#define PACKAGE_POWER_LIMIT ((__u64)3 << 62)
-
static void notify_thresholds(__u64 msr_val)
{
/* check whether the interrupt handler is defined;
@@ -363,27 +352,23 @@ static void intel_thermal_interrupt(void)
if (therm_throt_process(msr_val & THERM_STATUS_PROCHOT,
THERMAL_THROTTLING_EVENT,
CORE_LEVEL) != 0)
- mce_log_therm_throt_event(CORE_THROTTLED | msr_val);
+ mce_log_therm_throt_event(msr_val);
if (this_cpu_has(X86_FEATURE_PLN))
- if (therm_throt_process(msr_val & THERM_STATUS_POWER_LIMIT,
+ therm_throt_process(msr_val & THERM_STATUS_POWER_LIMIT,
POWER_LIMIT_EVENT,
- CORE_LEVEL) != 0)
- mce_log_therm_throt_event(CORE_POWER_LIMIT | msr_val);
+ CORE_LEVEL);
if (this_cpu_has(X86_FEATURE_PTS)) {
rdmsrl(MSR_IA32_PACKAGE_THERM_STATUS, msr_val);
- if (therm_throt_process(msr_val & PACKAGE_THERM_STATUS_PROCHOT,
+ therm_throt_process(msr_val & PACKAGE_THERM_STATUS_PROCHOT,
THERMAL_THROTTLING_EVENT,
- PACKAGE_LEVEL) != 0)
- mce_log_therm_throt_event(PACKAGE_THROTTLED | msr_val);
+ PACKAGE_LEVEL);
if (this_cpu_has(X86_FEATURE_PLN))
- if (therm_throt_process(msr_val &
+ therm_throt_process(msr_val &
PACKAGE_THERM_STATUS_POWER_LIMIT,
POWER_LIMIT_EVENT,
- PACKAGE_LEVEL) != 0)
- mce_log_therm_throt_event(PACKAGE_POWER_LIMIT
- | msr_val);
+ PACKAGE_LEVEL);
}
}
--
1.6.0.3
^ permalink raw reply related [flat|nested] 9+ messages in thread
* Re: [PATCH] x86/mcheck/therm_throt.c: Don't log power limit and package level thermal throttle event in mce log
2011-11-14 21:11 [PATCH] x86/mcheck/therm_throt.c: Don't log power limit and package level thermal throttle event in mce log Fenghua Yu
@ 2011-12-05 13:18 ` Borislav Petkov
2011-12-05 22:18 ` Luck, Tony
0 siblings, 1 reply; 9+ messages in thread
From: Borislav Petkov @ 2011-12-05 13:18 UTC (permalink / raw)
To: Tony Luck
Cc: Fenghua Yu, H Peter Anvin, Thomas Gleixner, Ingo Molnar,
Andrew Morton, Len Brown, linux-kernel, x86
This looks like a sane improvement, Tony I'm assuming you're handling this?
On Mon, Nov 14, 2011 at 01:11:22PM -0800, Fenghua Yu wrote:
> From: Fenghua Yu <fenghua.yu@intel.com>
>
> Because of BIOS issues, some platforms report mce errors after power limit and
> thermal throttle events. Customers are concerned about the mce errors. Although
> BIOS need to fix the issues eventually, the events should not be viewed as mce
> errors in the first place.
>
> This patch doesn't log power limit and package level thermal throttle events
> in mce log. When the events happen, only count them in respective counters in
> sysfs.
>
> For legacy reason, core level thermal throttle is still logged in mce log and
> counted in counter in sysfs.
>
> Signed-off-by: Fenghua Yu <fenghua.yu@intel.com>
> ---
> arch/x86/kernel/cpu/mcheck/therm_throt.c | 29 +++++++----------------------
> 1 files changed, 7 insertions(+), 22 deletions(-)
>
> diff --git a/arch/x86/kernel/cpu/mcheck/therm_throt.c b/arch/x86/kernel/cpu/mcheck/therm_throt.c
> index 787e06c..ce04b58 100644
> --- a/arch/x86/kernel/cpu/mcheck/therm_throt.c
> +++ b/arch/x86/kernel/cpu/mcheck/therm_throt.c
> @@ -323,17 +323,6 @@ device_initcall(thermal_throttle_init_device);
>
> #endif /* CONFIG_SYSFS */
>
> -/*
> - * Set up the most two significant bit to notify mce log that this thermal
> - * event type.
> - * This is a temp solution. May be changed in the future with mce log
> - * infrasture.
> - */
> -#define CORE_THROTTLED (0)
> -#define CORE_POWER_LIMIT ((__u64)1 << 62)
> -#define PACKAGE_THROTTLED ((__u64)2 << 62)
> -#define PACKAGE_POWER_LIMIT ((__u64)3 << 62)
> -
> static void notify_thresholds(__u64 msr_val)
> {
> /* check whether the interrupt handler is defined;
> @@ -363,27 +352,23 @@ static void intel_thermal_interrupt(void)
> if (therm_throt_process(msr_val & THERM_STATUS_PROCHOT,
> THERMAL_THROTTLING_EVENT,
> CORE_LEVEL) != 0)
> - mce_log_therm_throt_event(CORE_THROTTLED | msr_val);
> + mce_log_therm_throt_event(msr_val);
>
> if (this_cpu_has(X86_FEATURE_PLN))
> - if (therm_throt_process(msr_val & THERM_STATUS_POWER_LIMIT,
> + therm_throt_process(msr_val & THERM_STATUS_POWER_LIMIT,
> POWER_LIMIT_EVENT,
> - CORE_LEVEL) != 0)
> - mce_log_therm_throt_event(CORE_POWER_LIMIT | msr_val);
> + CORE_LEVEL);
>
> if (this_cpu_has(X86_FEATURE_PTS)) {
> rdmsrl(MSR_IA32_PACKAGE_THERM_STATUS, msr_val);
> - if (therm_throt_process(msr_val & PACKAGE_THERM_STATUS_PROCHOT,
> + therm_throt_process(msr_val & PACKAGE_THERM_STATUS_PROCHOT,
> THERMAL_THROTTLING_EVENT,
> - PACKAGE_LEVEL) != 0)
> - mce_log_therm_throt_event(PACKAGE_THROTTLED | msr_val);
> + PACKAGE_LEVEL);
> if (this_cpu_has(X86_FEATURE_PLN))
> - if (therm_throt_process(msr_val &
> + therm_throt_process(msr_val &
> PACKAGE_THERM_STATUS_POWER_LIMIT,
> POWER_LIMIT_EVENT,
> - PACKAGE_LEVEL) != 0)
> - mce_log_therm_throt_event(PACKAGE_POWER_LIMIT
> - | msr_val);
> + PACKAGE_LEVEL);
> }
> }
>
> --
> 1.6.0.3
>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at http://www.tux.org/lkml/
--
Regards/Gruss,
Boris.
Advanced Micro Devices GmbH
Einsteinring 24, 85609 Dornach
GM: Alberto Bozzo
Reg: Dornach, Landkreis Muenchen
HRB Nr. 43632 WEEE Registernr: 129 19551
^ permalink raw reply [flat|nested] 9+ messages in thread
* RE: [PATCH] x86/mcheck/therm_throt.c: Don't log power limit and package level thermal throttle event in mce log
2011-12-05 13:18 ` Borislav Petkov
@ 2011-12-05 22:18 ` Luck, Tony
2011-12-06 15:31 ` Borislav Petkov
0 siblings, 1 reply; 9+ messages in thread
From: Luck, Tony @ 2011-12-05 22:18 UTC (permalink / raw)
To: Borislav Petkov
Cc: Yu, Fenghua, H Peter Anvin, Thomas Gleixner, Ingo Molnar,
Andrew Morton, Brown, Len, linux-kernel, x86
This looks like a sane improvement, Tony I'm assuming you're handling this?
I can push this - but it would be nice to have some other comments from
outside of Intel (we've been beaten up written-by Intel, acked-by: Intel
too many times).
Can I count that as an "Acked-by:"?
-Tony
^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: [PATCH] x86/mcheck/therm_throt.c: Don't log power limit and package level thermal throttle event in mce log
2011-12-05 22:18 ` Luck, Tony
@ 2011-12-06 15:31 ` Borislav Petkov
2011-12-06 17:48 ` Yu, Fenghua
0 siblings, 1 reply; 9+ messages in thread
From: Borislav Petkov @ 2011-12-06 15:31 UTC (permalink / raw)
To: Luck, Tony
Cc: Yu, Fenghua, H Peter Anvin, Thomas Gleixner, Ingo Molnar,
Andrew Morton, Brown, Len, linux-kernel, x86
On Mon, Dec 05, 2011 at 02:18:26PM -0800, Luck, Tony wrote:
> This looks like a sane improvement, Tony I'm assuming you're handling this?
>
> I can push this - but it would be nice to have some other comments from
> outside of Intel (we've been beaten up written-by Intel, acked-by: Intel
> too many times).
>
> Can I count that as an "Acked-by:"?
Well, from looking at Udo's feedback here
http://marc.info/?l=linux-kernel&m=132206108724885
this patch looks like it helps in his case partially. I'm still not sure
about all the printks in therm_throt_process(), do you guys feel the
need to notify the user that the CPU clock was throttled/power limits
were applied, or rather have something more subtle...?
Hmmm.
--
Regards/Gruss,
Boris.
Advanced Micro Devices GmbH
Einsteinring 24, 85609 Dornach
GM: Alberto Bozzo
Reg: Dornach, Landkreis Muenchen
HRB Nr. 43632 WEEE Registernr: 129 19551
^ permalink raw reply [flat|nested] 9+ messages in thread
* RE: [PATCH] x86/mcheck/therm_throt.c: Don't log power limit and package level thermal throttle event in mce log
2011-12-06 15:31 ` Borislav Petkov
@ 2011-12-06 17:48 ` Yu, Fenghua
2011-12-06 19:06 ` Borislav Petkov
0 siblings, 1 reply; 9+ messages in thread
From: Yu, Fenghua @ 2011-12-06 17:48 UTC (permalink / raw)
To: Borislav Petkov, Luck, Tony
Cc: H Peter Anvin, Thomas Gleixner, Ingo Molnar, Andrew Morton,
Brown, Len, linux-kernel, x86
> -----Original Message-----
> From: Borislav Petkov [mailto:bp@amd64.org]
> Sent: Tuesday, December 06, 2011 7:31 AM
> To: Luck, Tony
> Cc: Yu, Fenghua; H Peter Anvin; Thomas Gleixner; Ingo Molnar; Andrew
> Morton; Brown, Len; linux-kernel; x86
> Subject: Re: [PATCH] x86/mcheck/therm_throt.c: Don't log power limit
> and package level thermal throttle event in mce log
>
> On Mon, Dec 05, 2011 at 02:18:26PM -0800, Luck, Tony wrote:
> > This looks like a sane improvement, Tony I'm assuming you're handling
> this?
> >
> > I can push this - but it would be nice to have some other comments
> from
> > outside of Intel (we've been beaten up written-by Intel, acked-by:
> Intel
> > too many times).
> >
> > Can I count that as an "Acked-by:"?
>
> Well, from looking at Udo's feedback here
>
> http://marc.info/?l=linux-kernel&m=132206108724885
>
> this patch looks like it helps in his case partially. I'm still not
> sure
> about all the printks in therm_throt_process(), do you guys feel the
> need to notify the user that the CPU clock was throttled/power limits
> were applied, or rather have something more subtle...?
The printk is one way to notify users about the power limit and thermal throttle. The printk only dumps the events in an interval (300*HZ).
Another way is to count the events in /sys/devices/system/cpu/cpu#/thermal_throttle. In this way, kernel logs every interrupt on any cpu into respective counters. User application can poll the counters and get more accurate and timely information for the events.
As explained in this patch, core level thermal throttle is still logged in mcelog for legacy reason after this patch is applied.
Thanks.
-Fenghua
^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: [PATCH] x86/mcheck/therm_throt.c: Don't log power limit and package level thermal throttle event in mce log
2011-12-06 17:48 ` Yu, Fenghua
@ 2011-12-06 19:06 ` Borislav Petkov
2011-12-06 19:26 ` Tony Luck
2011-12-06 19:27 ` Yu, Fenghua
0 siblings, 2 replies; 9+ messages in thread
From: Borislav Petkov @ 2011-12-06 19:06 UTC (permalink / raw)
To: Yu, Fenghua
Cc: Luck, Tony, H Peter Anvin, Thomas Gleixner, Ingo Molnar,
Andrew Morton, Brown, Len, linux-kernel, x86
On Tue, Dec 06, 2011 at 09:48:41AM -0800, Yu, Fenghua wrote:
> The printk is one way to notify users about the power limit and
> thermal throttle. The printk only dumps the events in an interval
> (300*HZ).
>
> Another way is to count the events in
> /sys/devices/system/cpu/cpu#/thermal_throttle. In this way, kernel
> logs every interrupt on any cpu into respective counters. User
> application can poll the counters and get more accurate and timely
> information for the events.
>
> As explained in this patch, core level thermal throttle is still
> logged in mcelog for legacy reason after this patch is applied.
I can see all that. Still, I'm questioning the need for those printks. A
user application polling the counters is a much better solution, IMHO,
than spamming the logs. IOW, is there a strong reason to have this -
even ratelimited - information in the logs and unnerve users, or, would
it be better to collect this info somewhere queitly and present it only
when something requests it?
Thanks.
--
Regards/Gruss,
Boris.
Advanced Micro Devices GmbH
Einsteinring 24, 85609 Dornach
GM: Alberto Bozzo
Reg: Dornach, Landkreis Muenchen
HRB Nr. 43632 WEEE Registernr: 129 19551
^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: [PATCH] x86/mcheck/therm_throt.c: Don't log power limit and package level thermal throttle event in mce log
2011-12-06 19:06 ` Borislav Petkov
@ 2011-12-06 19:26 ` Tony Luck
2011-12-06 19:56 ` Borislav Petkov
2011-12-06 19:27 ` Yu, Fenghua
1 sibling, 1 reply; 9+ messages in thread
From: Tony Luck @ 2011-12-06 19:26 UTC (permalink / raw)
To: Borislav Petkov
Cc: Yu, Fenghua, H Peter Anvin, Thomas Gleixner, Ingo Molnar,
Andrew Morton, Brown, Len, linux-kernel, x86
On Tue, Dec 6, 2011 at 11:06 AM, Borislav Petkov <bp@amd64.org> wrote:
> I can see all that. Still, I'm questioning the need for those printks. A
> user application polling the counters is a much better solution, IMHO,
> than spamming the logs. IOW, is there a strong reason to have this -
> even ratelimited - information in the logs and unnerve users, or, would
> it be better to collect this info somewhere queitly and present it only
> when something requests it?
Striking the right balance here is hard - if one has a BIOS that set the
thresholds at "interesting" values - then you certainly don't want to the
console to be spammed with a lot of useless junk.
But if there is a real problem - then having someone tell you later that
you should have been checking some obscure file in /sys to see that
some thermal/power limit events were being seen may not go over very
well.
When we have some comprehensive system health monitoring daemon
that does check these files, and can be configured to raise suitable alerts,
then the printks can go away.
-Tony
^ permalink raw reply [flat|nested] 9+ messages in thread
* RE: [PATCH] x86/mcheck/therm_throt.c: Don't log power limit and package level thermal throttle event in mce log
2011-12-06 19:06 ` Borislav Petkov
2011-12-06 19:26 ` Tony Luck
@ 2011-12-06 19:27 ` Yu, Fenghua
1 sibling, 0 replies; 9+ messages in thread
From: Yu, Fenghua @ 2011-12-06 19:27 UTC (permalink / raw)
To: Borislav Petkov
Cc: Luck, Tony, H Peter Anvin, Thomas Gleixner, Ingo Molnar,
Andrew Morton, Brown, Len, linux-kernel, x86
> -----Original Message-----
> From: Borislav Petkov [mailto:bp@amd64.org]
> Sent: Tuesday, December 06, 2011 11:07 AM
> To: Yu, Fenghua
> Cc: Luck, Tony; H Peter Anvin; Thomas Gleixner; Ingo Molnar; Andrew
> Morton; Brown, Len; linux-kernel; x86
> Subject: Re: [PATCH] x86/mcheck/therm_throt.c: Don't log power limit
> and package level thermal throttle event in mce log
>
> On Tue, Dec 06, 2011 at 09:48:41AM -0800, Yu, Fenghua wrote:
>
> > The printk is one way to notify users about the power limit and
> > thermal throttle. The printk only dumps the events in an interval
> > (300*HZ).
> >
> > Another way is to count the events in
> > /sys/devices/system/cpu/cpu#/thermal_throttle. In this way, kernel
> > logs every interrupt on any cpu into respective counters. User
> > application can poll the counters and get more accurate and timely
> > information for the events.
> >
> > As explained in this patch, core level thermal throttle is still
> > logged in mcelog for legacy reason after this patch is applied.
>
> I can see all that. Still, I'm questioning the need for those printks.
> A
> user application polling the counters is a much better solution, IMHO,
> than spamming the logs. IOW, is there a strong reason to have this -
> even ratelimited - information in the logs and unnerve users, or, would
> it be better to collect this info somewhere queitly and present it only
> when something requests it?
The printks are legacy code. People may have expect that info already. Removing them may cause
Others' complain.
Plus thermal throttle and power limit shouldn't happen too offen and/or for too long time. So the printks shouldn't print a lot of times in normal case and should spam dmesg. If the printks
do print a lot of times for a long period, user does need to pay attention to the events. One
concrete example is like what happened on x220. Please note the information printed by the
printks were noticed and referred, i.e. the info is useful in this case. From that perspective, the printks are necessary.
Having said that, the purpose of THIS patch is to remove thermal throttle and power limit events from mcelog to avoid people thinking scary hardware errors. If the printks are thought unnecessary, that should be in another patch.
Thanks.
-Fenghua
^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: [PATCH] x86/mcheck/therm_throt.c: Don't log power limit and package level thermal throttle event in mce log
2011-12-06 19:26 ` Tony Luck
@ 2011-12-06 19:56 ` Borislav Petkov
0 siblings, 0 replies; 9+ messages in thread
From: Borislav Petkov @ 2011-12-06 19:56 UTC (permalink / raw)
To: Tony Luck
Cc: Yu, Fenghua, H Peter Anvin, Thomas Gleixner, Ingo Molnar,
Andrew Morton, Brown, Len, linux-kernel, x86
On Tue, Dec 06, 2011 at 11:26:03AM -0800, Tony Luck wrote:
> On Tue, Dec 6, 2011 at 11:06 AM, Borislav Petkov <bp@amd64.org> wrote:
> > I can see all that. Still, I'm questioning the need for those printks. A
> > user application polling the counters is a much better solution, IMHO,
> > than spamming the logs. IOW, is there a strong reason to have this -
> > even ratelimited - information in the logs and unnerve users, or, would
> > it be better to collect this info somewhere queitly and present it only
> > when something requests it?
>
> Striking the right balance here is hard - if one has a BIOS that set the
> thresholds at "interesting" values - then you certainly don't want to the
> console to be spammed with a lot of useless junk.
>
> But if there is a real problem - then having someone tell you later that
> you should have been checking some obscure file in /sys to see that
> some thermal/power limit events were being seen may not go over very
> well.
Agreed.
> When we have some comprehensive system health monitoring daemon that
> does check these files, and can be configured to raise suitable
> alerts, then the printks can go away.
Ok, that makes sense, actually. A follow-up: what recovery handling are
you thinking of here, maybe force-suspend the box or disable boosting or
whatever?
All I'm saying is, how does one take care of the real problem you
mention above? I hope you're seeing my point here: I'm simply
questioning the fact whether printk's are optimal here.
But, before we completely drift off, to answer your original question:
I'm fine with the patch, it is Intel-only anyway so if you guys feel it
is a step in the right direction, you can have my ACK.
The printks story sounds like something we'll not be solving today
anyway, so... :-)
Thanks.
--
Regards/Gruss,
Boris.
Advanced Micro Devices GmbH
Einsteinring 24, 85609 Dornach
GM: Alberto Bozzo
Reg: Dornach, Landkreis Muenchen
HRB Nr. 43632 WEEE Registernr: 129 19551
^ permalink raw reply [flat|nested] 9+ messages in thread
end of thread, other threads:[~2011-12-06 19:56 UTC | newest]
Thread overview: 9+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2011-11-14 21:11 [PATCH] x86/mcheck/therm_throt.c: Don't log power limit and package level thermal throttle event in mce log Fenghua Yu
2011-12-05 13:18 ` Borislav Petkov
2011-12-05 22:18 ` Luck, Tony
2011-12-06 15:31 ` Borislav Petkov
2011-12-06 17:48 ` Yu, Fenghua
2011-12-06 19:06 ` Borislav Petkov
2011-12-06 19:26 ` Tony Luck
2011-12-06 19:56 ` Borislav Petkov
2011-12-06 19:27 ` Yu, Fenghua
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox