xen-devel.lists.xenproject.org archive mirror
 help / color / mirror / Atom feed
* [PATCH] perf: Check all MSRs before passing hw check
@ 2013-03-15 12:20 George Dunlap
  2013-03-15 12:50 ` Jan Beulich
  2013-03-18  8:42 ` Ingo Molnar
  0 siblings, 2 replies; 8+ messages in thread
From: George Dunlap @ 2013-03-15 12:20 UTC (permalink / raw)
  To: xen-devel
  Cc: George Dunlap, Thomas Gleixner, H. Peter Anvin, x86, Konrad Wilk

check_hw_exists has a number of checks which go to two exit paths:
msr_fail and bios_fail.  Checks classified as msr_fail will cause
check_hw_exists() to return false, causing the PMU not to be used;
bios_fail checks will only cause a warning to be printed, but will
return true.

The problem is that if there are both msr failures and bios failures,
and the routine hits a bios_fail check first, it will exit early and
return true, not finishing the rest of the msr checks.  If those msrs
are in fact broken, it will cause them to be used erroneously.

This changset causes check_hw_exists() to go through all of the msr
checks, failing and returning false if any of them fail.

This problem affects kernels as far back as 3.2, and should thus be
considered for backport.

Signed-off-by: George Dunlap <george.dunlap@eu.citrix.com>
CC: Konrad Wilk <konrad.wilk@oracle.com>
CC: Thomas Gleixner <tglx@linutronix.de>
CC: "H. Peter Anvin" <hpa@zytor.com>
CC: x86@kernel.org
---
 arch/x86/kernel/cpu/perf_event.c |   20 ++++++++++----------
 1 file changed, 10 insertions(+), 10 deletions(-)

diff --git a/arch/x86/kernel/cpu/perf_event.c b/arch/x86/kernel/cpu/perf_event.c
index 6774c17..df30c9a 100644
--- a/arch/x86/kernel/cpu/perf_event.c
+++ b/arch/x86/kernel/cpu/perf_event.c
@@ -182,6 +182,7 @@ static bool check_hw_exists(void)
 {
 	u64 val, val_new = ~0;
 	int i, reg, ret = 0;
+	int bios_fail = 0;
 
 	/*
 	 * Check to see if the BIOS enabled any of the counters, if so
@@ -193,7 +194,7 @@ static bool check_hw_exists(void)
 		if (ret)
 			goto msr_fail;
 		if (val & ARCH_PERFMON_EVENTSEL_ENABLE)
-			goto bios_fail;
+			bios_fail = 1;
 	}
 
 	if (x86_pmu.num_counters_fixed) {
@@ -203,7 +204,7 @@ static bool check_hw_exists(void)
 			goto msr_fail;
 		for (i = 0; i < x86_pmu.num_counters_fixed; i++) {
 			if (val & (0x03 << i*4))
-				goto bios_fail;
+				bios_fail = 1;
 		}
 	}
 
@@ -221,14 +222,13 @@ static bool check_hw_exists(void)
 	if (ret || val != val_new)
 		goto msr_fail;
 
-	return true;
-
-bios_fail:
-	/*
-	 * We still allow the PMU driver to operate:
-	 */
-	printk(KERN_CONT "Broken BIOS detected, complain to your hardware vendor.\n");
-	printk(KERN_ERR FW_BUG "the BIOS has corrupted hw-PMU resources (MSR %x is %Lx)\n", reg, val);
+	if (bios_fail) {
+		/*
+		 * We still allow the PMU driver to operate:
+		 */
+		printk(KERN_CONT "Broken BIOS detected, complain to your hardware vendor.\n");
+		printk(KERN_ERR FW_BUG "the BIOS has corrupted hw-PMU resources (MSR %x is %Lx)\n", reg, val);
+	}
 
 	return true;
 
-- 
1.7.9.5

^ permalink raw reply related	[flat|nested] 8+ messages in thread

* Re: [PATCH] perf: Check all MSRs before passing hw check
  2013-03-15 12:20 [PATCH] perf: Check all MSRs before passing hw check George Dunlap
@ 2013-03-15 12:50 ` Jan Beulich
  2013-03-15 14:43   ` George Dunlap
  2013-03-18  8:42 ` Ingo Molnar
  1 sibling, 1 reply; 8+ messages in thread
From: Jan Beulich @ 2013-03-15 12:50 UTC (permalink / raw)
  To: George Dunlap; +Cc: Thomas Gleixner, x86, H. PeterAnvin, Konrad Wilk, xen-devel

>>> On 15.03.13 at 13:20, George Dunlap <george.dunlap@eu.citrix.com> wrote:
> --- a/arch/x86/kernel/cpu/perf_event.c
> +++ b/arch/x86/kernel/cpu/perf_event.c
> @@ -182,6 +182,7 @@ static bool check_hw_exists(void)
>  {
>  	u64 val, val_new = ~0;
>  	int i, reg, ret = 0;
> +	int bios_fail = 0;
>  
>  	/*
>  	 * Check to see if the BIOS enabled any of the counters, if so
> @@ -193,7 +194,7 @@ static bool check_hw_exists(void)
>  		if (ret)
>  			goto msr_fail;
>  		if (val & ARCH_PERFMON_EVENTSEL_ENABLE)
> -			goto bios_fail;
> +			bios_fail = 1;
>  	}
>  
>  	if (x86_pmu.num_counters_fixed) {
> @@ -203,7 +204,7 @@ static bool check_hw_exists(void)
>  			goto msr_fail;
>  		for (i = 0; i < x86_pmu.num_counters_fixed; i++) {
>  			if (val & (0x03 << i*4))
> -				goto bios_fail;
> +				bios_fail = 1;
>  		}
>  	}
>  
> @@ -221,14 +222,13 @@ static bool check_hw_exists(void)
>  	if (ret || val != val_new)
>  		goto msr_fail;
>  
> -	return true;
> -
> -bios_fail:
> -	/*
> -	 * We still allow the PMU driver to operate:
> -	 */
> -	printk(KERN_CONT "Broken BIOS detected, complain to your hardware vendor.\n");
> -	printk(KERN_ERR FW_BUG "the BIOS has corrupted hw-PMU resources (MSR %x is %Lx)\n", reg, val);
> +	if (bios_fail) {
> +		/*
> +		 * We still allow the PMU driver to operate:
> +		 */
> +		printk(KERN_CONT "Broken BIOS detected, complain to your hardware vendor.\n");
> +		printk(KERN_ERR FW_BUG "the BIOS has corrupted hw-PMU resources (MSR %x is %Lx)\n", reg, val);

The values being printed here are not meaningful anymore with this
patch.

Jan

> +	}
>  
>  	return true;
>  

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [PATCH] perf: Check all MSRs before passing hw check
  2013-03-15 12:50 ` Jan Beulich
@ 2013-03-15 14:43   ` George Dunlap
  2013-03-15 15:25     ` Jan Beulich
  0 siblings, 1 reply; 8+ messages in thread
From: George Dunlap @ 2013-03-15 14:43 UTC (permalink / raw)
  To: Jan Beulich
  Cc: Thomas Gleixner, x86@kernel.org, H. PeterAnvin, Konrad Wilk,
	xen-devel@lists.xen.org

On 15/03/13 12:50, Jan Beulich wrote:
>>>> On 15.03.13 at 13:20, George Dunlap <george.dunlap@eu.citrix.com> wrote:
>> --- a/arch/x86/kernel/cpu/perf_event.c
>> +++ b/arch/x86/kernel/cpu/perf_event.c
>> @@ -182,6 +182,7 @@ static bool check_hw_exists(void)
>>   {
>>   	u64 val, val_new = ~0;
>>   	int i, reg, ret = 0;
>> +	int bios_fail = 0;
>>   
>>   	/*
>>   	 * Check to see if the BIOS enabled any of the counters, if so
>> @@ -193,7 +194,7 @@ static bool check_hw_exists(void)
>>   		if (ret)
>>   			goto msr_fail;
>>   		if (val & ARCH_PERFMON_EVENTSEL_ENABLE)
>> -			goto bios_fail;
>> +			bios_fail = 1;
>>   	}
>>   
>>   	if (x86_pmu.num_counters_fixed) {
>> @@ -203,7 +204,7 @@ static bool check_hw_exists(void)
>>   			goto msr_fail;
>>   		for (i = 0; i < x86_pmu.num_counters_fixed; i++) {
>>   			if (val & (0x03 << i*4))
>> -				goto bios_fail;
>> +				bios_fail = 1;
>>   		}
>>   	}
>>   
>> @@ -221,14 +222,13 @@ static bool check_hw_exists(void)
>>   	if (ret || val != val_new)
>>   		goto msr_fail;
>>   
>> -	return true;
>> -
>> -bios_fail:
>> -	/*
>> -	 * We still allow the PMU driver to operate:
>> -	 */
>> -	printk(KERN_CONT "Broken BIOS detected, complain to your hardware vendor.\n");
>> -	printk(KERN_ERR FW_BUG "the BIOS has corrupted hw-PMU resources (MSR %x is %Lx)\n", reg, val);
>> +	if (bios_fail) {
>> +		/*
>> +		 * We still allow the PMU driver to operate:
>> +		 */
>> +		printk(KERN_CONT "Broken BIOS detected, complain to your hardware vendor.\n");
>> +		printk(KERN_ERR FW_BUG "the BIOS has corrupted hw-PMU resources (MSR %x is %Lx)\n", reg, val);
> The values being printed here are not meaningful anymore with this
> patch.

Right -- then I guess the options are:
1. Don't print any values
2. Print values on the first broken MSR detected (but not subsequent ones)
3. Print values on all broken MSRs.

#2 is the current behavior, and won't risk spamming the console.

  -George

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [PATCH] perf: Check all MSRs before passing hw check
  2013-03-15 14:43   ` George Dunlap
@ 2013-03-15 15:25     ` Jan Beulich
  0 siblings, 0 replies; 8+ messages in thread
From: Jan Beulich @ 2013-03-15 15:25 UTC (permalink / raw)
  To: George Dunlap
  Cc: Thomas Gleixner, x86@kernel.org, H. PeterAnvin, Konrad Wilk,
	xen-devel@lists.xen.org

>>> On 15.03.13 at 15:43, George Dunlap <george.dunlap@eu.citrix.com> wrote:
> On 15/03/13 12:50, Jan Beulich wrote:
>>>>> On 15.03.13 at 13:20, George Dunlap <george.dunlap@eu.citrix.com> wrote:
>>> +	if (bios_fail) {
>>> +		/*
>>> +		 * We still allow the PMU driver to operate:
>>> +		 */
>>> +		printk(KERN_CONT "Broken BIOS detected, complain to your hardware vendor.\n");
>>> +		printk(KERN_ERR FW_BUG "the BIOS has corrupted hw-PMU resources (MSR %x is %Lx)\n", reg, val);
>> The values being printed here are not meaningful anymore with this
>> patch.
> 
> Right -- then I guess the options are:
> 1. Don't print any values
> 2. Print values on the first broken MSR detected (but not subsequent ones)
> 3. Print values on all broken MSRs.
> 
> #2 is the current behavior, and won't risk spamming the console.

I tend to agree that #2 is best, but the x86/perf maintainers
will have the final say anyway.

Jan

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [PATCH] perf: Check all MSRs before passing hw check
  2013-03-15 12:20 [PATCH] perf: Check all MSRs before passing hw check George Dunlap
  2013-03-15 12:50 ` Jan Beulich
@ 2013-03-18  8:42 ` Ingo Molnar
  2013-03-18 10:40   ` George Dunlap
  1 sibling, 1 reply; 8+ messages in thread
From: Ingo Molnar @ 2013-03-18  8:42 UTC (permalink / raw)
  To: George Dunlap, Peter Zijlstra
  Cc: Thomas Gleixner, Konrad Wilk, H. Peter Anvin, x86, xen-devel


* George Dunlap <george.dunlap@eu.citrix.com> wrote:

> check_hw_exists has a number of checks which go to two exit paths:
> msr_fail and bios_fail.  Checks classified as msr_fail will cause
> check_hw_exists() to return false, causing the PMU not to be used;
> bios_fail checks will only cause a warning to be printed, but will
> return true.
> 
> The problem is that if there are both msr failures and bios failures,
> and the routine hits a bios_fail check first, it will exit early and
> return true, not finishing the rest of the msr checks.  If those msrs
> are in fact broken, it will cause them to be used erroneously.
> 
> This changset causes check_hw_exists() to go through all of the msr
> checks, failing and returning false if any of them fail.
> 
> This problem affects kernels as far back as 3.2, and should thus be
> considered for backport.
> 
> Signed-off-by: George Dunlap <george.dunlap@eu.citrix.com>
> CC: Konrad Wilk <konrad.wilk@oracle.com>
> CC: Thomas Gleixner <tglx@linutronix.de>
> CC: "H. Peter Anvin" <hpa@zytor.com>
> CC: x86@kernel.org
> ---
>  arch/x86/kernel/cpu/perf_event.c |   20 ++++++++++----------
>  1 file changed, 10 insertions(+), 10 deletions(-)

What is missing is a description of what specific platform this gets 
triggered on and exactly why. Is some hw feature emulation missing that 
causes the check to fail?

Thanks,

	Ingo

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [PATCH] perf: Check all MSRs before passing hw check
  2013-03-18  8:42 ` Ingo Molnar
@ 2013-03-18 10:40   ` George Dunlap
  2013-03-18 10:53     ` Ingo Molnar
  0 siblings, 1 reply; 8+ messages in thread
From: George Dunlap @ 2013-03-18 10:40 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Peter Zijlstra, Konrad Wilk, x86@kernel.org,
	xen-devel@lists.xen.org, H. Peter Anvin, Thomas Gleixner

On 18/03/13 08:42, Ingo Molnar wrote:
> * George Dunlap <george.dunlap@eu.citrix.com> wrote:
>
>> check_hw_exists has a number of checks which go to two exit paths:
>> msr_fail and bios_fail.  Checks classified as msr_fail will cause
>> check_hw_exists() to return false, causing the PMU not to be used;
>> bios_fail checks will only cause a warning to be printed, but will
>> return true.
>>
>> The problem is that if there are both msr failures and bios failures,
>> and the routine hits a bios_fail check first, it will exit early and
>> return true, not finishing the rest of the msr checks.  If those msrs
>> are in fact broken, it will cause them to be used erroneously.
>>
>> This changset causes check_hw_exists() to go through all of the msr
>> checks, failing and returning false if any of them fail.
>>
>> This problem affects kernels as far back as 3.2, and should thus be
>> considered for backport.
>>
>> Signed-off-by: George Dunlap <george.dunlap@eu.citrix.com>
>> CC: Konrad Wilk <konrad.wilk@oracle.com>
>> CC: Thomas Gleixner <tglx@linutronix.de>
>> CC: "H. Peter Anvin" <hpa@zytor.com>
>> CC: x86@kernel.org
>> ---
>>   arch/x86/kernel/cpu/perf_event.c |   20 ++++++++++----------
>>   1 file changed, 10 insertions(+), 10 deletions(-)
> What is missing is a description of what specific platform this gets
> triggered on and exactly why. Is some hw feature emulation missing that
> causes the check to fail?

Remember, there are two checks failing: the second one is supposed to 
fail and disable the PMU entirely, but it's not getting there because 
when the first one fails, it skips the rest but returns "success" anyway.

The warning on the first check is as follows:

[    8.131985] Performance Events: Broken BIOS detected, complain to 
your hardware vendor.^M
[    8.139997] [Firmware Bug]: the BIOS has corrupted hw-PMU resources 
(MSR c0010000 is 530076)^M

c0010000 is the AMD  MSR_K7_EVNTSEL0; the check it's failing is:
   if (val & ARCH_PERFMON_EVENTSEL_ENABLE)

So it discovers that one of the performance counters is already enabled 
-- worth a warning, but by itself not worth disabling the PMU.  This is 
most likely to be exactly what the warning message says: a buggy BIOS 
that enables perfcounters enabled for some reason.

The second check is supposed to detect that the PMU is actually not 
usable -- in my case because it's running virtualized (under Xen).

  -George

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [PATCH] perf: Check all MSRs before passing hw check
  2013-03-18 10:40   ` George Dunlap
@ 2013-03-18 10:53     ` Ingo Molnar
  2013-03-18 10:55       ` George Dunlap
  0 siblings, 1 reply; 8+ messages in thread
From: Ingo Molnar @ 2013-03-18 10:53 UTC (permalink / raw)
  To: George Dunlap
  Cc: Peter Zijlstra, Konrad Wilk, x86@kernel.org,
	xen-devel@lists.xen.org, H. Peter Anvin, Thomas Gleixner


* George Dunlap <george.dunlap@eu.citrix.com> wrote:

> On 18/03/13 08:42, Ingo Molnar wrote:
> >* George Dunlap <george.dunlap@eu.citrix.com> wrote:
> >
> >>check_hw_exists has a number of checks which go to two exit paths:
> >>msr_fail and bios_fail.  Checks classified as msr_fail will cause
> >>check_hw_exists() to return false, causing the PMU not to be used;
> >>bios_fail checks will only cause a warning to be printed, but will
> >>return true.
> >>
> >>The problem is that if there are both msr failures and bios failures,
> >>and the routine hits a bios_fail check first, it will exit early and
> >>return true, not finishing the rest of the msr checks.  If those msrs
> >>are in fact broken, it will cause them to be used erroneously.
> >>
> >>This changset causes check_hw_exists() to go through all of the msr
> >>checks, failing and returning false if any of them fail.
> >>
> >>This problem affects kernels as far back as 3.2, and should thus be
> >>considered for backport.
> >>
> >>Signed-off-by: George Dunlap <george.dunlap@eu.citrix.com>
> >>CC: Konrad Wilk <konrad.wilk@oracle.com>
> >>CC: Thomas Gleixner <tglx@linutronix.de>
> >>CC: "H. Peter Anvin" <hpa@zytor.com>
> >>CC: x86@kernel.org
> >>---
> >>  arch/x86/kernel/cpu/perf_event.c |   20 ++++++++++----------
> >>  1 file changed, 10 insertions(+), 10 deletions(-)
> >What is missing is a description of what specific platform this gets
> >triggered on and exactly why. Is some hw feature emulation missing that
> >causes the check to fail?
> 
> Remember, there are two checks failing: the second one is supposed
> to fail and disable the PMU entirely, but it's not getting there
> because when the first one fails, it skips the rest but returns
> "success" anyway.
> 
> The warning on the first check is as follows:
> 
> [    8.131985] Performance Events: Broken BIOS detected, complain to
> your hardware vendor.^M
> [    8.139997] [Firmware Bug]: the BIOS has corrupted hw-PMU
> resources (MSR c0010000 is 530076)^M
> 
> c0010000 is the AMD  MSR_K7_EVNTSEL0; the check it's failing is:
>   if (val & ARCH_PERFMON_EVENTSEL_ENABLE)
> 
> So it discovers that one of the performance counters is already
> enabled -- worth a warning, but by itself not worth disabling the
> PMU.  This is most likely to be exactly what the warning message
> says: a buggy BIOS that enables perfcounters enabled for some
> reason.
> 
> The second check is supposed to detect that the PMU is actually not
> usable -- in my case because it's running virtualized (under Xen).

I got the logic from your original description - what I wanted was for the 
specific messages to be included in the patch changelog, plus a 
description of what misbehaved before the patch and what behaves better 
after the patch - on your specific system.

In other words, please use the customary changelog style we use in the 
kernel:

  " Current code does (A), this has a problem when (B).
    We can improve this doing (C), because (D)."

Thanks,

        Ingo

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [PATCH] perf: Check all MSRs before passing hw check
  2013-03-18 10:53     ` Ingo Molnar
@ 2013-03-18 10:55       ` George Dunlap
  0 siblings, 0 replies; 8+ messages in thread
From: George Dunlap @ 2013-03-18 10:55 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Peter Zijlstra, Konrad Wilk, x86@kernel.org,
	xen-devel@lists.xen.org, H. Peter Anvin, Thomas Gleixner

On 18/03/13 10:53, Ingo Molnar wrote:
> * George Dunlap <george.dunlap@eu.citrix.com> wrote:
>
>> On 18/03/13 08:42, Ingo Molnar wrote:
>>> * George Dunlap <george.dunlap@eu.citrix.com> wrote:
>>>
>>>> check_hw_exists has a number of checks which go to two exit paths:
>>>> msr_fail and bios_fail.  Checks classified as msr_fail will cause
>>>> check_hw_exists() to return false, causing the PMU not to be used;
>>>> bios_fail checks will only cause a warning to be printed, but will
>>>> return true.
>>>>
>>>> The problem is that if there are both msr failures and bios failures,
>>>> and the routine hits a bios_fail check first, it will exit early and
>>>> return true, not finishing the rest of the msr checks.  If those msrs
>>>> are in fact broken, it will cause them to be used erroneously.
>>>>
>>>> This changset causes check_hw_exists() to go through all of the msr
>>>> checks, failing and returning false if any of them fail.
>>>>
>>>> This problem affects kernels as far back as 3.2, and should thus be
>>>> considered for backport.
>>>>
>>>> Signed-off-by: George Dunlap <george.dunlap@eu.citrix.com>
>>>> CC: Konrad Wilk <konrad.wilk@oracle.com>
>>>> CC: Thomas Gleixner <tglx@linutronix.de>
>>>> CC: "H. Peter Anvin" <hpa@zytor.com>
>>>> CC: x86@kernel.org
>>>> ---
>>>>   arch/x86/kernel/cpu/perf_event.c |   20 ++++++++++----------
>>>>   1 file changed, 10 insertions(+), 10 deletions(-)
>>> What is missing is a description of what specific platform this gets
>>> triggered on and exactly why. Is some hw feature emulation missing that
>>> causes the check to fail?
>> Remember, there are two checks failing: the second one is supposed
>> to fail and disable the PMU entirely, but it's not getting there
>> because when the first one fails, it skips the rest but returns
>> "success" anyway.
>>
>> The warning on the first check is as follows:
>>
>> [    8.131985] Performance Events: Broken BIOS detected, complain to
>> your hardware vendor.^M
>> [    8.139997] [Firmware Bug]: the BIOS has corrupted hw-PMU
>> resources (MSR c0010000 is 530076)^M
>>
>> c0010000 is the AMD  MSR_K7_EVNTSEL0; the check it's failing is:
>>    if (val & ARCH_PERFMON_EVENTSEL_ENABLE)
>>
>> So it discovers that one of the performance counters is already
>> enabled -- worth a warning, but by itself not worth disabling the
>> PMU.  This is most likely to be exactly what the warning message
>> says: a buggy BIOS that enables perfcounters enabled for some
>> reason.
>>
>> The second check is supposed to detect that the PMU is actually not
>> usable -- in my case because it's running virtualized (under Xen).
> I got the logic from your original description - what I wanted was for the
> specific messages to be included in the patch changelog, plus a
> description of what misbehaved before the patch and what behaves better
> after the patch - on your specific system.
>
> In other words, please use the customary changelog style we use in the
> kernel:
>
>    " Current code does (A), this has a problem when (B).
>      We can improve this doing (C), because (D)."

Right, got it. Standby for v2.

  -George

^ permalink raw reply	[flat|nested] 8+ messages in thread

end of thread, other threads:[~2013-03-18 10:55 UTC | newest]

Thread overview: 8+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2013-03-15 12:20 [PATCH] perf: Check all MSRs before passing hw check George Dunlap
2013-03-15 12:50 ` Jan Beulich
2013-03-15 14:43   ` George Dunlap
2013-03-15 15:25     ` Jan Beulich
2013-03-18  8:42 ` Ingo Molnar
2013-03-18 10:40   ` George Dunlap
2013-03-18 10:53     ` Ingo Molnar
2013-03-18 10:55       ` George Dunlap

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).