[PATCH] RFC: Linux: disable APERF/MPERF feature in PV kernels

xen-devel.lists.xenproject.org archive mirror
 help / color / mirror / Atom feed

* [PATCH] RFC: Linux: disable APERF/MPERF feature in PV kernels
@ 2012-05-22 16:07 Andre Przywara
  2012-05-22 16:52 ` Jeremy Fitzhardinge
                   ` (2 more replies)
  0 siblings, 3 replies; 21+ messages in thread
From: Andre Przywara @ 2012-05-22 16:07 UTC (permalink / raw)
  To: Jan Beulich, Jeremy Fitzhardinge; +Cc: xen-devel

[-- Attachment #1: Type: text/plain, Size: 1032 bytes --]

Hi,

while testing some APERF/MPERF semantics I discovered that this feature 
is enabled in Xen Dom0, but is not reliable.
The Linux kernel's scheduler uses this feature if it sees the CPUID bit, 
leading to costly RDMSR traps (a few 100,000s during a kernel compile) 
and bogus values due to VCPU migration during the measurement.
The attached patch explicitly disables this CPU capability inside the 
Linux kernel, I couldn't measure any APERF/MPERF reads anymore with the 
patch applied.
I am not sure if the PVOPS code is the right place to fix this, we could 
as well do it in the HV's xen/arch/x86/traps.c:pv_cpuid().
Also when the Dom0 VCPUs are pinned, we could allow this, but I am not 
sure if it's worth to do so.

Awaiting your comments.

Regards,
Andre.

P.S. Of course this doesn't fix pure userland software like cpupower, 
but I would consider this in the user's responsibility to not use these 
tools in Dom0, but instead use xenpm.

-- 
Andre Przywara
AMD-Operating System Research Center (OSRC), Dresden, Germany

[-- Attachment #2: xenpv_aperfmperf.patch --]
[-- Type: text/x-patch, Size: 712 bytes --]

commit e802e47d85314b4541288e4a19d057e2ea885a28
Author: Andre Przywara <andre.przywara@amd.com>
Date:   Tue May 22 15:13:07 2012 +0200

    filter APERFMPERF feature in Xen to avoid kernel internal usage

    Signed-off-by: Andre Przywara <andre.przywara@amd.com>

diff --git a/arch/x86/xen/enlighten.c b/arch/x86/xen/enlighten.c
index 95dccce..71252d5 100644
--- a/arch/x86/xen/enlighten.c
+++ b/arch/x86/xen/enlighten.c
@@ -240,6 +240,11 @@ static void xen_cpuid(unsigned int *ax, unsigned int *bx,
 		*dx = cpuid_leaf5_edx_val;
 		return;

+	case 6:
+		/* Disabling APERFMPERF for kernel usage */
+		maskecx = ~(1U << 0);
+		break;
+
 	case 0xb:
 		/* Suppress extended topology stuff */
 		maskebx = 0;

[-- Attachment #3: Type: text/plain, Size: 126 bytes --]

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel

^ permalink raw reply related	[flat|nested] 21+ messages in thread

* Re: [PATCH] RFC: Linux: disable APERF/MPERF feature in PV kernels
  2012-05-22 16:07 [PATCH] RFC: Linux: disable APERF/MPERF feature in PV kernels Andre Przywara
@ 2012-05-22 16:52 ` Jeremy Fitzhardinge
  2012-05-22 17:08   ` Malcolm Crossley
  2012-05-22 20:46   ` Andre Przywara
  2012-05-22 17:18 ` Konrad Rzeszutek Wilk
  2012-05-23  7:34 ` Jan Beulich
  2 siblings, 2 replies; 21+ messages in thread
From: Jeremy Fitzhardinge @ 2012-05-22 16:52 UTC (permalink / raw)
  To: Andre Przywara; +Cc: xen-devel, Konrad Rzeszutek Wilk

On 05/22/2012 09:07 AM, Andre Przywara wrote:
> Hi,
>
> while testing some APERF/MPERF semantics I discovered that this
> feature is enabled in Xen Dom0, but is not reliable.
> The Linux kernel's scheduler uses this feature if it sees the CPUID
> bit, leading to costly RDMSR traps (a few 100,000s during a kernel
> compile) and bogus values due to VCPU migration during the measurement.
> The attached patch explicitly disables this CPU capability inside the
> Linux kernel, I couldn't measure any APERF/MPERF reads anymore with
> the patch applied.
> I am not sure if the PVOPS code is the right place to fix this, we
> could as well do it in the HV's xen/arch/x86/traps.c:pv_cpuid().
> Also when the Dom0 VCPUs are pinned, we could allow this, but I am not
> sure if it's worth to do so.

Seems reasonable to me.  Do all those RDMSR traps have a measurable
performance effect?

Also, is there a symbolic constant for that bit?

    J

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH] RFC: Linux: disable APERF/MPERF feature in PV kernels
  2012-05-22 16:52 ` Jeremy Fitzhardinge
@ 2012-05-22 17:08   ` Malcolm Crossley
  2012-05-23  8:10     ` Jan Beulich
  2012-05-22 20:46   ` Andre Przywara
  1 sibling, 1 reply; 21+ messages in thread
From: Malcolm Crossley @ 2012-05-22 17:08 UTC (permalink / raw)
  To: xen-devel

[-- Attachment #1: Type: text/plain, Size: 1390 bytes --]

On 22/05/12 17:52, Jeremy Fitzhardinge wrote:
> On 05/22/2012 09:07 AM, Andre Przywara wrote:
>> Hi,
>>
>> while testing some APERF/MPERF semantics I discovered that this
>> feature is enabled in Xen Dom0, but is not reliable.
>> The Linux kernel's scheduler uses this feature if it sees the CPUID
>> bit, leading to costly RDMSR traps (a few 100,000s during a kernel
>> compile) and bogus values due to VCPU migration during the measurement.
>> The attached patch explicitly disables this CPU capability inside the
>> Linux kernel, I couldn't measure any APERF/MPERF reads anymore with
>> the patch applied.
>> I am not sure if the PVOPS code is the right place to fix this, we
>> could as well do it in the HV's xen/arch/x86/traps.c:pv_cpuid().
>> Also when the Dom0 VCPUs are pinned, we could allow this, but I am not
>> sure if it's worth to do so.
> Seems reasonable to me.  Do all those RDMSR traps have a measurable
> performance effect?
>
> Also, is there a symbolic constant for that bit?
>
>      J
>
> _______________________________________________
> Xen-devel mailing list
> Xen-devel@lists.xen.org
> http://lists.xen.org/xen-devel
Hi,

I've attached a patch which masks the matching CPUID leaves in the Xen pv_cpuid function.
Should the logic in pv_cpuid be changed to only pass through explictly allowed CPUID leaves rather than masking
them using case statements?

Malcolm


[-- Attachment #2: mask-thermal-and-power-management-leaf.patch --]
[-- Type: text/x-patch, Size: 944 bytes --]

# HG changeset patch
# Parent 238900a4ed227d04c164d4cd12dfc66f7a25b946
[PATCH] Xen: Prevent unsupported CPUID leaves from being passed through to dom0 PV guest.

The PV CPUID emulation code is passing through CPUID leaves which do match the unsupported case statement. CPUID features should be passed through when Xen can safely support those features.

Signed-off-by: Malcolm Crossley <malcolm.crossley@citrix.com>

diff -r 238900a4ed22 xen/arch/x86/traps.c
--- a/xen/arch/x86/traps.c
+++ b/xen/arch/x86/traps.c
@@ -900,6 +900,8 @@ static void pv_cpuid(struct cpu_user_reg
         break;
 
     case 0x00000005: /* MONITOR/MWAIT */
+    case 0x00000006: /* Thermal and Power Management leaf */
+    case 0x00000009: /* Direct Cache Access Information Leaf */
     case 0x0000000a: /* Architectural Performance Monitor Features */
     case 0x0000000b: /* Extended Topology Enumeration */
     case 0x8000000a: /* SVM revision and features */

[-- Attachment #3: Type: text/plain, Size: 126 bytes --]

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH] RFC: Linux: disable APERF/MPERF feature in PV kernels
  2012-05-22 16:07 [PATCH] RFC: Linux: disable APERF/MPERF feature in PV kernels Andre Przywara
  2012-05-22 16:52 ` Jeremy Fitzhardinge
@ 2012-05-22 17:18 ` Konrad Rzeszutek Wilk
  2012-05-22 21:02   ` Andre Przywara
  2012-05-23  7:34 ` Jan Beulich
  2 siblings, 1 reply; 21+ messages in thread
From: Konrad Rzeszutek Wilk @ 2012-05-22 17:18 UTC (permalink / raw)
  To: Andre Przywara; +Cc: Jeremy Fitzhardinge, xen-devel

On Tue, May 22, 2012 at 06:07:11PM +0200, Andre Przywara wrote:
> Hi,
> 
> while testing some APERF/MPERF semantics I discovered that this
> feature is enabled in Xen Dom0, but is not reliable.
> The Linux kernel's scheduler uses this feature if it sees the CPUID
> bit, leading to costly RDMSR traps (a few 100,000s during a kernel
> compile) and bogus values due to VCPU migration during the

Can you point me to the Linux scheduler code that does this? Thanks.

> measurement.
> The attached patch explicitly disables this CPU capability inside
> the Linux kernel, I couldn't measure any APERF/MPERF reads anymore
> with the patch applied.
> I am not sure if the PVOPS code is the right place to fix this, we
> could as well do it in the HV's xen/arch/x86/traps.c:pv_cpuid().
> Also when the Dom0 VCPUs are pinned, we could allow this, but I am
> not sure if it's worth to do so.
> 
> Awaiting your comments.
> 
> Regards,
> Andre.
> 
> P.S. Of course this doesn't fix pure userland software like
> cpupower, but I would consider this in the user's responsibility to

Which would not work anymore as the cpufreq support is disabled
when it boots under Xen.

> not use these tools in Dom0, but instead use xenpm.
> 
> -- 
> Andre Przywara
> AMD-Operating System Research Center (OSRC), Dresden, Germany

> commit e802e47d85314b4541288e4a19d057e2ea885a28
> Author: Andre Przywara <andre.przywara@amd.com>
> Date:   Tue May 22 15:13:07 2012 +0200
> 
>     filter APERFMPERF feature in Xen to avoid kernel internal usage
>     
>     Signed-off-by: Andre Przywara <andre.przywara@amd.com>
> 
> diff --git a/arch/x86/xen/enlighten.c b/arch/x86/xen/enlighten.c
> index 95dccce..71252d5 100644
> --- a/arch/x86/xen/enlighten.c
> +++ b/arch/x86/xen/enlighten.c
> @@ -240,6 +240,11 @@ static void xen_cpuid(unsigned int *ax, unsigned int *bx,
>  		*dx = cpuid_leaf5_edx_val;
>  		return;
>  
> +	case 6:
> +		/* Disabling APERFMPERF for kernel usage */
> +		maskecx = ~(1U << 0);
> +		break;
> +
>  	case 0xb:
>  		/* Suppress extended topology stuff */
>  		maskebx = 0;

> _______________________________________________
> Xen-devel mailing list
> Xen-devel@lists.xen.org
> http://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH] RFC: Linux: disable APERF/MPERF feature in PV kernels
  2012-05-22 16:52 ` Jeremy Fitzhardinge
  2012-05-22 17:08   ` Malcolm Crossley
@ 2012-05-22 20:46   ` Andre Przywara
  1 sibling, 0 replies; 21+ messages in thread
From: Andre Przywara @ 2012-05-22 20:46 UTC (permalink / raw)
  To: Jeremy Fitzhardinge; +Cc: xen-devel, Konrad Rzeszutek Wilk

On 05/22/2012 06:52 PM, Jeremy Fitzhardinge wrote:
> On 05/22/2012 09:07 AM, Andre Przywara wrote:
>> Hi,
>>
>> while testing some APERF/MPERF semantics I discovered that this
>> feature is enabled in Xen Dom0, but is not reliable.
>> The Linux kernel's scheduler uses this feature if it sees the CPUID
>> bit, leading to costly RDMSR traps (a few 100,000s during a kernel
>> compile) and bogus values due to VCPU migration during the measurement.
>> The attached patch explicitly disables this CPU capability inside the
>> Linux kernel, I couldn't measure any APERF/MPERF reads anymore with
>> the patch applied.
>> I am not sure if the PVOPS code is the right place to fix this, we
>> could as well do it in the HV's xen/arch/x86/traps.c:pv_cpuid().
>> Also when the Dom0 VCPUs are pinned, we could allow this, but I am not
>> sure if it's worth to do so.
>
> Seems reasonable to me.  Do all those RDMSR traps have a measurable
> performance effect?

I haven't tested this. I was more concerned about the invalid readouts 
from the MSRs from Dom0 side. One of our tests complained that the 
second of two consecutive APERF reads was actually lower than the first, 
violating the strict monotony requirement. Even if these negative values 
would be rare, I assume the differences are somewhat random if the 
readouts happen to come from different pCPUs.
Start "watch cpupower monitor" and compile a kernel in Dom0 to see what 
I mean...

>
> Also, is there a symbolic constant for that bit?

APERFMPERF is one of the scattered bits 
(arch/x86/kernel/cpu/scattered.c), so the existing constant uses the 
artificial Linux numbering. But right, for the next version I'd create a 
symbolic constant.

Regards,
Andre.

-- 
Andre Przywara
AMD-OSRC (Dresden)
Tel: x29712

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH] RFC: Linux: disable APERF/MPERF feature in PV kernels
  2012-05-22 21:02   ` Andre Przywara
@ 2012-05-22 21:00     ` Konrad Rzeszutek Wilk
  2012-05-22 22:44       ` Andre Przywara
  0 siblings, 1 reply; 21+ messages in thread
From: Konrad Rzeszutek Wilk @ 2012-05-22 21:00 UTC (permalink / raw)
  To: Andre Przywara; +Cc: Jeremy Fitzhardinge, xen-devel

On Tue, May 22, 2012 at 11:02:01PM +0200, Andre Przywara wrote:
> On 05/22/2012 07:18 PM, Konrad Rzeszutek Wilk wrote:
> >On Tue, May 22, 2012 at 06:07:11PM +0200, Andre Przywara wrote:
> >>Hi,
> >>
> >>while testing some APERF/MPERF semantics I discovered that this
> >>feature is enabled in Xen Dom0, but is not reliable.
> >>The Linux kernel's scheduler uses this feature if it sees the CPUID
> >>bit, leading to costly RDMSR traps (a few 100,000s during a kernel
> >>compile) and bogus values due to VCPU migration during the
> >
> >Can you point me to the Linux scheduler code that does this? Thanks.
> 
> arch/x86/kernel/cpu/sched.c contains code to read out and compute
> APERF/MPERF registers. I added a Xen debug-key to dump a usage
> counter added in traps.c and thus could prove that it is actually
> the kernel that accesses these registers.
> As far as I understood this the idea is to learn about boosting and
> down-clocking (P-states) to get a fairer view on the actual
> computing time a process consumed.

Looks like its looking for this:

X86_FEATURE_APERFMPERF

Perhaps masking that should do it? Something along this in enlighten.c:

         cpuid_leaf1_edx_mask =
                 ~((1 << X86_FEATURE_MCE)  |  /* disable MCE */
                   (1 << X86_FEATURE_MCA)  |  /* disable MCA */
                   (1 << X86_FEATURE_MTRR) |  /* disable MTRR */
                   (1 << X86_FEATURE_ACC));   /* thermal monitoring

would be more appropiate?

Or is that attribute on a different leaf?
> 
> >>measurement.
> >>The attached patch explicitly disables this CPU capability inside
> >>the Linux kernel, I couldn't measure any APERF/MPERF reads anymore
> >>with the patch applied.
> >>I am not sure if the PVOPS code is the right place to fix this, we
> >>could as well do it in the HV's xen/arch/x86/traps.c:pv_cpuid().
> >>Also when the Dom0 VCPUs are pinned, we could allow this, but I am
> >>not sure if it's worth to do so.
> >>
> >>Awaiting your comments.
> >>
> >>Regards,
> >>Andre.
> >>
> >>P.S. Of course this doesn't fix pure userland software like
> >>cpupower, but I would consider this in the user's responsibility to
> >
> >Which would not work anymore as the cpufreq support is disabled
> >when it boots under Xen.
> 
> Do you mean with "anymore" in a future kernel? I tested this on
> 3.4.0 and cpupower monitor worked fine. Right, cpufreq is not
> enabled, but cpupower uses the /dev/cpu/<n>/msr device file to
> directly read the MSRs. So I get this output if run on an idle Dom0:

Ahh. Neat. Will have to play with that.

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH] RFC: Linux: disable APERF/MPERF feature in PV kernels
  2012-05-22 17:18 ` Konrad Rzeszutek Wilk
@ 2012-05-22 21:02   ` Andre Przywara
  2012-05-22 21:00     ` Konrad Rzeszutek Wilk
  0 siblings, 1 reply; 21+ messages in thread
From: Andre Przywara @ 2012-05-22 21:02 UTC (permalink / raw)
  To: Konrad Rzeszutek Wilk; +Cc: Jeremy Fitzhardinge, xen-devel

On 05/22/2012 07:18 PM, Konrad Rzeszutek Wilk wrote:
> On Tue, May 22, 2012 at 06:07:11PM +0200, Andre Przywara wrote:
>> Hi,
>>
>> while testing some APERF/MPERF semantics I discovered that this
>> feature is enabled in Xen Dom0, but is not reliable.
>> The Linux kernel's scheduler uses this feature if it sees the CPUID
>> bit, leading to costly RDMSR traps (a few 100,000s during a kernel
>> compile) and bogus values due to VCPU migration during the
>
> Can you point me to the Linux scheduler code that does this? Thanks.

arch/x86/kernel/cpu/sched.c contains code to read out and compute 
APERF/MPERF registers. I added a Xen debug-key to dump a usage counter 
added in traps.c and thus could prove that it is actually the kernel 
that accesses these registers.
As far as I understood this the idea is to learn about boosting and 
down-clocking (P-states) to get a fairer view on the actual computing 
time a process consumed.

>> measurement.
>> The attached patch explicitly disables this CPU capability inside
>> the Linux kernel, I couldn't measure any APERF/MPERF reads anymore
>> with the patch applied.
>> I am not sure if the PVOPS code is the right place to fix this, we
>> could as well do it in the HV's xen/arch/x86/traps.c:pv_cpuid().
>> Also when the Dom0 VCPUs are pinned, we could allow this, but I am
>> not sure if it's worth to do so.
>>
>> Awaiting your comments.
>>
>> Regards,
>> Andre.
>>
>> P.S. Of course this doesn't fix pure userland software like
>> cpupower, but I would consider this in the user's responsibility to
>
> Which would not work anymore as the cpufreq support is disabled
> when it boots under Xen.

Do you mean with "anymore" in a future kernel? I tested this on 3.4.0 
and cpupower monitor worked fine. Right, cpufreq is not enabled, but 
cpupower uses the /dev/cpu/<n>/msr device file to directly read the 
MSRs. So I get this output if run on an idle Dom0:

     |Mperf
CPU | C0   | Cx   | Freq
    0|  0.10| 99.90|  3473
....

Regards,
Andre.

>
>> not use these tools in Dom0, but instead use xenpm.
>>
>> --
>> Andre Przywara
>> AMD-Operating System Research Center (OSRC), Dresden, Germany
>
>> commit e802e47d85314b4541288e4a19d057e2ea885a28
>> Author: Andre Przywara<andre.przywara@amd.com>
>> Date:   Tue May 22 15:13:07 2012 +0200
>>
>>      filter APERFMPERF feature in Xen to avoid kernel internal usage
>>
>>      Signed-off-by: Andre Przywara<andre.przywara@amd.com>
>>
>> diff --git a/arch/x86/xen/enlighten.c b/arch/x86/xen/enlighten.c
>> index 95dccce..71252d5 100644
>> --- a/arch/x86/xen/enlighten.c
>> +++ b/arch/x86/xen/enlighten.c
>> @@ -240,6 +240,11 @@ static void xen_cpuid(unsigned int *ax, unsigned int *bx,
>>   		*dx = cpuid_leaf5_edx_val;
>>   		return;
>>
>> +	case 6:
>> +		/* Disabling APERFMPERF for kernel usage */
>> +		maskecx = ~(1U<<  0);
>> +		break;
>> +
>>   	case 0xb:
>>   		/* Suppress extended topology stuff */
>>   		maskebx = 0;
>
>> _______________________________________________
>> Xen-devel mailing list
>> Xen-devel@lists.xen.org
>> http://lists.xen.org/xen-devel
>
>


-- 
Andre Przywara
AMD-Operating System Research Center (OSRC), Dresden, Germany
Tel: +49 351 448-3567-12

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH] RFC: Linux: disable APERF/MPERF feature in PV kernels
  2012-05-22 21:00     ` Konrad Rzeszutek Wilk
@ 2012-05-22 22:44       ` Andre Przywara
  2012-05-23 13:26         ` Konrad Rzeszutek Wilk
  0 siblings, 1 reply; 21+ messages in thread
From: Andre Przywara @ 2012-05-22 22:44 UTC (permalink / raw)
  To: Konrad Rzeszutek Wilk; +Cc: Jeremy Fitzhardinge, xen-devel

On 05/22/2012 11:00 PM, Konrad Rzeszutek Wilk wrote:
> On Tue, May 22, 2012 at 11:02:01PM +0200, Andre Przywara wrote:
>> On 05/22/2012 07:18 PM, Konrad Rzeszutek Wilk wrote:
>>> On Tue, May 22, 2012 at 06:07:11PM +0200, Andre Przywara wrote:
>>>> Hi,
>>>>
>>>> while testing some APERF/MPERF semantics I discovered that this
>>>> feature is enabled in Xen Dom0, but is not reliable.
>>>> The Linux kernel's scheduler uses this feature if it sees the CPUID
>>>> bit, leading to costly RDMSR traps (a few 100,000s during a kernel
>>>> compile) and bogus values due to VCPU migration during the
>>>
>>> Can you point me to the Linux scheduler code that does this? Thanks.
>>
>> arch/x86/kernel/cpu/sched.c contains code to read out and compute
>> APERF/MPERF registers. I added a Xen debug-key to dump a usage
>> counter added in traps.c and thus could prove that it is actually
>> the kernel that accesses these registers.
>> As far as I understood this the idea is to learn about boosting and
>> down-clocking (P-states) to get a fairer view on the actual
>> computing time a process consumed.
>
> Looks like its looking for this:
>
> X86_FEATURE_APERFMPERF
>
> Perhaps masking that should do it? Something along this in enlighten.c:
>
>           cpuid_leaf1_edx_mask =
>                   ~((1<<  X86_FEATURE_MCE)  |  /* disable MCE */
>                     (1<<  X86_FEATURE_MCA)  |  /* disable MCA */
>                     (1<<  X86_FEATURE_MTRR) |  /* disable MTRR */
>                     (1<<  X86_FEATURE_ACC));   /* thermal monitoring
>
> would be more appropiate?
>
> Or is that attribute on a different leaf?

Right, it is bit 0 on level 6. That's why I couldn't use any of the 
predefined masks and I didn't feel like inventing a new one just for 
this single bit.
We could as well explicitly use clear_cpu_cap somewhere, but I didn't 
find any code place in the Xen tree already doing this, instead it looks 
like it belongs to where I put it (we handle leaf 5 in a special way 
already here)

>>
>>>> measurement.
>>>> The attached patch explicitly disables this CPU capability inside
>>>> the Linux kernel, I couldn't measure any APERF/MPERF reads anymore
>>>> with the patch applied.
>>>> I am not sure if the PVOPS code is the right place to fix this, we
>>>> could as well do it in the HV's xen/arch/x86/traps.c:pv_cpuid().
>>>> Also when the Dom0 VCPUs are pinned, we could allow this, but I am
>>>> not sure if it's worth to do so.
>>>>
>>>> Awaiting your comments.
>>>>
>>>> Regards,
>>>> Andre.
>>>>
>>>> P.S. Of course this doesn't fix pure userland software like
>>>> cpupower, but I would consider this in the user's responsibility to
>>>
>>> Which would not work anymore as the cpufreq support is disabled
>>> when it boots under Xen.
>>
>> Do you mean with "anymore" in a future kernel? I tested this on
>> 3.4.0 and cpupower monitor worked fine. Right, cpufreq is not
>> enabled, but cpupower uses the /dev/cpu/<n>/msr device file to
>> directly read the MSRs. So I get this output if run on an idle Dom0:
>
> Ahh. Neat. Will have to play with that.

Bad news is we cannot forbid cpupower querying the feature directly 
using the CPUID instruction in PV guests. Only we could patch it to use 
/proc/cpuinfo readout instead, as this reflects the kernel view of 
available features. With my patch aperfmperf is no longer there.

Regards,
Andre.

-- 
Andre Przywara
AMD-OSRC (Dresden)
Tel: x29712

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH] RFC: Linux: disable APERF/MPERF feature in PV kernels
  2012-05-22 16:07 [PATCH] RFC: Linux: disable APERF/MPERF feature in PV kernels Andre Przywara
  2012-05-22 16:52 ` Jeremy Fitzhardinge
  2012-05-22 17:18 ` Konrad Rzeszutek Wilk
@ 2012-05-23  7:34 ` Jan Beulich
  2012-05-23  9:14   ` Andre Przywara
  2012-05-23 11:11   ` Andrew Cooper
  2 siblings, 2 replies; 21+ messages in thread
From: Jan Beulich @ 2012-05-23  7:34 UTC (permalink / raw)
  To: Andre Przywara; +Cc: Jeremy Fitzhardinge, xen-devel, Konrad Rzeszutek Wilk

>>> On 22.05.12 at 18:07, Andre Przywara <andre.przywara@amd.com> wrote:
> while testing some APERF/MPERF semantics I discovered that this feature 
> is enabled in Xen Dom0, but is not reliable.
> The Linux kernel's scheduler uses this feature if it sees the CPUID bit, 
> leading to costly RDMSR traps (a few 100,000s during a kernel compile) 
> and bogus values due to VCPU migration during the measurement.
> The attached patch explicitly disables this CPU capability inside the 
> Linux kernel, I couldn't measure any APERF/MPERF reads anymore with the 
> patch applied.
> I am not sure if the PVOPS code is the right place to fix this, we could 
> as well do it in the HV's xen/arch/x86/traps.c:pv_cpuid().
> Also when the Dom0 VCPUs are pinned, we could allow this, but I am not 
> sure if it's worth to do so.
> 
> Awaiting your comments.

First of all I'm of the opinion that this indeed should not be
masked in the hypervisor - there's no reason to disallow the
guest to read these registers (but we should of course deny
writes as long as Xen is controlling P-states, which we do).

Next I'd like to note that in our kernels we simply don't build
arch/x86/kernel/cpu/sched.o. Together with CPU_FREQ being
suppressed, there's no consumer of the feature flag in our
kernels.

So I would think that your suggested change is appropriate,
but I'm adding Konrad to Cc as these days he's the one to pick
this up.

Jan

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH] RFC: Linux: disable APERF/MPERF feature in PV kernels
  2012-05-22 17:08   ` Malcolm Crossley
@ 2012-05-23  8:10     ` Jan Beulich
  0 siblings, 0 replies; 21+ messages in thread
From: Jan Beulich @ 2012-05-23  8:10 UTC (permalink / raw)
  To: Malcolm Crossley; +Cc: xen-devel

>>> On 22.05.12 at 19:08, Malcolm Crossley <malcolm.crossley@citrix.com> wrote:
> On 22/05/12 17:52, Jeremy Fitzhardinge wrote:
>> On 05/22/2012 09:07 AM, Andre Przywara wrote:
>>> Hi,
>>>
>>> while testing some APERF/MPERF semantics I discovered that this
>>> feature is enabled in Xen Dom0, but is not reliable.
>>> The Linux kernel's scheduler uses this feature if it sees the CPUID
>>> bit, leading to costly RDMSR traps (a few 100,000s during a kernel
>>> compile) and bogus values due to VCPU migration during the measurement.
>>> The attached patch explicitly disables this CPU capability inside the
>>> Linux kernel, I couldn't measure any APERF/MPERF reads anymore with
>>> the patch applied.
>>> I am not sure if the PVOPS code is the right place to fix this, we
>>> could as well do it in the HV's xen/arch/x86/traps.c:pv_cpuid().
>>> Also when the Dom0 VCPUs are pinned, we could allow this, but I am not
>>> sure if it's worth to do so.
>> Seems reasonable to me.  Do all those RDMSR traps have a measurable
>> performance effect?
>>
>> Also, is there a symbolic constant for that bit?
>>
>>      J
>>
>> _______________________________________________
>> Xen-devel mailing list
>> Xen-devel@lists.xen.org 
>> http://lists.xen.org/xen-devel 
> Hi,
> 
> I've attached a patch which masks the matching CPUID leaves in the Xen 
> pv_cpuid function.
> Should the logic in pv_cpuid be changed to only pass through explictly 
> allowed CPUID leaves rather than masking
> them using case statements?
> 
> Malcolm

As said in another reply, I don't think we should mask the feature
in the hypervisor (and certainly not when "cpufreq=dom0-kernel").
Furthermore your patch does this only for Dom0.

Jan

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH] RFC: Linux: disable APERF/MPERF feature in PV kernels
  2012-05-23  7:34 ` Jan Beulich
@ 2012-05-23  9:14   ` Andre Przywara
  2012-05-23  9:43     ` Jan Beulich
  2012-05-23 11:11   ` Andrew Cooper
  1 sibling, 1 reply; 21+ messages in thread
From: Andre Przywara @ 2012-05-23  9:14 UTC (permalink / raw)
  To: Jan Beulich; +Cc: Jeremy Fitzhardinge, xen-devel, Konrad Rzeszutek Wilk

On 05/23/2012 09:34 AM, Jan Beulich wrote:
>>>> On 22.05.12 at 18:07, Andre Przywara<andre.przywara@amd.com>  wrote:
>> while testing some APERF/MPERF semantics I discovered that this feature
>> is enabled in Xen Dom0, but is not reliable.
>> The Linux kernel's scheduler uses this feature if it sees the CPUID bit,
>> leading to costly RDMSR traps (a few 100,000s during a kernel compile)
>> and bogus values due to VCPU migration during the measurement.
>> The attached patch explicitly disables this CPU capability inside the
>> Linux kernel, I couldn't measure any APERF/MPERF reads anymore with the
>> patch applied.
>> I am not sure if the PVOPS code is the right place to fix this, we could
>> as well do it in the HV's xen/arch/x86/traps.c:pv_cpuid().
>> Also when the Dom0 VCPUs are pinned, we could allow this, but I am not
>> sure if it's worth to do so.
>>
>> Awaiting your comments.
>
> First of all I'm of the opinion that this indeed should not be
> masked in the hypervisor - there's no reason to disallow the
> guest to read these registers (but we should of course deny
> writes as long as Xen is controlling P-states, which we do).

Ok. Thanks for the acknowledgment.

> Next I'd like to note that in our kernels we simply don't build
> arch/x86/kernel/cpu/sched.o. Together with CPU_FREQ being
> suppressed, there's no consumer of the feature flag in our
> kernels.

With "our kernels" you mean OpenSuSE/SLES kernels? I quickly checked 
upstream as well as the repos on kernel.opensuse.org. In all of them 
sched.o is unconditionally included in the Makefile.
So is there a build patch to exclude this file for builds of distro Xen 
kernels?

Regards,
Andre.


> So I would think that your suggested change is appropriate,
> but I'm adding Konrad to Cc as these days he's the one to pick
> this up.
>
> Jan

-- 
Andre Przywara
AMD-Operating System Research Center (OSRC), Dresden, Germany

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH] RFC: Linux: disable APERF/MPERF feature in PV kernels
  2012-05-23  9:14   ` Andre Przywara
@ 2012-05-23  9:43     ` Jan Beulich
  2012-05-23  9:52       ` Andre Przywara
  0 siblings, 1 reply; 21+ messages in thread
From: Jan Beulich @ 2012-05-23  9:43 UTC (permalink / raw)
  To: Andre Przywara; +Cc: Jeremy Fitzhardinge, xen-devel, Konrad Rzeszutek Wilk

>>> On 23.05.12 at 11:14, Andre Przywara <andre.przywara@amd.com> wrote:
> On 05/23/2012 09:34 AM, Jan Beulich wrote:
>> Next I'd like to note that in our kernels we simply don't build
>> arch/x86/kernel/cpu/sched.o. Together with CPU_FREQ being
>> suppressed, there's no consumer of the feature flag in our
>> kernels.
> 
> With "our kernels" you mean OpenSuSE/SLES kernels? I quickly checked 
> upstream as well as the repos on kernel.opensuse.org. In all of them 
> sched.o is unconditionally included in the Makefile.
> So is there a build patch to exclude this file for builds of distro Xen 
> kernels?

Did you perhaps overlook

disabled-obj-$(CONFIG_XEN) := hypervisor.o mshyperv.o perfctr-watchdog.o \
			      perf_event.o sched.o vmware.o

in that very Makefile?

Jan

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH] RFC: Linux: disable APERF/MPERF feature in PV kernels
  2012-05-23  9:43     ` Jan Beulich
@ 2012-05-23  9:52       ` Andre Przywara
  2012-05-23 10:01         ` Jan Beulich
  0 siblings, 1 reply; 21+ messages in thread
From: Andre Przywara @ 2012-05-23  9:52 UTC (permalink / raw)
  To: Jan Beulich; +Cc: Jeremy Fitzhardinge, xen-devel, Konrad Rzeszutek Wilk

On 05/23/2012 11:43 AM, Jan Beulich wrote:
>>>> On 23.05.12 at 11:14, Andre Przywara<andre.przywara@amd.com>  wrote:
>> On 05/23/2012 09:34 AM, Jan Beulich wrote:
>>> Next I'd like to note that in our kernels we simply don't build
>>> arch/x86/kernel/cpu/sched.o. Together with CPU_FREQ being
>>> suppressed, there's no consumer of the feature flag in our
>>> kernels.
>>
>> With "our kernels" you mean OpenSuSE/SLES kernels? I quickly checked
>> upstream as well as the repos on kernel.opensuse.org. In all of them
>> sched.o is unconditionally included in the Makefile.
>> So is there a build patch to exclude this file for builds of distro Xen
>> kernels?
>
> Did you perhaps overlook
>
> disabled-obj-$(CONFIG_XEN) := hypervisor.o mshyperv.o perfctr-watchdog.o \
> 			      perf_event.o sched.o vmware.o
>
> in that very Makefile?

No, I just utterly ignored it ;-)

Thanks for the hint.
But this works only for SuSE since you build different kernels for Xen 
and without Xen, right?
Starting with 3.x I now have only a single kernel image which I use for 
both native and as Dom0.

So the patch to disable APERFMPERF is still useful, at least for 
upstream, right?

Regards,
Andre.


-- 
Andre Przywara
AMD-Operating System Research Center (OSRC), Dresden, Germany

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH] RFC: Linux: disable APERF/MPERF feature in PV kernels
  2012-05-23  9:52       ` Andre Przywara
@ 2012-05-23 10:01         ` Jan Beulich
  0 siblings, 0 replies; 21+ messages in thread
From: Jan Beulich @ 2012-05-23 10:01 UTC (permalink / raw)
  To: Andre Przywara; +Cc: Jeremy Fitzhardinge, xen-devel, Konrad Rzeszutek Wilk

>>> On 23.05.12 at 11:52, Andre Przywara <andre.przywara@amd.com> wrote:
> But this works only for SuSE since you build different kernels for Xen 
> and without Xen, right?
> Starting with 3.x I now have only a single kernel image which I use for 
> both native and as Dom0.
> 
> So the patch to disable APERFMPERF is still useful, at least for 
> upstream, right?

Oh yes, absolutely. I just wanted to show (not the least to
myself) that we're already taking care of this issue.

Jan

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH] RFC: Linux: disable APERF/MPERF feature in PV kernels
  2012-05-23  7:34 ` Jan Beulich
  2012-05-23  9:14   ` Andre Przywara
@ 2012-05-23 11:11   ` Andrew Cooper
  2012-05-23 12:18     ` Jan Beulich
  1 sibling, 1 reply; 21+ messages in thread
From: Andrew Cooper @ 2012-05-23 11:11 UTC (permalink / raw)
  To: Jan Beulich
  Cc: Andre Przywara, Jeremy Fitzhardinge, xen-devel,
	Konrad Rzeszutek Wilk

On 23/05/12 08:34, Jan Beulich wrote:
>>>> On 22.05.12 at 18:07, Andre Przywara <andre.przywara@amd.com> wrote:
>> while testing some APERF/MPERF semantics I discovered that this feature 
>> is enabled in Xen Dom0, but is not reliable.
>> The Linux kernel's scheduler uses this feature if it sees the CPUID bit, 
>> leading to costly RDMSR traps (a few 100,000s during a kernel compile) 
>> and bogus values due to VCPU migration during the measurement.
>> The attached patch explicitly disables this CPU capability inside the 
>> Linux kernel, I couldn't measure any APERF/MPERF reads anymore with the 
>> patch applied.
>> I am not sure if the PVOPS code is the right place to fix this, we could 
>> as well do it in the HV's xen/arch/x86/traps.c:pv_cpuid().
>> Also when the Dom0 VCPUs are pinned, we could allow this, but I am not 
>> sure if it's worth to do so.
>>
>> Awaiting your comments.
> First of all I'm of the opinion that this indeed should not be
> masked in the hypervisor - there's no reason to disallow the
> guest to read these registers (but we should of course deny
> writes as long as Xen is controlling P-states, which we do).

I am sorry but I am going to have to disagree with you on this point.

We should not be advertising this feature to any guest at all if we
can't provide an implementation which works as native expects.  Else we
are failing in our job of virtualisation.

There is 'dom0_vcpus_pin'[1] which identity pins dom0 vcpus, and
prevents update of the affinity masks, and appears to conditionally
allow access to certain MSRs.  I think it would be fine to expose this
feature iff dom0s vcpus are pinned in this fashion.  That way, the
measurement should succeed, even if dom0 only has read access to the MSRs.

The same logic applies to domU guests, although there is currently no
way I can see to get domU domains into a state where advertising it
would be safe.

~Andrew

[1] I am however about to submit a patch (for inclusion after the
feature freeze) which adds more semantics to this command line option. 
There is currently no way ask Xen to pin dom0s vcpus from domain create
time but allow their affinity masks to be updated later.  The XenServer
performance team have been experimenting and have found performance
benefits from being able to do this.

>
> Next I'd like to note that in our kernels we simply don't build
> arch/x86/kernel/cpu/sched.o. Together with CPU_FREQ being
> suppressed, there's no consumer of the feature flag in our
> kernels.
>
> So I would think that your suggested change is appropriate,
> but I'm adding Konrad to Cc as these days he's the one to pick
> this up.
>
> Jan
>
>
> _______________________________________________
> Xen-devel mailing list
> Xen-devel@lists.xen.org
> http://lists.xen.org/xen-devel

-- 
Andrew Cooper - Dom0 Kernel Engineer, Citrix XenServer
T: +44 (0)1223 225 900, http://www.citrix.com

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH] RFC: Linux: disable APERF/MPERF feature in PV kernels
  2012-05-23 11:11   ` Andrew Cooper
@ 2012-05-23 12:18     ` Jan Beulich
  2012-05-23 13:21       ` Andrew Cooper
  0 siblings, 1 reply; 21+ messages in thread
From: Jan Beulich @ 2012-05-23 12:18 UTC (permalink / raw)
  To: Andrew Cooper
  Cc: Andre Przywara, Jeremy Fitzhardinge, xen-devel,
	Konrad Rzeszutek Wilk

>>> On 23.05.12 at 13:11, Andrew Cooper <andrew.cooper3@citrix.com> wrote:
> On 23/05/12 08:34, Jan Beulich wrote:
>> First of all I'm of the opinion that this indeed should not be
>> masked in the hypervisor - there's no reason to disallow the
>> guest to read these registers (but we should of course deny
>> writes as long as Xen is controlling P-states, which we do).
> 
> I am sorry but I am going to have to disagree with you on this point.
> 
> We should not be advertising this feature to any guest at all if we
> can't provide an implementation which works as native expects.  Else we
> are failing in our job of virtualisation.

That's perhaps a matter of the position you take - for HVM, I
would agree with yours, but there's many more aspects (not
the least related to accessing other MSRs) that we fail to
"properly" virtualize for PV guests - my position is that it is the
nature of PV that guest kernels have to be aware of being
virtualized (and hence stay away from doing certain things
unless [they think] they know what they're doing).

> There is 'dom0_vcpus_pin'[1] which identity pins dom0 vcpus, and
> prevents update of the affinity masks, and appears to conditionally
> allow access to certain MSRs.  I think it would be fine to expose this
> feature iff dom0s vcpus are pinned in this fashion.  That way, the
> measurement should succeed, even if dom0 only has read access to the MSRs.

Restricting it to this case would be too restrictive - it really
makes sense at any time where the vCPU's affinity has exactly
one bit set (or to be precise, the intersection of it and the set
of online pCPU-s).

Jan

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH] RFC: Linux: disable APERF/MPERF feature in PV kernels
  2012-05-23 12:18     ` Jan Beulich
@ 2012-05-23 13:21       ` Andrew Cooper
  2012-05-23 13:31         ` Andre Przywara
  0 siblings, 1 reply; 21+ messages in thread
From: Andrew Cooper @ 2012-05-23 13:21 UTC (permalink / raw)
  To: Jan Beulich
  Cc: Andre Przywara, Jeremy Fitzhardinge, xen-devel,
	Konrad Rzeszutek Wilk

On 23/05/12 13:18, Jan Beulich wrote:
>>>> On 23.05.12 at 13:11, Andrew Cooper <andrew.cooper3@citrix.com> wrote:
>> On 23/05/12 08:34, Jan Beulich wrote:
>>> First of all I'm of the opinion that this indeed should not be
>>> masked in the hypervisor - there's no reason to disallow the
>>> guest to read these registers (but we should of course deny
>>> writes as long as Xen is controlling P-states, which we do).
>> I am sorry but I am going to have to disagree with you on this point.
>>
>> We should not be advertising this feature to any guest at all if we
>> can't provide an implementation which works as native expects.  Else we
>> are failing in our job of virtualisation.
> That's perhaps a matter of the position you take - for HVM, I
> would agree with yours, but there's many more aspects (not
> the least related to accessing other MSRs) that we fail to
> "properly" virtualize for PV guests - my position is that it is the
> nature of PV that guest kernels have to be aware of being
> virtualized (and hence stay away from doing certain things
> unless [they think] they know what they're doing).
>
>> There is 'dom0_vcpus_pin'[1] which identity pins dom0 vcpus, and
>> prevents update of the affinity masks, and appears to conditionally
>> allow access to certain MSRs.  I think it would be fine to expose this
>> feature iff dom0s vcpus are pinned in this fashion.  That way, the
>> measurement should succeed, even if dom0 only has read access to the MSRs.
> Restricting it to this case would be too restrictive - it really
> makes sense at any time where the vCPU's affinity has exactly
> one bit set (or to be precise, the intersection of it and the set
> of online pCPU-s).
>
> Jan
>

That is unfortunately too lax.  You also need to be able to guarantee
that the affinity mask is not updated (and vcpu rescheduled) while in
the middle of a measurement.  Xen cant sensibly work out if or when a
guest is taking a measurement, nor can dom0.  So the only safe solution
I can see is for Xen to prevent the affinity masks from ever being
updated.  With more thought, this would also preclude migration of a
guest to another host.

-- 
Andrew Cooper - Dom0 Kernel Engineer, Citrix XenServer
T: +44 (0)1223 225 900, http://www.citrix.com

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH] RFC: Linux: disable APERF/MPERF feature in PV kernels
  2012-05-22 22:44       ` Andre Przywara
@ 2012-05-23 13:26         ` Konrad Rzeszutek Wilk
  2012-05-24 13:24           ` Andre Przywara
  0 siblings, 1 reply; 21+ messages in thread
From: Konrad Rzeszutek Wilk @ 2012-05-23 13:26 UTC (permalink / raw)
  To: Andre Przywara; +Cc: Jeremy Fitzhardinge, xen-devel

On Wed, May 23, 2012 at 12:44:07AM +0200, Andre Przywara wrote:
> On 05/22/2012 11:00 PM, Konrad Rzeszutek Wilk wrote:
> >On Tue, May 22, 2012 at 11:02:01PM +0200, Andre Przywara wrote:
> >>On 05/22/2012 07:18 PM, Konrad Rzeszutek Wilk wrote:
> >>>On Tue, May 22, 2012 at 06:07:11PM +0200, Andre Przywara wrote:
> >>>>Hi,
> >>>>
> >>>>while testing some APERF/MPERF semantics I discovered that this
> >>>>feature is enabled in Xen Dom0, but is not reliable.
> >>>>The Linux kernel's scheduler uses this feature if it sees the CPUID
> >>>>bit, leading to costly RDMSR traps (a few 100,000s during a kernel
> >>>>compile) and bogus values due to VCPU migration during the
> >>>
> >>>Can you point me to the Linux scheduler code that does this? Thanks.
> >>
> >>arch/x86/kernel/cpu/sched.c contains code to read out and compute
> >>APERF/MPERF registers. I added a Xen debug-key to dump a usage
> >>counter added in traps.c and thus could prove that it is actually
> >>the kernel that accesses these registers.
> >>As far as I understood this the idea is to learn about boosting and
> >>down-clocking (P-states) to get a fairer view on the actual
> >>computing time a process consumed.
> >
> >Looks like its looking for this:
> >
> >X86_FEATURE_APERFMPERF
> >
> >Perhaps masking that should do it? Something along this in enlighten.c:
> >
> >          cpuid_leaf1_edx_mask =
> >                  ~((1<<  X86_FEATURE_MCE)  |  /* disable MCE */
> >                    (1<<  X86_FEATURE_MCA)  |  /* disable MCA */
> >                    (1<<  X86_FEATURE_MTRR) |  /* disable MTRR */
> >                    (1<<  X86_FEATURE_ACC));   /* thermal monitoring
> >
> >would be more appropiate?
> >
> >Or is that attribute on a different leaf?
> 
> Right, it is bit 0 on level 6. That's why I couldn't use any of the
> predefined masks and I didn't feel like inventing a new one just for
> this single bit.
> We could as well explicitly use clear_cpu_cap somewhere, but I
> didn't find any code place in the Xen tree already doing this,
> instead it looks like it belongs to where I put it (we handle leaf 5
> in a special way already here)

OK, can you resend the patch please, looking similar to what you sent
earlier, but do use a #define if possible (you can have the #define
in that file) and an comment explaining why this is neccessary -
and point to the Linux source code that uses this.

Thanks!
.. snip..
> >>>>P.S. Of course this doesn't fix pure userland software like
> >>>>cpupower, but I would consider this in the user's responsibility to
> >>>
> >>>Which would not work anymore as the cpufreq support is disabled
> >>>when it boots under Xen.
> >>
> >>Do you mean with "anymore" in a future kernel? I tested this on
> >>3.4.0 and cpupower monitor worked fine. Right, cpufreq is not
> >>enabled, but cpupower uses the /dev/cpu/<n>/msr device file to
> >>directly read the MSRs. So I get this output if run on an idle Dom0:
> >
> >Ahh. Neat. Will have to play with that.
> 
> Bad news is we cannot forbid cpupower querying the feature directly
> using the CPUID instruction in PV guests. Only we could patch it to
> use /proc/cpuinfo readout instead, as this reflects the kernel view
> of available features. With my patch aperfmperf is no longer there.

Looks like a patch to cpupower should be cooked up too?

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH] RFC: Linux: disable APERF/MPERF feature in PV kernels
  2012-05-23 13:21       ` Andrew Cooper
@ 2012-05-23 13:31         ` Andre Przywara
  0 siblings, 0 replies; 21+ messages in thread
From: Andre Przywara @ 2012-05-23 13:31 UTC (permalink / raw)
  To: Andrew Cooper
  Cc: Jeremy Fitzhardinge, xen-devel, Jan Beulich,
	Konrad Rzeszutek Wilk

On 05/23/2012 03:21 PM, Andrew Cooper wrote:
> On 23/05/12 13:18, Jan Beulich wrote:
>>>>> On 23.05.12 at 13:11, Andrew Cooper<andrew.cooper3@citrix.com>  wrote:
>>> On 23/05/12 08:34, Jan Beulich wrote:
>>>> First of all I'm of the opinion that this indeed should not be
>>>> masked in the hypervisor - there's no reason to disallow the
>>>> guest to read these registers (but we should of course deny
>>>> writes as long as Xen is controlling P-states, which we do).
>>> I am sorry but I am going to have to disagree with you on this point.
>>>
>>> We should not be advertising this feature to any guest at all if we
>>> can't provide an implementation which works as native expects.  Else we
>>> are failing in our job of virtualisation.
>> That's perhaps a matter of the position you take - for HVM, I
>> would agree with yours, but there's many more aspects (not
>> the least related to accessing other MSRs) that we fail to
>> "properly" virtualize for PV guests - my position is that it is the
>> nature of PV that guest kernels have to be aware of being
>> virtualized (and hence stay away from doing certain things
>> unless [they think] they know what they're doing).
>>
>>> There is 'dom0_vcpus_pin'[1] which identity pins dom0 vcpus, and
>>> prevents update of the affinity masks, and appears to conditionally
>>> allow access to certain MSRs.  I think it would be fine to expose this
>>> feature iff dom0s vcpus are pinned in this fashion.  That way, the
>>> measurement should succeed, even if dom0 only has read access to the MSRs.
>> Restricting it to this case would be too restrictive - it really
>> makes sense at any time where the vCPU's affinity has exactly
>> one bit set (or to be precise, the intersection of it and the set
>> of online pCPU-s).
>>
>> Jan
>>
>
> That is unfortunately too lax.  You also need to be able to guarantee
> that the affinity mask is not updated (and vcpu rescheduled) while in
> the middle of a measurement.  Xen cant sensibly work out if or when a
> guest is taking a measurement, nor can dom0.  So the only safe solution
> I can see is for Xen to prevent the affinity masks from ever being
> updated.  With more thought, this would also preclude migration of a
> guest to another host.

Iff we really care about this feature, we could as well emulate it:
On every VCPU migration we calculate the difference between the two 
pCPU's values of APERF and MPERF. On the trap this value is added to the 
current MSR value. Similar to what is done with the TSC in HVM.
We trap on every MSR access anyway, so the performance impact is only 
four HV rdmsrs on every VCPU migration.

Only I am not sure if this is really a problem we should solve or if 
wouldn't be easier for us and clearer to the user to just discourage 
those accesses.

Regards,
Andre.

-- 
Andre Przywara
AMD-Operating System Research Center (OSRC), Dresden, Germany

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH] RFC: Linux: disable APERF/MPERF feature in PV kernels
  2012-05-23 13:26         ` Konrad Rzeszutek Wilk
@ 2012-05-24 13:24           ` Andre Przywara
  2012-05-29 10:54             ` Andre Przywara
  0 siblings, 1 reply; 21+ messages in thread
From: Andre Przywara @ 2012-05-24 13:24 UTC (permalink / raw)
  To: Konrad Rzeszutek Wilk; +Cc: Jeremy Fitzhardinge, xen-devel

On 05/23/2012 03:26 PM, Konrad Rzeszutek Wilk wrote:
> On Wed, May 23, 2012 at 12:44:07AM +0200, Andre Przywara wrote:
>> On 05/22/2012 11:00 PM, Konrad Rzeszutek Wilk wrote:
>>> On Tue, May 22, 2012 at 11:02:01PM +0200, Andre Przywara wrote:
>>>> On 05/22/2012 07:18 PM, Konrad Rzeszutek Wilk wrote:
>>>>> On Tue, May 22, 2012 at 06:07:11PM +0200, Andre Przywara wrote:
>>>>>> Hi,
>>>>>>
>>>>>> while testing some APERF/MPERF semantics I discovered that this
>>>>>> feature is enabled in Xen Dom0, but is not reliable.
>>>>>> The Linux kernel's scheduler uses this feature if it sees the CPUID
>>>>>> bit, leading to costly RDMSR traps (a few 100,000s during a kernel
>>>>>> compile) and bogus values due to VCPU migration during the
>>>>>
>>>>> Can you point me to the Linux scheduler code that does this? Thanks.
>>>>
>>>> arch/x86/kernel/cpu/sched.c contains code to read out and compute
>>>> APERF/MPERF registers. I added a Xen debug-key to dump a usage
>>>> counter added in traps.c and thus could prove that it is actually
>>>> the kernel that accesses these registers.
>>>> As far as I understood this the idea is to learn about boosting and
>>>> down-clocking (P-states) to get a fairer view on the actual
>>>> computing time a process consumed.
>>>
>>> Looks like its looking for this:
>>>
>>> X86_FEATURE_APERFMPERF
>>>
>>> Perhaps masking that should do it? Something along this in enlighten.c:
>>>
>>>           cpuid_leaf1_edx_mask =
>>>                   ~((1<<   X86_FEATURE_MCE)  |  /* disable MCE */
>>>                     (1<<   X86_FEATURE_MCA)  |  /* disable MCA */
>>>                     (1<<   X86_FEATURE_MTRR) |  /* disable MTRR */
>>>                     (1<<   X86_FEATURE_ACC));   /* thermal monitoring
>>>
>>> would be more appropiate?
>>>
>>> Or is that attribute on a different leaf?
>>
>> Right, it is bit 0 on level 6. That's why I couldn't use any of the
>> predefined masks and I didn't feel like inventing a new one just for
>> this single bit.
>> We could as well explicitly use clear_cpu_cap somewhere, but I
>> didn't find any code place in the Xen tree already doing this,
>> instead it looks like it belongs to where I put it (we handle leaf 5
>> in a special way already here)
>
> OK, can you resend the patch please, looking similar to what you sent
> earlier, but do use a #define if possible (you can have the #define
> in that file) and an comment explaining why this is neccessary -
> and point to the Linux source code that uses this.

Well, I was about to do this and wanted to see if this has any 
performance impact - only to discover that 3.4 does not trigger it 
anymore. After some debugging it turns out the guy reading APERF/MPERF 
was not arch/x86/kernel/cpu/sched.c, but drivers/cpufreq/mperf.c. So 
with disabling cpufreq the only real user is gone already.
So the patch is kind of pointless as it on 3.4 with cpufreq already 
disabled. Remains to be investigated why sched.c is not called (I added 
a usage counter, it stays at zero).
To avoid future mis-uses of APERF/MPERF by the kernel, I'd like to add 
the patch anyway. I will send it again when I have a clearer picture of 
this.

.. snip..
> Looks like a patch to cpupower should be cooked up too?

I will contact the author.

Regards,
Andre.

-- 
Andre Przywara
AMD-OSRC (Dresden)
Tel: x29712

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH] RFC: Linux: disable APERF/MPERF feature in PV kernels
  2012-05-24 13:24           ` Andre Przywara
@ 2012-05-29 10:54             ` Andre Przywara
  0 siblings, 0 replies; 21+ messages in thread
From: Andre Przywara @ 2012-05-29 10:54 UTC (permalink / raw)
  To: Konrad Rzeszutek Wilk; +Cc: Jeremy Fitzhardinge, xen-devel

On 05/24/2012 03:24 PM, Andre Przywara wrote:
> On 05/23/2012 03:26 PM, Konrad Rzeszutek Wilk wrote:
>> On Wed, May 23, 2012 at 12:44:07AM +0200, Andre Przywara wrote:
>>> On 05/22/2012 11:00 PM, Konrad Rzeszutek Wilk wrote:
>>>> On Tue, May 22, 2012 at 11:02:01PM +0200, Andre Przywara wrote:
>>>>> On 05/22/2012 07:18 PM, Konrad Rzeszutek Wilk wrote:
>>>>>> On Tue, May 22, 2012 at 06:07:11PM +0200, Andre Przywara wrote:
>>>>>>> Hi,
>>>>>>>
>>>>>>> while testing some APERF/MPERF semantics I discovered that this
>>>>>>> feature is enabled in Xen Dom0, but is not reliable.
>>>>>>> The Linux kernel's scheduler uses this feature if it sees the CPUID
>>>>>>> bit, leading to costly RDMSR traps (a few 100,000s during a kernel
>>>>>>> compile) and bogus values due to VCPU migration during the
>>>>>>
>>>>>> Can you point me to the Linux scheduler code that does this? Thanks.
>>>>>
>>>>> arch/x86/kernel/cpu/sched.c contains code to read out and compute
>>>>> APERF/MPERF registers. I added a Xen debug-key to dump a usage
>>>>> counter added in traps.c and thus could prove that it is actually
>>>>> the kernel that accesses these registers.
>>>>> As far as I understood this the idea is to learn about boosting and
>>>>> down-clocking (P-states) to get a fairer view on the actual
>>>>> computing time a process consumed.
>>>>
>>>> Looks like its looking for this:
>>>>
>>>> X86_FEATURE_APERFMPERF
>>>>
>>>> Perhaps masking that should do it? Something along this in enlighten.c:
>>>>
>>>> cpuid_leaf1_edx_mask =
>>>> ~((1<< X86_FEATURE_MCE) | /* disable MCE */
>>>> (1<< X86_FEATURE_MCA) | /* disable MCA */
>>>> (1<< X86_FEATURE_MTRR) | /* disable MTRR */
>>>> (1<< X86_FEATURE_ACC)); /* thermal monitoring
>>>>
>>>> would be more appropiate?
>>>>
>>>> Or is that attribute on a different leaf?
>>>
>>> Right, it is bit 0 on level 6. That's why I couldn't use any of the
>>> predefined masks and I didn't feel like inventing a new one just for
>>> this single bit.
>>> We could as well explicitly use clear_cpu_cap somewhere, but I
>>> didn't find any code place in the Xen tree already doing this,
>>> instead it looks like it belongs to where I put it (we handle leaf 5
>>> in a special way already here)
>>
>> OK, can you resend the patch please, looking similar to what you sent
>> earlier, but do use a #define if possible (you can have the #define
>> in that file) and an comment explaining why this is neccessary -
>> and point to the Linux source code that uses this.
>
> Well, I was about to do this and wanted to see if this has any
> performance impact - only to discover that 3.4 does not trigger it
> anymore. After some debugging it turns out the guy reading APERF/MPERF
> was not arch/x86/kernel/cpu/sched.c, but drivers/cpufreq/mperf.c. So
> with disabling cpufreq the only real user is gone already.
> So the patch is kind of pointless as it on 3.4 with cpufreq already
> disabled. Remains to be investigated why sched.c is not called (I added
> a usage counter, it stays at zero).

The scheduler code accessing the MSRs is disabled by default:
kernel/sched/features.h: SCHED_FEAT(ARCH_POWER, false)
I enabled this and saw the traps.
So the only kernel user remaining is/was cpufreq (and /dev/cpu/<n>/msr).

> To avoid future mis-uses of APERF/MPERF by the kernel, I'd like to add
> the patch anyway. I will send it again when I have a clearer picture of
> this.

The patch will arrive in a minute. Although this is kind of pointless 
for 3.4, I put in a stable tag. Also I find the removed aperfmperf flag 
from /proc/cpuinfo useful, this can be handy for Xen aware power 
management tools.

> .. snip..
>> Looks like a patch to cpupower should be cooked up too?
>
> I will contact the author.

Have done this. We are still discussing the best solution with Thomas 
Renninger and Jan Beulich, but a simple warning (and maybe a hint to 
xenpm) looks the best for the time being. Not sure if proper 
emulation/virtualization is worth the effort.

Regards,
Andre.

-- 
Andre Przywara
AMD-Operating System Research Center (OSRC), Dresden, Germany

^ permalink raw reply	[flat|nested] 21+ messages in thread

end of thread, other threads:[~2012-05-29 10:54 UTC | newest]

Thread overview: 21+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2012-05-22 16:07 [PATCH] RFC: Linux: disable APERF/MPERF feature in PV kernels Andre Przywara
2012-05-22 16:52 ` Jeremy Fitzhardinge
2012-05-22 17:08   ` Malcolm Crossley
2012-05-23  8:10     ` Jan Beulich
2012-05-22 20:46   ` Andre Przywara
2012-05-22 17:18 ` Konrad Rzeszutek Wilk
2012-05-22 21:02   ` Andre Przywara
2012-05-22 21:00     ` Konrad Rzeszutek Wilk
2012-05-22 22:44       ` Andre Przywara
2012-05-23 13:26         ` Konrad Rzeszutek Wilk
2012-05-24 13:24           ` Andre Przywara
2012-05-29 10:54             ` Andre Przywara
2012-05-23  7:34 ` Jan Beulich
2012-05-23  9:14   ` Andre Przywara
2012-05-23  9:43     ` Jan Beulich
2012-05-23  9:52       ` Andre Przywara
2012-05-23 10:01         ` Jan Beulich
2012-05-23 11:11   ` Andrew Cooper
2012-05-23 12:18     ` Jan Beulich
2012-05-23 13:21       ` Andrew Cooper
2012-05-23 13:31         ` Andre Przywara

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).