From mboxrd@z Thu Jan 1 00:00:00 1970 From: Andre Przywara Subject: Re: [PATCH] RFC: Linux: disable APERF/MPERF feature in PV kernels Date: Tue, 22 May 2012 23:02:01 +0200 Message-ID: <4FBBFEC9.6040100@amd.com> References: <4FBBB9AF.6020704@amd.com> <20120522171858.GB19601@phenom.dumpdata.com> Mime-Version: 1.0 Content-Type: text/plain; charset="us-ascii"; Format="flowed" Content-Transfer-Encoding: 7bit Return-path: In-Reply-To: <20120522171858.GB19601@phenom.dumpdata.com> List-Unsubscribe: , List-Post: List-Help: List-Subscribe: , Sender: xen-devel-bounces@lists.xen.org Errors-To: xen-devel-bounces@lists.xen.org To: Konrad Rzeszutek Wilk Cc: Jeremy Fitzhardinge , xen-devel List-Id: xen-devel@lists.xenproject.org On 05/22/2012 07:18 PM, Konrad Rzeszutek Wilk wrote: > On Tue, May 22, 2012 at 06:07:11PM +0200, Andre Przywara wrote: >> Hi, >> >> while testing some APERF/MPERF semantics I discovered that this >> feature is enabled in Xen Dom0, but is not reliable. >> The Linux kernel's scheduler uses this feature if it sees the CPUID >> bit, leading to costly RDMSR traps (a few 100,000s during a kernel >> compile) and bogus values due to VCPU migration during the > > Can you point me to the Linux scheduler code that does this? Thanks. arch/x86/kernel/cpu/sched.c contains code to read out and compute APERF/MPERF registers. I added a Xen debug-key to dump a usage counter added in traps.c and thus could prove that it is actually the kernel that accesses these registers. As far as I understood this the idea is to learn about boosting and down-clocking (P-states) to get a fairer view on the actual computing time a process consumed. >> measurement. >> The attached patch explicitly disables this CPU capability inside >> the Linux kernel, I couldn't measure any APERF/MPERF reads anymore >> with the patch applied. >> I am not sure if the PVOPS code is the right place to fix this, we >> could as well do it in the HV's xen/arch/x86/traps.c:pv_cpuid(). >> Also when the Dom0 VCPUs are pinned, we could allow this, but I am >> not sure if it's worth to do so. >> >> Awaiting your comments. >> >> Regards, >> Andre. >> >> P.S. Of course this doesn't fix pure userland software like >> cpupower, but I would consider this in the user's responsibility to > > Which would not work anymore as the cpufreq support is disabled > when it boots under Xen. Do you mean with "anymore" in a future kernel? I tested this on 3.4.0 and cpupower monitor worked fine. Right, cpufreq is not enabled, but cpupower uses the /dev/cpu//msr device file to directly read the MSRs. So I get this output if run on an idle Dom0: |Mperf CPU | C0 | Cx | Freq 0| 0.10| 99.90| 3473 .... Regards, Andre. > >> not use these tools in Dom0, but instead use xenpm. >> >> -- >> Andre Przywara >> AMD-Operating System Research Center (OSRC), Dresden, Germany > >> commit e802e47d85314b4541288e4a19d057e2ea885a28 >> Author: Andre Przywara >> Date: Tue May 22 15:13:07 2012 +0200 >> >> filter APERFMPERF feature in Xen to avoid kernel internal usage >> >> Signed-off-by: Andre Przywara >> >> diff --git a/arch/x86/xen/enlighten.c b/arch/x86/xen/enlighten.c >> index 95dccce..71252d5 100644 >> --- a/arch/x86/xen/enlighten.c >> +++ b/arch/x86/xen/enlighten.c >> @@ -240,6 +240,11 @@ static void xen_cpuid(unsigned int *ax, unsigned int *bx, >> *dx = cpuid_leaf5_edx_val; >> return; >> >> + case 6: >> + /* Disabling APERFMPERF for kernel usage */ >> + maskecx = ~(1U<< 0); >> + break; >> + >> case 0xb: >> /* Suppress extended topology stuff */ >> maskebx = 0; > >> _______________________________________________ >> Xen-devel mailing list >> Xen-devel@lists.xen.org >> http://lists.xen.org/xen-devel > > -- Andre Przywara AMD-Operating System Research Center (OSRC), Dresden, Germany Tel: +49 351 448-3567-12