From mboxrd@z Thu Jan 1 00:00:00 1970 From: Konrad Rzeszutek Wilk Subject: Re: [PATCH] RFC: Linux: disable APERF/MPERF feature in PV kernels Date: Wed, 23 May 2012 09:26:14 -0400 Message-ID: <20120523132614.GA15660@phenom.dumpdata.com> References: <4FBBB9AF.6020704@amd.com> <20120522171858.GB19601@phenom.dumpdata.com> <4FBBFEC9.6040100@amd.com> <20120522210031.GA25983@phenom.dumpdata.com> <4FBC16B7.9080307@amd.com> Mime-Version: 1.0 Content-Type: text/plain; charset="us-ascii" Content-Transfer-Encoding: 7bit Return-path: Content-Disposition: inline In-Reply-To: <4FBC16B7.9080307@amd.com> List-Unsubscribe: , List-Post: List-Help: List-Subscribe: , Sender: xen-devel-bounces@lists.xen.org Errors-To: xen-devel-bounces@lists.xen.org To: Andre Przywara Cc: Jeremy Fitzhardinge , xen-devel List-Id: xen-devel@lists.xenproject.org On Wed, May 23, 2012 at 12:44:07AM +0200, Andre Przywara wrote: > On 05/22/2012 11:00 PM, Konrad Rzeszutek Wilk wrote: > >On Tue, May 22, 2012 at 11:02:01PM +0200, Andre Przywara wrote: > >>On 05/22/2012 07:18 PM, Konrad Rzeszutek Wilk wrote: > >>>On Tue, May 22, 2012 at 06:07:11PM +0200, Andre Przywara wrote: > >>>>Hi, > >>>> > >>>>while testing some APERF/MPERF semantics I discovered that this > >>>>feature is enabled in Xen Dom0, but is not reliable. > >>>>The Linux kernel's scheduler uses this feature if it sees the CPUID > >>>>bit, leading to costly RDMSR traps (a few 100,000s during a kernel > >>>>compile) and bogus values due to VCPU migration during the > >>> > >>>Can you point me to the Linux scheduler code that does this? Thanks. > >> > >>arch/x86/kernel/cpu/sched.c contains code to read out and compute > >>APERF/MPERF registers. I added a Xen debug-key to dump a usage > >>counter added in traps.c and thus could prove that it is actually > >>the kernel that accesses these registers. > >>As far as I understood this the idea is to learn about boosting and > >>down-clocking (P-states) to get a fairer view on the actual > >>computing time a process consumed. > > > >Looks like its looking for this: > > > >X86_FEATURE_APERFMPERF > > > >Perhaps masking that should do it? Something along this in enlighten.c: > > > > cpuid_leaf1_edx_mask = > > ~((1<< X86_FEATURE_MCE) | /* disable MCE */ > > (1<< X86_FEATURE_MCA) | /* disable MCA */ > > (1<< X86_FEATURE_MTRR) | /* disable MTRR */ > > (1<< X86_FEATURE_ACC)); /* thermal monitoring > > > >would be more appropiate? > > > >Or is that attribute on a different leaf? > > Right, it is bit 0 on level 6. That's why I couldn't use any of the > predefined masks and I didn't feel like inventing a new one just for > this single bit. > We could as well explicitly use clear_cpu_cap somewhere, but I > didn't find any code place in the Xen tree already doing this, > instead it looks like it belongs to where I put it (we handle leaf 5 > in a special way already here) OK, can you resend the patch please, looking similar to what you sent earlier, but do use a #define if possible (you can have the #define in that file) and an comment explaining why this is neccessary - and point to the Linux source code that uses this. Thanks! .. snip.. > >>>>P.S. Of course this doesn't fix pure userland software like > >>>>cpupower, but I would consider this in the user's responsibility to > >>> > >>>Which would not work anymore as the cpufreq support is disabled > >>>when it boots under Xen. > >> > >>Do you mean with "anymore" in a future kernel? I tested this on > >>3.4.0 and cpupower monitor worked fine. Right, cpufreq is not > >>enabled, but cpupower uses the /dev/cpu//msr device file to > >>directly read the MSRs. So I get this output if run on an idle Dom0: > > > >Ahh. Neat. Will have to play with that. > > Bad news is we cannot forbid cpupower querying the feature directly > using the CPUID instruction in PV guests. Only we could patch it to > use /proc/cpuinfo readout instead, as this reflects the kernel view > of available features. With my patch aperfmperf is no longer there. Looks like a patch to cpupower should be cooked up too?