From mboxrd@z Thu Jan 1 00:00:00 1970 From: Andre Przywara Subject: Re: [PATCH] RFC: Linux: disable APERF/MPERF feature in PV kernels Date: Tue, 29 May 2012 12:54:04 +0200 Message-ID: <4FC4AACC.4000909@amd.com> References: <4FBBB9AF.6020704@amd.com> <20120522171858.GB19601@phenom.dumpdata.com> <4FBBFEC9.6040100@amd.com> <20120522210031.GA25983@phenom.dumpdata.com> <4FBC16B7.9080307@amd.com> <20120523132614.GA15660@phenom.dumpdata.com> <4FBE36AA.3070903@amd.com> Mime-Version: 1.0 Content-Type: text/plain; charset="us-ascii"; Format="flowed" Content-Transfer-Encoding: 7bit Return-path: In-Reply-To: <4FBE36AA.3070903@amd.com> List-Unsubscribe: , List-Post: List-Help: List-Subscribe: , Sender: xen-devel-bounces@lists.xen.org Errors-To: xen-devel-bounces@lists.xen.org To: Konrad Rzeszutek Wilk Cc: Jeremy Fitzhardinge , xen-devel List-Id: xen-devel@lists.xenproject.org On 05/24/2012 03:24 PM, Andre Przywara wrote: > On 05/23/2012 03:26 PM, Konrad Rzeszutek Wilk wrote: >> On Wed, May 23, 2012 at 12:44:07AM +0200, Andre Przywara wrote: >>> On 05/22/2012 11:00 PM, Konrad Rzeszutek Wilk wrote: >>>> On Tue, May 22, 2012 at 11:02:01PM +0200, Andre Przywara wrote: >>>>> On 05/22/2012 07:18 PM, Konrad Rzeszutek Wilk wrote: >>>>>> On Tue, May 22, 2012 at 06:07:11PM +0200, Andre Przywara wrote: >>>>>>> Hi, >>>>>>> >>>>>>> while testing some APERF/MPERF semantics I discovered that this >>>>>>> feature is enabled in Xen Dom0, but is not reliable. >>>>>>> The Linux kernel's scheduler uses this feature if it sees the CPUID >>>>>>> bit, leading to costly RDMSR traps (a few 100,000s during a kernel >>>>>>> compile) and bogus values due to VCPU migration during the >>>>>> >>>>>> Can you point me to the Linux scheduler code that does this? Thanks. >>>>> >>>>> arch/x86/kernel/cpu/sched.c contains code to read out and compute >>>>> APERF/MPERF registers. I added a Xen debug-key to dump a usage >>>>> counter added in traps.c and thus could prove that it is actually >>>>> the kernel that accesses these registers. >>>>> As far as I understood this the idea is to learn about boosting and >>>>> down-clocking (P-states) to get a fairer view on the actual >>>>> computing time a process consumed. >>>> >>>> Looks like its looking for this: >>>> >>>> X86_FEATURE_APERFMPERF >>>> >>>> Perhaps masking that should do it? Something along this in enlighten.c: >>>> >>>> cpuid_leaf1_edx_mask = >>>> ~((1<< X86_FEATURE_MCE) | /* disable MCE */ >>>> (1<< X86_FEATURE_MCA) | /* disable MCA */ >>>> (1<< X86_FEATURE_MTRR) | /* disable MTRR */ >>>> (1<< X86_FEATURE_ACC)); /* thermal monitoring >>>> >>>> would be more appropiate? >>>> >>>> Or is that attribute on a different leaf? >>> >>> Right, it is bit 0 on level 6. That's why I couldn't use any of the >>> predefined masks and I didn't feel like inventing a new one just for >>> this single bit. >>> We could as well explicitly use clear_cpu_cap somewhere, but I >>> didn't find any code place in the Xen tree already doing this, >>> instead it looks like it belongs to where I put it (we handle leaf 5 >>> in a special way already here) >> >> OK, can you resend the patch please, looking similar to what you sent >> earlier, but do use a #define if possible (you can have the #define >> in that file) and an comment explaining why this is neccessary - >> and point to the Linux source code that uses this. > > Well, I was about to do this and wanted to see if this has any > performance impact - only to discover that 3.4 does not trigger it > anymore. After some debugging it turns out the guy reading APERF/MPERF > was not arch/x86/kernel/cpu/sched.c, but drivers/cpufreq/mperf.c. So > with disabling cpufreq the only real user is gone already. > So the patch is kind of pointless as it on 3.4 with cpufreq already > disabled. Remains to be investigated why sched.c is not called (I added > a usage counter, it stays at zero). The scheduler code accessing the MSRs is disabled by default: kernel/sched/features.h: SCHED_FEAT(ARCH_POWER, false) I enabled this and saw the traps. So the only kernel user remaining is/was cpufreq (and /dev/cpu//msr). > To avoid future mis-uses of APERF/MPERF by the kernel, I'd like to add > the patch anyway. I will send it again when I have a clearer picture of > this. The patch will arrive in a minute. Although this is kind of pointless for 3.4, I put in a stable tag. Also I find the removed aperfmperf flag from /proc/cpuinfo useful, this can be handy for Xen aware power management tools. > .. snip.. >> Looks like a patch to cpupower should be cooked up too? > > I will contact the author. Have done this. We are still discussing the best solution with Thomas Renninger and Jan Beulich, but a simple warning (and maybe a hint to xenpm) looks the best for the time being. Not sure if proper emulation/virtualization is worth the effort. Regards, Andre. -- Andre Przywara AMD-Operating System Research Center (OSRC), Dresden, Germany