From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1754520Ab0LMIXE (ORCPT ); Mon, 13 Dec 2010 03:23:04 -0500 Received: from mga11.intel.com ([192.55.52.93]:17821 "EHLO mga11.intel.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1751908Ab0LMIXB (ORCPT ); Mon, 13 Dec 2010 03:23:01 -0500 X-ExtLoop1: 1 X-IronPort-AV: E=Sophos;i="4.59,335,1288594800"; d="scan'208";a="635734991" Subject: Re: [RFC PATCH 2/3 v3] perf: Implement Nehalem uncore pmu From: Lin Ming To: Stephane Eranian Cc: Peter Zijlstra , Andi Kleen , Ingo Molnar , Frederic Weisbecker , Arjan van de Ven , lkml In-Reply-To: References: <1291267223.2405.314.camel@minggr.sh.intel.com> <1291922467.6803.44.camel@twins> <1291978036.6803.95.camel@twins> Content-Type: text/plain; charset="UTF-8" Date: Mon, 13 Dec 2010 16:27:28 +0800 Message-ID: <1292228848.10384.107.camel@minggr.sh.intel.com> Mime-Version: 1.0 X-Mailer: Evolution 2.30.2 Content-Transfer-Encoding: 7bit Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Sat, 2010-12-11 at 13:49 +0800, Stephane Eranian wrote: > Hi, > > Ok, so I have an explanation for what we are seeing. In fact, what > bothered me in all > of this is that I did not recall ever running into this problem of > double-interrupt with HT > when I implemented the perfmon support for uncore sampling. The reason > is in fact > real simple. > > You are getting interrupts on both threads because you have enabled uncore PMU > on all CPUs, in uncore_cpu_starting(): > + rdmsrl(MSR_IA32_DEBUGCTLMSR, val); > + wrmsrl(MSR_IA32_DEBUGCTLMSR, val | DEBUGCTLMSR_ENABLE_UNCORE_PMI); Yeah! Thanks for the catch. > > You need to do that only on ONE of the two siblings. In fact, you want > that only ONCE > per socket. You can do this on the first CPU to use the uncore to add > something a bit > more dynamic and to make sure you have some control over where the > overhead is applied. > Once you do that, only one CPU/socket will get the interrupt and all > will be fine. I'll update the patches. Thanks, Lin Ming > > > > On Fri, Dec 10, 2010 at 11:47 AM, Peter Zijlstra wrote: > > On Fri, 2010-12-10 at 00:46 +0100, Stephane Eranian wrote: > >> Hi, > >> > >> So I have tested this patch a bit on WSM and as I expected there > >> are issues with sampling. > >> > >> When HT is on, both siblings CPUs get the interrupt. The HW does not > >> allow you to only point interrupts to a single HT thread (CPU). > > > > Egads, how ugly :/ > > > >> I did verify that indeed both threads get the interrupt and that you have a > >> race condition. Both sibling CPUs stop uncore, get the status. They may get > >> the same overflow status. Both will pass the uncore->active_mask because > >> it's shared among siblings cores. Thus, you have a race for the whole > >> interrupt handler execution. > >> > >> You need some serialization in there. But the patch does not address this. > >> The problem is different from the back-to-back interrupt issue that > >> Don worked on. > >> The per-cpu marked/handled trick cannot work to avoid this problem. > >> > >> You cannot simply say "the lowest indexed" CPU of a sibling pair > >> handles the interrupt > >> because you don't know if this in an uncore intr, core interrupt or > >> something else. You > >> need to check. That means each HT thread needs to check uncore > >> ovfl_status. IF the > >> status is zero, then return. Otherwise, you need to do a 2nd level > >> check before you can > >> execute the handler. You need to know if the sibling CPU has already > >> "consumed" that > >> interrupt. > >> > >> I think you need some sort of generation counter per physical core and > >> per HT thread. > >> On interrupt, you could do something along the line of: > >> if (mycpu->intr_count == mysibling->intr_count) { > >> then mycpu->intr_count++ > >> execute intr_handler() > >> } else { > >> mycpu->intr_count++ > >> return; > >> } > >> Of course, the above needs some atomicity and ad locking > > > > Does that guarantee that the same sibling handles all interrupts? Since > > a lot of the infrastructure uses local*_t we're not good with cross-cpu > > stuff. > > > > Damn what a mess.. we need to serialize enough for both cpus to at least > > see the overflow bit.. maybe something like: > > > > > > struct intel_percore { > > ... > > atomic_t uncore_barrier; > > }; > > > > void uncore_barrier(void) > > { > > struct intel_percore *percore = this_cpu_ptr(cpu_hw_events)->percore; > > int armed; > > > > armed = atomic_cmpxchg(&percore->uncore_barrier, 0, 1) == 0; > > if (armed) { > > /* we armed, it, now wait for completion */ > > while (atomic_read(&percore->uncore_barrier)) > > cpu_relax(); > > } else { > > /* our sibling must have, decrement it */ > > if (atomic_cmpxchg(&percore->uncore_barrier, 1, 0) != 1) > > BUG(); > > } > > } > > > > Then have something like: > > > > handle_uncore_interrupt() > > { > > u64 overflow = rdmsrl(MSR_UNCORE_PERF_GLOBAL_OVF_STATUS); > > int cpu; > > > > if (!overflow) > > return 0; /* not our interrupt to handle */ > > > > uncore_barrier(); /* wait so our sibling will also observe the overflow */ > > > > cpu = smp_processor_id(); > > if (cpu != cpumask_first(topology_thread_cpumask(cpu))) > > return 1; /* our sibling will handle it, eat the NMI */ > > > > /* OK, we've got an overflow and we're the first CPU in the thread mask */ > > > > ... do fancy stuff ... > > > > return 1; /* we handled it, eat the NMI */ > > } > > > > > >> (but I don't think you can use locks in NMI context). > > > > You can, as long as they're never used from !NMI, its icky, but it > > works. > > > >> This makes me wonder if vectoring uncore to NMI is really needed, > >> given you cannot > >> correlated to an IP, incl. a kernel IP. If we were to vector to a > >> dedicated (lower prio) > >> vector, then we could use the trick of saying the lowest indexed CPU in a pair > >> handles the interrupt (because we would already know this is an uncore > >> interrupt). > >> This would be much simpler. Price: not samples in kernel's critical > >> section. But those > >> are useless anyway with uncore events. > > > > But the uncore uses the same PMI line, right? You cannot point the > > uncore to another vector. /me goes find the docs -- ok, its too early in > > the morning to clearly parse that... > > > > Besides, people asked for the sampling thing didn't they (also we need > > it to fold the count to avoid overflow on 48bit). Also the PAPI people > > even want per-task uncore counters because that's the only thing PAPI > > can do. > >