From mboxrd@z Thu Jan 1 00:00:00 1970 From: Boris Ostrovsky Subject: Re: [PATCH v1 00/13] x86/PMU: Xen PMU PV support Date: Thu, 12 Sep 2013 10:58:33 -0400 Message-ID: <5231D699.9050809@oracle.com> References: <1378826470-4085-1-git-send-email-boris.ostrovsky@oracle.com> <522F584002000078000F2203@nat28.tlf.novell.com> <522F3F0F.9060207@oracle.com> <5230A1D3.8070805@eu.citrix.com> <5230B4F1.5060207@oracle.com> <52318BC3.4070907@eu.citrix.com> Mime-Version: 1.0 Content-Type: text/plain; charset="us-ascii"; Format="flowed" Content-Transfer-Encoding: 7bit Return-path: Received: from mail6.bemta4.messagelabs.com ([85.158.143.247]) by lists.xen.org with esmtp (Exim 4.72) (envelope-from ) id 1VK8KE-0001MA-Le for xen-devel@lists.xenproject.org; Thu, 12 Sep 2013 14:56:54 +0000 In-Reply-To: <52318BC3.4070907@eu.citrix.com> List-Unsubscribe: , List-Post: List-Help: List-Subscribe: , Sender: xen-devel-bounces@lists.xen.org Errors-To: xen-devel-bounces@lists.xen.org To: George Dunlap Cc: suravee.suthikulpanit@amd.com, jacob.shin@amd.com, eddie.dong@intel.com, dietmar.hahn@ts.fujitsu.com, Jan Beulich , jun.nakajima@intel.com, xen-devel List-Id: xen-devel@lists.xenproject.org On 09/12/2013 05:39 AM, George Dunlap wrote: > On 11/09/13 19:22, Boris Ostrovsky wrote: >> On 09/11/2013 01:01 PM, George Dunlap wrote: >>> On 10/09/13 16:47, Boris Ostrovsky wrote: >>>> On 09/10/2013 11:34 AM, Jan Beulich wrote: >>>>>>>> On 10.09.13 at 17:20, Boris Ostrovsky >>>>>>>> wrote: >>>>>> This version has following limitations: >>>>>> * For accurate profiling of dom0/Xen dom0 VCPUs should be pinned. >>>>>> * Hypervisor code is only profiled on processors that have >>>>>> running dom0 VCPUs >>>>>> on them. >>>>> With that I assume this is an RFC rather than full-fledged >>>>> submission? >>>> >>>> I was thinking that this would be something like stage 1 >>>> implementation (and >>>> probably should have mentioned this in the cover letter). >>>> >>>> For this stage I wanted to confine all changes on Linux side to xen >>>> subtrees. >>>> Properly addressing the above limitation would likely require >>>> changes in non-xen >>>> sources (change in perf file format, remote MSR access etc.). >>> >>> I think having the vpmu stuff for PV guests is a great idea, and >>> from a quick skim through I don't have any problems with the general >>> approach. (Obviously some more detailed review will be needed.) >>> >>> However, I'm not a fan of this method of collecting perf stuff for >>> Xen and other VMs together in the cpu buffers for dom0. I think >>> it's ugly, fragile, and non-scalable, and I would prefer to see if >>> we could implement the same feature (allowing perf to analyze Xen >>> and other vcpus) some other way. And I would rather not use it as a >>> "stage 1", for fear that it would become entrenched. >> >> I can see how collecting samples for other domains may be >> questionable now (DOM0_PRIV mode) since at this stage there is no way >> to distinguish between samples for non-priviledged domains. >> >> But why do you think that getting data for both dom0 and Xen is >> problematic? Someone has to process Xen's samples and who would do >> this if not dom0? We could store samples in separate files (e.g. >> perf.data.dom0 and perf.data.xen) but that's toolstack's job. > > It's not so much about dom0 collecting the samples and passing them on > to the analysis tools; this is already what xenalyze does, in > essence. It's about the requirement of having the dom0 vcpus pinned > 1-1 to physical cpus: both limiting the flexibility for scheduling, > and limiting the configuration flexibility wrt having dom0 vcpus < > pcpus. That is what seems an ugly hack to me -- having dom0 sort of > try to do something that requires hypervisor-level privileges and > making a bit of a mess of it. I probably should have explained the limitations better in the original message. Pinning: The only reason this version requires pinning is because I haven't provided hooks in Linux perf code to store both PCPU and VCPU of the sample in the perf_sample_data. And I didn't do so this because this would need to be done outside of arch/x86/xen and I decided not to go there for this stage. So for now perf still only knows about CPUs, not PCPUs or VCPUs. Note that hypervisor already provides information about both P/VCPUs to dom0 (*) so so when I fix what I described above in Linux (kernel and perf toolstack) the right association of P/VCPUs will start working. And pinning is not really *required*. If you don't pin you will not get accurate sample distribution of hypervisor samples in perf. For instance, if Xen's foo() was sampled on PCPU0 and then PCPU1 while dom0's VCPU0 was running on each of them perf will assime that both samples were taken on CPU0. Note again: CPU0, not P- or VCPU0). #VCPUs < #PCPUs This is different from pinning. The issue here is that tools (e.g. perf) need to access the PMU's MSR. And they do it with something like wrmsr(msr, value), and they assume that they are programming PMU on current processor. So if a dom0's VCPU never runs on some PCPU it currently cannot program the PMU there. One way to address this could be to have wrmsr_cpu(cpu, msr, value). And presumably on bare metal this will be patched over with regular wrmsr. (*) Well, it doesn't. Because I forgot to add this to the code (it's one line, really) but I will in the next version. > > I'm unfortunately not familiar enough with the perf system to know > exactly what it is that Linux needs to do (why, for example, you think > it would need remote MSR access if dom0 weren't pinned), Remote MSR access is needed not because of pinning but because the tool (perf, or any other tool for that matter) needs to program the PMU on non-dom0 processors. > and how hard would be for Xen just to do that work, and provide an > "adapter" that would translate Xen-specific stuff into something perf > could consume. Would it be possible, for example, for dom0 to specify > what needed to be collected, for Xen to generate the samples in a > Xen-specific format, and then have something in dom0 that would > separate the samples into one file per domain that look similar enough > to a trace file that the perf system could consume it? Perf calculates sampling period on each sample and writes resulting value into the counter MSR (I haven't looked yet at how it uses other performance facilities such as PEBS, IBS and such). Processing sample data is done by the toolstack and is relatively easy, we don't need Xen-specific format (once we fix the pinning issue so we know to whom a sample belongs). Programming PMU HW from exiting perf code is the challenge. Thanks. -boris