Re: [PATCH v1 00/13] x86/PMU: Xen PMU PV support

xen-devel.lists.xenproject.org archive mirror
 help / color / mirror / Atom feed

From: Boris Ostrovsky <boris.ostrovsky@oracle.com>
To: George Dunlap <george.dunlap@eu.citrix.com>
Cc: suravee.suthikulpanit@amd.com, jacob.shin@amd.com,
	eddie.dong@intel.com, dietmar.hahn@ts.fujitsu.com,
	Jan Beulich <JBeulich@suse.com>,
	jun.nakajima@intel.com,
	xen-devel <xen-devel@lists.xenproject.org>
Subject: Re: [PATCH v1 00/13] x86/PMU: Xen PMU PV support
Date: Thu, 12 Sep 2013 10:58:33 -0400	[thread overview]
Message-ID: <5231D699.9050809@oracle.com> (raw)
In-Reply-To: <52318BC3.4070907@eu.citrix.com>

On 09/12/2013 05:39 AM, George Dunlap wrote:
> On 11/09/13 19:22, Boris Ostrovsky wrote:
>> On 09/11/2013 01:01 PM, George Dunlap wrote:
>>> On 10/09/13 16:47, Boris Ostrovsky wrote:
>>>> On 09/10/2013 11:34 AM, Jan Beulich wrote:
>>>>>>>> On 10.09.13 at 17:20, Boris Ostrovsky 
>>>>>>>> <boris.ostrovsky@oracle.com> wrote:
>>>>>> This version has following limitations:
>>>>>> * For accurate profiling of dom0/Xen dom0 VCPUs should be pinned.
>>>>>> * Hypervisor code is only profiled on processors that have 
>>>>>> running dom0 VCPUs
>>>>>> on them.
>>>>> With that I assume this is an RFC rather than full-fledged 
>>>>> submission?
>>>>
>>>> I was thinking that this would be something like stage 1 
>>>> implementation (and
>>>> probably should have mentioned this in the cover letter).
>>>>
>>>> For this stage I wanted to confine all changes on Linux side to xen 
>>>> subtrees.
>>>> Properly addressing the above limitation would likely require 
>>>> changes in non-xen
>>>> sources (change in perf file format, remote MSR access etc.).
>>>
>>> I think having the vpmu stuff for PV guests is a great idea, and 
>>> from a quick skim through I don't have any problems with the general 
>>> approach.  (Obviously some more detailed review will be needed.)
>>>
>>> However, I'm not a fan of this method of collecting perf stuff for 
>>> Xen and other VMs together in the cpu buffers for dom0.  I think 
>>> it's ugly, fragile, and non-scalable, and I would prefer to see if 
>>> we could implement the same feature (allowing perf to analyze Xen 
>>> and other vcpus) some other way.  And I would rather not use it as a 
>>> "stage 1", for fear that it would become entrenched.
>>
>> I can see how collecting samples for other domains may be 
>> questionable now (DOM0_PRIV mode) since at this stage there is no way 
>> to distinguish between samples for non-priviledged domains.
>>
>> But why do you think that getting data for both dom0 and Xen is 
>> problematic? Someone has to process Xen's samples and who would do 
>> this if not dom0? We could store samples in separate files (e.g. 
>> perf.data.dom0 and perf.data.xen) but that's toolstack's job.
>
> It's not so much about dom0 collecting the samples and passing them on 
> to the analysis tools; this is already what xenalyze does, in 
> essence.  It's about the requirement of having the dom0 vcpus pinned 
> 1-1 to physical cpus: both limiting the flexibility for scheduling, 
> and limiting the configuration flexibility wrt having dom0 vcpus < 
> pcpus.  That is what seems an ugly hack to me -- having dom0 sort of 
> try to do something that requires hypervisor-level privileges and 
> making a bit of a mess of it.

I probably should have explained the limitations better in the
original message.

Pinning:

The only reason this version requires pinning is because I haven't
provided hooks in Linux perf code to store both PCPU and VCPU of the
sample in the perf_sample_data. And I didn't do so this because this
would need to be done outside of arch/x86/xen and I decided not to go
there for this stage. So for now perf still only knows about CPUs, not
PCPUs or VCPUs.

Note that hypervisor already provides information about both P/VCPUs to
dom0 (*) so so when I fix what I described above in Linux (kernel and perf
toolstack) the right association of P/VCPUs will start working.

And pinning is not really *required*. If you don't pin you will not
get accurate sample distribution of hypervisor samples in perf.
For instance, if Xen's foo() was sampled on PCPU0 and then PCPU1 while
dom0's VCPU0 was running on each of them perf will assime that both
samples were taken on CPU0. Note again: CPU0, not P- or VCPU0).

#VCPUs < #PCPUs

This is different from pinning. The issue here is that tools (e.g. perf) 
need to
access the PMU's MSR. And they do it with something like wrmsr(msr, value),
and they assume that they are programming PMU on current processor. So
if a dom0's VCPU never runs on some PCPU it currently cannot program the
PMU there. One way to address this could be to have wrmsr_cpu(cpu, msr,
value). And presumably on bare metal this will be patched over with regular
wrmsr.

(*) Well, it doesn't. Because I forgot to add this to the code (it's one 
line, really)
but I will in the next version.

>
> I'm unfortunately not familiar enough with the perf system to know 
> exactly what it is that Linux needs to do (why, for example, you think 
> it would need remote MSR access if dom0 weren't pinned), 

Remote MSR access is needed not because of pinning but because the tool 
(perf, or
any other tool for that matter) needs to program the PMU on non-dom0 
processors.

> and how hard would be for Xen just to do that work, and provide an 
> "adapter" that would translate Xen-specific stuff into something perf 
> could consume.  Would it be possible, for example, for dom0 to specify 
> what needed to be collected, for Xen to generate the samples in a 
> Xen-specific format, and then have something in dom0 that would 
> separate the samples into one file per domain that look similar enough 
> to a trace file that the perf system could consume it?

Perf calculates sampling period on each sample and writes resulting 
value into the
counter MSR (I haven't looked yet at how it uses other performance 
facilities such as
PEBS, IBS and such).

Processing sample data is done by the toolstack and is relatively easy, 
we don't need
Xen-specific format (once we fix the pinning issue so we know to whom a 
sample belongs).
Programming PMU HW from exiting perf code is the challenge.

Thanks.
-boris

     prev parent reply	other threads:[~2013-09-12 14:56 UTC|newest]

Thread overview: 30+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2013-09-10 15:20 [PATCH v1 00/13] x86/PMU: Xen PMU PV support Boris Ostrovsky
2013-09-10 15:20 ` [PATCH v1 01/13] Export hypervisor symbols Boris Ostrovsky
2013-09-11  7:51   ` Jan Beulich
2013-09-11 13:55     ` Boris Ostrovsky
2013-09-11 14:12       ` Jan Beulich
2013-09-11 14:57         ` Boris Ostrovsky
2013-09-11 16:01           ` Jan Beulich
2013-09-10 15:20 ` [PATCH v1 02/13] Set VCPU's is_running flag closer to when the VCPU is dispatched Boris Ostrovsky
2013-09-11  7:58   ` Jan Beulich
2013-09-10 15:21 ` [PATCH v1 03/13] x86/PMU: Stop AMD counters when called from vpmu_save_force() Boris Ostrovsky
2013-09-10 15:21 ` [PATCH v1 04/13] x86/VPMU: Minor VPMU cleanup Boris Ostrovsky
2013-09-10 15:21 ` [PATCH v1 05/13] intel/VPMU: Clean up Intel VPMU code Boris Ostrovsky
2013-09-10 15:21 ` [PATCH v1 06/13] x86/PMU: Add public xenpmu.h Boris Ostrovsky
2013-09-11  8:13   ` Jan Beulich
2013-09-11 14:03     ` Boris Ostrovsky
2013-09-11 14:16       ` Jan Beulich
2013-09-11  8:37   ` Ian Campbell
2013-09-10 15:21 ` [PATCH v1 07/13] x86/PMU: Make vpmu not HVM-specific Boris Ostrovsky
2013-09-10 15:21 ` [PATCH v1 08/13] x86/PMU: Interface for setting PMU mode and flags Boris Ostrovsky
2013-09-10 15:21 ` [PATCH v1 09/13] x86/PMU: Initialize PMU for PV guests Boris Ostrovsky
2013-09-10 15:21 ` [PATCH v1 10/13] x86/PMU: Add support for PMU registes handling on " Boris Ostrovsky
2013-09-10 15:21 ` [PATCH v1 11/13] x86/PMU: Handle PMU interrupts for " Boris Ostrovsky
2013-09-10 15:21 ` [PATCH v1 12/13] x86/PMU: Save VPMU state for PV guests during context switch Boris Ostrovsky
2013-09-10 15:21 ` [PATCH v1 13/13] x86/PMU: Move vpmu files up from hvm directory Boris Ostrovsky
2013-09-10 15:34 ` [PATCH v1 00/13] x86/PMU: Xen PMU PV support Jan Beulich
2013-09-10 15:47   ` Boris Ostrovsky
2013-09-11 17:01     ` George Dunlap
2013-09-11 18:22       ` Boris Ostrovsky
2013-09-12  9:39         ` George Dunlap
2013-09-12 14:58           ` Boris Ostrovsky [this message]

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=5231D699.9050809@oracle.com \
    --to=boris.ostrovsky@oracle.com \
    --cc=JBeulich@suse.com \
    --cc=dietmar.hahn@ts.fujitsu.com \
    --cc=eddie.dong@intel.com \
    --cc=george.dunlap@eu.citrix.com \
    --cc=jacob.shin@amd.com \
    --cc=jun.nakajima@intel.com \
    --cc=suravee.suthikulpanit@amd.com \
    --cc=xen-devel@lists.xenproject.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).