KVM PMU virtualization

kvm.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* KVM PMU virtualization
@ 2010-02-25 15:04 Jes Sorensen
  2010-02-25 15:44 ` Jan Kiszka
  2010-02-25 17:34 ` Joerg Roedel
  0 siblings, 2 replies; 99+ messages in thread
From: Jes Sorensen @ 2010-02-25 15:04 UTC (permalink / raw)
  To: KVM General
  Cc: Peter Zijlstra, Avi Kivity, Zachary Amsden, Gleb Natapov,
	Ingo Molnar, ming.m.lin, Zhang, Yanmin

Hi,

It looks like several of us have been looking at how to use the PMU
for virtualization. Rather than continuing to have discussions in
smaller groups, I think it is a good idea we move it to the mailing
lists to see what we can share and avoid duplicate efforts.

There are really two separate things to handle:

1) Add support to perf to allow it to monitor a KVM guest from the
    host.

2) Allow guests access to the PMU (or an emulated PMU), making it
    possible to run perf on applications running within the guest.

I know some of you have been looking at 1) and I am currently working
on 2). I have been looking at various approaches, including whether it
is feasible to share the PMU between the host and multiple guests. For
now I am going to focus on allowing one guest to take control of the
PMU, then later hopefully adding support for multiplexing it between
multiple guests.

Eventually we will see proper hardware PMU virtualization from Intel and
AMD (admittedly I have only looked at the Intel specs so far), and by
then be able to allow the host as well as the guests to share the PMU.

If anybody else is working on this, I'd love to hear about it so we can
coordinate our efforts. The main purpose with this mail was really to
being the discussion to the mailing list to avoid duplicated efforts.

Cheers,
Jes

PS: I'll be AFK all of next week, so it may take a few days for me to
reply to follow-up discussions.

^ permalink raw reply	[flat|nested] 99+ messages in thread

* Re: KVM PMU virtualization
  2010-02-25 15:04 KVM PMU virtualization Jes Sorensen
@ 2010-02-25 15:44 ` Jan Kiszka
  2010-02-25 16:26   ` Ingo Molnar
  2010-02-25 17:34 ` Joerg Roedel
  1 sibling, 1 reply; 99+ messages in thread
From: Jan Kiszka @ 2010-02-25 15:44 UTC (permalink / raw)
  To: Jes Sorensen
  Cc: KVM General, Peter Zijlstra, Avi Kivity, Zachary Amsden,
	Gleb Natapov, Ingo Molnar, ming.m.lin, Zhang, Yanmin

Jes Sorensen wrote:
> Hi,
> 
> It looks like several of us have been looking at how to use the PMU
> for virtualization. Rather than continuing to have discussions in
> smaller groups, I think it is a good idea we move it to the mailing
> lists to see what we can share and avoid duplicate efforts.
> 
> There are really two separate things to handle:
> 
> 1) Add support to perf to allow it to monitor a KVM guest from the
>    host.
> 
> 2) Allow guests access to the PMU (or an emulated PMU), making it
>    possible to run perf on applications running within the guest.
> 
> I know some of you have been looking at 1) and I am currently working
> on 2). I have been looking at various approaches, including whether it
> is feasible to share the PMU between the host and multiple guests. For
> now I am going to focus on allowing one guest to take control of the
> PMU, then later hopefully adding support for multiplexing it between
> multiple guests.

Given that perf can apply the PMU to individual host tasks, I don't see
fundamental problems multiplexing it between individual guests (which
can then internally multiplex it again).

Then the next challenge might be how to handle the case of both host and
guest trying to use PMU resources at the same time. For the sparse debug
registers resources I simply disable the effect of guest injected
breakpoints once the host wants to use them. The guest still sees its
programmed values, though. One could try to schedule free registers
between both, but given how rare such use cases are, I decided to go for
a simple approach. Probably the situation is not that different for the PMU.

> 
> Eventually we will see proper hardware PMU virtualization from Intel and
> AMD (admittedly I have only looked at the Intel specs so far), and by
> then be able to allow the host as well as the guests to share the PMU.
> 
> If anybody else is working on this, I'd love to hear about it so we can
> coordinate our efforts. The main purpose with this mail was really to
> being the discussion to the mailing list to avoid duplicated efforts.

I thought I've seen quite some code for PMU virtualization in Xen's HVM
code. Might be worth studying what they do already and adopt/extend it
for KVM.

Jan

-- 
Siemens AG, Corporate Technology, CT T DE IT 1
Corporate Competence Center Embedded Linux

^ permalink raw reply	[flat|nested] 99+ messages in thread

* Re: KVM PMU virtualization
  2010-02-25 15:44 ` Jan Kiszka
@ 2010-02-25 16:26   ` Ingo Molnar
  2010-02-26  2:52     ` Zhang, Yanmin
  2010-02-26 11:03     ` Jes Sorensen
  0 siblings, 2 replies; 99+ messages in thread
From: Ingo Molnar @ 2010-02-25 16:26 UTC (permalink / raw)
  To: Jan Kiszka
  Cc: Jes Sorensen, KVM General, Peter Zijlstra, Avi Kivity,
	Zachary Amsden, Gleb Natapov, ming.m.lin, Zhang, Yanmin

* Jan Kiszka <jan.kiszka@siemens.com> wrote:

> Jes Sorensen wrote:
> > Hi,
> > 
> > It looks like several of us have been looking at how to use the PMU
> > for virtualization. Rather than continuing to have discussions in
> > smaller groups, I think it is a good idea we move it to the mailing
> > lists to see what we can share and avoid duplicate efforts.
> > 
> > There are really two separate things to handle:
> > 
> > 1) Add support to perf to allow it to monitor a KVM guest from the
> >    host.
> > 
> > 2) Allow guests access to the PMU (or an emulated PMU), making it
> >    possible to run perf on applications running within the guest.
> > 
> > I know some of you have been looking at 1) and I am currently working
> > on 2). I have been looking at various approaches, including whether it
> > is feasible to share the PMU between the host and multiple guests. For
> > now I am going to focus on allowing one guest to take control of the
> > PMU, then later hopefully adding support for multiplexing it between
> > multiple guests.
> 
> Given that perf can apply the PMU to individual host tasks, I don't see 
> fundamental problems multiplexing it between individual guests (which can 
> then internally multiplex it again).

In terms of how to expose it to guests, a 'soft PMU' might be a usable 
approach. Although to Linux guests you could expose much more functionality 
and an non-PMU-limited number of instrumentation events, via a more 
intelligent interface.

But note that in terms of handling it on the host side the PMU approach is not 
acceptable: instead it should map to proper perf_events, not try to muck with 
the PMU itself.

That, besides integrating properly with perf usage on the host, will also 
allow interesting 'PMU' features on guests: you could set up the host side to 
trace block IO requests (or VM exits) for example, and expose that as 'PMC
#0' on the guest side.

That's a neat feature: the guest profiling tools would immediately (and 
transparently) be able to measure VM exits or IO heaviness, on a per guest 
basis, as seen on the host side.

More would be possible too.

Thanks,

	Ingo

^ permalink raw reply	[flat|nested] 99+ messages in thread

* Re: KVM PMU virtualization
  2010-02-25 15:04 KVM PMU virtualization Jes Sorensen
  2010-02-25 15:44 ` Jan Kiszka
@ 2010-02-25 17:34 ` Joerg Roedel
  2010-02-26  2:55   ` Zhang, Yanmin
                     ` (2 more replies)
  1 sibling, 3 replies; 99+ messages in thread
From: Joerg Roedel @ 2010-02-25 17:34 UTC (permalink / raw)
  To: Jes Sorensen
  Cc: KVM General, Peter Zijlstra, Avi Kivity, Zachary Amsden,
	Gleb Natapov, Ingo Molnar, ming.m.lin, Zhang, Yanmin

On Thu, Feb 25, 2010 at 04:04:28PM +0100, Jes Sorensen wrote:

> 1) Add support to perf to allow it to monitor a KVM guest from the
>    host.

This shouldn't be a big problem. The PMU of AMD Fam10 processors can be
configured to count only when in guest mode. Perf needs to be aware of
that and fetch the rip from a different place when monitoring a guest.

> 2) Allow guests access to the PMU (or an emulated PMU), making it
>    possible to run perf on applications running within the guest.

The biggest problem I see here is teaching the guest about the available
events. The available event sets are dependent on the processor family
(at least on AMD).
A simple approach would be shadowing the perf msrs which is a simple
thing to do. More problematic is the reinjection of performance
interrupts and performance nmis.

I personally don't like a self-defined event-set as the only solution
because that would probably only work with linux and perf. I think we
should have a way (additionally to a soft-event interface) which allows
to expose the host pmu events to the guest.

	Joerg

^ permalink raw reply	[flat|nested] 99+ messages in thread

* Re: KVM PMU virtualization
  2010-02-25 16:26   ` Ingo Molnar
@ 2010-02-26  2:52     ` Zhang, Yanmin
  2010-02-26  8:45       ` Ingo Molnar
  2010-02-26 11:03     ` Jes Sorensen
  1 sibling, 1 reply; 99+ messages in thread
From: Zhang, Yanmin @ 2010-02-26  2:52 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Jan Kiszka, Jes Sorensen, KVM General, Peter Zijlstra, Avi Kivity,
	Zachary Amsden, Gleb Natapov, ming.m.lin

On Thu, 2010-02-25 at 17:26 +0100, Ingo Molnar wrote:
> * Jan Kiszka <jan.kiszka@siemens.com> wrote:
> 
> > Jes Sorensen wrote:
> > > Hi,
> > > 
> > > It looks like several of us have been looking at how to use the PMU
> > > for virtualization. Rather than continuing to have discussions in
> > > smaller groups, I think it is a good idea we move it to the mailing
> > > lists to see what we can share and avoid duplicate efforts.
> > > 
> > > There are really two separate things to handle:
> > > 
> > > 1) Add support to perf to allow it to monitor a KVM guest from the
> > >    host.
> > > 
> > > 2) Allow guests access to the PMU (or an emulated PMU), making it
> > >    possible to run perf on applications running within the guest.
> > > 
> > > I know some of you have been looking at 1) and I am currently working
> > > on 2). I have been looking at various approaches, including whether it
> > > is feasible to share the PMU between the host and multiple guests. For
> > > now I am going to focus on allowing one guest to take control of the
> > > PMU, then later hopefully adding support for multiplexing it between
> > > multiple guests.
> > 
> > Given that perf can apply the PMU to individual host tasks, I don't see 
> > fundamental problems multiplexing it between individual guests (which can 
> > then internally multiplex it again).
> 
> In terms of how to expose it to guests, a 'soft PMU' might be a usable 
> approach. Although to Linux guests you could expose much more functionality 
> and an non-PMU-limited number of instrumentation events, via a more 
> intelligent interface.
> 
> But note that in terms of handling it on the host side the PMU approach is not 
> acceptable: instead it should map to proper perf_events, not try to muck with 
> the PMU itself.
> 


> That, besides integrating properly with perf usage on the host, will also 
> allow interesting 'PMU' features on guests: you could set up the host side to 
> trace block IO requests (or VM exits) for example, and expose that as 'PMC
> #0' on the guest side.
So virtualization becomes non-transparent to guest os? I know virtio is an
optimization on guest side.




^ permalink raw reply	[flat|nested] 99+ messages in thread

* Re: KVM PMU virtualization
  2010-02-25 17:34 ` Joerg Roedel
@ 2010-02-26  2:55   ` Zhang, Yanmin
  2010-02-26  8:51     ` Joerg Roedel
  2010-02-26  8:42   ` Ingo Molnar
  2010-02-26 11:01   ` Jes Sorensen
  2 siblings, 1 reply; 99+ messages in thread
From: Zhang, Yanmin @ 2010-02-26  2:55 UTC (permalink / raw)
  To: Joerg Roedel
  Cc: Jes Sorensen, KVM General, Peter Zijlstra, Avi Kivity,
	Zachary Amsden, Gleb Natapov, Ingo Molnar, ming.m.lin

On Thu, 2010-02-25 at 18:34 +0100, Joerg Roedel wrote:
> On Thu, Feb 25, 2010 at 04:04:28PM +0100, Jes Sorensen wrote:
> 
> > 1) Add support to perf to allow it to monitor a KVM guest from the
> >    host.
> 
> This shouldn't be a big problem. The PMU of AMD Fam10 processors can be
> configured to count only when in guest mode. Perf needs to be aware of
> that and fetch the rip from a different place when monitoring a guest.
The idea is we want to measure both host and guest at the same time, and
compare all the hot functions fairly.


> 
> > 2) Allow guests access to the PMU (or an emulated PMU), making it
> >    possible to run perf on applications running within the guest.
> 
> The biggest problem I see here is teaching the guest about the available
> events. The available event sets are dependent on the processor family
> (at least on AMD).
> A simple approach would be shadowing the perf msrs which is a simple
> thing to do. More problematic is the reinjection of performance
> interrupts and performance nmis.
> 
> I personally don't like a self-defined event-set as the only solution
> because that would probably only work with linux and perf. I think we
> should have a way (additionally to a soft-event interface) which allows
> to expose the host pmu events to the guest.
> 
> 	Joerg
> 



^ permalink raw reply	[flat|nested] 99+ messages in thread

* Re: KVM PMU virtualization
  2010-02-25 17:34 ` Joerg Roedel
  2010-02-26  2:55   ` Zhang, Yanmin
@ 2010-02-26  8:42   ` Ingo Molnar
  2010-02-26  9:46     ` Avi Kivity
  2010-02-26 11:01   ` Jes Sorensen
  2 siblings, 1 reply; 99+ messages in thread
From: Ingo Molnar @ 2010-02-26  8:42 UTC (permalink / raw)
  To: Joerg Roedel
  Cc: Jes Sorensen, KVM General, Peter Zijlstra, Avi Kivity,
	Zachary Amsden, Gleb Natapov, ming.m.lin, Zhang, Yanmin,
	Peter Zijlstra, Thomas Gleixner, H. Peter Anvin, Arjan van de Ven,
	Fr??d??ric Weisbecker, Arnaldo Carvalho de Melo

* Joerg Roedel <joro@8bytes.org> wrote:

> I personally don't like a self-defined event-set as the only solution 
> because that would probably only work with linux and perf. [...]

The 'soft-PMU' i suggested is transparent on the guest side - if you want to 
enable non-Linux and legacy-Linux.

It's basically a PMU interface provided to the guest by catching the right MSR 
accesses, implemented via perf_event_create_kernel_counter()/etc. on the host 
side.

Note that the 'soft PMU' still sucks from a design POV as there's no generic 
hw interface to the PMU. So there would have to be a 'soft AMD' and a 'soft 
Intel' PMU driver at minimum.

Far cleaner would be to expose it via hypercalls to guest OSs that are 
interested in instrumentation. That way it could also transparently integrate 
with tracing, probes, etc. It would also be wiser to first concentrate on 
improving Linux<->Linux guest/host combos before gutting the design just to 
fit Windows into the picture ...

	Ingo

^ permalink raw reply	[flat|nested] 99+ messages in thread

* Re: KVM PMU virtualization
  2010-02-26  2:52     ` Zhang, Yanmin
@ 2010-02-26  8:45       ` Ingo Molnar
  0 siblings, 0 replies; 99+ messages in thread
From: Ingo Molnar @ 2010-02-26  8:45 UTC (permalink / raw)
  To: Zhang, Yanmin
  Cc: Jan Kiszka, Jes Sorensen, KVM General, Peter Zijlstra, Avi Kivity,
	Zachary Amsden, Gleb Natapov, ming.m.lin


* Zhang, Yanmin <yanmin_zhang@linux.intel.com> wrote:

> On Thu, 2010-02-25 at 17:26 +0100, Ingo Molnar wrote:
> > * Jan Kiszka <jan.kiszka@siemens.com> wrote:
> > 
> > > Jes Sorensen wrote:
> > > > Hi,
> > > > 
> > > > It looks like several of us have been looking at how to use the PMU
> > > > for virtualization. Rather than continuing to have discussions in
> > > > smaller groups, I think it is a good idea we move it to the mailing
> > > > lists to see what we can share and avoid duplicate efforts.
> > > > 
> > > > There are really two separate things to handle:
> > > > 
> > > > 1) Add support to perf to allow it to monitor a KVM guest from the
> > > >    host.
> > > > 
> > > > 2) Allow guests access to the PMU (or an emulated PMU), making it
> > > >    possible to run perf on applications running within the guest.
> > > > 
> > > > I know some of you have been looking at 1) and I am currently working
> > > > on 2). I have been looking at various approaches, including whether it
> > > > is feasible to share the PMU between the host and multiple guests. For
> > > > now I am going to focus on allowing one guest to take control of the
> > > > PMU, then later hopefully adding support for multiplexing it between
> > > > multiple guests.
> > > 
> > > Given that perf can apply the PMU to individual host tasks, I don't see 
> > > fundamental problems multiplexing it between individual guests (which can 
> > > then internally multiplex it again).
> > 
> > In terms of how to expose it to guests, a 'soft PMU' might be a usable 
> > approach. Although to Linux guests you could expose much more 
> > functionality and an non-PMU-limited number of instrumentation events, via 
> > a more intelligent interface.
> > 
> > But note that in terms of handling it on the host side the PMU approach is 
> > not acceptable: instead it should map to proper perf_events, not try to 
> > muck with the PMU itself.
> 
> 
> > That, besides integrating properly with perf usage on the host, will also 
> > allow interesting 'PMU' features on guests: you could set up the host side 
> > to trace block IO requests (or VM exits) for example, and expose that as 
> > 'PMC
> > #0' on the guest side.
>
> So virtualization becomes non-transparent to guest os? I know virtio is an 
> optimization on guest side.

The 'soft PMU' is transparent. The 'count IO events' kind of feature could be 
transparent too: you could re-configure (on the host) a given 'hardware' event 
to really count some software event.

That would make it compatible with whatever guest side tooling (without having 
to change that tooling) - while still allowing interesting new things to be 
measured.

Thanks,

	Ingo

^ permalink raw reply	[flat|nested] 99+ messages in thread

* Re: KVM PMU virtualization
  2010-02-26  2:55   ` Zhang, Yanmin
@ 2010-02-26  8:51     ` Joerg Roedel
  2010-02-26  9:17       ` Ingo Molnar
  0 siblings, 1 reply; 99+ messages in thread
From: Joerg Roedel @ 2010-02-26  8:51 UTC (permalink / raw)
  To: Zhang, Yanmin
  Cc: Jes Sorensen, KVM General, Peter Zijlstra, Avi Kivity,
	Zachary Amsden, Gleb Natapov, Ingo Molnar, ming.m.lin

On Fri, Feb 26, 2010 at 10:55:17AM +0800, Zhang, Yanmin wrote:
> On Thu, 2010-02-25 at 18:34 +0100, Joerg Roedel wrote:
> > On Thu, Feb 25, 2010 at 04:04:28PM +0100, Jes Sorensen wrote:
> > 
> > > 1) Add support to perf to allow it to monitor a KVM guest from the
> > >    host.
> > 
> > This shouldn't be a big problem. The PMU of AMD Fam10 processors can be
> > configured to count only when in guest mode. Perf needs to be aware of
> > that and fetch the rip from a different place when monitoring a guest.

> The idea is we want to measure both host and guest at the same time, and
> compare all the hot functions fairly.

So you want to measure while the guest vcpu is running and the vmexit
path of that vcpu (including qemu userspace part) together? The
challenge here is to find out if a performance event originated in guest
mode or in host mode.
But we can check for that in the nmi-protected part of the vmexit path.

	Joerg


^ permalink raw reply	[flat|nested] 99+ messages in thread

* Re: KVM PMU virtualization
  2010-02-26  8:51     ` Joerg Roedel
@ 2010-02-26  9:17       ` Ingo Molnar
  2010-02-26 10:42         ` Joerg Roedel
  2010-03-02  7:09         ` Zhang, Yanmin
  0 siblings, 2 replies; 99+ messages in thread
From: Ingo Molnar @ 2010-02-26  9:17 UTC (permalink / raw)
  To: Joerg Roedel
  Cc: Zhang, Yanmin, Jes Sorensen, KVM General, Peter Zijlstra,
	Avi Kivity, Zachary Amsden, Gleb Natapov, ming.m.lin

* Joerg Roedel <joro@8bytes.org> wrote:

> On Fri, Feb 26, 2010 at 10:55:17AM +0800, Zhang, Yanmin wrote:
> > On Thu, 2010-02-25 at 18:34 +0100, Joerg Roedel wrote:
> > > On Thu, Feb 25, 2010 at 04:04:28PM +0100, Jes Sorensen wrote:
> > > 
> > > > 1) Add support to perf to allow it to monitor a KVM guest from the
> > > >    host.
> > > 
> > > This shouldn't be a big problem. The PMU of AMD Fam10 processors can be
> > > configured to count only when in guest mode. Perf needs to be aware of
> > > that and fetch the rip from a different place when monitoring a guest.
> 
> > The idea is we want to measure both host and guest at the same time, and
> > compare all the hot functions fairly.
> 
> So you want to measure while the guest vcpu is running and the vmexit
> path of that vcpu (including qemu userspace part) together? The
> challenge here is to find out if a performance event originated in guest
> mode or in host mode.
> But we can check for that in the nmi-protected part of the vmexit path.

As far as instrumentation goes, virtualization is simply another 'PID 
dimension' of measurement.

Today we can isolate system performance measurements/events to the following 
domains:

 - per system
 - per cpu
 - per task

( Note that PowerPC already supports certain sorts of 'hypervisor/kernel/user' 
  domain separation, and we have some ABI details for all that but it's by no 
  means complete. Anton is using the PowerPC bits AFAIK, so it already works 
  to a certain degree. )

When extending measurements to KVM, we want two things:

 - user friendliness: instead of having to check 'ps' and figure out which 
   Qemu thread is the KVM thread we want to profile, just give a convenience
   namespace to access guest profiling info. -G ought to map to the first
   currently running KVM guest it can find. (which would match like 90% of the
   cases) - etc. No ifs and when. If 'perf kvm top' doesnt show something 
   useful by default the whole effort is for naught.

 - Extend core facilities and enable the following measurement dimensions:

     host-kernel-space
     host-user-space
     guest-kernel-space
     guest-user-space

   on a per guest basis. We want to be able to measure just what the guest 
   does, and we want to be able to measure just what the host does.

   Some of this the hardware helps us with (say only measuring host kernel 
   events is possible), some has to be done by fiddling with event 
   enable/disable at vm-exit / vm-entry time.

My suggestion, as always, would be to start very simple and very minimal:

Enable 'perf kvm top' to show guest overhead. Use the exact same kernel image 
both as a host and as guest (for testing), to not have to deal with the symbol 
space transport problem initially. Enable 'perf kvm record' to only record 
guest events by default. Etc.

This alone will be a quite useful result already - and gives a basis for 
further work. No need to spend months to do the big grand design straight 
away, all of this can be done gradually and in the order of usefulness - and 
you'll always have something that actually works (and helps your other KVM 
projects) along the way.

[ And, as so often, once you walk that path, that grand scheme you are 
  thinking about right now might easily become last year's really bad idea ;-) ]

So please start walking the path and experience the challenges first-hand.

Thanks,

	Ingo

^ permalink raw reply	[flat|nested] 99+ messages in thread

* Re: KVM PMU virtualization
  2010-02-26  8:42   ` Ingo Molnar
@ 2010-02-26  9:46     ` Avi Kivity
  2010-02-26 10:39       ` Joerg Roedel
  2010-02-26 10:44       ` Ingo Molnar
  0 siblings, 2 replies; 99+ messages in thread
From: Avi Kivity @ 2010-02-26  9:46 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Joerg Roedel, Jes Sorensen, KVM General, Peter Zijlstra,
	Zachary Amsden, Gleb Natapov, ming.m.lin, Zhang, Yanmin,
	Peter Zijlstra, Thomas Gleixner, H. Peter Anvin, Arjan van de Ven,
	Fr??d??ric Weisbecker, Arnaldo Carvalho de Melo

On 02/26/2010 10:42 AM, Ingo Molnar wrote:
> * Joerg Roedel<joro@8bytes.org>  wrote:
>
>    
>> I personally don't like a self-defined event-set as the only solution
>> because that would probably only work with linux and perf. [...]
>>      
> The 'soft-PMU' i suggested is transparent on the guest side - if you want to
> enable non-Linux and legacy-Linux.
>
> It's basically a PMU interface provided to the guest by catching the right MSR
> accesses, implemented via perf_event_create_kernel_counter()/etc. on the host
> side.
>    

That only works if the software interface is 100% lossless - we can 
recreate every single hardware configuration through the API.  Is this 
the case?

> Note that the 'soft PMU' still sucks from a design POV as there's no generic
> hw interface to the PMU. So there would have to be a 'soft AMD' and a 'soft
> Intel' PMU driver at minimum.
>    

Right, this will severely limit migration domains to hosts of the same 
vendor and processor generation.  There is a  middle ground, though, 
Intel has recently moved to define an "architectural pmu" which is not 
model specific.  I don't know if AMD adopted it.  We could offer both 
options - native host capabilities, with a loss of compatibility, and 
the architectural pmu, with loss of model specific counters.

> Far cleaner would be to expose it via hypercalls to guest OSs that are
> interested in instrumentation.

It's also slower - you can give the guest direct access to the various 
counters so no exits are taken when reading the counters (though perhaps 
many tools are only interested in the interrupts, not the counter values).

> That way it could also transparently integrate
> with tracing, probes, etc. It would also be wiser to first concentrate on
> improving Linux<->Linux guest/host combos before gutting the design just to
> fit Windows into the picture ...
>    

"gutting the design"?

-- 
Do not meddle in the internals of kernels, for they are subtle and quick to panic.

^ permalink raw reply	[flat|nested] 99+ messages in thread

* Re: KVM PMU virtualization
  2010-02-26  9:46     ` Avi Kivity
@ 2010-02-26 10:39       ` Joerg Roedel
  2010-02-26 10:46         ` Ingo Molnar
  2010-02-26 10:44       ` Ingo Molnar
  1 sibling, 1 reply; 99+ messages in thread
From: Joerg Roedel @ 2010-02-26 10:39 UTC (permalink / raw)
  To: Avi Kivity
  Cc: Ingo Molnar, Jes Sorensen, KVM General, Peter Zijlstra,
	Zachary Amsden, Gleb Natapov, ming.m.lin, Zhang, Yanmin,
	Peter Zijlstra, Thomas Gleixner, H. Peter Anvin, Arjan van de Ven,
	Fr??d??ric Weisbecker, Arnaldo Carvalho de Melo

On Fri, Feb 26, 2010 at 11:46:34AM +0200, Avi Kivity wrote:
> On 02/26/2010 10:42 AM, Ingo Molnar wrote:
>> Note that the 'soft PMU' still sucks from a design POV as there's no generic
>> hw interface to the PMU. So there would have to be a 'soft AMD' and a 'soft
>> Intel' PMU driver at minimum.
>>    
>
> Right, this will severely limit migration domains to hosts of the same  
> vendor and processor generation.  There is a  middle ground, though,  
> Intel has recently moved to define an "architectural pmu" which is not  
> model specific.  I don't know if AMD adopted it.  We could offer both  
> options - native host capabilities, with a loss of compatibility, and  
> the architectural pmu, with loss of model specific counters.

I only had a quick look yet on the architectural pmu from intel but it
looks like it can be emulated for a guest on amd using existing
features.

	Joerg


^ permalink raw reply	[flat|nested] 99+ messages in thread

* Re: KVM PMU virtualization
  2010-02-26  9:17       ` Ingo Molnar
@ 2010-02-26 10:42         ` Joerg Roedel
  2010-02-26 10:56           ` Ingo Molnar
  2010-03-02  7:09         ` Zhang, Yanmin
  1 sibling, 1 reply; 99+ messages in thread
From: Joerg Roedel @ 2010-02-26 10:42 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Zhang, Yanmin, Jes Sorensen, KVM General, Peter Zijlstra,
	Avi Kivity, Zachary Amsden, Gleb Natapov, ming.m.lin

On Fri, Feb 26, 2010 at 10:17:32AM +0100, Ingo Molnar wrote:
> My suggestion, as always, would be to start very simple and very minimal:
> 
> Enable 'perf kvm top' to show guest overhead. Use the exact same kernel image 
> both as a host and as guest (for testing), to not have to deal with the symbol 
> space transport problem initially. Enable 'perf kvm record' to only record 
> guest events by default. Etc.
> 
> This alone will be a quite useful result already - and gives a basis for 
> further work. No need to spend months to do the big grand design straight 
> away, all of this can be done gradually and in the order of usefulness - and 
> you'll always have something that actually works (and helps your other KVM 
> projects) along the way.
> 
> [ And, as so often, once you walk that path, that grand scheme you are 
>   thinking about right now might easily become last year's really bad idea ;-) ]
> 
> So please start walking the path and experience the challenges first-hand.

That sounds like a good approach for the 'measure-guest-from-host'
problem. It is also not very hard to implement. Where does perf fetch
the rip of the nmi from, stack only or is this configurable?

	Joerg


^ permalink raw reply	[flat|nested] 99+ messages in thread

* Re: KVM PMU virtualization
  2010-02-26  9:46     ` Avi Kivity
  2010-02-26 10:39       ` Joerg Roedel
@ 2010-02-26 10:44       ` Ingo Molnar
  2010-02-26 11:16         ` Avi Kivity
  2010-02-26 11:23         ` Jes Sorensen
  1 sibling, 2 replies; 99+ messages in thread
From: Ingo Molnar @ 2010-02-26 10:44 UTC (permalink / raw)
  To: Avi Kivity
  Cc: Joerg Roedel, Jes Sorensen, KVM General, Peter Zijlstra,
	Zachary Amsden, Gleb Natapov, ming.m.lin, Zhang, Yanmin,
	Peter Zijlstra, Thomas Gleixner, H. Peter Anvin, Arjan van de Ven,
	Fr??d??ric Weisbecker, Arnaldo Carvalho de Melo

* Avi Kivity <avi@redhat.com> wrote:

> Right, this will severely limit migration domains to hosts of the same 
> vendor and processor generation.  There is a middle ground, though, Intel 
> has recently moved to define an "architectural pmu" which is not model 
> specific.  I don't know if AMD adopted it. [...]

Nope. It's "architectural" the following way: Intel wont change it with future 
CPU models, outside of the definitions of the hw-ABI. PMUs were model specific 
prior that time.

I'd say there's near zero chance the MSR spaces will unify. All the 'advanced' 
PMU features are wildly incompatible, and the gap is increasing not 
decreasing.

> > Far cleaner would be to expose it via hypercalls to guest OSs that are 
> > interested in instrumentation.
> 
> It's also slower - you can give the guest direct access to the various 
> counters so no exits are taken when reading the counters (though perhaps 
> many tools are only interested in the interrupts, not the counter values).

Direct access to counters is not something that is a big issue. [ Given that i 
sometimes can see KVM redraw the screen of a guest OS real-time i doubt this 
is the biggest of performance challenges right now ;-) ]

By far the biggest instrumentation issue is:

 - availability
 - usability
 - flexibility

Exposing the raw hw is a step backwards in many regards. The same way we dont 
want to expose chipsets to the guest to allow them to do RAS. The same way we 
dont want to expose most raw PCI devices to guest in general, but have all 
these virt driver abstractions.

> > That way it could also transparently integrate with tracing, probes, etc. 
> > It would also be wiser to first concentrate on improving Linux<->Linux 
> > guest/host combos before gutting the design just to fit Windows into the 
> > picture ...
> 
> "gutting the design"?

Yes, gutting the design of a sane instrumentation API and moving it back 10-20 
years by squeezing it through non-standardized and incompatible PMU drivers.

When it comes to design my main interest is the Linux<->Linux combo.

	Ingo

^ permalink raw reply	[flat|nested] 99+ messages in thread

* Re: KVM PMU virtualization
  2010-02-26 10:39       ` Joerg Roedel
@ 2010-02-26 10:46         ` Ingo Molnar
  2010-02-26 10:51           ` Avi Kivity
  2010-02-26 11:06           ` Joerg Roedel
  0 siblings, 2 replies; 99+ messages in thread
From: Ingo Molnar @ 2010-02-26 10:46 UTC (permalink / raw)
  To: Joerg Roedel
  Cc: Avi Kivity, Jes Sorensen, KVM General, Peter Zijlstra,
	Zachary Amsden, Gleb Natapov, ming.m.lin, Zhang, Yanmin,
	Peter Zijlstra, Thomas Gleixner, H. Peter Anvin, Arjan van de Ven,
	Fr??d??ric Weisbecker, Arnaldo Carvalho de Melo


* Joerg Roedel <joro@8bytes.org> wrote:

> On Fri, Feb 26, 2010 at 11:46:34AM +0200, Avi Kivity wrote:
> > On 02/26/2010 10:42 AM, Ingo Molnar wrote:
> >> Note that the 'soft PMU' still sucks from a design POV as there's no generic
> >> hw interface to the PMU. So there would have to be a 'soft AMD' and a 'soft
> >> Intel' PMU driver at minimum.
> >>    
> >
> > Right, this will severely limit migration domains to hosts of the same  
> > vendor and processor generation.  There is a  middle ground, though,  
> > Intel has recently moved to define an "architectural pmu" which is not  
> > model specific.  I don't know if AMD adopted it.  We could offer both  
> > options - native host capabilities, with a loss of compatibility, and  
> > the architectural pmu, with loss of model specific counters.
> 
> I only had a quick look yet on the architectural pmu from intel but it looks 
> like it can be emulated for a guest on amd using existing features.

AMD CPUs dont have enough events for that, they cannot do the 3 fixed events 
in addition to the 2 generic ones.

Nor do you really want to standardize on KVM guests on returning 
'GenuineIntel' in CPUID, so that the various guest side OSs use the Intel PMU 
drivers, right?

	Ingo

^ permalink raw reply	[flat|nested] 99+ messages in thread

* Re: KVM PMU virtualization
  2010-02-26 10:46         ` Ingo Molnar
@ 2010-02-26 10:51           ` Avi Kivity
  2010-02-26 11:06           ` Joerg Roedel
  1 sibling, 0 replies; 99+ messages in thread
From: Avi Kivity @ 2010-02-26 10:51 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Joerg Roedel, Jes Sorensen, KVM General, Peter Zijlstra,
	Zachary Amsden, Gleb Natapov, ming.m.lin, Zhang, Yanmin,
	Peter Zijlstra, Thomas Gleixner, H. Peter Anvin, Arjan van de Ven,
	Fr??d??ric Weisbecker, Arnaldo Carvalho de Melo

On 02/26/2010 12:46 PM, Ingo Molnar wrote:
>
>>> Right, this will severely limit migration domains to hosts of the same
>>> vendor and processor generation.  There is a  middle ground, though,
>>> Intel has recently moved to define an "architectural pmu" which is not
>>> model specific.  I don't know if AMD adopted it.  We could offer both
>>> options - native host capabilities, with a loss of compatibility, and
>>> the architectural pmu, with loss of model specific counters.
>>>        
>> I only had a quick look yet on the architectural pmu from intel but it looks
>> like it can be emulated for a guest on amd using existing features.
>>      
> AMD CPUs dont have enough events for that, they cannot do the 3 fixed events
> in addition to the 2 generic ones.
>
> Nor do you really want to standardize on KVM guests on returning
> 'GenuineIntel' in CPUID, so that the various guest side OSs use the Intel PMU
> drivers, right?
>
>    

No - that would only work if AMD also adopted the architectural pmu.

Note virtualization clusters are typically split into 'migration pools' 
consisting of hosts with similar processor features, so that you can 
expose those features and yet live migrate guests at will.  It's likely 
that all hosts have the same pmu anyway, so the only downside is that we 
now have to expose the host's processor family and model.

-- 
Do not meddle in the internals of kernels, for they are subtle and quick to panic.


^ permalink raw reply	[flat|nested] 99+ messages in thread

* Re: KVM PMU virtualization
  2010-02-26 10:42         ` Joerg Roedel
@ 2010-02-26 10:56           ` Ingo Molnar
  0 siblings, 0 replies; 99+ messages in thread
From: Ingo Molnar @ 2010-02-26 10:56 UTC (permalink / raw)
  To: Joerg Roedel
  Cc: Zhang, Yanmin, Jes Sorensen, KVM General, Peter Zijlstra,
	Avi Kivity, Zachary Amsden, Gleb Natapov, ming.m.lin


* Joerg Roedel <joro@8bytes.org> wrote:

> On Fri, Feb 26, 2010 at 10:17:32AM +0100, Ingo Molnar wrote:
> > My suggestion, as always, would be to start very simple and very minimal:
> > 
> > Enable 'perf kvm top' to show guest overhead. Use the exact same kernel image 
> > both as a host and as guest (for testing), to not have to deal with the symbol 
> > space transport problem initially. Enable 'perf kvm record' to only record 
> > guest events by default. Etc.
> > 
> > This alone will be a quite useful result already - and gives a basis for 
> > further work. No need to spend months to do the big grand design straight 
> > away, all of this can be done gradually and in the order of usefulness - and 
> > you'll always have something that actually works (and helps your other KVM 
> > projects) along the way.
> > 
> > [ And, as so often, once you walk that path, that grand scheme you are 
> >   thinking about right now might easily become last year's really bad idea ;-) ]
> > 
> > So please start walking the path and experience the challenges first-hand.
> 
> That sounds like a good approach for the 'measure-guest-from-host'
> problem. It is also not very hard to implement. Where does perf fetch
> the rip of the nmi from, stack only or is this configurable?

The host semantics are that it takes the stack from the regs, and with 
call-graph recording (perf record -g) it will walk down the exception stack, 
irq stack, kernel stack, and user-space stack as well. (up to the point the 
pages are present - it stops on a non-present page. An app that is being 
profiled has its stack present so it's not an issue in practice.)

I'd suggest to leave out call graph sampling initially, and just get 'perf kvm 
top' to work with guest RIPs, simply sampled from the VM exit state.

See arch/x86/kernel/cpu/perf_event.c:

static void
perf_callchain_kernel(struct pt_regs *regs, struct perf_callchain_entry *entry)
{
        callchain_store(entry, PERF_CONTEXT_KERNEL);
        callchain_store(entry, regs->ip);

        dump_trace(NULL, regs, NULL, regs->bp, &backtrace_ops, entry);
}

If you have easy access to the VM state from NMI context right there then just 
hack in the guest RIP and you should have some prototype that samples the 
guest. (assuming you use the same kernel image for both the host an the guest)

This would be the easiest way to prototype it all.

	Ingo

^ permalink raw reply	[flat|nested] 99+ messages in thread

* Re: KVM PMU virtualization
  2010-02-25 17:34 ` Joerg Roedel
  2010-02-26  2:55   ` Zhang, Yanmin
  2010-02-26  8:42   ` Ingo Molnar
@ 2010-02-26 11:01   ` Jes Sorensen
  2 siblings, 0 replies; 99+ messages in thread
From: Jes Sorensen @ 2010-02-26 11:01 UTC (permalink / raw)
  To: Joerg Roedel
  Cc: KVM General, Peter Zijlstra, Avi Kivity, Zachary Amsden,
	Gleb Natapov, Ingo Molnar, ming.m.lin, Zhang, Yanmin

On 02/25/10 18:34, Joerg Roedel wrote:
> The biggest problem I see here is teaching the guest about the available
> events. The available event sets are dependent on the processor family
> (at least on AMD).
> A simple approach would be shadowing the perf msrs which is a simple
> thing to do. More problematic is the reinjection of performance
> interrupts and performance nmis.

IMHO the only real solution here is to map it to the host CPU, and
require -cpu host for PMU support. There is no point in trying to
emulate PMU features which we don't have in the hardware. Ie. you cannot
count cache misses if the hardware doesn't support it.

Cheers,
Jes


^ permalink raw reply	[flat|nested] 99+ messages in thread

* Re: KVM PMU virtualization
  2010-02-25 16:26   ` Ingo Molnar
  2010-02-26  2:52     ` Zhang, Yanmin
@ 2010-02-26 11:03     ` Jes Sorensen
  1 sibling, 0 replies; 99+ messages in thread
From: Jes Sorensen @ 2010-02-26 11:03 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Jan Kiszka, KVM General, Peter Zijlstra, Avi Kivity,
	Zachary Amsden, Gleb Natapov, ming.m.lin, Zhang, Yanmin

On 02/25/10 17:26, Ingo Molnar wrote:
>> Given that perf can apply the PMU to individual host tasks, I don't see
>> fundamental problems multiplexing it between individual guests (which can
>> then internally multiplex it again).
>
> In terms of how to expose it to guests, a 'soft PMU' might be a usable
> approach. Although to Linux guests you could expose much more functionality
> and an non-PMU-limited number of instrumentation events, via a more
> intelligent interface.
>
> But note that in terms of handling it on the host side the PMU approach is not
> acceptable: instead it should map to proper perf_events, not try to muck with
> the PMU itself.

I am not keen on emulating the PMU, if we do that we end up having to
emulate a large number of MSR accesses, which is really costly. It makes
a lot more sense to give the guest direct access to the PMU. The problem
here is how to manage it without too much overhead.

Cheers,
Jes


^ permalink raw reply	[flat|nested] 99+ messages in thread

* Re: KVM PMU virtualization
  2010-02-26 10:46         ` Ingo Molnar
  2010-02-26 10:51           ` Avi Kivity
@ 2010-02-26 11:06           ` Joerg Roedel
  2010-02-26 11:18             ` Jes Sorensen
  2010-02-26 11:20             ` Ingo Molnar
  1 sibling, 2 replies; 99+ messages in thread
From: Joerg Roedel @ 2010-02-26 11:06 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Avi Kivity, Jes Sorensen, KVM General, Peter Zijlstra,
	Zachary Amsden, Gleb Natapov, ming.m.lin, Zhang, Yanmin,
	Peter Zijlstra, Thomas Gleixner, H. Peter Anvin, Arjan van de Ven,
	Fr??d??ric Weisbecker, Arnaldo Carvalho de Melo

On Fri, Feb 26, 2010 at 11:46:59AM +0100, Ingo Molnar wrote:
> 
> * Joerg Roedel <joro@8bytes.org> wrote:
> 
> > On Fri, Feb 26, 2010 at 11:46:34AM +0200, Avi Kivity wrote:
> > > On 02/26/2010 10:42 AM, Ingo Molnar wrote:
> > >> Note that the 'soft PMU' still sucks from a design POV as there's no generic
> > >> hw interface to the PMU. So there would have to be a 'soft AMD' and a 'soft
> > >> Intel' PMU driver at minimum.
> > >>    
> > >
> > > Right, this will severely limit migration domains to hosts of the same  
> > > vendor and processor generation.  There is a  middle ground, though,  
> > > Intel has recently moved to define an "architectural pmu" which is not  
> > > model specific.  I don't know if AMD adopted it.  We could offer both  
> > > options - native host capabilities, with a loss of compatibility, and  
> > > the architectural pmu, with loss of model specific counters.
> > 
> > I only had a quick look yet on the architectural pmu from intel but it looks 
> > like it can be emulated for a guest on amd using existing features.
> 
> AMD CPUs dont have enough events for that, they cannot do the 3 fixed events 
> in addition to the 2 generic ones.

Good point. Maybe we can emulate that with some counter round-robin
usage if the guest really uses all 5 counters.

> Nor do you really want to standardize on KVM guests on returning 
> 'GenuineIntel' in CPUID, so that the various guest side OSs use the Intel PMU 
> drivers, right?

Isn't there a cpuid bit indicating the availability of architectural
perfmon?

	Joerg


^ permalink raw reply	[flat|nested] 99+ messages in thread

* Re: KVM PMU virtualization
  2010-02-26 10:44       ` Ingo Molnar
@ 2010-02-26 11:16         ` Avi Kivity
  2010-02-26 11:26           ` Ingo Molnar
  2010-02-26 11:23         ` Jes Sorensen
  1 sibling, 1 reply; 99+ messages in thread
From: Avi Kivity @ 2010-02-26 11:16 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Joerg Roedel, Jes Sorensen, KVM General, Peter Zijlstra,
	Zachary Amsden, Gleb Natapov, ming.m.lin, Zhang, Yanmin,
	Peter Zijlstra, Thomas Gleixner, H. Peter Anvin, Arjan van de Ven,
	Fr??d??ric Weisbecker, Arnaldo Carvalho de Melo

On 02/26/2010 12:44 PM, Ingo Molnar wrote:
>>> Far cleaner would be to expose it via hypercalls to guest OSs that are
>>> interested in instrumentation.
>>>        
>> It's also slower - you can give the guest direct access to the various
>> counters so no exits are taken when reading the counters (though perhaps
>> many tools are only interested in the interrupts, not the counter values).
>>      
> Direct access to counters is not something that is a big issue. [ Given that i
> sometimes can see KVM redraw the screen of a guest OS real-time i doubt this
> is the biggest of performance challenges right now ;-) ]
>    

Outside 4-bit vga mode, this shouldn't happen.  Can you describe your 
scenario?

> By far the biggest instrumentation issue is:
>
>   - availability
>   - usability
>   - flexibility
>
> Exposing the raw hw is a step backwards in many regards.

In a way, virtualization as a whole is a step backwards.  We take the 
nice firesystem/timer/network/scheduler APIs, and expose them as raw 
hardware.  The pmu isn't any different.

> The same way we dont
> want to expose chipsets to the guest to allow them to do RAS. The same way we
> dont want to expose most raw PCI devices to guest in general, but have all
> these virt driver abstractions.
>    

Whenever we have a choice, we expose raw hardware (usually emulated, but 
in some cases real).  Raw hardware has the huge advantage of being 
already supported.  Write a software abstraction, and you get to (a) 
write and maintain the spec (b) write drivers for all guests (c) mumble 
something to users of OSes to which you haven't ported your driver (d) 
explain to users that they need to install those drivers.

For networking and block, it is simply impossible to obtain good 
performance without introducing a new interface, but for other stuff, 
that may not be the case.

>>> That way it could also transparently integrate with tracing, probes, etc.
>>> It would also be wiser to first concentrate on improving Linux<->Linux
>>> guest/host combos before gutting the design just to fit Windows into the
>>> picture ...
>>>        
>> "gutting the design"?
>>      
> Yes, gutting the design of a sane instrumentation API and moving it back 10-20
> years by squeezing it through non-standardized and incompatible PMU drivers.
>    

Any new interface will be incompatible to all the exiting guests out 
there; and unlike networking, you can't retrofit a pmu interface to an 
existing guest.

> When it comes to design my main interest is the Linux<->Linux combo.
>    

My main interest is the OSes that users actually install, and those are 
Windows and non-bleeding-edge Linux.

Look at guests as you do at userspace: you don't want to inflict changes 
upon them.

-- 
Do not meddle in the internals of kernels, for they are subtle and quick to panic.

^ permalink raw reply	[flat|nested] 99+ messages in thread

* Re: KVM PMU virtualization
  2010-02-26 11:06           ` Joerg Roedel
@ 2010-02-26 11:18             ` Jes Sorensen
  2010-02-26 11:24               ` Ingo Molnar
  2010-02-26 11:20             ` Ingo Molnar
  1 sibling, 1 reply; 99+ messages in thread
From: Jes Sorensen @ 2010-02-26 11:18 UTC (permalink / raw)
  To: Joerg Roedel
  Cc: Ingo Molnar, Avi Kivity, KVM General, Peter Zijlstra,
	Zachary Amsden, Gleb Natapov, ming.m.lin, Zhang, Yanmin,
	Peter Zijlstra, Thomas Gleixner, H. Peter Anvin, Arjan van de Ven,
	Fr??d??ric Weisbecker, Arnaldo Carvalho de Melo

On 02/26/10 12:06, Joerg Roedel wrote:
> Isn't there a cpuid bit indicating the availability of architectural
> perfmon?

Nope, the perfmon flag is a fake Linux flag, set based on the contents
on cpuid 0x0a

Jes


^ permalink raw reply	[flat|nested] 99+ messages in thread

* Re: KVM PMU virtualization
  2010-02-26 11:06           ` Joerg Roedel
  2010-02-26 11:18             ` Jes Sorensen
@ 2010-02-26 11:20             ` Ingo Molnar
  1 sibling, 0 replies; 99+ messages in thread
From: Ingo Molnar @ 2010-02-26 11:20 UTC (permalink / raw)
  To: Joerg Roedel
  Cc: Avi Kivity, Jes Sorensen, KVM General, Peter Zijlstra,
	Zachary Amsden, Gleb Natapov, ming.m.lin, Zhang, Yanmin,
	Peter Zijlstra, Thomas Gleixner, H. Peter Anvin, Arjan van de Ven,
	Fr??d??ric Weisbecker, Arnaldo Carvalho de Melo


* Joerg Roedel <joro@8bytes.org> wrote:

> On Fri, Feb 26, 2010 at 11:46:59AM +0100, Ingo Molnar wrote:
> > 
> > * Joerg Roedel <joro@8bytes.org> wrote:
> > 
> > > On Fri, Feb 26, 2010 at 11:46:34AM +0200, Avi Kivity wrote:
> > > > On 02/26/2010 10:42 AM, Ingo Molnar wrote:
> > > >> Note that the 'soft PMU' still sucks from a design POV as there's no generic
> > > >> hw interface to the PMU. So there would have to be a 'soft AMD' and a 'soft
> > > >> Intel' PMU driver at minimum.
> > > >>    
> > > >
> > > > Right, this will severely limit migration domains to hosts of the same  
> > > > vendor and processor generation.  There is a  middle ground, though,  
> > > > Intel has recently moved to define an "architectural pmu" which is not  
> > > > model specific.  I don't know if AMD adopted it.  We could offer both  
> > > > options - native host capabilities, with a loss of compatibility, and  
> > > > the architectural pmu, with loss of model specific counters.
> > > 
> > > I only had a quick look yet on the architectural pmu from intel but it looks 
> > > like it can be emulated for a guest on amd using existing features.
> > 
> > AMD CPUs dont have enough events for that, they cannot do the 3 fixed events 
> > in addition to the 2 generic ones.
> 
> Good point. Maybe we can emulate that with some counter round-robin
> usage if the guest really uses all 5 counters.
> 
> > Nor do you really want to standardize on KVM guests on returning 
> > 'GenuineIntel' in CPUID, so that the various guest side OSs use the Intel PMU 
> > drivers, right?
> 
> Isn't there a cpuid bit indicating the availability of architectural 
> perfmon?

there is, but can you rely on all guest OSs keying off their PMU drivers based 
purely on the CPUID bit and not on any other CPUID aspects?

Guest OSs like ... Linux v2.6.33:

void __init init_hw_perf_events(void)
{
        int err;

        pr_info("Performance Events: ");

        switch (boot_cpu_data.x86_vendor) {
        case X86_VENDOR_INTEL:
                err = intel_pmu_init();
                break;
        case X86_VENDOR_AMD:
                err = amd_pmu_init();
                break;
        default:

Really, if you want to emulate a single Intel PMU driver model you need to 
pretend that you are an Intel CPU, throughout. This cannot be had both ways.

	Ingo

^ permalink raw reply	[flat|nested] 99+ messages in thread

* Re: KVM PMU virtualization
  2010-02-26 10:44       ` Ingo Molnar
  2010-02-26 11:16         ` Avi Kivity
@ 2010-02-26 11:23         ` Jes Sorensen
  2010-02-26 11:42           ` Ingo Molnar
  1 sibling, 1 reply; 99+ messages in thread
From: Jes Sorensen @ 2010-02-26 11:23 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Avi Kivity, Joerg Roedel, KVM General, Peter Zijlstra,
	Zachary Amsden, Gleb Natapov, ming.m.lin, Zhang, Yanmin,
	Peter Zijlstra, Thomas Gleixner, H. Peter Anvin, Arjan van de Ven,
	Fr??d??ric Weisbecker, Arnaldo Carvalho de Melo

On 02/26/10 11:44, Ingo Molnar wrote:
> Direct access to counters is not something that is a big issue. [ Given that i
> sometimes can see KVM redraw the screen of a guest OS real-time i doubt this
> is the biggest of performance challenges right now ;-) ]
>
> By far the biggest instrumentation issue is:
>
>   - availability
>   - usability
>   - flexibility
>
> Exposing the raw hw is a step backwards in many regards. The same way we dont
> want to expose chipsets to the guest to allow them to do RAS. The same way we
> dont want to expose most raw PCI devices to guest in general, but have all
> these virt driver abstractions.

I have to say I disagree on that. When you run perfmon on a system, it
is normally to measure a specific application. You want to see accurate
numbers for cache misses, mul instructions or whatever else is selected.
Emulating the PMU rather than using the real one, makes the numbers far
less useful. The most useful way to provide PMU support in a guest is
to expose the real PMU and let the guest OS program it.

We can do this in a reasonable way today, if we allow to take the PMU
away from the host, and only let guests access it when it's in use.
Hopefully Intel and AMD will come up with proper hw PMU virtualization
support that allows us to do it 100% guest and host at some point.

Cheers,
Jes

^ permalink raw reply	[flat|nested] 99+ messages in thread

* Re: KVM PMU virtualization
  2010-02-26 11:18             ` Jes Sorensen
@ 2010-02-26 11:24               ` Ingo Molnar
  2010-02-26 11:25                 ` Jes Sorensen
  0 siblings, 1 reply; 99+ messages in thread
From: Ingo Molnar @ 2010-02-26 11:24 UTC (permalink / raw)
  To: Jes Sorensen
  Cc: Joerg Roedel, Avi Kivity, KVM General, Peter Zijlstra,
	Zachary Amsden, Gleb Natapov, ming.m.lin, Zhang, Yanmin,
	Peter Zijlstra, Thomas Gleixner, H. Peter Anvin, Arjan van de Ven,
	Fr??d??ric Weisbecker, Arnaldo Carvalho de Melo


* Jes Sorensen <Jes.Sorensen@redhat.com> wrote:

> On 02/26/10 12:06, Joerg Roedel wrote:
>
> > Isn't there a cpuid bit indicating the availability of architectural 
> > perfmon?
> 
> Nope, the perfmon flag is a fake Linux flag, set based on the contents on 
> cpuid 0x0a

There is a way to query the CPU for 'architectural perfmon' though, via CPUID 
alone - that is how we set the X86_FEATURE_ARCH_PERFMON shortcut. The logic 
is:

        if (c->cpuid_level > 9) {
                unsigned eax = cpuid_eax(10);
                /* Check for version and the number of counters */
                if ((eax & 0xff) && (((eax>>8) & 0xff) > 1))
                        set_cpu_cap(c, X86_FEATURE_ARCH_PERFMON);
        }

But emulating that doesnt solve the problem: as OSs generally dont key their 
PMU drivers off the relatively new 'architectural perfmon' CPUID detail, but 
based on much higher level CPUID attributes. (like Intel/AMD)

	Ingo

^ permalink raw reply	[flat|nested] 99+ messages in thread

* Re: KVM PMU virtualization
  2010-02-26 11:24               ` Ingo Molnar
@ 2010-02-26 11:25                 ` Jes Sorensen
  0 siblings, 0 replies; 99+ messages in thread
From: Jes Sorensen @ 2010-02-26 11:25 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Joerg Roedel, Avi Kivity, KVM General, Peter Zijlstra,
	Zachary Amsden, Gleb Natapov, ming.m.lin, Zhang, Yanmin,
	Peter Zijlstra, Thomas Gleixner, H. Peter Anvin, Arjan van de Ven,
	Fr??d??ric Weisbecker, Arnaldo Carvalho de Melo

On 02/26/10 12:24, Ingo Molnar wrote:
> There is a way to query the CPU for 'architectural perfmon' though, via CPUID
> alone - that is how we set the X86_FEATURE_ARCH_PERFMON shortcut. The logic
> is:
>
>          if (c->cpuid_level>  9) {
>                  unsigned eax = cpuid_eax(10);
>                  /* Check for version and the number of counters */
>                  if ((eax&  0xff)&&  (((eax>>8)&  0xff)>  1))
>                          set_cpu_cap(c, X86_FEATURE_ARCH_PERFMON);
>          }
>
> But emulating that doesnt solve the problem: as OSs generally dont key their
> PMU drivers off the relatively new 'architectural perfmon' CPUID detail, but
> based on much higher level CPUID attributes. (like Intel/AMD)

Right, there is far more to it than just the arch-perfmon feature. They
still need to query cpuid 0x0a for counter size, number of counters and
stuff like that.

Cheers,
Jes

^ permalink raw reply	[flat|nested] 99+ messages in thread

* Re: KVM PMU virtualization
  2010-02-26 11:16         ` Avi Kivity
@ 2010-02-26 11:26           ` Ingo Molnar
  2010-02-26 11:47             ` Avi Kivity
  0 siblings, 1 reply; 99+ messages in thread
From: Ingo Molnar @ 2010-02-26 11:26 UTC (permalink / raw)
  To: Avi Kivity
  Cc: Joerg Roedel, Jes Sorensen, KVM General, Peter Zijlstra,
	Zachary Amsden, Gleb Natapov, ming.m.lin, Zhang, Yanmin,
	Peter Zijlstra, Thomas Gleixner, H. Peter Anvin, Arjan van de Ven,
	Fr??d??ric Weisbecker, Arnaldo Carvalho de Melo


* Avi Kivity <avi@redhat.com> wrote:

> On 02/26/2010 12:44 PM, Ingo Molnar wrote:
> >>>Far cleaner would be to expose it via hypercalls to guest OSs that are
> >>>interested in instrumentation.
> >>It's also slower - you can give the guest direct access to the various
> >>counters so no exits are taken when reading the counters (though perhaps
> >>many tools are only interested in the interrupts, not the counter values).
> >Direct access to counters is not something that is a big issue. [ Given that i
> >sometimes can see KVM redraw the screen of a guest OS real-time i doubt this
> >is the biggest of performance challenges right now ;-) ]
> 
> Outside 4-bit vga mode, this shouldn't happen.  Can you describe
> your scenario?
> 
> >By far the biggest instrumentation issue is:
> >
> >  - availability
> >  - usability
> >  - flexibility
> >
> >Exposing the raw hw is a step backwards in many regards.
> 
> In a way, virtualization as a whole is a step backwards.  We take the nice 
> firesystem/timer/network/scheduler APIs, and expose them as raw hardware.  
> The pmu isn't any different.

Uhm, it's obviously very different. A fake NE2000 will work on both Intel and 
AMD CPUs. Same for a fake PIT. PMU drivers are fundamentally per CPU vendor 
though.

So there's no "generic hardware" to emulate.

	Ingo

^ permalink raw reply	[flat|nested] 99+ messages in thread

* Re: KVM PMU virtualization
  2010-02-26 11:23         ` Jes Sorensen
@ 2010-02-26 11:42           ` Ingo Molnar
  2010-02-26 11:51             ` Avi Kivity
                               ` (2 more replies)
  0 siblings, 3 replies; 99+ messages in thread
From: Ingo Molnar @ 2010-02-26 11:42 UTC (permalink / raw)
  To: Jes Sorensen
  Cc: Avi Kivity, Joerg Roedel, KVM General, Peter Zijlstra,
	Zachary Amsden, Gleb Natapov, ming.m.lin, Zhang, Yanmin,
	Peter Zijlstra, Thomas Gleixner, H. Peter Anvin, Arjan van de Ven,
	Fr??d??ric Weisbecker, Arnaldo Carvalho de Melo

* Jes Sorensen <Jes.Sorensen@redhat.com> wrote:

> On 02/26/10 11:44, Ingo Molnar wrote:
> >Direct access to counters is not something that is a big issue. [ Given that i
> >sometimes can see KVM redraw the screen of a guest OS real-time i doubt this
> >is the biggest of performance challenges right now ;-) ]
> >
> >By far the biggest instrumentation issue is:
> >
> >  - availability
> >  - usability
> >  - flexibility
> >
> >Exposing the raw hw is a step backwards in many regards. The same way we dont
> >want to expose chipsets to the guest to allow them to do RAS. The same way we
> >dont want to expose most raw PCI devices to guest in general, but have all
> >these virt driver abstractions.
> 
> I have to say I disagree on that. When you run perfmon on a system, it is 
> normally to measure a specific application. You want to see accurate numbers 
> for cache misses, mul instructions or whatever else is selected.

You can still get those. You can even enable RDPMC access and avoid VM exits.

What you _cannot_ do is to 'steal' the PMU and just give it to the guest.

> Emulating the PMU rather than using the real one, makes the numbers far less 
> useful. The most useful way to provide PMU support in a guest is to expose 
> the real PMU and let the guest OS program it.

Firstly, an emulated PMU was only the second-tier option i suggested. By far 
the best approach is native API to the host regarding performance events and 
good guest side integration.

Secondly, the PMU cannot be 'given' to the guest in the general case. Those 
are privileged registers. They can expose sensitive host execution details, 
etc. etc. So if you emulate a PMU you have to exit out of most PMU accesses 
anyway for a secure solution. (RDPMC can still be supported, but in close 
cooperation with the host)

> We can do this in a reasonable way today, if we allow to take the PMU away 
> from the host, and only let guests access it when it's in use. [...]

You get my sure-fire NAK for that kind of crap though. Interfering with the 
host PMU and stealing it, is not a technical approach that has acceptable 
quality.

You need to integrate it properly so that host PMU functionality still works 
fine. (Within hardware constraints)

	Ingo

^ permalink raw reply	[flat|nested] 99+ messages in thread

* Re: KVM PMU virtualization
  2010-02-26 11:26           ` Ingo Molnar
@ 2010-02-26 11:47             ` Avi Kivity
  0 siblings, 0 replies; 99+ messages in thread
From: Avi Kivity @ 2010-02-26 11:47 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Joerg Roedel, Jes Sorensen, KVM General, Peter Zijlstra,
	Zachary Amsden, Gleb Natapov, ming.m.lin, Zhang, Yanmin,
	Peter Zijlstra, Thomas Gleixner, H. Peter Anvin, Arjan van de Ven,
	Fr??d??ric Weisbecker, Arnaldo Carvalho de Melo

On 02/26/2010 01:26 PM, Ingo Molnar wrote:
>>
>>> By far the biggest instrumentation issue is:
>>>
>>>   - availability
>>>   - usability
>>>   - flexibility
>>>
>>> Exposing the raw hw is a step backwards in many regards.
>>>        
>> In a way, virtualization as a whole is a step backwards.  We take the nice
>> firesystem/timer/network/scheduler APIs, and expose them as raw hardware.
>> The pmu isn't any different.
>>      
> Uhm, it's obviously very different. A fake NE2000 will work on both Intel and
> AMD CPUs. Same for a fake PIT. PMU drivers are fundamentally per CPU vendor
> though.
>
> So there's no "generic hardware" to emulate.
>    

That's true, and it reduces the usability of the feature (you have to 
restrict your migration pools or not expose the pmu), but the general 
points still stand.

-- 
Do not meddle in the internals of kernels, for they are subtle and quick to panic.


^ permalink raw reply	[flat|nested] 99+ messages in thread

* Re: KVM PMU virtualization
  2010-02-26 11:42           ` Ingo Molnar
@ 2010-02-26 11:51             ` Avi Kivity
  2010-02-26 12:07               ` Ingo Molnar
  2010-02-26 13:28               ` Peter Zijlstra
  2010-02-26 12:49             ` Jes Sorensen
  2010-03-01 17:22             ` Zachary Amsden
  2 siblings, 2 replies; 99+ messages in thread
From: Avi Kivity @ 2010-02-26 11:51 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Jes Sorensen, Joerg Roedel, KVM General, Peter Zijlstra,
	Zachary Amsden, Gleb Natapov, ming.m.lin, Zhang, Yanmin,
	Peter Zijlstra, Thomas Gleixner, H. Peter Anvin, Arjan van de Ven,
	Fr??d??ric Weisbecker, Arnaldo Carvalho de Melo

On 02/26/2010 01:42 PM, Ingo Molnar wrote:
> * Jes Sorensen<Jes.Sorensen@redhat.com>  wrote:
>
>    
>> On 02/26/10 11:44, Ingo Molnar wrote:
>>      
>>> Direct access to counters is not something that is a big issue. [ Given that i
>>> sometimes can see KVM redraw the screen of a guest OS real-time i doubt this
>>> is the biggest of performance challenges right now ;-) ]
>>>
>>> By far the biggest instrumentation issue is:
>>>
>>>   - availability
>>>   - usability
>>>   - flexibility
>>>
>>> Exposing the raw hw is a step backwards in many regards. The same way we dont
>>> want to expose chipsets to the guest to allow them to do RAS. The same way we
>>> dont want to expose most raw PCI devices to guest in general, but have all
>>> these virt driver abstractions.
>>>        
>> I have to say I disagree on that. When you run perfmon on a system, it is
>> normally to measure a specific application. You want to see accurate numbers
>> for cache misses, mul instructions or whatever else is selected.
>>      
> You can still get those. You can even enable RDPMC access and avoid VM exits.
>
> What you _cannot_ do is to 'steal' the PMU and just give it to the guest.
>    

Agreed - if both the host and guest want the pmu, the host wins.  This 
is what we do with debug registers - if both the host and guest contend 
for them, the host wins.

>> Emulating the PMU rather than using the real one, makes the numbers far less
>> useful. The most useful way to provide PMU support in a guest is to expose
>> the real PMU and let the guest OS program it.
>>      
> Firstly, an emulated PMU was only the second-tier option i suggested. By far
> the best approach is native API to the host regarding performance events and
> good guest side integration.
>    

A native API to the host will lock out 100% of the install base now, and 
a large section of any future install base.

> Secondly, the PMU cannot be 'given' to the guest in the general case. Those
> are privileged registers. They can expose sensitive host execution details,
> etc. etc. So if you emulate a PMU you have to exit out of most PMU accesses
> anyway for a secure solution. (RDPMC can still be supported, but in close
> cooperation with the host)
>    

No, stop and restart the counters on every exit/entry, so the guest 
doesn't observe any host data.

>> We can do this in a reasonable way today, if we allow to take the PMU away
>> from the host, and only let guests access it when it's in use. [...]
>>      
> You get my sure-fire NAK for that kind of crap though. Interfering with the
> host PMU and stealing it, is not a technical approach that has acceptable
> quality.
>
>    

It would be the other way round - the host would steal the pmu from the 
guest.  Later we can try to time-slice and extrapolate, though that's 
not going to be easy.

-- 
Do not meddle in the internals of kernels, for they are subtle and quick to panic.


^ permalink raw reply	[flat|nested] 99+ messages in thread

* Re: KVM PMU virtualization
  2010-02-26 11:51             ` Avi Kivity
@ 2010-02-26 12:07               ` Ingo Molnar
  2010-02-26 12:20                 ` Avi Kivity
  2010-02-26 13:28               ` Peter Zijlstra
  1 sibling, 1 reply; 99+ messages in thread
From: Ingo Molnar @ 2010-02-26 12:07 UTC (permalink / raw)
  To: Avi Kivity
  Cc: Jes Sorensen, Joerg Roedel, KVM General, Peter Zijlstra,
	Zachary Amsden, Gleb Natapov, ming.m.lin, Zhang, Yanmin,
	Peter Zijlstra, Thomas Gleixner, H. Peter Anvin, Arjan van de Ven,
	Fr??d??ric Weisbecker, Arnaldo Carvalho de Melo


* Avi Kivity <avi@redhat.com> wrote:

> A native API to the host will lock out 100% of the install base now, and a 
> large section of any future install base.

... which is why i suggested the soft-PMU approach.

And note that _any_ solution we offer locks out 100% of the installed base 
right now, as no solution is in the kernel yet. The only question is what kind 
of upgrade effort is needed for users to make use of the feature.

	Ingo

^ permalink raw reply	[flat|nested] 99+ messages in thread

* Re: KVM PMU virtualization
  2010-02-26 12:07               ` Ingo Molnar
@ 2010-02-26 12:20                 ` Avi Kivity
  2010-02-26 12:38                   ` Ingo Molnar
                                     ` (2 more replies)
  0 siblings, 3 replies; 99+ messages in thread
From: Avi Kivity @ 2010-02-26 12:20 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Jes Sorensen, Joerg Roedel, KVM General, Peter Zijlstra,
	Zachary Amsden, Gleb Natapov, ming.m.lin, Zhang, Yanmin,
	Peter Zijlstra, Thomas Gleixner, H. Peter Anvin, Arjan van de Ven,
	Fr??d??ric Weisbecker, Arnaldo Carvalho de Melo

On 02/26/2010 02:07 PM, Ingo Molnar wrote:
> * Avi Kivity<avi@redhat.com>  wrote:
>
>    
>> A native API to the host will lock out 100% of the install base now, and a
>> large section of any future install base.
>>      
> ... which is why i suggested the soft-PMU approach.
>    

Not sure I understand it completely.

Do you mean to take the model specific host pmu events, and expose them 
to the guest via trap'n'emulate?  In that case we may as well assign the 
host pmu to the guest if the host isn't using it, and avoid the traps.

Do you mean to choose some older pmu and emulate it using whatever pmu 
model the host has?  I haven't checked, but aren't there mutually 
exclusive events in every model pair?  The closest thing would be the 
architectural pmu thing.

Or do you mean to define a new, kvm-specific pmu model and feed it off 
the host pmu?  In this case all the guests will need to be taught about 
it, which raises the compatibility problem.

> And note that _any_ solution we offer locks out 100% of the installed base
> right now, as no solution is in the kernel yet. The only question is what kind
> of upgrade effort is needed for users to make use of the feature.
>    

I meant the guest installed base.  Hosts can be upgraded transparently 
to the guests (not even a shutdown/reboot).

-- 
Do not meddle in the internals of kernels, for they are subtle and quick to panic.

^ permalink raw reply	[flat|nested] 99+ messages in thread

* Re: KVM PMU virtualization
  2010-02-26 12:20                 ` Avi Kivity
@ 2010-02-26 12:38                   ` Ingo Molnar
  2010-02-26 13:04                     ` Avi Kivity
  2010-02-26 12:56                   ` Jes Sorensen
  2010-02-26 13:31                   ` Ingo Molnar
  2 siblings, 1 reply; 99+ messages in thread
From: Ingo Molnar @ 2010-02-26 12:38 UTC (permalink / raw)
  To: Avi Kivity
  Cc: Jes Sorensen, Joerg Roedel, KVM General, Peter Zijlstra,
	Zachary Amsden, Gleb Natapov, ming.m.lin, Zhang, Yanmin,
	Peter Zijlstra, Thomas Gleixner, H. Peter Anvin, Arjan van de Ven,
	Fr??d??ric Weisbecker, Arnaldo Carvalho de Melo


* Avi Kivity <avi@redhat.com> wrote:

> On 02/26/2010 02:07 PM, Ingo Molnar wrote:
> >* Avi Kivity<avi@redhat.com>  wrote:
> >
> >>A native API to the host will lock out 100% of the install base now, and a
> >>large section of any future install base.
> >... which is why i suggested the soft-PMU approach.
> 
> Not sure I understand it completely.
> 
> Do you mean to take the model specific host pmu events, and expose them to 
> the guest via trap'n'emulate?  In that case we may as well assign the host 
> pmu to the guest if the host isn't using it, and avoid the traps.

You are making the incorrect assumption that the emulated PMU uses up all host 
PMU resources ...

> Do you mean to choose some older pmu and emulate it using whatever pmu model 
> the host has?  I haven't checked, but aren't there mutually exclusive events 
> in every model pair?  The closest thing would be the architectural pmu 
> thing.

Yes, something like Core2 with 2 generic events.

That would leave 2 extra generic events on Nehalem and better. (which is 
really the target CPU type for any new feature we are talking about right now. 
Plus performance analysis tends to skew towards more modern CPU types as 
well.)

Plus the emulation can be smart about it and only use up a given number. Most 
guest OSs dont use the full PMU - they use a single counter.

Ideally for Linux<->Linux there would be a PMU paravirt driver that allocates 
events on an as-needed basis.

> Or do you mean to define a new, kvm-specific pmu model and feed it off the 
> host pmu?  In this case all the guests will need to be taught about it, 
> which raises the compatibility problem.
>
> > And note that _any_ solution we offer locks out 100% of the installed base 
> > right now, as no solution is in the kernel yet. The only question is what 
> > kind of upgrade effort is needed for users to make use of the feature.
> 
> I meant the guest installed base.  Hosts can be upgraded transparently to 
> the guests (not even a shutdown/reboot).

The irony: this time guest-transparent solutions that need no configuration 
are good? ;-)

The very same argument holds for the file server thing: a guest transparent 
solution is easier wrt. the upgrade path.

	Ingo

^ permalink raw reply	[flat|nested] 99+ messages in thread

* Re: KVM PMU virtualization
  2010-02-26 11:42           ` Ingo Molnar
  2010-02-26 11:51             ` Avi Kivity
@ 2010-02-26 12:49             ` Jes Sorensen
  2010-02-26 13:06               ` Ingo Molnar
  2010-03-01 17:22             ` Zachary Amsden
  2 siblings, 1 reply; 99+ messages in thread
From: Jes Sorensen @ 2010-02-26 12:49 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Avi Kivity, Joerg Roedel, KVM General, Peter Zijlstra,
	Zachary Amsden, Gleb Natapov, ming.m.lin, Zhang, Yanmin,
	Peter Zijlstra, Thomas Gleixner, H. Peter Anvin, Arjan van de Ven,
	Fr??d??ric Weisbecker, Arnaldo Carvalho de Melo

On 02/26/10 12:42, Ingo Molnar wrote:
>
> * Jes Sorensen<Jes.Sorensen@redhat.com>  wrote:
>> I have to say I disagree on that. When you run perfmon on a system, it is
>> normally to measure a specific application. You want to see accurate numbers
>> for cache misses, mul instructions or whatever else is selected.
>
> You can still get those. You can even enable RDPMC access and avoid VM exits.
>
> What you _cannot_ do is to 'steal' the PMU and just give it to the guest.

Well you cannot steal the PMU without collaborating with perf_event.c,
but thats quite feasible. Sharing the PMU between the guest and the host
is very costly and guarantees incorrect results in the host. Unless you
completely emulate the PMU by faking it and then allocating PMU counters
one by one at the host level. However that means trapping a lot of MSR
access.

> Firstly, an emulated PMU was only the second-tier option i suggested. By far
> the best approach is native API to the host regarding performance events and
> good guest side integration.
>
> Secondly, the PMU cannot be 'given' to the guest in the general case. Those
> are privileged registers. They can expose sensitive host execution details,
> etc. etc. So if you emulate a PMU you have to exit out of most PMU accesses
> anyway for a secure solution. (RDPMC can still be supported, but in close
> cooperation with the host)

There is nothing secret in the host PMU, and it's easy to clear out the
counters before passing them off to the guest.

>> We can do this in a reasonable way today, if we allow to take the PMU away
>> from the host, and only let guests access it when it's in use. [...]
>
> You get my sure-fire NAK for that kind of crap though. Interfering with the
> host PMU and stealing it, is not a technical approach that has acceptable
> quality.

Having an allocation scheme and sharing it with the host, is a perfectly
legitimate and very clean way to do it. Once it's given to the guest,
the host knows not to touch it until it's been released again.

> You need to integrate it properly so that host PMU functionality still works
> fine. (Within hardware constraints)

Well with the hardware currently available, there is no such thing as
clean sharing between the host and the guest. It cannot be done without
messing up the host measurements, which effectively renders measuring at
the host side useless while a guest is allowed access to the PMU.

Jes

^ permalink raw reply	[flat|nested] 99+ messages in thread

* Re: KVM PMU virtualization
  2010-02-26 12:20                 ` Avi Kivity
  2010-02-26 12:38                   ` Ingo Molnar
@ 2010-02-26 12:56                   ` Jes Sorensen
  2010-02-26 13:31                   ` Ingo Molnar
  2 siblings, 0 replies; 99+ messages in thread
From: Jes Sorensen @ 2010-02-26 12:56 UTC (permalink / raw)
  To: Avi Kivity
  Cc: Ingo Molnar, Joerg Roedel, KVM General, Peter Zijlstra,
	Zachary Amsden, Gleb Natapov, ming.m.lin, Zhang, Yanmin,
	Peter Zijlstra, Thomas Gleixner, H. Peter Anvin, Arjan van de Ven,
	Fr??d??ric Weisbecker, Arnaldo Carvalho de Melo

On 02/26/10 13:20, Avi Kivity wrote:
> On 02/26/2010 02:07 PM, Ingo Molnar wrote:
>> ... which is why i suggested the soft-PMU approach.
>
> Not sure I understand it completely.
>
> Do you mean to take the model specific host pmu events, and expose them
> to the guest via trap'n'emulate? In that case we may as well assign the
> host pmu to the guest if the host isn't using it, and avoid the traps.
>
> Do you mean to choose some older pmu and emulate it using whatever pmu
> model the host has? I haven't checked, but aren't there mutually
> exclusive events in every model pair? The closest thing would be the
> architectural pmu thing.

You cannot do this, as you say there is no guarantee that there are no
overlaps, and the current host may have different counter sizes two
which makes emulating it even more costly.

The cpuid bits basically tells you which version of the counters are
available, how many counters are there, word size of the counters and
I believe there are bits also stating which optional features are
available to be counted.

> Or do you mean to define a new, kvm-specific pmu model and feed it off
> the host pmu? In this case all the guests will need to be taught about
> it, which raises the compatibility problem.

Cannot be done in a reasonable manner due to the above.

The key to all of this is that guests OSes, including that other OS,
should be able to use the performance counters without needing special
para virt drivers or other OS modifications. If we start requering that
kind of stuff, the whole point of having the feature goes down the
toilet.

Cheers,
Jes

^ permalink raw reply	[flat|nested] 99+ messages in thread

* Re: KVM PMU virtualization
  2010-02-26 12:38                   ` Ingo Molnar
@ 2010-02-26 13:04                     ` Avi Kivity
  2010-02-26 13:13                       ` Jes Sorensen
  2010-02-26 13:18                       ` Ingo Molnar
  0 siblings, 2 replies; 99+ messages in thread
From: Avi Kivity @ 2010-02-26 13:04 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Jes Sorensen, Joerg Roedel, KVM General, Peter Zijlstra,
	Zachary Amsden, Gleb Natapov, ming.m.lin, Zhang, Yanmin,
	Peter Zijlstra, Thomas Gleixner, H. Peter Anvin, Arjan van de Ven,
	Fr??d??ric Weisbecker, Arnaldo Carvalho de Melo

On 02/26/2010 02:38 PM, Ingo Molnar wrote:
> * Avi Kivity<avi@redhat.com>  wrote:
>
>    
>> On 02/26/2010 02:07 PM, Ingo Molnar wrote:
>>      
>>> * Avi Kivity<avi@redhat.com>   wrote:
>>>
>>>        
>>>> A native API to the host will lock out 100% of the install base now, and a
>>>> large section of any future install base.
>>>>          
>>> ... which is why i suggested the soft-PMU approach.
>>>        
>> Not sure I understand it completely.
>>
>> Do you mean to take the model specific host pmu events, and expose them to
>> the guest via trap'n'emulate?  In that case we may as well assign the host
>> pmu to the guest if the host isn't using it, and avoid the traps.
>>      
> You are making the incorrect assumption that the emulated PMU uses up all host
> PMU resources ...
>    

Well, in the general case, it may?  If it doesn't, the host may use 
them.  We do a similar thing with debug breakpoints.

Sharing the pmu will mean trapping control msr writes at least, though.

>> Do you mean to choose some older pmu and emulate it using whatever pmu model
>> the host has?  I haven't checked, but aren't there mutually exclusive events
>> in every model pair?  The closest thing would be the architectural pmu
>> thing.
>>      
> Yes, something like Core2 with 2 generic events.
>
> That would leave 2 extra generic events on Nehalem and better. (which is
> really the target CPU type for any new feature we are talking about right now.
> Plus performance analysis tends to skew towards more modern CPU types as
> well.)
>    

Can you emulate the Core 2 pmu on, say, a P4?  Those P4s have very 
different instruction caches so I imagine the events are very different 
as well.

Agree about favouring modern processors.

> Plus the emulation can be smart about it and only use up a given number. Most
> guest OSs dont use the full PMU - they use a single counter.
>    

But you have to expose all of the counters, no?  Unless you go with a 
kvm-specific pmu as described below.

> Ideally for Linux<->Linux there would be a PMU paravirt driver that allocates
> events on an as-needed basis.
>    

Or we could watch the control register and see how the guest programs 
it, provided it doesn't do that a lot.

>> Or do you mean to define a new, kvm-specific pmu model and feed it off the
>> host pmu?  In this case all the guests will need to be taught about it,
>> which raises the compatibility problem.
>>
>>      
>>> And note that _any_ solution we offer locks out 100% of the installed base
>>> right now, as no solution is in the kernel yet. The only question is what
>>> kind of upgrade effort is needed for users to make use of the feature.
>>>        
>> I meant the guest installed base.  Hosts can be upgraded transparently to
>> the guests (not even a shutdown/reboot).
>>      
> The irony: this time guest-transparent solutions that need no configuration
> are good? ;-)
>
> The very same argument holds for the file server thing: a guest transparent
> solution is easier wrt. the upgrade path.
>    

If we add pmu support, guests can begin to use if immediately.  If we 
add the file server support, guests need to install drivers before they 
can use it, while guest admins have no motivation to do so (it helps the 
host, not the guest).

Is something wrong with just using sshfs?  Seems a lot less hassle to me.

-- 
Do not meddle in the internals of kernels, for they are subtle and quick to panic.


^ permalink raw reply	[flat|nested] 99+ messages in thread

* Re: KVM PMU virtualization
  2010-02-26 12:49             ` Jes Sorensen
@ 2010-02-26 13:06               ` Ingo Molnar
  2010-02-26 13:30                 ` Avi Kivity
  2010-02-26 13:31                 ` Jes Sorensen
  0 siblings, 2 replies; 99+ messages in thread
From: Ingo Molnar @ 2010-02-26 13:06 UTC (permalink / raw)
  To: Jes Sorensen
  Cc: Avi Kivity, Joerg Roedel, KVM General, Peter Zijlstra,
	Zachary Amsden, Gleb Natapov, ming.m.lin, Zhang, Yanmin,
	Peter Zijlstra, Thomas Gleixner, H. Peter Anvin, Arjan van de Ven,
	Fr??d??ric Weisbecker, Arnaldo Carvalho de Melo


* Jes Sorensen <Jes.Sorensen@redhat.com> wrote:

> On 02/26/10 12:42, Ingo Molnar wrote:
> >
> >* Jes Sorensen<Jes.Sorensen@redhat.com>  wrote:
> >>
> >> I have to say I disagree on that. When you run perfmon on a system, it is 
> >> normally to measure a specific application. You want to see accurate 
> >> numbers for cache misses, mul instructions or whatever else is selected.
> >
> > You can still get those. You can even enable RDPMC access and avoid VM 
> > exits.
> >
> > What you _cannot_ do is to 'steal' the PMU and just give it to the guest.
> 
> Well you cannot steal the PMU without collaborating with perf_event.c, but 
> thats quite feasible. Sharing the PMU between the guest and the host is very 
> costly and guarantees incorrect results in the host. Unless you completely 
> emulate the PMU by faking it and then allocating PMU counters one by one at 
> the host level. However that means trapping a lot of MSR access.

It's not that many MSR accesses.

> >Firstly, an emulated PMU was only the second-tier option i suggested. By far
> >the best approach is native API to the host regarding performance events and
> >good guest side integration.
> >
> >Secondly, the PMU cannot be 'given' to the guest in the general case. Those
> >are privileged registers. They can expose sensitive host execution details,
> >etc. etc. So if you emulate a PMU you have to exit out of most PMU accesses
> >anyway for a secure solution. (RDPMC can still be supported, but in close
> >cooperation with the host)
> 
> There is nothing secret in the host PMU, and it's easy to clear out the 
> counters before passing them off to the guest.

That's wrong. On some CPUs the host PMU can be used to say sample aspects of 
another CPU, allowing statistical attacks to recover crypto keys. It can be 
used to sample memory access patterns of another node.

There's a good reason PMU configuration registers are privileged and there's 
good value in only giving a certain sub-set to less privileged entities by 
default.

> >>We can do this in a reasonable way today, if we allow to take the PMU away
> >>from the host, and only let guests access it when it's in use. [...]
> >
> >You get my sure-fire NAK for that kind of crap though. Interfering with the
> >host PMU and stealing it, is not a technical approach that has acceptable
> >quality.
> 
> Having an allocation scheme and sharing it with the host, is a perfectly 
> legitimate and very clean way to do it. Once it's given to the guest, the 
> host knows not to touch it until it's been released again.

'Full PMU' is not the granularity i find acceptable though: please do what i 
suggested, event granularity allocation and scheduling.

We are rehashing the whole 'perfmon versus perf events/counters' design 
arguments again here really.

> > You need to integrate it properly so that host PMU functionality still 
> > works fine. (Within hardware constraints)
> 
> Well with the hardware currently available, there is no such thing as clean 
> sharing between the host and the guest. It cannot be done without messing up 
> the host measurements, which effectively renders measuring at the host side 
> useless while a guest is allowed access to the PMU.

That's precisely my point: the guest should obviously not get raw access to 
the PMU. (except where it might matter to performance, such as RDPMC)

	Ingo

^ permalink raw reply	[flat|nested] 99+ messages in thread

* Re: KVM PMU virtualization
  2010-02-26 13:04                     ` Avi Kivity
@ 2010-02-26 13:13                       ` Jes Sorensen
  2010-02-26 13:27                         ` Ingo Molnar
  2010-02-26 13:18                       ` Ingo Molnar
  1 sibling, 1 reply; 99+ messages in thread
From: Jes Sorensen @ 2010-02-26 13:13 UTC (permalink / raw)
  To: Avi Kivity
  Cc: Ingo Molnar, Joerg Roedel, KVM General, Peter Zijlstra,
	Zachary Amsden, Gleb Natapov, ming.m.lin, Zhang, Yanmin,
	Peter Zijlstra, Thomas Gleixner, H. Peter Anvin, Arjan van de Ven,
	Fr??d??ric Weisbecker, Arnaldo Carvalho de Melo

On 02/26/10 14:04, Avi Kivity wrote:
> On 02/26/2010 02:38 PM, Ingo Molnar wrote:
>> Yes, something like Core2 with 2 generic events.
>>
>> That would leave 2 extra generic events on Nehalem and better. (which is
>> really the target CPU type for any new feature we are talking about
>> right now.
>> Plus performance analysis tends to skew towards more modern CPU types as
>> well.)
>
> Can you emulate the Core 2 pmu on, say, a P4? Those P4s have very
> different instruction caches so I imagine the events are very different
> as well.
>
> Agree about favouring modern processors.

You certainly cannot emulate the Core2 on a P4. The Core2 is Perfmon v2,
whereas Nehalem and Atom are v3 if I remember correctly. I am not even
100% sure a v3 is capable of emulating a v2, though I expect v3 to have
bigger counters then v2, but I don't think that is guaranteed. I can
only handle so many hours of reading Intel manuals per day, before I end
up in a padded cell, so I could be wrong on some of this.

>> Plus the emulation can be smart about it and only use up a given
>> number. Most
>> guest OSs dont use the full PMU - they use a single counter.
>
> But you have to expose all of the counters, no? Unless you go with a
> kvm-specific pmu as described below.

You have to, at least all the fixed ones (3 on Core2) and the two arch
ones. Thats the minimum and any guest being told it's running on a Core2
will expect to find those.

Cheers,
Jes

^ permalink raw reply	[flat|nested] 99+ messages in thread

* Re: KVM PMU virtualization
  2010-02-26 13:04                     ` Avi Kivity
  2010-02-26 13:13                       ` Jes Sorensen
@ 2010-02-26 13:18                       ` Ingo Molnar
  2010-02-26 13:34                         ` Jes Sorensen
  1 sibling, 1 reply; 99+ messages in thread
From: Ingo Molnar @ 2010-02-26 13:18 UTC (permalink / raw)
  To: Avi Kivity
  Cc: Jes Sorensen, Joerg Roedel, KVM General, Peter Zijlstra,
	Zachary Amsden, Gleb Natapov, ming.m.lin, Zhang, Yanmin,
	Peter Zijlstra, Thomas Gleixner, H. Peter Anvin, Arjan van de Ven,
	Fr??d??ric Weisbecker, Arnaldo Carvalho de Melo


* Avi Kivity <avi@redhat.com> wrote:

> Can you emulate the Core 2 pmu on, say, a P4? [...]

How about the Pentium? Or the i486?

As long as there's perf events support, the CPU can be supported in a soft 
PMU. You can even cross-map exotic hw events if need to be - but most of the 
tooling (in just about any OS) uses just a handful of core events ...

	Ingo

^ permalink raw reply	[flat|nested] 99+ messages in thread

* Re: KVM PMU virtualization
  2010-02-26 13:13                       ` Jes Sorensen
@ 2010-02-26 13:27                         ` Ingo Molnar
  2010-02-26 13:33                           ` Avi Kivity
  2010-02-26 14:07                           ` Jes Sorensen
  0 siblings, 2 replies; 99+ messages in thread
From: Ingo Molnar @ 2010-02-26 13:27 UTC (permalink / raw)
  To: Jes Sorensen
  Cc: Avi Kivity, Joerg Roedel, KVM General, Peter Zijlstra,
	Zachary Amsden, Gleb Natapov, ming.m.lin, Zhang, Yanmin,
	Peter Zijlstra, Thomas Gleixner, H. Peter Anvin, Arjan van de Ven,
	Fr??d??ric Weisbecker, Arnaldo Carvalho de Melo

* Jes Sorensen <Jes.Sorensen@redhat.com> wrote:

> > Agree about favouring modern processors.
> 
> You certainly cannot emulate the Core2 on a P4. The Core2 is Perfmon v2, 
> whereas Nehalem and Atom are v3 if I remember correctly. [...]

Of course you can emulate a good portion of it, as long as there's perf 
support on the host side for P4.

If the guest programs a cachemiss event, you program a cachemiss perf event on 
the host and feed its values to the emulated MSR state. You _dont_ program the 
raw PMU on the host side - just use the API i outlined to get struct 
perf_event.

The emulation wont be perfect: not all events will count and not all events 
will be available in a P4 (and some Core2 events might not even make sense in 
a P4), but that is reality as well: often documented events dont count, and 
often non-documented events count.

What matters to 99.9% of people who actually use this stuff is a few core sets 
of events - which are available in P4s and in Core2 as well. Cycles, 
instructions, branches, maybe cache-misses. Sometimes FPU stuff.

For Linux<->Linux the sanest, tier-1 approach would be to map sys_perf_open() 
on the guest side over to the host, transparently, via a paravirt driver.

	Ingo

^ permalink raw reply	[flat|nested] 99+ messages in thread

* Re: KVM PMU virtualization
  2010-02-26 11:51             ` Avi Kivity
  2010-02-26 12:07               ` Ingo Molnar
@ 2010-02-26 13:28               ` Peter Zijlstra
  2010-02-26 13:44                 ` Avi Kivity
  2010-02-26 13:51                 ` Jes Sorensen
  1 sibling, 2 replies; 99+ messages in thread
From: Peter Zijlstra @ 2010-02-26 13:28 UTC (permalink / raw)
  To: Avi Kivity
  Cc: Ingo Molnar, Jes Sorensen, Joerg Roedel, KVM General,
	Zachary Amsden, Gleb Natapov, ming.m.lin, Zhang, Yanmin,
	Thomas Gleixner, H. Peter Anvin, Arjan van de Ven,
	Fr??d??ric Weisbecker, Arnaldo Carvalho de Melo

On Fri, 2010-02-26 at 13:51 +0200, Avi Kivity wrote:

> It would be the other way round - the host would steal the pmu from the 
> guest.  Later we can try to time-slice and extrapolate, though that's 
> not going to be easy. 

Right, so perf already does the time slicing and interpolating thing, so
a soft-pmu gets that for free.

Anyway, this discussion seems somewhat in a stale-mate position.

The KVM folks basically demand a full PMU MSR shadow with PMI
passthrough so that their $legacy shit works without modification. 

My question with that is how $legacy muck can ever know how the current
PMU works, you can't even properly emulate a core2 pmu on a nehalem
because intel keeps messing with the event codes for every new model.

So basically for this to work means the guest can't run legacy stuff
anyway, but needs to run very up-to-date software, so we might as well
create a soft-pmu/paravirt interface now and have all up-to-date
software support that for the next generation.

Furthermore, when KVM doesn't virtualize the physical system topology,
some PMU features cannot even be sanely used from a vcpu.

So while currently a root user can already tie up all of the pmu using
perf, simply using that to hand the full pmu off to the guest still
leaves lots of issues.

^ permalink raw reply	[flat|nested] 99+ messages in thread

* Re: KVM PMU virtualization
  2010-02-26 13:06               ` Ingo Molnar
@ 2010-02-26 13:30                 ` Avi Kivity
  2010-02-26 13:32                   ` Jes Sorensen
                                     ` (3 more replies)
  2010-02-26 13:31                 ` Jes Sorensen
  1 sibling, 4 replies; 99+ messages in thread
From: Avi Kivity @ 2010-02-26 13:30 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Jes Sorensen, Joerg Roedel, KVM General, Peter Zijlstra,
	Zachary Amsden, Gleb Natapov, ming.m.lin, Zhang, Yanmin,
	Peter Zijlstra, Thomas Gleixner, H. Peter Anvin, Arjan van de Ven,
	Fr??d??ric Weisbecker, Arnaldo Carvalho de Melo

On 02/26/2010 03:06 PM, Ingo Molnar wrote:
>
>>> Firstly, an emulated PMU was only the second-tier option i suggested. By far
>>> the best approach is native API to the host regarding performance events and
>>> good guest side integration.
>>>
>>> Secondly, the PMU cannot be 'given' to the guest in the general case. Those
>>> are privileged registers. They can expose sensitive host execution details,
>>> etc. etc. So if you emulate a PMU you have to exit out of most PMU accesses
>>> anyway for a secure solution. (RDPMC can still be supported, but in close
>>> cooperation with the host)
>>>        
>> There is nothing secret in the host PMU, and it's easy to clear out the
>> counters before passing them off to the guest.
>>      
> That's wrong. On some CPUs the host PMU can be used to say sample aspects of
> another CPU, allowing statistical attacks to recover crypto keys. It can be
> used to sample memory access patterns of another node.
>
> There's a good reason PMU configuration registers are privileged and there's
> good value in only giving a certain sub-set to less privileged entities by
> default.
>    

Even if there were no security considerations, if the guest can observe 
host data in the pmu, it means the pmu is inaccurate.  We should expose 
guest data only in the guest pmu.  That's not difficult to do, you stop 
the pmu on exit and swap the counters on context switches.

>> Having an allocation scheme and sharing it with the host, is a perfectly
>> legitimate and very clean way to do it. Once it's given to the guest, the
>> host knows not to touch it until it's been released again.
>>      
> 'Full PMU' is not the granularity i find acceptable though: please do what i
> suggested, event granularity allocation and scheduling.
>
> We are rehashing the whole 'perfmon versus perf events/counters' design
> arguments again here really.
>    

Scheduling at event granularity would be a good thing.  However we need 
to be able to handle the guest using the full pmu.

Note that scheduling is only needed if both the guest and host want the 
pmu at the same time - and that should be a rare case and not the one to 
optimize for.

>>> You need to integrate it properly so that host PMU functionality still
>>> works fine. (Within hardware constraints)
>>>        
>> Well with the hardware currently available, there is no such thing as clean
>> sharing between the host and the guest. It cannot be done without messing up
>> the host measurements, which effectively renders measuring at the host side
>> useless while a guest is allowed access to the PMU.
>>      
> That's precisely my point: the guest should obviously not get raw access to
> the PMU. (except where it might matter to performance, such as RDPMC)
>    

That's doable if all counters are steerable.  IIRC some counters are 
fixed function, but I'm not certain about that.

-- 
Do not meddle in the internals of kernels, for they are subtle and quick to panic.


^ permalink raw reply	[flat|nested] 99+ messages in thread

* Re: KVM PMU virtualization
  2010-02-26 13:06               ` Ingo Molnar
  2010-02-26 13:30                 ` Avi Kivity
@ 2010-02-26 13:31                 ` Jes Sorensen
  1 sibling, 0 replies; 99+ messages in thread
From: Jes Sorensen @ 2010-02-26 13:31 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Avi Kivity, Joerg Roedel, KVM General, Peter Zijlstra,
	Zachary Amsden, Gleb Natapov, ming.m.lin, Zhang, Yanmin,
	Peter Zijlstra, Thomas Gleixner, H. Peter Anvin, Arjan van de Ven,
	Fr??d??ric Weisbecker, Arnaldo Carvalho de Melo

On 02/26/10 14:06, Ingo Molnar wrote:
>
> * Jes Sorensen<Jes.Sorensen@redhat.com>  wrote:
>> Well you cannot steal the PMU without collaborating with perf_event.c, but
>> thats quite feasible. Sharing the PMU between the guest and the host is very
>> costly and guarantees incorrect results in the host. Unless you completely
>> emulate the PMU by faking it and then allocating PMU counters one by one at
>> the host level. However that means trapping a lot of MSR access.
>
> It's not that many MSR accesses.

Well it's more than enough to double the number of MSRs KVM has to track
on switches.

>> There is nothing secret in the host PMU, and it's easy to clear out the
>> counters before passing them off to the guest.
>
> That's wrong. On some CPUs the host PMU can be used to say sample aspects of
> another CPU, allowing statistical attacks to recover crypto keys. It can be
> used to sample memory access patterns of another node.
>
> There's a good reason PMU configuration registers are privileged and there's
> good value in only giving a certain sub-set to less privileged entities by
> default.

If a PMU can really count stuff on another CPU, then we shouldn't allow
PMU access to any application at all. It's more than just a KVM guest vs
a KVM guest issue then, but also a thread to thread issue.

My idea was obviously not to expose host timings to a guest. Save the
counters when a guest exits, and reload them when it's restarted. Not
just when switching to another task, but also when entering KVM, to
avoid the guest seeing overhead spent within KVM.

>> Having an allocation scheme and sharing it with the host, is a perfectly
>> legitimate and very clean way to do it. Once it's given to the guest, the
>> host knows not to touch it until it's been released again.
>
> 'Full PMU' is not the granularity i find acceptable though: please do what i
> suggested, event granularity allocation and scheduling.

As I wrote earlier, at that level we have to do it all emulated. In
this case, providing any of this to a guest seems to be a waste of time
since the interface will cost way too much in trapping back and forth
and you have contention with the very limited resources in the PMU with
just 5 counters to pick from on Core2.

The guest PMU will think it's running on top of real hardware, and
scaling/estimating numbers like the perf_event.c code does today,
except that it will be using already scaled and estimated numbers for
it's calculations. Application users will have little use for this.

>> Well with the hardware currently available, there is no such thing as clean
>> sharing between the host and the guest. It cannot be done without messing up
>> the host measurements, which effectively renders measuring at the host side
>> useless while a guest is allowed access to the PMU.
>
> That's precisely my point: the guest should obviously not get raw access to
> the PMU. (except where it might matter to performance, such as RDPMC)

Well either you allow access to the PMU or you don't. If you allow
direct access to the PMU counters, but not the control registers, you
have to specify the counter sizes to match that of the host, making it
impossible to really emulate core2 on a non core2 architecture etc.

Jes

^ permalink raw reply	[flat|nested] 99+ messages in thread

* Re: KVM PMU virtualization
  2010-02-26 12:20                 ` Avi Kivity
  2010-02-26 12:38                   ` Ingo Molnar
  2010-02-26 12:56                   ` Jes Sorensen
@ 2010-02-26 13:31                   ` Ingo Molnar
  2010-02-26 13:37                     ` Jes Sorensen
  2010-02-26 13:40                     ` Avi Kivity
  2 siblings, 2 replies; 99+ messages in thread
From: Ingo Molnar @ 2010-02-26 13:31 UTC (permalink / raw)
  To: Avi Kivity
  Cc: Jes Sorensen, Joerg Roedel, KVM General, Peter Zijlstra,
	Zachary Amsden, Gleb Natapov, ming.m.lin, Zhang, Yanmin,
	Peter Zijlstra, Thomas Gleixner, H. Peter Anvin, Arjan van de Ven,
	Fr??d??ric Weisbecker, Arnaldo Carvalho de Melo


* Avi Kivity <avi@redhat.com> wrote:

> Or do you mean to define a new, kvm-specific pmu model and feed it off the 
> host pmu?  In this case all the guests will need to be taught about it, 
> which raises the compatibility problem.

You are missing two big things wrt. compatibility here:

 1) The first upgrade overhead a one time overhead only.

 2) Once a Linux guest has upgraded, it will work in the future, with _any_ 
    future CPU - _without_ having to upgrade the guest!

Dont you see the advantage of that? You can instrument an old system on new 
hardware, without having to upgrade that guest for the new CPU support.

With the 'steal the PMU' messy approach the guest OS has to be upgraded to the 
new CPU type all the time. Ad infinitum.

	Ingo

^ permalink raw reply	[flat|nested] 99+ messages in thread

* Re: KVM PMU virtualization
  2010-02-26 13:30                 ` Avi Kivity
@ 2010-02-26 13:32                   ` Jes Sorensen
  2010-02-26 13:44                   ` Ingo Molnar
                                     ` (2 subsequent siblings)
  3 siblings, 0 replies; 99+ messages in thread
From: Jes Sorensen @ 2010-02-26 13:32 UTC (permalink / raw)
  To: Avi Kivity
  Cc: Ingo Molnar, Joerg Roedel, KVM General, Peter Zijlstra,
	Zachary Amsden, Gleb Natapov, ming.m.lin, Zhang, Yanmin,
	Peter Zijlstra, Thomas Gleixner, H. Peter Anvin, Arjan van de Ven,
	Fr??d??ric Weisbecker, Arnaldo Carvalho de Melo

On 02/26/10 14:30, Avi Kivity wrote:
> On 02/26/2010 03:06 PM, Ingo Molnar wrote:
>> That's precisely my point: the guest should obviously not get raw
>> access to
>> the PMU. (except where it might matter to performance, such as RDPMC)
>
> That's doable if all counters are steerable. IIRC some counters are
> fixed function, but I'm not certain about that.

I am not an expert, but from what I learned from Peter, there are
constraints on some of the counters. Ie. certain types of events can
only be counted on certain counters, which limits the already very
limited number of counters even further.

Cheers,
Jes

^ permalink raw reply	[flat|nested] 99+ messages in thread

* Re: KVM PMU virtualization
  2010-02-26 13:27                         ` Ingo Molnar
@ 2010-02-26 13:33                           ` Avi Kivity
  2010-02-26 14:07                           ` Jes Sorensen
  1 sibling, 0 replies; 99+ messages in thread
From: Avi Kivity @ 2010-02-26 13:33 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Jes Sorensen, Joerg Roedel, KVM General, Peter Zijlstra,
	Zachary Amsden, Gleb Natapov, ming.m.lin, Zhang, Yanmin,
	Peter Zijlstra, Thomas Gleixner, H. Peter Anvin, Arjan van de Ven,
	Fr??d??ric Weisbecker, Arnaldo Carvalho de Melo

On 02/26/2010 03:27 PM, Ingo Molnar wrote:
>
> For Linux<->Linux the sanest, tier-1 approach would be to map sys_perf_open()
> on the guest side over to the host, transparently, via a paravirt driver.
>    

Let us for the purpose of this discussion assume that we are also 
interested in supporting Windows and older Linux.  Paravirt 
optimizations can be added after we have the basic functionality, if 
they prove necessary.

-- 
Do not meddle in the internals of kernels, for they are subtle and quick to panic.


^ permalink raw reply	[flat|nested] 99+ messages in thread

* Re: KVM PMU virtualization
  2010-02-26 13:18                       ` Ingo Molnar
@ 2010-02-26 13:34                         ` Jes Sorensen
  0 siblings, 0 replies; 99+ messages in thread
From: Jes Sorensen @ 2010-02-26 13:34 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Avi Kivity, Joerg Roedel, KVM General, Peter Zijlstra,
	Zachary Amsden, Gleb Natapov, ming.m.lin, Zhang, Yanmin,
	Peter Zijlstra, Thomas Gleixner, H. Peter Anvin, Arjan van de Ven,
	Fr??d??ric Weisbecker, Arnaldo Carvalho de Melo

On 02/26/10 14:18, Ingo Molnar wrote:
>
> * Avi Kivity<avi@redhat.com>  wrote:
>
>> Can you emulate the Core 2 pmu on, say, a P4? [...]
>
> How about the Pentium? Or the i486?
>
> As long as there's perf events support, the CPU can be supported in a soft
> PMU. You can even cross-map exotic hw events if need to be - but most of the
> tooling (in just about any OS) uses just a handful of core events ...

This is only possible if all future CPU perfmon events are guaranteed
to be a superset of previous versions. Otherwise you end up emulating
events and providing randomly generated numbers back.

The perfmon revision and size we present to a guest has to match the
current host.

Jes

^ permalink raw reply	[flat|nested] 99+ messages in thread

* Re: KVM PMU virtualization
  2010-02-26 13:31                   ` Ingo Molnar
@ 2010-02-26 13:37                     ` Jes Sorensen
  2010-02-26 13:55                       ` Avi Kivity
  2010-03-01 18:54                       ` Zachary Amsden
  2010-02-26 13:40                     ` Avi Kivity
  1 sibling, 2 replies; 99+ messages in thread
From: Jes Sorensen @ 2010-02-26 13:37 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Avi Kivity, Joerg Roedel, KVM General, Peter Zijlstra,
	Zachary Amsden, Gleb Natapov, ming.m.lin, Zhang, Yanmin,
	Peter Zijlstra, Thomas Gleixner, H. Peter Anvin, Arjan van de Ven,
	Fr??d??ric Weisbecker, Arnaldo Carvalho de Melo

On 02/26/10 14:31, Ingo Molnar wrote:
> You are missing two big things wrt. compatibility here:
>
>   1) The first upgrade overhead a one time overhead only.
>
>   2) Once a Linux guest has upgraded, it will work in the future, with _any_
>      future CPU - _without_ having to upgrade the guest!
>
> Dont you see the advantage of that? You can instrument an old system on new
> hardware, without having to upgrade that guest for the new CPU support.

That would only work if you are guaranteed to be able to emulate old
hardware on new hardware. Not going to be feasible, so then we are in a
real mess.

> With the 'steal the PMU' messy approach the guest OS has to be upgraded to the
> new CPU type all the time. Ad infinitum.

The way the Perfmon architecture is specified by Intel, that is what we
are stuck with. It's not going to be possible via software emulation to
count cache misses, unless you run it in a micro architecture emulator.

Jes

^ permalink raw reply	[flat|nested] 99+ messages in thread

* Re: KVM PMU virtualization
  2010-02-26 13:31                   ` Ingo Molnar
  2010-02-26 13:37                     ` Jes Sorensen
@ 2010-02-26 13:40                     ` Avi Kivity
  2010-02-26 14:01                       ` Ingo Molnar
  1 sibling, 1 reply; 99+ messages in thread
From: Avi Kivity @ 2010-02-26 13:40 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Jes Sorensen, Joerg Roedel, KVM General, Peter Zijlstra,
	Zachary Amsden, Gleb Natapov, ming.m.lin, Zhang, Yanmin,
	Peter Zijlstra, Thomas Gleixner, H. Peter Anvin, Arjan van de Ven,
	Fr??d??ric Weisbecker, Arnaldo Carvalho de Melo

On 02/26/2010 03:31 PM, Ingo Molnar wrote:
> * Avi Kivity<avi@redhat.com>  wrote:
>
>    
>> Or do you mean to define a new, kvm-specific pmu model and feed it off the
>> host pmu?  In this case all the guests will need to be taught about it,
>> which raises the compatibility problem.
>>      
> You are missing two big things wrt. compatibility here:
>
>   1) The first upgrade overhead a one time overhead only.
>    

May be one too many, for certain guests.  Of course it may be argued 
that if the guest wants performance monitoring that much, they will upgrade.

Certainly guests that we don't port won't be able to use this.  I doubt 
we'll be able to make Windows work with this - the only performance tool 
I'm familiar with on Windows is Intel's VTune, and that's proprietary.

>   2) Once a Linux guest has upgraded, it will work in the future, with _any_
>      future CPU - _without_ having to upgrade the guest!
>
> Dont you see the advantage of that? You can instrument an old system on new
> hardware, without having to upgrade that guest for the new CPU support.
>    

That also works for the architectural pmu, of course that's Intel only.  
And there you don't need to upgrade the guest even once.

The arch pmu seems nicely done - there's a bit for every counter that 
can be enabled and disabled at will, and the number of counters is also 
determined from cpuid.

-- 
Do not meddle in the internals of kernels, for they are subtle and quick to panic.

^ permalink raw reply	[flat|nested] 99+ messages in thread

* Re: KVM PMU virtualization
  2010-02-26 13:30                 ` Avi Kivity
  2010-02-26 13:32                   ` Jes Sorensen
@ 2010-02-26 13:44                   ` Ingo Molnar
  2010-02-26 13:53                     ` Avi Kivity
  2010-02-28 16:11                     ` Joerg Roedel
  2010-02-26 14:49                   ` Peter Zijlstra
  2010-02-26 14:50                   ` Peter Zijlstra
  3 siblings, 2 replies; 99+ messages in thread
From: Ingo Molnar @ 2010-02-26 13:44 UTC (permalink / raw)
  To: Avi Kivity
  Cc: Jes Sorensen, Joerg Roedel, KVM General, Peter Zijlstra,
	Zachary Amsden, Gleb Natapov, ming.m.lin, Zhang, Yanmin,
	Peter Zijlstra, Thomas Gleixner, H. Peter Anvin, Arjan van de Ven,
	Fr??d??ric Weisbecker, Arnaldo Carvalho de Melo


* Avi Kivity <avi@redhat.com> wrote:

> On 02/26/2010 03:06 PM, Ingo Molnar wrote:
> >
> >>>Firstly, an emulated PMU was only the second-tier option i suggested. By far
> >>>the best approach is native API to the host regarding performance events and
> >>>good guest side integration.
> >>>
> >>>Secondly, the PMU cannot be 'given' to the guest in the general case. Those
> >>>are privileged registers. They can expose sensitive host execution details,
> >>>etc. etc. So if you emulate a PMU you have to exit out of most PMU accesses
> >>>anyway for a secure solution. (RDPMC can still be supported, but in close
> >>>cooperation with the host)
> >>There is nothing secret in the host PMU, and it's easy to clear out the
> >>counters before passing them off to the guest.
> >That's wrong. On some CPUs the host PMU can be used to say sample aspects of
> >another CPU, allowing statistical attacks to recover crypto keys. It can be
> >used to sample memory access patterns of another node.
> >
> >There's a good reason PMU configuration registers are privileged and there's
> >good value in only giving a certain sub-set to less privileged entities by
> >default.
> 
> Even if there were no security considerations, if the guest can observe host 
> data in the pmu, it means the pmu is inaccurate.  We should expose guest 
> data only in the guest pmu.  That's not difficult to do, you stop the pmu on 
> exit and swap the counters on context switches.

Again you are making an incorrect assumption: that information leakage via the 
PMU only occurs while the host is running on that CPU. It does not - the PMU 
can leak general system details _while the guest is running_.

So for this and for the many other reasons we dont want to give a raw PMU to 
guests:

 - A paravirt event driver is more compatible and more transparent in the long 
   run: it allows hardware upgrade and upgraded PMU functionality (for Linux) 
   without having to upgrade the guest OS. Via that a guest OS could even be
   live-migrated to a different PMU, without noticing anything about it.

   In contrast, a 'stolen', 'raw' PMU directly programmed by the guest OS 
   always assumes the guest OS is upgraded to the host. Also, 'raw' PMU state 
   cannot be live-migrated. (save/restore doesnt help)

 - It's far cleaner on the host side as well: more granular, per event usage
   is possible. The guest can use portion of the PMU (managed by the host), 
   and the host can use a portion too.

   In contrast, a 'stolen', 'raw' PMU directly programmed by the guest OS
   precludes the host OS from running some different piece of instrumentation
   at the same time.

 - It's more secure: the host can have a finegrained policy about what kinds of
   events it exposes to the guest. It might chose to only expose software 
   events for example.

   In contrast, a 'stolen', 'raw' PMU directly programmed by the guest OS is
   an all-or-nothing policy affair: either you fully allow the guest (and live
   with whatever consequences the piece of hardware that takes up a fair chunk
   on the CPU die causes), or you allow none of it.

 - A proper paravirt event driver gives more features as well: it can exposes 
   host software events and tracepoints, probes - not restricting itself to 
   the 'hardware PMU' abstraction.

 - There's proper event scheduling and event allocation. Time-slicing, etc.


The thing is, we made quite similar arguments in the past, during the perfmon 
vs. perfcounters discussions. There's really a big advantage to proper 
abstractions, both on the host and on the guest side.
 
	Ingo

^ permalink raw reply	[flat|nested] 99+ messages in thread

* Re: KVM PMU virtualization
  2010-02-26 13:28               ` Peter Zijlstra
@ 2010-02-26 13:44                 ` Avi Kivity
  2010-02-26 13:51                 ` Jes Sorensen
  1 sibling, 0 replies; 99+ messages in thread
From: Avi Kivity @ 2010-02-26 13:44 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Ingo Molnar, Jes Sorensen, Joerg Roedel, KVM General,
	Zachary Amsden, Gleb Natapov, ming.m.lin, Zhang, Yanmin,
	Thomas Gleixner, H. Peter Anvin, Arjan van de Ven,
	Fr??d??ric Weisbecker, Arnaldo Carvalho de Melo

On 02/26/2010 03:28 PM, Peter Zijlstra wrote:
> On Fri, 2010-02-26 at 13:51 +0200, Avi Kivity wrote:
>
>    
>> It would be the other way round - the host would steal the pmu from the
>> guest.  Later we can try to time-slice and extrapolate, though that's
>> not going to be easy.
>>      
> Right, so perf already does the time slicing and interpolating thing, so
> a soft-pmu gets that for free.
>    

True.

> Anyway, this discussion seems somewhat in a stale-mate position.
>
> The KVM folks basically demand a full PMU MSR shadow with PMI
> passthrough so that their $legacy shit works without modification.
>
> My question with that is how $legacy muck can ever know how the current
> PMU works, you can't even properly emulate a core2 pmu on a nehalem
> because intel keeps messing with the event codes for every new model.
>    

Right, this is pretty bad.  For Windows it's probably acceptable to 
upgrade your performance tools (since that's separate from the OS).  In 
Linux it is integrated into the kernel, and it's fairly unacceptable to 
demand a kernel upgrade when your host is upgraded underneath you.

> So basically for this to work means the guest can't run legacy stuff
> anyway, but needs to run very up-to-date software, so we might as well
> create a soft-pmu/paravirt interface now and have all up-to-date
> software support that for the next generation.
>    

Still that leaves us with no Windows / non-Linux solution.


-- 
Do not meddle in the internals of kernels, for they are subtle and quick to panic.


^ permalink raw reply	[flat|nested] 99+ messages in thread

* Re: KVM PMU virtualization
  2010-02-26 13:28               ` Peter Zijlstra
  2010-02-26 13:44                 ` Avi Kivity
@ 2010-02-26 13:51                 ` Jes Sorensen
  2010-02-26 14:42                   ` Peter Zijlstra
  1 sibling, 1 reply; 99+ messages in thread
From: Jes Sorensen @ 2010-02-26 13:51 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Avi Kivity, Ingo Molnar, Joerg Roedel, KVM General,
	Zachary Amsden, Gleb Natapov, ming.m.lin, Zhang, Yanmin,
	Thomas Gleixner, H. Peter Anvin, Arjan van de Ven,
	Fr??d??ric Weisbecker, Arnaldo Carvalho de Melo

On 02/26/10 14:28, Peter Zijlstra wrote:
> On Fri, 2010-02-26 at 13:51 +0200, Avi Kivity wrote:
>
>> It would be the other way round - the host would steal the pmu from the
>> guest.  Later we can try to time-slice and extrapolate, though that's
>> not going to be easy.
>
> Right, so perf already does the time slicing and interpolating thing, so
> a soft-pmu gets that for free.

What I don't like here is that without rewriting the guest OS, there
will be two layers of time-slicing and extrapolation. That is going to
make the reported numbers close to useless.

> Anyway, this discussion seems somewhat in a stale-mate position.
>
> The KVM folks basically demand a full PMU MSR shadow with PMI
> passthrough so that their $legacy shit works without modification.
>
> My question with that is how $legacy muck can ever know how the current
> PMU works, you can't even properly emulate a core2 pmu on a nehalem
> because intel keeps messing with the event codes for every new model.
>
> So basically for this to work means the guest can't run legacy stuff
> anyway, but needs to run very up-to-date software, so we might as well
> create a soft-pmu/paravirt interface now and have all up-to-date
> software support that for the next generation.

That is the problem. Today there is a large install base out there of
core2 users who wish to measure their stuff on the hardware they have.
The same will be true for Nehalem based stuff, when whatever replaces
Nehalem comes out makes that incompatible.

Since we are unable to emulate Core2 on Nehalem, and almost certainly
will be unable to emulate Nehalem on it's successor, we are stuck with
this.

A para-virt interface is a nice idea, but since we cannot emulate an
old CPU properly it still means there isn't much we can do as we're
stuck with the same limitations. I simply see the value of introducing
a para-virt interface for this.

> Furthermore, when KVM doesn't virtualize the physical system topology,
> some PMU features cannot even be sanely used from a vcpu.

That is definitely an issue, and there is nothing we can really do about
that. Having two guests running in parallel under KVM means that they
are going to see more cache misses than they would if they ran barebone
on the hardware.

However even with all of this, we have to keep in mind who is going to
use the performance monitoring in a guest. It is going to be application
writers, mostly people writing analytical/scientific applications. They
rarely have control over the OS they are running on, but are given
systems and told to work on what they are given. Driver upgrades and
things like that don't come quickly. However they also tend to
understand limitations like these and will be able to still benefit from
perf on a system like that.

> So while currently a root user can already tie up all of the pmu using
> perf, simply using that to hand the full pmu off to the guest still
> leaves lots of issues.

Well isn't that the case with the current setup anyway? If enough user
apps start requesting PMU resources, the hw is going to run out of
counters very quickly anyway.

The real issue here IMHO is whether or not is it possible to use a PMU
to count anything on different CPU? If that is really possible, sharing
the PMU is not an option :(

All that said, what we really want is for Intel+AMD to come up with
proper hw PMU virtualization support that makes it easy to rotate the
full PMU in and out for a guest. Then this whole discussion will become
a non issue.

Cheers,
Jes

^ permalink raw reply	[flat|nested] 99+ messages in thread

* Re: KVM PMU virtualization
  2010-02-26 13:44                   ` Ingo Molnar
@ 2010-02-26 13:53                     ` Avi Kivity
  2010-02-26 14:12                       ` Ingo Molnar
  2010-02-28 16:11                     ` Joerg Roedel
  1 sibling, 1 reply; 99+ messages in thread
From: Avi Kivity @ 2010-02-26 13:53 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Jes Sorensen, Joerg Roedel, KVM General, Peter Zijlstra,
	Zachary Amsden, Gleb Natapov, ming.m.lin, Zhang, Yanmin,
	Peter Zijlstra, Thomas Gleixner, H. Peter Anvin, Arjan van de Ven,
	Fr??d??ric Weisbecker, Arnaldo Carvalho de Melo

On 02/26/2010 03:44 PM, Ingo Molnar wrote:
> * Avi Kivity<avi@redhat.com>  wrote:
>
>    
>> On 02/26/2010 03:06 PM, Ingo Molnar wrote:
>>      
>>>        
>>>>> Firstly, an emulated PMU was only the second-tier option i suggested. By far
>>>>> the best approach is native API to the host regarding performance events and
>>>>> good guest side integration.
>>>>>
>>>>> Secondly, the PMU cannot be 'given' to the guest in the general case. Those
>>>>> are privileged registers. They can expose sensitive host execution details,
>>>>> etc. etc. So if you emulate a PMU you have to exit out of most PMU accesses
>>>>> anyway for a secure solution. (RDPMC can still be supported, but in close
>>>>> cooperation with the host)
>>>>>            
>>>> There is nothing secret in the host PMU, and it's easy to clear out the
>>>> counters before passing them off to the guest.
>>>>          
>>> That's wrong. On some CPUs the host PMU can be used to say sample aspects of
>>> another CPU, allowing statistical attacks to recover crypto keys. It can be
>>> used to sample memory access patterns of another node.
>>>
>>> There's a good reason PMU configuration registers are privileged and there's
>>> good value in only giving a certain sub-set to less privileged entities by
>>> default.
>>>        
>> Even if there were no security considerations, if the guest can observe host
>> data in the pmu, it means the pmu is inaccurate.  We should expose guest
>> data only in the guest pmu.  That's not difficult to do, you stop the pmu on
>> exit and swap the counters on context switches.
>>      
> Again you are making an incorrect assumption: that information leakage via the
> PMU only occurs while the host is running on that CPU. It does not - the PMU
> can leak general system details _while the guest is running_.
>    

You mean like bus transactions on a multicore?  Well, we're already 
exposed to cache timing attacks.

> So for this and for the many other reasons we dont want to give a raw PMU to
> guests:
>
>   - A paravirt event driver is more compatible and more transparent in the long
>     run: it allows hardware upgrade and upgraded PMU functionality (for Linux)
>     without having to upgrade the guest OS. Via that a guest OS could even be
>     live-migrated to a different PMU, without noticing anything about it.
>    

What about Windows?

>     In contrast, a 'stolen', 'raw' PMU directly programmed by the guest OS
>     always assumes the guest OS is upgraded to the host. Also, 'raw' PMU state
>     cannot be live-migrated. (save/restore doesnt help)
>    

Why not?  So long as the source and destination are compatible?

>   - It's far cleaner on the host side as well: more granular, per event usage
>     is possible. The guest can use portion of the PMU (managed by the host),
>     and the host can use a portion too.
>
>     In contrast, a 'stolen', 'raw' PMU directly programmed by the guest OS
>     precludes the host OS from running some different piece of instrumentation
>     at the same time.
>    

Right, time slicing is something we want.

>   - It's more secure: the host can have a finegrained policy about what kinds of
>     events it exposes to the guest. It might chose to only expose software
>     events for example.
>
>     In contrast, a 'stolen', 'raw' PMU directly programmed by the guest OS is
>     an all-or-nothing policy affair: either you fully allow the guest (and live
>     with whatever consequences the piece of hardware that takes up a fair chunk
>     on the CPU die causes), or you allow none of it.
>    

No, we can hide insecure events with a full pmu.  Trap the control 
register and don't pass it on to the hardware.

>   - A proper paravirt event driver gives more features as well: it can exposes
>     host software events and tracepoints, probes - not restricting itself to
>     the 'hardware PMU' abstraction.
>    

But it is limited to whatever the host stack supports.  At least that's 
our control, but things like PEBS will take a ton of work.

>   - There's proper event scheduling and event allocation. Time-slicing, etc.
>
>
> The thing is, we made quite similar arguments in the past, during the perfmon
> vs. perfcounters discussions. There's really a big advantage to proper
> abstractions, both on the host and on the guest side.
>    

We only control half of the equation.  That's very different compared to 
tools/perf.

-- 
Do not meddle in the internals of kernels, for they are subtle and quick to panic.


^ permalink raw reply	[flat|nested] 99+ messages in thread

* Re: KVM PMU virtualization
  2010-02-26 13:37                     ` Jes Sorensen
@ 2010-02-26 13:55                       ` Avi Kivity
  2010-02-26 14:27                         ` Peter Zijlstra
  2010-03-01 18:54                       ` Zachary Amsden
  1 sibling, 1 reply; 99+ messages in thread
From: Avi Kivity @ 2010-02-26 13:55 UTC (permalink / raw)
  To: Jes Sorensen
  Cc: Ingo Molnar, Joerg Roedel, KVM General, Peter Zijlstra,
	Zachary Amsden, Gleb Natapov, ming.m.lin, Zhang, Yanmin,
	Peter Zijlstra, Thomas Gleixner, H. Peter Anvin, Arjan van de Ven,
	Fr??d??ric Weisbecker, Arnaldo Carvalho de Melo

On 02/26/2010 03:37 PM, Jes Sorensen wrote:
> On 02/26/10 14:31, Ingo Molnar wrote:
>> You are missing two big things wrt. compatibility here:
>>
>>   1) The first upgrade overhead a one time overhead only.
>>
>>   2) Once a Linux guest has upgraded, it will work in the future, 
>> with _any_
>>      future CPU - _without_ having to upgrade the guest!
>>
>> Dont you see the advantage of that? You can instrument an old system 
>> on new
>> hardware, without having to upgrade that guest for the new CPU support.
>
> That would only work if you are guaranteed to be able to emulate old
> hardware on new hardware. Not going to be feasible, so then we are in a
> real mess.
>

That actually works on the Intel-only architectural pmu.  I'm beginning 
to like it more and more.

-- 
Do not meddle in the internals of kernels, for they are subtle and quick to panic.


^ permalink raw reply	[flat|nested] 99+ messages in thread

* Re: KVM PMU virtualization
  2010-02-26 13:40                     ` Avi Kivity
@ 2010-02-26 14:01                       ` Ingo Molnar
  2010-02-26 14:22                         ` Avi Kivity
  0 siblings, 1 reply; 99+ messages in thread
From: Ingo Molnar @ 2010-02-26 14:01 UTC (permalink / raw)
  To: Avi Kivity
  Cc: Jes Sorensen, Joerg Roedel, KVM General, Peter Zijlstra,
	Zachary Amsden, Gleb Natapov, ming.m.lin, Zhang, Yanmin,
	Peter Zijlstra, Thomas Gleixner, H. Peter Anvin, Arjan van de Ven,
	Fr??d??ric Weisbecker, Arnaldo Carvalho de Melo

* Avi Kivity <avi@redhat.com> wrote:

> On 02/26/2010 03:31 PM, Ingo Molnar wrote:
> >* Avi Kivity<avi@redhat.com>  wrote:
> >
> >>Or do you mean to define a new, kvm-specific pmu model and feed it off the
> >>host pmu?  In this case all the guests will need to be taught about it,
> >>which raises the compatibility problem.
> >You are missing two big things wrt. compatibility here:
> >
> >  1) The first upgrade overhead a one time overhead only.
> 
> May be one too many, for certain guests.  Of course it may be argued
> that if the guest wants performance monitoring that much, they will
> upgrade.

Yes, that can certainly be argued.

Note another logical inconsistency: you are assuming reluctance to upgrade for 
a set of users who are doing _performance analysis_.

In fact those types of users are amongst the most upgrade-happy. Often they'll 
run modern hardware and modern software. Most of the time they are developers 
themselves who try to make sure their stuff works on the latest & greatest 
hardware _and_ software.

So people running P4's trying to tune their stuff under Red Hat Linux 9 and 
trying to use the PMU uner KVM is not really a concern rooted overly deeply in 
reality.

> Certainly guests that we don't port won't be able to use this.  I doubt 
> we'll be able to make Windows work with this - the only performance tool I'm 
> familiar with on Windows is Intel's VTune, and that's proprietary.

Dont you see the extreme irony of your wish to limit Linux kernel design 
decisions and features based on ... Windows and other proprietary software?

> >  2) Once a Linux guest has upgraded, it will work in the future, with _any_
> >     future CPU - _without_ having to upgrade the guest!
> >
> >Dont you see the advantage of that? You can instrument an old system on new
> >hardware, without having to upgrade that guest for the new CPU support.
> 
> That also works for the architectural pmu, of course that's Intel
> only.  And there you don't need to upgrade the guest even once.

Besides being Intel only, it only exposes a limited sub-set of hw events. (far 
fewer than the generic ones offered by perf events)

	Ingo

^ permalink raw reply	[flat|nested] 99+ messages in thread

* Re: KVM PMU virtualization
  2010-02-26 13:27                         ` Ingo Molnar
  2010-02-26 13:33                           ` Avi Kivity
@ 2010-02-26 14:07                           ` Jes Sorensen
  2010-02-26 14:11                             ` Avi Kivity
  1 sibling, 1 reply; 99+ messages in thread
From: Jes Sorensen @ 2010-02-26 14:07 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Avi Kivity, Joerg Roedel, KVM General, Peter Zijlstra,
	Zachary Amsden, Gleb Natapov, ming.m.lin, Zhang, Yanmin,
	Peter Zijlstra, Thomas Gleixner, H. Peter Anvin, Arjan van de Ven,
	Fr??d??ric Weisbecker, Arnaldo Carvalho de Melo

On 02/26/10 14:27, Ingo Molnar wrote:
>
> * Jes Sorensen<Jes.Sorensen@redhat.com>  wrote:
>> You certainly cannot emulate the Core2 on a P4. The Core2 is Perfmon v2,
>> whereas Nehalem and Atom are v3 if I remember correctly. [...]
>
> Of course you can emulate a good portion of it, as long as there's perf
> support on the host side for P4.

Actually P4 is pretty uninteresting in this discussion due to the lack
of VMX support, it's the same issue for Nehalem vs Core2. The problem
is the same though, we cannot tell the guest that yes P4 has this
event, but no, we are going to feed you bogus data.

> If the guest programs a cachemiss event, you program a cachemiss perf event on
> the host and feed its values to the emulated MSR state. You _dont_ program the
> raw PMU on the host side - just use the API i outlined to get struct
> perf_event.
>
> The emulation wont be perfect: not all events will count and not all events
> will be available in a P4 (and some Core2 events might not even make sense in
> a P4), but that is reality as well: often documented events dont count, and
> often non-documented events count.
>
> What matters to 99.9% of people who actually use this stuff is a few core sets
> of events - which are available in P4s and in Core2 as well. Cycles,
> instructions, branches, maybe cache-misses. Sometimes FPU stuff.

I really do not like to make guesses about how people use this stuff.
The things you and I look for as kernel hackers are often very different
than application authors look for and use. That is one thing I learned
from being expose to strange Fortran programmers at SGI.

It makes me very uncomfortable telling a guest OS that we offer features
X, Y, Z and then start lying feeding back numbers that do not match what
was requested, and there is no way to to tell the guest that.

> For Linux<->Linux the sanest, tier-1 approach would be to map sys_perf_open()
> on the guest side over to the host, transparently, via a paravirt driver.

Paravirt is a nice optimization, but is and will always be an
optimization. Fact of the matter is that the bulk of usage of
virtualization is for running distributions with slow kernel
upgrade rates, like SLES and RHEL, and other proprietary operating
systems which we have no control over. Para-virt will do little good for
either of these groups.

Cheers,
Jes

^ permalink raw reply	[flat|nested] 99+ messages in thread

* Re: KVM PMU virtualization
  2010-02-26 14:07                           ` Jes Sorensen
@ 2010-02-26 14:11                             ` Avi Kivity
  0 siblings, 0 replies; 99+ messages in thread
From: Avi Kivity @ 2010-02-26 14:11 UTC (permalink / raw)
  To: Jes Sorensen
  Cc: Ingo Molnar, Joerg Roedel, KVM General, Peter Zijlstra,
	Zachary Amsden, Gleb Natapov, ming.m.lin, Zhang, Yanmin,
	Peter Zijlstra, Thomas Gleixner, H. Peter Anvin, Arjan van de Ven,
	Fr??d??ric Weisbecker, Arnaldo Carvalho de Melo

On 02/26/2010 04:07 PM, Jes Sorensen wrote:
> On 02/26/10 14:27, Ingo Molnar wrote:
>>
>> * Jes Sorensen<Jes.Sorensen@redhat.com>  wrote:
>>> You certainly cannot emulate the Core2 on a P4. The Core2 is Perfmon 
>>> v2,
>>> whereas Nehalem and Atom are v3 if I remember correctly. [...]
>>
>> Of course you can emulate a good portion of it, as long as there's perf
>> support on the host side for P4.
>
> Actually P4 is pretty uninteresting in this discussion due to the lack
> of VMX support, it's the same issue for Nehalem vs Core2. The problem
> is the same though, we cannot tell the guest that yes P4 has this
> event, but no, we are going to feed you bogus data.

The Pentium D which is a P4 derivative has vmx support.  However it is 
so slow I'm fine with ignoring it for this feature.

-- 
Do not meddle in the internals of kernels, for they are subtle and quick to panic.


^ permalink raw reply	[flat|nested] 99+ messages in thread

* Re: KVM PMU virtualization
  2010-02-26 13:53                     ` Avi Kivity
@ 2010-02-26 14:12                       ` Ingo Molnar
  2010-02-26 14:53                         ` Avi Kivity
  2010-02-28 16:31                         ` Joerg Roedel
  0 siblings, 2 replies; 99+ messages in thread
From: Ingo Molnar @ 2010-02-26 14:12 UTC (permalink / raw)
  To: Avi Kivity
  Cc: Jes Sorensen, Joerg Roedel, KVM General, Peter Zijlstra,
	Zachary Amsden, Gleb Natapov, ming.m.lin, Zhang, Yanmin,
	Peter Zijlstra, Thomas Gleixner, H. Peter Anvin, Arjan van de Ven,
	Fr??d??ric Weisbecker, Arnaldo Carvalho de Melo


* Avi Kivity <avi@redhat.com> wrote:

> On 02/26/2010 03:44 PM, Ingo Molnar wrote:
> >* Avi Kivity<avi@redhat.com>  wrote:
> >
> >>On 02/26/2010 03:06 PM, Ingo Molnar wrote:
> >>>>>Firstly, an emulated PMU was only the second-tier option i suggested. By far
> >>>>>the best approach is native API to the host regarding performance events and
> >>>>>good guest side integration.
> >>>>>
> >>>>>Secondly, the PMU cannot be 'given' to the guest in the general case. Those
> >>>>>are privileged registers. They can expose sensitive host execution details,
> >>>>>etc. etc. So if you emulate a PMU you have to exit out of most PMU accesses
> >>>>>anyway for a secure solution. (RDPMC can still be supported, but in close
> >>>>>cooperation with the host)
> >>>>There is nothing secret in the host PMU, and it's easy to clear out the
> >>>>counters before passing them off to the guest.
> >>>That's wrong. On some CPUs the host PMU can be used to say sample aspects of
> >>>another CPU, allowing statistical attacks to recover crypto keys. It can be
> >>>used to sample memory access patterns of another node.
> >>>
> >>>There's a good reason PMU configuration registers are privileged and there's
> >>>good value in only giving a certain sub-set to less privileged entities by
> >>>default.
> >>Even if there were no security considerations, if the guest can observe host
> >>data in the pmu, it means the pmu is inaccurate.  We should expose guest
> >>data only in the guest pmu.  That's not difficult to do, you stop the pmu on
> >>exit and swap the counters on context switches.
> >Again you are making an incorrect assumption: that information leakage via the
> >PMU only occurs while the host is running on that CPU. It does not - the PMU
> >can leak general system details _while the guest is running_.
> 
> You mean like bus transactions on a multicore?  Well, we're already
> exposed to cache timing attacks.

If you give a full PMU to a guest it's a whole different dimension and quality 
of information. Literally hundreds of different events about all sorts of 
aspects of the CPU and the hardware in general.

> >So for this and for the many other reasons we dont want to give a raw PMU to
> >guests:
> >
> >  - A paravirt event driver is more compatible and more transparent in the long
> >    run: it allows hardware upgrade and upgraded PMU functionality (for Linux)
> >    without having to upgrade the guest OS. Via that a guest OS could even be
> >    live-migrated to a different PMU, without noticing anything about it.
> 
> What about Windows?

What is your question? Why should i limit Linux kernel design decisions based 
on any aspect of Windows? You might want to support it, but _please_ dont let 
the design be dictated by it ...

> >    In contrast, a 'stolen', 'raw' PMU directly programmed by the guest OS
> >    always assumes the guest OS is upgraded to the host. Also, 'raw' PMU state
> >    cannot be live-migrated. (save/restore doesnt help)
> 
> Why not?  So long as the source and destination are compatible?

'As long as it works' is certainly a good enough filter for quality ;-)

> >  - It's far cleaner on the host side as well: more granular, per event usage
> >    is possible. The guest can use portion of the PMU (managed by the host),
> >    and the host can use a portion too.
> >
> >    In contrast, a 'stolen', 'raw' PMU directly programmed by the guest OS
> >    precludes the host OS from running some different piece of instrumentation
> >    at the same time.
> 
> Right, time slicing is something we want.
> 
> >  - It's more secure: the host can have a finegrained policy about what kinds of
> >    events it exposes to the guest. It might chose to only expose software
> >    events for example.
> >
> >    In contrast, a 'stolen', 'raw' PMU directly programmed by the guest OS is
> >    an all-or-nothing policy affair: either you fully allow the guest (and live
> >    with whatever consequences the piece of hardware that takes up a fair chunk
> >    on the CPU die causes), or you allow none of it.
> 
> No, we can hide insecure events with a full pmu.  Trap the control register 
> and don't pass it on to the hardware.

So you basically concede partial emulation ...

> >  - A proper paravirt event driver gives more features as well: it can exposes
> >    host software events and tracepoints, probes - not restricting itself to
> >    the 'hardware PMU' abstraction.
> 
> But it is limited to whatever the host stack supports.  At least
> that's our control, but things like PEBS will take a ton of work.

PEBS support is being implemented for perf, as a transparent feature. So once 
it's available, PEBS support will magically improve the quality of guest OS 
samples, if a paravirt driver approach is used and if sys_perf_event_open() is 
taught about that driver. Without any other change needed on the guest side.

> >  - There's proper event scheduling and event allocation. Time-slicing, etc.
> >
> >
> > The thing is, we made quite similar arguments in the past, during the 
> > perfmon vs. perfcounters discussions. There's really a big advantage to 
> > proper abstractions, both on the host and on the guest side.
> 
> We only control half of the equation.  That's very different compared to 
> tools/perf.

You mean Windows?

For heaven's sake, why dont you think like Linus thought 20 years ago. To the 
hell with Windows suckiness and lets make sure our stuff works well. Then the 
users will come, developers will come, and people will profile Linux under 
Linux and maybe the tools will be so good that they'll profile under Linux 
using Wine just to be able to use those good tools...

If you gut Linux capabilities like that to accomodate for the suckiness of 
Windows, without giving a technological edge to Linux, and then we are bound 
to fail in the long run ...

	Ingo

^ permalink raw reply	[flat|nested] 99+ messages in thread

* Re: KVM PMU virtualization
  2010-02-26 14:01                       ` Ingo Molnar
@ 2010-02-26 14:22                         ` Avi Kivity
  2010-02-26 14:37                           ` Ingo Molnar
  0 siblings, 1 reply; 99+ messages in thread
From: Avi Kivity @ 2010-02-26 14:22 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Jes Sorensen, Joerg Roedel, KVM General, Peter Zijlstra,
	Zachary Amsden, Gleb Natapov, ming.m.lin, Zhang, Yanmin,
	Peter Zijlstra, Thomas Gleixner, H. Peter Anvin, Arjan van de Ven,
	Fr??d??ric Weisbecker, Arnaldo Carvalho de Melo

On 02/26/2010 04:01 PM, Ingo Molnar wrote:
> * Avi Kivity<avi@redhat.com>  wrote:
>
>    
>> On 02/26/2010 03:31 PM, Ingo Molnar wrote:
>>      
>>> * Avi Kivity<avi@redhat.com>   wrote:
>>>
>>>        
>>>> Or do you mean to define a new, kvm-specific pmu model and feed it off the
>>>> host pmu?  In this case all the guests will need to be taught about it,
>>>> which raises the compatibility problem.
>>>>          
>>> You are missing two big things wrt. compatibility here:
>>>
>>>   1) The first upgrade overhead a one time overhead only.
>>>        
>> May be one too many, for certain guests.  Of course it may be argued
>> that if the guest wants performance monitoring that much, they will
>> upgrade.
>>      
> Yes, that can certainly be argued.
>
> Note another logical inconsistency: you are assuming reluctance to upgrade for
> a set of users who are doing _performance analysis_.
>
> In fact those types of users are amongst the most upgrade-happy. Often they'll
> run modern hardware and modern software. Most of the time they are developers
> themselves who try to make sure their stuff works on the latest&  greatest
> hardware _and_ software.
>    

I wouldn't go as far, but I agree there is less resistance to change 
here.  A Windows user certainly ought to be willing to install a new 
VTune release, and a RHEL user can be convinced to upgrade from (say) 
5.4 to 5.6 with new backported paravirt pmu support.

I wouldn't like to force them to upgrade to 2.6.3x though.  Many of 
those users will be developers of in-house applications who are trying 
to understand their applications under production loads.

>> Certainly guests that we don't port won't be able to use this.  I doubt
>> we'll be able to make Windows work with this - the only performance tool I'm
>> familiar with on Windows is Intel's VTune, and that's proprietary.
>>      
> Dont you see the extreme irony of your wish to limit Linux kernel design
> decisions and features based on ... Windows and other proprietary software?
>    

Not at all.  Virtualization is a hardware compatibility game.  To see 
what happens if you don't play it, see Xen.  Eventually they to 
implemented hardware support even though the pv approach is so wonderful.

If we go the pv route, we'll limit the usefulness of Linux in this 
scenario to a subset of guests.  Users will simply walk away and choose 
a hypervisor whose authors have less interest in irony and more in 
providing the features they want.

A pv approach can come after we have a baseline that is useful to all users.

>>>   2) Once a Linux guest has upgraded, it will work in the future, with _any_
>>>      future CPU - _without_ having to upgrade the guest!
>>>
>>> Dont you see the advantage of that? You can instrument an old system on new
>>> hardware, without having to upgrade that guest for the new CPU support.
>>>        
>> That also works for the architectural pmu, of course that's Intel
>> only.  And there you don't need to upgrade the guest even once.
>>      
> Besides being Intel only, it only exposes a limited sub-set of hw events. (far
> fewer than the generic ones offered by perf events)
>
>    

Things aren't mutually exclusive.  Offer the arch pmu for maximum future 
compatibility (Intel only, alas), the full pmu for maximum features, and 
the pv pmu for flexibility.

-- 
Do not meddle in the internals of kernels, for they are subtle and quick to panic.


^ permalink raw reply	[flat|nested] 99+ messages in thread

* Re: KVM PMU virtualization
  2010-02-26 13:55                       ` Avi Kivity
@ 2010-02-26 14:27                         ` Peter Zijlstra
  2010-02-26 14:54                           ` Avi Kivity
  0 siblings, 1 reply; 99+ messages in thread
From: Peter Zijlstra @ 2010-02-26 14:27 UTC (permalink / raw)
  To: Avi Kivity
  Cc: Jes Sorensen, Ingo Molnar, Joerg Roedel, KVM General,
	Zachary Amsden, Gleb Natapov, ming.m.lin, Zhang, Yanmin,
	Thomas Gleixner, H. Peter Anvin, Arjan van de Ven,
	Fr??d??ric Weisbecker, Arnaldo Carvalho de Melo

On Fri, 2010-02-26 at 15:55 +0200, Avi Kivity wrote:
> 
> That actually works on the Intel-only architectural pmu.  I'm beginning 
> to like it more and more. 

Only for the arch defined events, all _7_ of them.


^ permalink raw reply	[flat|nested] 99+ messages in thread

* Re: KVM PMU virtualization
  2010-02-26 14:22                         ` Avi Kivity
@ 2010-02-26 14:37                           ` Ingo Molnar
  2010-02-26 16:03                             ` Avi Kivity
  0 siblings, 1 reply; 99+ messages in thread
From: Ingo Molnar @ 2010-02-26 14:37 UTC (permalink / raw)
  To: Avi Kivity
  Cc: Jes Sorensen, Joerg Roedel, KVM General, Peter Zijlstra,
	Zachary Amsden, Gleb Natapov, ming.m.lin, Zhang, Yanmin,
	Peter Zijlstra, Thomas Gleixner, H. Peter Anvin, Arjan van de Ven,
	Fr??d??ric Weisbecker, Arnaldo Carvalho de Melo

* Avi Kivity <avi@redhat.com> wrote:

> >> Certainly guests that we don't port won't be able to use this.  I doubt
> >> we'll be able to make Windows work with this - the only performance tool I'm
> >> familiar with on Windows is Intel's VTune, and that's proprietary.
> >
> > Dont you see the extreme irony of your wish to limit Linux kernel design 
> > decisions and features based on ... Windows and other proprietary 
> > software?
> 
> Not at all.  Virtualization is a hardware compatibility game.  To see what 
> happens if you don't play it, see Xen.  Eventually they to implemented 
> hardware support even though the pv approach is so wonderful.

That's not quite equivalent though.

KVM used to be the clean, integrate-code-with-Linux virtualization approach, 
designed specifically for CPUs that can be virtualized properly. (VMX support 
first, then SVM, etc.)

KVM virtualized ages-old concepts with relatively straightforward hardware 
ABIs: x86 execution, IRQ abstractions, device abstractions, etc.

Now you are in essence turning that all around:

 - the PMU is by no means properly virtualized nor really virtualizable by 
   direct access. There's no virtual PMU that ticks independently of the host 
   PMU.

 - the PMU hardware itself is not a well standardized piece of hardware. It's 
   very vendor dependent and very limiting.

So to some degree you are playing the role of Xen in this specific affair. You 
are pushing for something that shouldnt be done in that form. You want to 
interfere with the host PMU by going via the fast & easy short-term hack to 
just let the guest OS have the PMU, without any regard to how this impacts 
long-term feasible solutions.

I.e. you are a bit like the guy who would have told Linus in 1994:

 " Dude, why dont you use the Windows APIs? It's far more compatible and 
   that's the only way you could run any serious apps. Besides, it requires 
   no upgrade. Admittedly it's a bit messy and 16-bit but hey, that's our 
   installed base after all. "

	Ingo

^ permalink raw reply	[flat|nested] 99+ messages in thread

* Re: KVM PMU virtualization
  2010-02-26 13:51                 ` Jes Sorensen
@ 2010-02-26 14:42                   ` Peter Zijlstra
  2010-03-08 18:14                     ` Avi Kivity
  0 siblings, 1 reply; 99+ messages in thread
From: Peter Zijlstra @ 2010-02-26 14:42 UTC (permalink / raw)
  To: Jes Sorensen
  Cc: Avi Kivity, Ingo Molnar, Joerg Roedel, KVM General,
	Zachary Amsden, Gleb Natapov, ming.m.lin, Zhang, Yanmin,
	Thomas Gleixner, H. Peter Anvin, Arjan van de Ven,
	Fr??d??ric Weisbecker, Arnaldo Carvalho de Melo

On Fri, 2010-02-26 at 14:51 +0100, Jes Sorensen wrote:
> 
> > Furthermore, when KVM doesn't virtualize the physical system topology,
> > some PMU features cannot even be sanely used from a vcpu.
> 
> That is definitely an issue, and there is nothing we can really do about
> that. Having two guests running in parallel under KVM means that they
> are going to see more cache misses than they would if they ran barebone
> on the hardware.
> 
> However even with all of this, we have to keep in mind who is going to
> use the performance monitoring in a guest. It is going to be application
> writers, mostly people writing analytical/scientific applications. They
> rarely have control over the OS they are running on, but are given
> systems and told to work on what they are given. Driver upgrades and
> things like that don't come quickly. However they also tend to
> understand limitations like these and will be able to still benefit from
> perf on a system like that.

What I meant was things like memory controller bound counters, intel
uncore and amd northbridge, without knowing what node the vcpu got
scheduled to there is no way they can program the raw hardware in a
meaningful way, amd nb in particular is interesting in that you could
choose not to offer the intel uncore msrs, but the amd nb are shadowed
over the generic pmcs, so you have no way to filter those out.

Same goes for stuff like the intel ANY flag, LBR filter control and
similar muck, a vcpu can't make use of those things in a meaningful
manner.

Also, intel debugstore things requires a host linear address, again, not
something a vcpu can easily provide (although that might be worked
around with an msr trap, but that still limits you to 1 page data sizes,
not a limitation all software will respect).

> All that said, what we really want is for Intel+AMD to come up with
> proper hw PMU virtualization support that makes it easy to rotate the
> full PMU in and out for a guest. Then this whole discussion will become
> a non issue.

As it stands there simply are a number of PMU features that defy being
virtualized, simply because the virt stuff doesn't do system topology.
So even if they were to support a virtualized pmu, it would likely be a
different beast than the native hardware is, and it will be several
hardware models in the future, coming up with a paravirt interface and
getting !linux hosts to adapt and !linux guests to use is probably as
'easy'.

^ permalink raw reply	[flat|nested] 99+ messages in thread

* Re: KVM PMU virtualization
  2010-02-26 13:30                 ` Avi Kivity
  2010-02-26 13:32                   ` Jes Sorensen
  2010-02-26 13:44                   ` Ingo Molnar
@ 2010-02-26 14:49                   ` Peter Zijlstra
  2010-02-26 14:50                   ` Peter Zijlstra
  3 siblings, 0 replies; 99+ messages in thread
From: Peter Zijlstra @ 2010-02-26 14:49 UTC (permalink / raw)
  To: Avi Kivity
  Cc: Ingo Molnar, Jes Sorensen, Joerg Roedel, KVM General,
	Zachary Amsden, Gleb Natapov, ming.m.lin, Zhang, Yanmin,
	Thomas Gleixner, H. Peter Anvin, Arjan van de Ven,
	Fr??d??ric Weisbecker, Arnaldo Carvalho de Melo

On Fri, 2010-02-26 at 15:30 +0200, Avi Kivity wrote:
> 
> Even if there were no security considerations, if the guest can observe 
> host data in the pmu, it means the pmu is inaccurate.  We should expose 
> guest data only in the guest pmu.  That's not difficult to do, you stop 
> the pmu on exit and swap the counters on context switches. 

That's not enough, memory node wide counters are impossible to isolate
like that, the same for core wide (ANY flag) counters.




^ permalink raw reply	[flat|nested] 99+ messages in thread

* Re: KVM PMU virtualization
  2010-02-26 13:30                 ` Avi Kivity
                                     ` (2 preceding siblings ...)
  2010-02-26 14:49                   ` Peter Zijlstra
@ 2010-02-26 14:50                   ` Peter Zijlstra
  3 siblings, 0 replies; 99+ messages in thread
From: Peter Zijlstra @ 2010-02-26 14:50 UTC (permalink / raw)
  To: Avi Kivity
  Cc: Ingo Molnar, Jes Sorensen, Joerg Roedel, KVM General,
	Zachary Amsden, Gleb Natapov, ming.m.lin, Zhang, Yanmin,
	Thomas Gleixner, H. Peter Anvin, Arjan van de Ven,
	Fr??d??ric Weisbecker, Arnaldo Carvalho de Melo

On Fri, 2010-02-26 at 15:30 +0200, Avi Kivity wrote:
> 
> Scheduling at event granularity would be a good thing.  However we need 
> to be able to handle the guest using the full pmu. 

Does the full PMU include things like LBR, PEBS and uncore? in that
case, there is no way you're going to get that properly and securely
virtualized by using raw access.


^ permalink raw reply	[flat|nested] 99+ messages in thread

* Re: KVM PMU virtualization
  2010-02-26 14:12                       ` Ingo Molnar
@ 2010-02-26 14:53                         ` Avi Kivity
  2010-02-26 15:14                           ` Peter Zijlstra
  2010-02-28 16:31                         ` Joerg Roedel
  1 sibling, 1 reply; 99+ messages in thread
From: Avi Kivity @ 2010-02-26 14:53 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Jes Sorensen, Joerg Roedel, KVM General, Peter Zijlstra,
	Zachary Amsden, Gleb Natapov, ming.m.lin, Zhang, Yanmin,
	Peter Zijlstra, Thomas Gleixner, H. Peter Anvin, Arjan van de Ven,
	Fr??d??ric Weisbecker, Arnaldo Carvalho de Melo

On 02/26/2010 04:12 PM, Ingo Molnar wrote:
>>>
>>> Again you are making an incorrect assumption: that information leakage via the
>>> PMU only occurs while the host is running on that CPU. It does not - the PMU
>>> can leak general system details _while the guest is running_.
>>>        
>> You mean like bus transactions on a multicore?  Well, we're already
>> exposed to cache timing attacks.
>>      
> If you give a full PMU to a guest it's a whole different dimension and quality
> of information. Literally hundreds of different events about all sorts of
> aspects of the CPU and the hardware in general.
>    

Well, we filter out the bad events then.

>>> So for this and for the many other reasons we dont want to give a raw PMU to
>>> guests:
>>>
>>>   - A paravirt event driver is more compatible and more transparent in the long
>>>     run: it allows hardware upgrade and upgraded PMU functionality (for Linux)
>>>     without having to upgrade the guest OS. Via that a guest OS could even be
>>>     live-migrated to a different PMU, without noticing anything about it.
>>>        
>> What about Windows?
>>      
> What is your question? Why should i limit Linux kernel design decisions based
> on any aspect of Windows? You might want to support it, but _please_ dont let
> the design be dictated by it ...
>    

In our case the quality of implementation is judged by how well we 
support workloads that users run, and that means we have to support 
Windows well.  And that more or less means we can't have a pv-only pmu.

Which part of this do you disagree with?

>>>     In contrast, a 'stolen', 'raw' PMU directly programmed by the guest OS
>>>     always assumes the guest OS is upgraded to the host. Also, 'raw' PMU state
>>>     cannot be live-migrated. (save/restore doesnt help)
>>>        
>> Why not?  So long as the source and destination are compatible?
>>      
> 'As long as it works' is certainly a good enough filter for quality ;-)
>    

We already have this.  If you expose sse4.2 to the guest, you can't 
migrate to a host which doesn't support it.  If you expose a Nehalem pmu 
to the guest, you can't migrate to a host which supports it.  Users and 
tools already understand this.

It's true that the pmu case is more difficult since you can't migrate 
forwards as well as backwards, but that's life.

>> No, we can hide insecure events with a full pmu.  Trap the control register
>> and don't pass it on to the hardware.
>>      
> So you basically concede partial emulation ...
>    

Yes.  Still appears to follow the spec to the guest, though.  And with 
the option of full emulation for those who need it and sign on the 
dotted line.

>>>   - There's proper event scheduling and event allocation. Time-slicing, etc.
>>>
>>>
>>> The thing is, we made quite similar arguments in the past, during the
>>> perfmon vs. perfcounters discussions. There's really a big advantage to
>>> proper abstractions, both on the host and on the guest side.
>>>        
>> We only control half of the equation.  That's very different compared to
>> tools/perf.
>>      
> You mean Windows?
>
> For heaven's sake, why dont you think like Linus thought 20 years ago. To the
> hell with Windows suckiness and lets make sure our stuff works well.

In our case, making our stuff work well means making sure guests of the 
user's choice run well.  Not ours.  Currently users mostly choose 
Windows and Linux, so we have to make them both work.

(btw, the analogy would be, 'To hell with Unix suckiness, let's make 
sure our stuff works well'; where Linux reimplemented the Unix APIs, 
ensuring source compatibility with applications, kvm reimplements the 
hardware interface, ensuring binary compatibility with guests).

>   Then the
> users will come, developers will come, and people will profile Linux under
> Linux and maybe the tools will be so good that they'll profile under Linux
> using Wine just to be able to use those good tools...
>    

If we don't support Windows well, users will walk away, followed by 
starving developers.

> If you gut Linux capabilities like that to accomodate for the suckiness of
> Windows, without giving a technological edge to Linux, and then we are bound
> to fail in the long run ...
>    

I'm all for abusing the tight relationship between Linux-as-a-host and 
Linux-as-a-guest to gain an advantage for both.  One fruitful area would 
be asynchronous page faults, which has the potential to increase memory 
overcommit, for example.  But first of all we need to make sure that 
there is a baseline of support for all commonly used guests.

I think of it this way: once kvm deployment becomes widespread, 
Linux-as-a-guest gains an advantage.  But in order for kvm deployment to 
become widespread, it needs excellent support for all guests users 
actually use.

-- 
Do not meddle in the internals of kernels, for they are subtle and quick to panic.

^ permalink raw reply	[flat|nested] 99+ messages in thread

* Re: KVM PMU virtualization
  2010-02-26 14:27                         ` Peter Zijlstra
@ 2010-02-26 14:54                           ` Avi Kivity
  2010-02-26 15:08                             ` Peter Zijlstra
  0 siblings, 1 reply; 99+ messages in thread
From: Avi Kivity @ 2010-02-26 14:54 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Jes Sorensen, Ingo Molnar, Joerg Roedel, KVM General,
	Zachary Amsden, Gleb Natapov, ming.m.lin, Zhang, Yanmin,
	Thomas Gleixner, H. Peter Anvin, Arjan van de Ven,
	Fr??d??ric Weisbecker, Arnaldo Carvalho de Melo

On 02/26/2010 04:27 PM, Peter Zijlstra wrote:
> On Fri, 2010-02-26 at 15:55 +0200, Avi Kivity wrote:
>    
>> That actually works on the Intel-only architectural pmu.  I'm beginning
>> to like it more and more.
>>      
> Only for the arch defined events, all _7_ of them.
>    

That's 7 more than what we support now, and 7 more than what we can 
guarantee without it.

-- 
Do not meddle in the internals of kernels, for they are subtle and quick to panic.


^ permalink raw reply	[flat|nested] 99+ messages in thread

* Re: KVM PMU virtualization
  2010-02-26 14:54                           ` Avi Kivity
@ 2010-02-26 15:08                             ` Peter Zijlstra
  2010-02-26 15:11                               ` Avi Kivity
  0 siblings, 1 reply; 99+ messages in thread
From: Peter Zijlstra @ 2010-02-26 15:08 UTC (permalink / raw)
  To: Avi Kivity
  Cc: Jes Sorensen, Ingo Molnar, Joerg Roedel, KVM General,
	Zachary Amsden, Gleb Natapov, ming.m.lin, Zhang, Yanmin,
	Thomas Gleixner, H. Peter Anvin, Arjan van de Ven,
	Fr??d??ric Weisbecker, Arnaldo Carvalho de Melo

On Fri, 2010-02-26 at 16:54 +0200, Avi Kivity wrote:
> On 02/26/2010 04:27 PM, Peter Zijlstra wrote:
> > On Fri, 2010-02-26 at 15:55 +0200, Avi Kivity wrote:
> >    
> >> That actually works on the Intel-only architectural pmu.  I'm beginning
> >> to like it more and more.
> >>      
> > Only for the arch defined events, all _7_ of them.
> >    
> 
> That's 7 more than what we support now, and 7 more than what we can 
> guarantee without it.

Again, what windows software uses only those 7? Does it pay to only have
access to those 7 or does it limit the usability to exactly the same
subset a paravirt interface would?


^ permalink raw reply	[flat|nested] 99+ messages in thread

* Re: KVM PMU virtualization
  2010-02-26 15:08                             ` Peter Zijlstra
@ 2010-02-26 15:11                               ` Avi Kivity
  2010-02-26 15:18                                 ` Peter Zijlstra
  2010-02-26 15:55                                 ` Peter Zijlstra
  0 siblings, 2 replies; 99+ messages in thread
From: Avi Kivity @ 2010-02-26 15:11 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Jes Sorensen, Ingo Molnar, Joerg Roedel, KVM General,
	Zachary Amsden, Gleb Natapov, ming.m.lin, Zhang, Yanmin,
	Thomas Gleixner, H. Peter Anvin, Arjan van de Ven,
	Fr??d??ric Weisbecker, Arnaldo Carvalho de Melo

On 02/26/2010 05:08 PM, Peter Zijlstra wrote:
>> That's 7 more than what we support now, and 7 more than what we can
>> guarantee without it.
>>      
> Again, what windows software uses only those 7? Does it pay to only have
> access to those 7 or does it limit the usability to exactly the same
> subset a paravirt interface would?
>    

Good question.  Would be interesting to try out VTune with the non-arch 
pmu masked out.

-- 
Do not meddle in the internals of kernels, for they are subtle and quick to panic.


^ permalink raw reply	[flat|nested] 99+ messages in thread

* Re: KVM PMU virtualization
  2010-02-26 14:53                         ` Avi Kivity
@ 2010-02-26 15:14                           ` Peter Zijlstra
  2010-02-28 16:34                             ` Joerg Roedel
  0 siblings, 1 reply; 99+ messages in thread
From: Peter Zijlstra @ 2010-02-26 15:14 UTC (permalink / raw)
  To: Avi Kivity
  Cc: Ingo Molnar, Jes Sorensen, Joerg Roedel, KVM General,
	Zachary Amsden, Gleb Natapov, ming.m.lin, Zhang, Yanmin,
	Thomas Gleixner, H. Peter Anvin, Arjan van de Ven,
	Fr??d??ric Weisbecker, Arnaldo Carvalho de Melo

On Fri, 2010-02-26 at 16:53 +0200, Avi Kivity wrote:
> > If you give a full PMU to a guest it's a whole different dimension and quality
> > of information. Literally hundreds of different events about all sorts of
> > aspects of the CPU and the hardware in general.
> >    
> 
> Well, we filter out the bad events then. 

Which requires trapping the MSR access, at which point a soft-PMU is
almost there, right?


^ permalink raw reply	[flat|nested] 99+ messages in thread

* Re: KVM PMU virtualization
  2010-02-26 15:11                               ` Avi Kivity
@ 2010-02-26 15:18                                 ` Peter Zijlstra
  2010-02-26 15:55                                 ` Peter Zijlstra
  1 sibling, 0 replies; 99+ messages in thread
From: Peter Zijlstra @ 2010-02-26 15:18 UTC (permalink / raw)
  To: Avi Kivity
  Cc: Jes Sorensen, Ingo Molnar, Joerg Roedel, KVM General,
	Zachary Amsden, Gleb Natapov, ming.m.lin, Zhang, Yanmin,
	Thomas Gleixner, H. Peter Anvin, Arjan van de Ven,
	Fr??d??ric Weisbecker, Arnaldo Carvalho de Melo

On Fri, 2010-02-26 at 17:11 +0200, Avi Kivity wrote:
> On 02/26/2010 05:08 PM, Peter Zijlstra wrote:
> >> That's 7 more than what we support now, and 7 more than what we can
> >> guarantee without it.
> >>      
> > Again, what windows software uses only those 7? Does it pay to only have
> > access to those 7 or does it limit the usability to exactly the same
> > subset a paravirt interface would?
> >    
> 
> Good question.  Would be interesting to try out VTune with the non-arch 
> pmu masked out.

>From what I understood VTune uses PEBS+LBR, although I suppose they have
simple PMU modes too, never actually seen the software.


^ permalink raw reply	[flat|nested] 99+ messages in thread

* Re: KVM PMU virtualization
  2010-02-26 15:11                               ` Avi Kivity
  2010-02-26 15:18                                 ` Peter Zijlstra
@ 2010-02-26 15:55                                 ` Peter Zijlstra
  2010-02-26 16:06                                   ` Avi Kivity
  2010-03-01 19:03                                   ` Zachary Amsden
  1 sibling, 2 replies; 99+ messages in thread
From: Peter Zijlstra @ 2010-02-26 15:55 UTC (permalink / raw)
  To: Avi Kivity
  Cc: Jes Sorensen, Ingo Molnar, Joerg Roedel, KVM General,
	Zachary Amsden, Gleb Natapov, ming.m.lin, Zhang, Yanmin,
	Thomas Gleixner, H. Peter Anvin, Arjan van de Ven,
	Fr??d??ric Weisbecker, Arnaldo Carvalho de Melo

On Fri, 2010-02-26 at 17:11 +0200, Avi Kivity wrote:
> On 02/26/2010 05:08 PM, Peter Zijlstra wrote:
> >> That's 7 more than what we support now, and 7 more than what we can
> >> guarantee without it.
> >>      
> > Again, what windows software uses only those 7? Does it pay to only have
> > access to those 7 or does it limit the usability to exactly the same
> > subset a paravirt interface would?
> >    
> 
> Good question.  Would be interesting to try out VTune with the non-arch 
> pmu masked out.

Also, the ANY bit is part of the intel arch pmu, but you still have to
mask it out.

BTW, just wondering, why would a developer be running VTune in a guest
anyway? I'd think that a developer that windows oriented would simply
run windows on his desktop and VTune there.


^ permalink raw reply	[flat|nested] 99+ messages in thread

* Re: KVM PMU virtualization
  2010-02-26 14:37                           ` Ingo Molnar
@ 2010-02-26 16:03                             ` Avi Kivity
  2010-02-26 16:07                               ` Avi Kivity
  0 siblings, 1 reply; 99+ messages in thread
From: Avi Kivity @ 2010-02-26 16:03 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Jes Sorensen, Joerg Roedel, KVM General, Peter Zijlstra,
	Zachary Amsden, Gleb Natapov, ming.m.lin, Zhang, Yanmin,
	Peter Zijlstra, Thomas Gleixner, H. Peter Anvin, Arjan van de Ven,
	Fr??d??ric Weisbecker, Arnaldo Carvalho de Melo

On 02/26/2010 04:37 PM, Ingo Molnar wrote:
> * Avi Kivity<avi@redhat.com>  wrote:
>
>    
>>>> Certainly guests that we don't port won't be able to use this.  I doubt
>>>> we'll be able to make Windows work with this - the only performance tool I'm
>>>> familiar with on Windows is Intel's VTune, and that's proprietary.
>>>>          
>>> Dont you see the extreme irony of your wish to limit Linux kernel design
>>> decisions and features based on ... Windows and other proprietary
>>> software?
>>>        
>> Not at all.  Virtualization is a hardware compatibility game.  To see what
>> happens if you don't play it, see Xen.  Eventually they to implemented
>> hardware support even though the pv approach is so wonderful.
>>      
> That's not quite equivalent though.
>
> KVM used to be the clean, integrate-code-with-Linux virtualization approach,
> designed specifically for CPUs that can be virtualized properly. (VMX support
> first, then SVM, etc.)
>
> KVM virtualized ages-old concepts with relatively straightforward hardware
> ABIs: x86 execution, IRQ abstractions, device abstractions, etc.
>
> Now you are in essence turning that all around:
>
>   - the PMU is by no means properly virtualized nor really virtualizable by
>     direct access. There's no virtual PMU that ticks independently of the host
>     PMU.
>    

There's no guest debug registers that can be programmed independently of 
the host debug registers, but we manage somehow.  It's not perfect, but 
better than nothing.

For the common case of host-only or guest-only monitoring, things will 
work, perhaps without socketwide counters in security concious 
environments.  When both are used at the same time, something will have 
to give.

>   - the PMU hardware itself is not a well standardized piece of hardware. It's
>     very vendor dependent and very limiting.
>    

That's life.  If we force standardization by having a soft pmu, we'll be 
very limited as well.  If we don't, we reduce hardware independence 
which is a strong point of virtualization.  Clearly we need to make a 
trade-off here.

In favour of hardware dependence is that tools and users are already 
used to it.  There is also the architectural pmu that can provide a 
limited form of hardware independence.

Going pv trades off hardware dependence for software dependence.  
Suddenly only guests that you have control over can use the pmu.

> So to some degree you are playing the role of Xen in this specific affair. You
> are pushing for something that shouldnt be done in that form. You want to
> interfere with the host PMU by going via the fast&  easy short-term hack to
> just let the guest OS have the PMU, without any regard to how this impacts
> long-term feasible solutions.
>    

Maybe.  And maybe the vendors will improve virtualization support for 
the pmu, rendering the pv approach obsolete on new hardware.

> I.e. you are a bit like the guy who would have told Linus in 1994:
>
>   " Dude, why dont you use the Windows APIs? It's far more compatible and
>     that's the only way you could run any serious apps. Besides, it requires
>     no upgrade. Admittedly it's a bit messy and 16-bit but hey, that's our
>     installed base after all. "
>    

Hey, maybe we'd have significant desktop market share if he'd done this 
(though a replay of the wine history is much more likely).

But what are you suggesting?  That we make Windows a second class 
guest?  Most users run a mix of workloads, that will not go down well 
with them.  The choice is between first-class Windows support vs 
becoming a hobby hypervisor.

Let's make a kerner/user analogy again.  Would you be in favour of 
GPL-only-ing new syscalls, to give open source applications an edge over 
proprietary apps (technically known as "crap" among some)?

-- 
Do not meddle in the internals of kernels, for they are subtle and quick to panic.

^ permalink raw reply	[flat|nested] 99+ messages in thread

* Re: KVM PMU virtualization
  2010-02-26 15:55                                 ` Peter Zijlstra
@ 2010-02-26 16:06                                   ` Avi Kivity
  2010-03-01 19:03                                   ` Zachary Amsden
  1 sibling, 0 replies; 99+ messages in thread
From: Avi Kivity @ 2010-02-26 16:06 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Jes Sorensen, Ingo Molnar, Joerg Roedel, KVM General,
	Zachary Amsden, Gleb Natapov, ming.m.lin, Zhang, Yanmin,
	Thomas Gleixner, H. Peter Anvin, Arjan van de Ven,
	Fr??d??ric Weisbecker, Arnaldo Carvalho de Melo

On 02/26/2010 05:55 PM, Peter Zijlstra wrote:
> BTW, just wondering, why would a developer be running VTune in a guest
> anyway? I'd think that a developer that windows oriented would simply
> run windows on his desktop and VTune there.
>    

Cloud.

You have an app running somewhere on a cloud, internally or externally 
(you may not even know).  It's running a production workload and it 
isn't doing well.  You can't reproduce it on your desktop ("works for 
me, now go away").  So you rdesktop to your guest and monitor it.

You can't run anything on the host - you don't have access to it, you 
don't know who admins it (it's a program anyway), "the host" doesn't 
even exist, the guest moves around whenever the cloud feels like it.

-- 
Do not meddle in the internals of kernels, for they are subtle and quick to panic.

^ permalink raw reply	[flat|nested] 99+ messages in thread

* Re: KVM PMU virtualization
  2010-02-26 16:03                             ` Avi Kivity
@ 2010-02-26 16:07                               ` Avi Kivity
  0 siblings, 0 replies; 99+ messages in thread
From: Avi Kivity @ 2010-02-26 16:07 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Jes Sorensen, Joerg Roedel, KVM General, Peter Zijlstra,
	Zachary Amsden, Gleb Natapov, ming.m.lin, Zhang, Yanmin,
	Peter Zijlstra, Thomas Gleixner, H. Peter Anvin, Arjan van de Ven,
	Fr??d??ric Weisbecker, Arnaldo Carvalho de Melo

On 02/26/2010 06:03 PM, Avi Kivity wrote:

Note, I'll be away for a week, so will not be responsive for a while

-- 
Do not meddle in the internals of kernels, for they are subtle and quick to panic.


^ permalink raw reply	[flat|nested] 99+ messages in thread

* Re: KVM PMU virtualization
  2010-02-26 13:44                   ` Ingo Molnar
  2010-02-26 13:53                     ` Avi Kivity
@ 2010-02-28 16:11                     ` Joerg Roedel
  2010-03-01  8:39                       ` Ingo Molnar
  2010-03-01  8:44                       ` Ingo Molnar
  1 sibling, 2 replies; 99+ messages in thread
From: Joerg Roedel @ 2010-02-28 16:11 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Avi Kivity, Jes Sorensen, KVM General, Peter Zijlstra,
	Zachary Amsden, Gleb Natapov, ming.m.lin, Zhang, Yanmin,
	Peter Zijlstra, Thomas Gleixner, H. Peter Anvin, Arjan van de Ven,
	Fr??d??ric Weisbecker, Arnaldo Carvalho de Melo

On Fri, Feb 26, 2010 at 02:44:00PM +0100, Ingo Molnar wrote:
>  - A paravirt event driver is more compatible and more transparent in the long 
>    run: it allows hardware upgrade and upgraded PMU functionality (for Linux) 
>    without having to upgrade the guest OS. Via that a guest OS could even be
>    live-migrated to a different PMU, without noticing anything about it.
> 
>    In contrast, a 'stolen', 'raw' PMU directly programmed by the guest OS 
>    always assumes the guest OS is upgraded to the host. Also, 'raw' PMU state 
>    cannot be live-migrated. (save/restore doesnt help)

I agree with your arguments, having this soft-pmu for the guest has some
advantages over raw pmu access. It has a lot of advantages if the guest
migrated between hosts with different hardware.

But I still think we should have both, a soft-pmu and a pmu-emulation
(which is a more accurate term than 'raw guest pmu access') that looks
to the guest as a real hardware pmu would look like.  On a linux host
that is dedicated to executing virtual kvm machines there is little
point in sharing the pmu between guest and host because the host will
probably never use it.

This pmu-emulation will still use the perf-infrastructure for scheduling
the pmu registers, programming the pmu registers and things like that.
This could be used for example to emulate 48bit counters for the guest
even if the host only supports 32 bit counters. We even need the perf
infrastructure when we need to reinject pmu events into the guest.

>  - It's more secure: the host can have a finegrained policy about what kinds of
>    events it exposes to the guest. It might chose to only expose software 
>    events for example.

What do you mean by software events?

	Joerg

^ permalink raw reply	[flat|nested] 99+ messages in thread

* Re: KVM PMU virtualization
  2010-02-26 14:12                       ` Ingo Molnar
  2010-02-26 14:53                         ` Avi Kivity
@ 2010-02-28 16:31                         ` Joerg Roedel
  1 sibling, 0 replies; 99+ messages in thread
From: Joerg Roedel @ 2010-02-28 16:31 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Avi Kivity, Jes Sorensen, KVM General, Peter Zijlstra,
	Zachary Amsden, Gleb Natapov, ming.m.lin, Zhang, Yanmin,
	Peter Zijlstra, Thomas Gleixner, H. Peter Anvin, Arjan van de Ven,
	Fr??d??ric Weisbecker, Arnaldo Carvalho de Melo

On Fri, Feb 26, 2010 at 03:12:29PM +0100, Ingo Molnar wrote:
> You mean Windows?
> 
> For heaven's sake, why dont you think like Linus thought 20 years ago. To the 
> hell with Windows suckiness and lets make sure our stuff works well. Then the 
> users will come, developers will come, and people will profile Linux under 
> Linux and maybe the tools will be so good that they'll profile under Linux 
> using Wine just to be able to use those good tools...

Thats not a good comparison. Linux is nothing completly new, it was, and
still is, a new implemenation of an existing operating system concept and
thus at least mostly source-compatible to other operating systems
implementing this concept.
Linux would never had this success if it were not posix compliant and
could run applications like X or gcc which were written for other
operating systems.

	Joerg


^ permalink raw reply	[flat|nested] 99+ messages in thread

* Re: KVM PMU virtualization
  2010-02-26 15:14                           ` Peter Zijlstra
@ 2010-02-28 16:34                             ` Joerg Roedel
  0 siblings, 0 replies; 99+ messages in thread
From: Joerg Roedel @ 2010-02-28 16:34 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Avi Kivity, Ingo Molnar, Jes Sorensen, KVM General,
	Zachary Amsden, Gleb Natapov, ming.m.lin, Zhang, Yanmin,
	Thomas Gleixner, H. Peter Anvin, Arjan van de Ven,
	Fr??d??ric Weisbecker, Arnaldo Carvalho de Melo

On Fri, Feb 26, 2010 at 04:14:08PM +0100, Peter Zijlstra wrote:
> On Fri, 2010-02-26 at 16:53 +0200, Avi Kivity wrote:
> > > If you give a full PMU to a guest it's a whole different dimension and quality
> > > of information. Literally hundreds of different events about all sorts of
> > > aspects of the CPU and the hardware in general.
> > >    
> > 
> > Well, we filter out the bad events then. 
> 
> Which requires trapping the MSR access, at which point a soft-PMU is
> almost there, right?

The perfctl msrs need to be trapped anyway. Otherwise the guest could
generate NMIs in host context. But access to the perfctr registers could
be given to the guest.

	Joerg


^ permalink raw reply	[flat|nested] 99+ messages in thread

* Re: KVM PMU virtualization
  2010-02-28 16:11                     ` Joerg Roedel
@ 2010-03-01  8:39                       ` Ingo Molnar
  2010-03-01  8:58                         ` Joerg Roedel
  2010-03-01  8:44                       ` Ingo Molnar
  1 sibling, 1 reply; 99+ messages in thread
From: Ingo Molnar @ 2010-03-01  8:39 UTC (permalink / raw)
  To: Joerg Roedel
  Cc: Avi Kivity, Jes Sorensen, KVM General, Peter Zijlstra,
	Zachary Amsden, Gleb Natapov, ming.m.lin, Zhang, Yanmin,
	Peter Zijlstra, Thomas Gleixner, H. Peter Anvin, Arjan van de Ven,
	Fr??d??ric Weisbecker, Arnaldo Carvalho de Melo


* Joerg Roedel <joro@8bytes.org> wrote:

> >  - It's more secure: the host can have a finegrained policy about what kinds of
> >    events it exposes to the guest. It might chose to only expose software 
> >    events for example.
> 
> What do you mean by software events?

Things like:

aldebaran:~> perf stat -a sleep 1

 Performance counter stats for 'sleep 1':

   15995.719133  task-clock-msecs         #     15.981 CPUs 
           5787  context-switches         #      0.000 M/sec
            210  CPU-migrations           #      0.000 M/sec
         193909  page-faults              #      0.012 M/sec
    28704833507  cycles                   #   1794.532 M/sec  (scaled from 78.69%)
    14387445668  instructions             #      0.501 IPC    (scaled from 90.71%)
      736644616  branches                 #     46.053 M/sec  (scaled from 90.52%)
      695884659  branch-misses            #     94.467 %      (scaled from 90.70%)
      727070678  cache-references         #     45.454 M/sec  (scaled from 88.11%)
     1305560420  cache-misses             #     81.619 M/sec  (scaled from 52.00%)

    1.000942399  seconds time elapsed

These lines:

   15995.719133  task-clock-msecs         #     15.981 CPUs 
           5787  context-switches         #      0.000 M/sec
            210  CPU-migrations           #      0.000 M/sec
         193909  page-faults              #      0.012 M/sec

Are software events of the host - a subset of which could be transparently 
exposed to the guest. Same for tracepoints, probes, etc. Those are not exposed 
by the hardware PMU. So by doing a 'soft' PMU (or even better: a paravirt 
channel to perf events) we gain a lot more than just raw PMU functionality.

'performance events' are about a lot more than just the PMU, it's a coherent 
system health / system events / structured logging framework.

	Ingo

^ permalink raw reply	[flat|nested] 99+ messages in thread

* Re: KVM PMU virtualization
  2010-02-28 16:11                     ` Joerg Roedel
  2010-03-01  8:39                       ` Ingo Molnar
@ 2010-03-01  8:44                       ` Ingo Molnar
  2010-03-01 11:11                         ` Joerg Roedel
  1 sibling, 1 reply; 99+ messages in thread
From: Ingo Molnar @ 2010-03-01  8:44 UTC (permalink / raw)
  To: Joerg Roedel
  Cc: Avi Kivity, Jes Sorensen, KVM General, Peter Zijlstra,
	Zachary Amsden, Gleb Natapov, ming.m.lin, Zhang, Yanmin,
	Peter Zijlstra, Thomas Gleixner, H. Peter Anvin, Arjan van de Ven,
	Fr??d??ric Weisbecker, Arnaldo Carvalho de Melo

* Joerg Roedel <joro@8bytes.org> wrote:

> On Fri, Feb 26, 2010 at 02:44:00PM +0100, Ingo Molnar wrote:
> >  - A paravirt event driver is more compatible and more transparent in the long 
> >    run: it allows hardware upgrade and upgraded PMU functionality (for Linux) 
> >    without having to upgrade the guest OS. Via that a guest OS could even be
> >    live-migrated to a different PMU, without noticing anything about it.
> > 
> >    In contrast, a 'stolen', 'raw' PMU directly programmed by the guest OS 
> >    always assumes the guest OS is upgraded to the host. Also, 'raw' PMU state 
> >    cannot be live-migrated. (save/restore doesnt help)
> 
> I agree with your arguments, having this soft-pmu for the guest has some 
> advantages over raw pmu access. It has a lot of advantages if the guest 
> migrated between hosts with different hardware.
> 
> But I still think we should have both, a soft-pmu and a pmu-emulation (which 
> is a more accurate term than 'raw guest pmu access') that looks to the guest 
> as a real hardware pmu would look like.  On a linux host that is dedicated 
> to executing virtual kvm machines there is little point in sharing the pmu 
> between guest and host because the host will probably never use it.

There's a world of a difference between "will not use in certain usecases" and 
"cannot use at all because we've designed it so". By doing the latter we 
guarantee that sane shared usage of the PMU will never occur - which is bad.

Really, similar arguments have been made in the past about different domains 
of system usage: "one profiling session per system is more than enough, who 
needs transparent, per user profilers", etc. Such restrictions have been 
broken through again and again.

Think about it: this whole Linux thing is about 'sharing' resources. That 
concet really works and permeates everything we do in the kernel. Yes, it's 
somewhat hard for the PMU but we've done it on the host side via perf events 
and we really dont want to look back ...

My experience is that once the right profiling/tracing tools are there, people 
will use them in every which way. The bigger a box is, the more likely shared 
usage will occur - just statistically. Which coincides with KVM's "the bigger 
the box, the better for virtualization" general mantra.

	Ingo

^ permalink raw reply	[flat|nested] 99+ messages in thread

* Re: KVM PMU virtualization
  2010-03-01  8:39                       ` Ingo Molnar
@ 2010-03-01  8:58                         ` Joerg Roedel
  2010-03-01  9:04                           ` Ingo Molnar
  0 siblings, 1 reply; 99+ messages in thread
From: Joerg Roedel @ 2010-03-01  8:58 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Avi Kivity, Jes Sorensen, KVM General, Peter Zijlstra,
	Zachary Amsden, Gleb Natapov, ming.m.lin, Zhang, Yanmin,
	Peter Zijlstra, Thomas Gleixner, H. Peter Anvin, Arjan van de Ven,
	Fr??d??ric Weisbecker, Arnaldo Carvalho de Melo

On Mon, Mar 01, 2010 at 09:39:04AM +0100, Ingo Molnar wrote:
> > What do you mean by software events?
> 
> Things like:
> 
> aldebaran:~> perf stat -a sleep 1
> 
>  Performance counter stats for 'sleep 1':
> 
>    15995.719133  task-clock-msecs         #     15.981 CPUs 
>            5787  context-switches         #      0.000 M/sec
>             210  CPU-migrations           #      0.000 M/sec
>          193909  page-faults              #      0.012 M/sec
>     28704833507  cycles                   #   1794.532 M/sec  (scaled from 78.69%)
>     14387445668  instructions             #      0.501 IPC    (scaled from 90.71%)
>       736644616  branches                 #     46.053 M/sec  (scaled from 90.52%)
>       695884659  branch-misses            #     94.467 %      (scaled from 90.70%)
>       727070678  cache-references         #     45.454 M/sec  (scaled from 88.11%)
>      1305560420  cache-misses             #     81.619 M/sec  (scaled from 52.00%)
> 
>     1.000942399  seconds time elapsed
> 
> These lines:
> 
>    15995.719133  task-clock-msecs         #     15.981 CPUs 
>            5787  context-switches         #      0.000 M/sec
>             210  CPU-migrations           #      0.000 M/sec
>          193909  page-faults              #      0.012 M/sec
> 
> Are software events of the host - a subset of which could be transparently 
> exposed to the guest. Same for tracepoints, probes, etc. Those are not exposed 
> by the hardware PMU. So by doing a 'soft' PMU (or even better: a paravirt 
> channel to perf events) we gain a lot more than just raw PMU functionality.
> 
> 'performance events' are about a lot more than just the PMU, it's a coherent 
> system health / system events / structured logging framework.

Yeah I know. But these event should be available in the guest already,
no? They don't need any kind of hardware support from the pmu.
A paravirt perf channel from the guest to the host would be definitly a
win. It would be a powerful tool for kvm/linux-guest analysis (e.g.
trace host-kvm and guest-events together on the host)

	Joerg


^ permalink raw reply	[flat|nested] 99+ messages in thread

* Re: KVM PMU virtualization
  2010-03-01  8:58                         ` Joerg Roedel
@ 2010-03-01  9:04                           ` Ingo Molnar
  0 siblings, 0 replies; 99+ messages in thread
From: Ingo Molnar @ 2010-03-01  9:04 UTC (permalink / raw)
  To: Joerg Roedel
  Cc: Avi Kivity, Jes Sorensen, KVM General, Peter Zijlstra,
	Zachary Amsden, Gleb Natapov, ming.m.lin, Zhang, Yanmin,
	Peter Zijlstra, Thomas Gleixner, H. Peter Anvin, Arjan van de Ven,
	Fr??d??ric Weisbecker, Arnaldo Carvalho de Melo


* Joerg Roedel <joro@8bytes.org> wrote:

> On Mon, Mar 01, 2010 at 09:39:04AM +0100, Ingo Molnar wrote:
> > > What do you mean by software events?
> > 
> > Things like:
> > 
> > aldebaran:~> perf stat -a sleep 1
> > 
> >  Performance counter stats for 'sleep 1':
> > 
> >    15995.719133  task-clock-msecs         #     15.981 CPUs 
> >            5787  context-switches         #      0.000 M/sec
> >             210  CPU-migrations           #      0.000 M/sec
> >          193909  page-faults              #      0.012 M/sec
> >     28704833507  cycles                   #   1794.532 M/sec  (scaled from 78.69%)
> >     14387445668  instructions             #      0.501 IPC    (scaled from 90.71%)
> >       736644616  branches                 #     46.053 M/sec  (scaled from 90.52%)
> >       695884659  branch-misses            #     94.467 %      (scaled from 90.70%)
> >       727070678  cache-references         #     45.454 M/sec  (scaled from 88.11%)
> >      1305560420  cache-misses             #     81.619 M/sec  (scaled from 52.00%)
> > 
> >     1.000942399  seconds time elapsed
> > 
> > These lines:
> > 
> >    15995.719133  task-clock-msecs         #     15.981 CPUs 
> >            5787  context-switches         #      0.000 M/sec
> >             210  CPU-migrations           #      0.000 M/sec
> >          193909  page-faults              #      0.012 M/sec
> > 
> > Are software events of the host - a subset of which could be transparently 
> > exposed to the guest. Same for tracepoints, probes, etc. Those are not exposed 
> > by the hardware PMU. So by doing a 'soft' PMU (or even better: a paravirt 
> > channel to perf events) we gain a lot more than just raw PMU functionality.
> > 
> > 'performance events' are about a lot more than just the PMU, it's a coherent 
> > system health / system events / structured logging framework.
> 
> Yeah I know. But these event should be available in the guest already, no? 
> [...]

How would an old Linux or Windows guest know about them?

Also, even for new-Linux guests, they'd only have access to their own internal 
events - not to any host events.

My suggestion (admittedly not explained in any detail) was to allow guest 
access to certain _host_ events. I.e. a guest could profile its own impact on 
the host (such as VM exits, IO done on the host side, scheduling, etc.), 
without it having any (other) privileged access to the host.

This would be a powerful concept: you could profile your guest for host 
efficiency, _without_ having access to the host - beyond those events 
themselves. (which would be set up in a carefully filtered-to-guest manner.)

> [...] They don't need any kind of hardware support from the pmu. A paravirt 
> perf channel from the guest to the host would be definitly a win. It would 
> be a powerful tool for kvm/linux-guest analysis (e.g. trace host-kvm and 
> guest-events together on the host)

Yeah.

	Ingo

^ permalink raw reply	[flat|nested] 99+ messages in thread

* Re: KVM PMU virtualization
  2010-03-01  8:44                       ` Ingo Molnar
@ 2010-03-01 11:11                         ` Joerg Roedel
  2010-03-01 17:17                           ` Peter Zijlstra
  0 siblings, 1 reply; 99+ messages in thread
From: Joerg Roedel @ 2010-03-01 11:11 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Avi Kivity, Jes Sorensen, KVM General, Peter Zijlstra,
	Zachary Amsden, Gleb Natapov, ming.m.lin, Zhang, Yanmin,
	Peter Zijlstra, Thomas Gleixner, H. Peter Anvin, Arjan van de Ven,
	Fr??d??ric Weisbecker, Arnaldo Carvalho de Melo

On Mon, Mar 01, 2010 at 09:44:50AM +0100, Ingo Molnar wrote:

> There's a world of a difference between "will not use in certain usecases" and 
> "cannot use at all because we've designed it so". By doing the latter we 
> guarantee that sane shared usage of the PMU will never occur - which is bad.

I think we can emulate a real hardware pmu for guests using the perf
infrastructure. The emulation will not be complete but powerful enough
for most usecases. Some steps towards this might be:

1. Enhance perf to count pmu events only when cpu is in guest mode.

2. For every emulated performance counter the guest activates kvm
   allocates a perf_event and configures it for the guest (we may allow
   kvm to specify the counter index, the guest would be able to use
   rdpmc unintercepted then). Event filtering is also done in this step.

3. Before vmrun the guest activates all its counters, this can fail if
   the host uses them or the requested pmc index is not available for some
   reason.

4. Some additional magic to reinject pmu events into the guest

> Think about it: this whole Linux thing is about 'sharing' resources. That 
> concet really works and permeates everything we do in the kernel. Yes, it's 
> somewhat hard for the PMU but we've done it on the host side via perf events 
> and we really dont want to look back ...

As I learnt at the university this whole operating system thing is about
managing resource sharing and hardware abstraction ;-)

> My experience is that once the right profiling/tracing tools are there, people 
> will use them in every which way. The bigger a box is, the more likely shared 
> usage will occur - just statistically. Which coincides with KVM's "the bigger 
> the box, the better for virtualization" general mantra.

With the above approach the only point of conflict occurs when the host
wants to monitor the qemu-processes executing the vcpus which want to do
performance monitoring of their own or cpu-wide counting.

	Joerg

^ permalink raw reply	[flat|nested] 99+ messages in thread

* Re: KVM PMU virtualization
  2010-03-01 11:11                         ` Joerg Roedel
@ 2010-03-01 17:17                           ` Peter Zijlstra
  2010-03-01 18:36                             ` Joerg Roedel
  2010-03-08 10:15                             ` Avi Kivity
  0 siblings, 2 replies; 99+ messages in thread
From: Peter Zijlstra @ 2010-03-01 17:17 UTC (permalink / raw)
  To: Joerg Roedel
  Cc: Ingo Molnar, Avi Kivity, Jes Sorensen, KVM General,
	Zachary Amsden, Gleb Natapov, ming.m.lin, Zhang, Yanmin,
	Thomas Gleixner, H. Peter Anvin, Arjan van de Ven,
	Fr??d??ric Weisbecker, Arnaldo Carvalho de Melo

On Mon, 2010-03-01 at 12:11 +0100, Joerg Roedel wrote:
> 
> 1. Enhance perf to count pmu events only when cpu is in guest mode.

No enhancements needed, only hardware support for Intel doesn't provide
this iirc.

> 2. For every emulated performance counter the guest activates kvm
>    allocates a perf_event and configures it for the guest (we may allow
>    kvm to specify the counter index, the guest would be able to use
>    rdpmc unintercepted then). Event filtering is also done in this step.

rdpmc can never be used unintercepted, for perf might be multiplexing
the actual hw.

> 3. Before vmrun the guest activates all its counters, 

Right, this is could be used to approximate the guest only counting. I'm
not sure how the OS and USR bits interact with guest stuff - if the PMU
isn't aware of the virtualized priv levels then those will not work as
expected.

> this can fail if
>    the host uses them or the requested pmc index is not available for some
>    reason.

perf doesn't know about pmc indexes at the interface level, nor is that
needed I think.

> 4. Some additional magic to reinject pmu events into the guest

Right, that is needed, and might be 'interesting' since we get them from
NMI context.


^ permalink raw reply	[flat|nested] 99+ messages in thread

* Re: KVM PMU virtualization
  2010-02-26 11:42           ` Ingo Molnar
  2010-02-26 11:51             ` Avi Kivity
  2010-02-26 12:49             ` Jes Sorensen
@ 2010-03-01 17:22             ` Zachary Amsden
  2 siblings, 0 replies; 99+ messages in thread
From: Zachary Amsden @ 2010-03-01 17:22 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Jes Sorensen, Avi Kivity, Joerg Roedel, KVM General,
	Peter Zijlstra, Gleb Natapov, ming.m.lin, Zhang, Yanmin,
	Peter Zijlstra, Thomas Gleixner, H. Peter Anvin, Arjan van de Ven,
	Fr??d??ric Weisbecker, Arnaldo Carvalho de Melo

On 02/26/2010 01:42 AM, Ingo Molnar wrote:
> * Jes Sorensen<Jes.Sorensen@redhat.com>  wrote:
>
>    
>> On 02/26/10 11:44, Ingo Molnar wrote:
>>      
>>> Direct access to counters is not something that is a big issue. [ Given that i
>>> sometimes can see KVM redraw the screen of a guest OS real-time i doubt this
>>> is the biggest of performance challenges right now ;-) ]
>>>
>>> By far the biggest instrumentation issue is:
>>>
>>>   - availability
>>>   - usability
>>>   - flexibility
>>>
>>> Exposing the raw hw is a step backwards in many regards. The same way we dont
>>> want to expose chipsets to the guest to allow them to do RAS. The same way we
>>> dont want to expose most raw PCI devices to guest in general, but have all
>>> these virt driver abstractions.
>>>        
>> I have to say I disagree on that. When you run perfmon on a system, it is
>> normally to measure a specific application. You want to see accurate numbers
>> for cache misses, mul instructions or whatever else is selected.
>>      
> You can still get those. You can even enable RDPMC access and avoid VM exits.
>
> What you _cannot_ do is to 'steal' the PMU and just give it to the guest.
>
>    
>> Emulating the PMU rather than using the real one, makes the numbers far less
>> useful. The most useful way to provide PMU support in a guest is to expose
>> the real PMU and let the guest OS program it.
>>      
> Firstly, an emulated PMU was only the second-tier option i suggested. By far
> the best approach is native API to the host regarding performance events and
> good guest side integration.
>
> Secondly, the PMU cannot be 'given' to the guest in the general case. Those
> are privileged registers. They can expose sensitive host execution details,
> etc. etc. So if you emulate a PMU you have to exit out of most PMU accesses
> anyway for a secure solution. (RDPMC can still be supported, but in close
> cooperation with the host)
>
>    
>> We can do this in a reasonable way today, if we allow to take the PMU away
>> from the host, and only let guests access it when it's in use. [...]
>>      
> You get my sure-fire NAK for that kind of crap though. Interfering with the
> host PMU and stealing it, is not a technical approach that has acceptable
> quality.
>
> You need to integrate it properly so that host PMU functionality still works
> fine. (Within hardware constraints)
>    

I have to agree strongly with Ingo here.

If you can't reset, restore or offset the perf counters in hardware, 
then you can't expose them to the guest.  There is too much rich 
information about host state that can be derived and considered an 
information leak or covert channel, and you can't allow the guest to 
trample host PMU state.

On some architectures, to bank switch these perf counters is possible 
since you can read and write the full size counter MSRs.  However, it is 
a cumbersome task that must be done at every preemption point.  There 
are many ways to do so as lazily as possible so that overhead only 
happens in a guest which actively uses the PMU.  With careful 
bookkeeping, you can even compound the guest PMU counters back into the 
host counters if the host is using the PMU.

Sorting out the details about who to deliver the PMU exception to 
however, the host or the guest, when an overflow occurs, is a nasty, 
ugly dilemma, as is properly programming the counters so that overflow 
happens in a controlled fashion when both the host and the guest are 
attempting to use this feature.  So supporting "step ahead 13 
instructions and then give me an interrupt so I can signal my debugger" 
simultaneously and correctly in both the host and guest is a very hard 
task, perhaps untenable.

Zach

^ permalink raw reply	[flat|nested] 99+ messages in thread

* Re: KVM PMU virtualization
  2010-03-01 17:17                           ` Peter Zijlstra
@ 2010-03-01 18:36                             ` Joerg Roedel
  2010-03-08 10:15                             ` Avi Kivity
  1 sibling, 0 replies; 99+ messages in thread
From: Joerg Roedel @ 2010-03-01 18:36 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Ingo Molnar, Avi Kivity, Jes Sorensen, KVM General,
	Zachary Amsden, Gleb Natapov, ming.m.lin, Zhang, Yanmin,
	Thomas Gleixner, H. Peter Anvin, Arjan van de Ven,
	Fr??d??ric Weisbecker, Arnaldo Carvalho de Melo

On Mon, Mar 01, 2010 at 06:17:40PM +0100, Peter Zijlstra wrote:
> On Mon, 2010-03-01 at 12:11 +0100, Joerg Roedel wrote:
> > 
> > 1. Enhance perf to count pmu events only when cpu is in guest mode.
> 
> No enhancements needed, only hardware support for Intel doesn't provide
> this iirc.

At least the guest-bit for AMD perfctl registers is not supported yet
;-) Implementing this eliminates the requirement to write the perfctl
msrs on every vmrun and every vmexit. But thats a minor change.

> > 4. Some additional magic to reinject pmu events into the guest
> 
> Right, that is needed, and might be 'interesting' since we get them from
> NMI context.

I imagine some kind of callback which sets a flag in the kvm vcpu
structure. Since the NMI already triggered a vmexit the kvm code checks
for this bit on its path to re-entry.

	Joerg


^ permalink raw reply	[flat|nested] 99+ messages in thread

* Re: KVM PMU virtualization
  2010-02-26 13:37                     ` Jes Sorensen
  2010-02-26 13:55                       ` Avi Kivity
@ 2010-03-01 18:54                       ` Zachary Amsden
  1 sibling, 0 replies; 99+ messages in thread
From: Zachary Amsden @ 2010-03-01 18:54 UTC (permalink / raw)
  To: Jes Sorensen
  Cc: Ingo Molnar, Avi Kivity, Joerg Roedel, KVM General,
	Peter Zijlstra, Gleb Natapov, ming.m.lin, Zhang, Yanmin,
	Peter Zijlstra, Thomas Gleixner, H. Peter Anvin, Arjan van de Ven,
	Fr??d??ric Weisbecker, Arnaldo Carvalho de Melo

On 02/26/2010 03:37 AM, Jes Sorensen wrote:
> On 02/26/10 14:31, Ingo Molnar wrote:
>> You are missing two big things wrt. compatibility here:
>>
>>   1) The first upgrade overhead a one time overhead only.
>>
>>   2) Once a Linux guest has upgraded, it will work in the future, 
>> with _any_
>>      future CPU - _without_ having to upgrade the guest!
>>
>> Dont you see the advantage of that? You can instrument an old system 
>> on new
>> hardware, without having to upgrade that guest for the new CPU support.
>
> That would only work if you are guaranteed to be able to emulate old
> hardware on new hardware. Not going to be feasible, so then we are in a
> real mess.
>
>> With the 'steal the PMU' messy approach the guest OS has to be 
>> upgraded to the
>> new CPU type all the time. Ad infinitum.
>
> The way the Perfmon architecture is specified by Intel, that is what we
> are stuck with. It's not going to be possible via software emulation to
> count cache misses, unless you run it in a micro architecture emulator.

Sure you can count cache misses.

Step 1. Declare KVM to possess a virtual cache hereto unseen to guest VCPUs.
Step 2. Use micro architecture rules to add to cache misses in an 
undefined micro-architecture specific way
Step 3. <censored>
Step 4.  PROFIT!

The point being, there are no rules required to follow for 
architecturally unspecified events.  Instructions issued is well defined 
architecturally, one of very few such counters, while things like cache 
strides and organization are deliberately left to the implementation.

So returning zero is a perfectly valid choice for emulating cache misses.

Zach

^ permalink raw reply	[flat|nested] 99+ messages in thread

* Re: KVM PMU virtualization
  2010-02-26 15:55                                 ` Peter Zijlstra
  2010-02-26 16:06                                   ` Avi Kivity
@ 2010-03-01 19:03                                   ` Zachary Amsden
  1 sibling, 0 replies; 99+ messages in thread
From: Zachary Amsden @ 2010-03-01 19:03 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Avi Kivity, Jes Sorensen, Ingo Molnar, Joerg Roedel, KVM General,
	Gleb Natapov, ming.m.lin, Zhang, Yanmin, Thomas Gleixner,
	H. Peter Anvin, Arjan van de Ven, Fr??d??ric Weisbecker,
	Arnaldo Carvalho de Melo

On 02/26/2010 05:55 AM, Peter Zijlstra wrote:
> On Fri, 2010-02-26 at 17:11 +0200, Avi Kivity wrote:
>    
>> On 02/26/2010 05:08 PM, Peter Zijlstra wrote:
>>      
>>>> That's 7 more than what we support now, and 7 more than what we can
>>>> guarantee without it.
>>>>
>>>>          
>>> Again, what windows software uses only those 7? Does it pay to only have
>>> access to those 7 or does it limit the usability to exactly the same
>>> subset a paravirt interface would?
>>>
>>>        
>> Good question.  Would be interesting to try out VTune with the non-arch
>> pmu masked out.
>>      
> Also, the ANY bit is part of the intel arch pmu, but you still have to
> mask it out.
>
> BTW, just wondering, why would a developer be running VTune in a guest
> anyway? I'd think that a developer that windows oriented would simply
> run windows on his desktop and VTune there.
>    

What if you want to run on 10 different variations of Windows 32 / 64 / 
server / desktop configurations.  Do you maintain 10 installed pieces of 
hardware?

A virtual machine is a better solution.  And you might want to 
performance tune all 10 of those configurations as well.  Be nice if it 
were possible.

^ permalink raw reply	[flat|nested] 99+ messages in thread

* Re: KVM PMU virtualization
  2010-02-26  9:17       ` Ingo Molnar
  2010-02-26 10:42         ` Joerg Roedel
@ 2010-03-02  7:09         ` Zhang, Yanmin
  2010-03-02  9:36           ` Ingo Molnar
  2010-03-02  9:57           ` Peter Zijlstra
  1 sibling, 2 replies; 99+ messages in thread
From: Zhang, Yanmin @ 2010-03-02  7:09 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Joerg Roedel, Jes Sorensen, KVM General, Peter Zijlstra,
	Avi Kivity, Zachary Amsden, Gleb Natapov, ming.m.lin

On Fri, 2010-02-26 at 10:17 +0100, Ingo Molnar wrote:
> * Joerg Roedel <joro@8bytes.org> wrote:
> 
> > On Fri, Feb 26, 2010 at 10:55:17AM +0800, Zhang, Yanmin wrote:
> > > On Thu, 2010-02-25 at 18:34 +0100, Joerg Roedel wrote:
> > > > On Thu, Feb 25, 2010 at 04:04:28PM +0100, Jes Sorensen wrote:
> > > > 
> > > > > 1) Add support to perf to allow it to monitor a KVM guest from the
> > > > >    host.
> > > > 
> > > > This shouldn't be a big problem. The PMU of AMD Fam10 processors can be
> > > > configured to count only when in guest mode. Perf needs to be aware of
> > > > that and fetch the rip from a different place when monitoring a guest.
> > 
> > > The idea is we want to measure both host and guest at the same time, and
> > > compare all the hot functions fairly.
> > 
> > So you want to measure while the guest vcpu is running and the vmexit
> > path of that vcpu (including qemu userspace part) together? The
> > challenge here is to find out if a performance event originated in guest
> > mode or in host mode.
> > But we can check for that in the nmi-protected part of the vmexit path.
> 
> As far as instrumentation goes, virtualization is simply another 'PID 
> dimension' of measurement.
> 
> Today we can isolate system performance measurements/events to the following 
> domains:
> 
>  - per system
>  - per cpu
>  - per task
> 
> ( Note that PowerPC already supports certain sorts of 'hypervisor/kernel/user' 
>   domain separation, and we have some ABI details for all that but it's by no 
>   means complete. Anton is using the PowerPC bits AFAIK, so it already works 
>   to a certain degree. )
> 
> When extending measurements to KVM, we want two things:
> 
>  - user friendliness: instead of having to check 'ps' and figure out which 
>    Qemu thread is the KVM thread we want to profile, just give a convenience
>    namespace to access guest profiling info. -G ought to map to the first
>    currently running KVM guest it can find. (which would match like 90% of the
>    cases) - etc. No ifs and when. If 'perf kvm top' doesnt show something 
>    useful by default the whole effort is for naught.
> 
>  - Extend core facilities and enable the following measurement dimensions:
> 
>      host-kernel-space
>      host-user-space
>      guest-kernel-space
>      guest-user-space
> 
>    on a per guest basis. We want to be able to measure just what the guest 
>    does, and we want to be able to measure just what the host does.
> 
>    Some of this the hardware helps us with (say only measuring host kernel 
>    events is possible), some has to be done by fiddling with event 
>    enable/disable at vm-exit / vm-entry time.
> 
> My suggestion, as always, would be to start very simple and very minimal:
> 
> Enable 'perf kvm top' to show guest overhead. Use the exact same kernel image 
> both as a host and as guest (for testing), to not have to deal with the symbol 
> space transport problem initially. Enable 'perf kvm record' to only record 
> guest events by default. Etc.
> 
> This alone will be a quite useful result already - and gives a basis for 
> further work. No need to spend months to do the big grand design straight 
> away, all of this can be done gradually and in the order of usefulness - and 
> you'll always have something that actually works (and helps your other KVM 
> projects) along the way.
It took me for a couple of hours to read the emails on the topic.
Based on above idea, I worked out a prototype which is ugly, but does work
with top/record when both guest side and host side use the same kernel image,
while compiling most needed modules into kernel directly..

The commands are:
perf kvm top
perf kvm record
perf kvm report

They just collect guest kernel hot functions.

> 
> [ And, as so often, once you walk that path, that grand scheme you are 
>   thinking about right now might easily become last year's really bad idea ;-) ]
> 
> So please start walking the path and experience the challenges first-hand.
With my patch, I collected dbench data on Nehalem machine (2*4*2 logical cpu).
1) Vanilla host kernel (6G memory):
------------------------------------------------------------------------------------------------------------------------
   PerfTop:   15491 irqs/sec  kernel:93.6% [1000Hz cycles],  (all, 16 CPUs)
------------------------------------------------------------------------------------------------------------------------

             samples  pcnt function                        DSO
             _______ _____ _______________________________ ________________________________________

            99376.00 40.5% ext3_test_allocatable           /lib/modules/2.6.33-kvmymz/build/vmlinux
            41239.00 16.8% bitmap_search_next_usable_block /lib/modules/2.6.33-kvmymz/build/vmlinux
             7019.00  2.9% __ticket_spin_lock              /lib/modules/2.6.33-kvmymz/build/vmlinux
             5350.00  2.2% copy_user_generic_string        /lib/modules/2.6.33-kvmymz/build/vmlinux
             5208.00  2.1% do_get_write_access             /lib/modules/2.6.33-kvmymz/build/vmlinux
             4484.00  1.8% journal_dirty_metadata          /lib/modules/2.6.33-kvmymz/build/vmlinux
             4078.00  1.7% ext3_free_blocks_sb             /lib/modules/2.6.33-kvmymz/build/vmlinux
             3856.00  1.6% ext3_new_blocks                 /lib/modules/2.6.33-kvmymz/build/vmlinux
             3485.00  1.4% journal_get_undo_access         /lib/modules/2.6.33-kvmymz/build/vmlinux
             2803.00  1.1% ext3_try_to_allocate            /lib/modules/2.6.33-kvmymz/build/vmlinux
             2241.00  0.9% __find_get_block                /lib/modules/2.6.33-kvmymz/build/vmlinux
             1957.00  0.8% find_revoke_record              /lib/modules/2.6.33-kvmymz/build/vmlinux

2) guest os: start one guest os with 4GB memory.
------------------------------------------------------------------------------------------------------------------------
   PerfTop:     827 irqs/sec  kernel: 0.0% [1000Hz cycles],  (all, 16 CPUs)
------------------------------------------------------------------------------------------------------------------------

             samples  pcnt function                        DSO
             _______ _____ _______________________________ ________________________________________

            41701.00 28.1% __ticket_spin_lock              /lib/modules/2.6.33-kvmymz/build/vmlinux
            33843.00 22.8% ext3_test_allocatable           /lib/modules/2.6.33-kvmymz/build/vmlinux
            16862.00 11.4% bitmap_search_next_usable_block /lib/modules/2.6.33-kvmymz/build/vmlinux
             3278.00  2.2% native_flush_tlb_others         /lib/modules/2.6.33-kvmymz/build/vmlinux
             3200.00  2.2% copy_user_generic_string        /lib/modules/2.6.33-kvmymz/build/vmlinux
             3009.00  2.0% do_get_write_access             /lib/modules/2.6.33-kvmymz/build/vmlinux
             2834.00  1.9% journal_dirty_metadata          /lib/modules/2.6.33-kvmymz/build/vmlinux
             1965.00  1.3% journal_get_undo_access         /lib/modules/2.6.33-kvmymz/build/vmlinux
             1907.00  1.3% ext3_new_blocks                 /lib/modules/2.6.33-kvmymz/build/vmlinux
             1790.00  1.2% ext3_free_blocks_sb             /lib/modules/2.6.33-kvmymz/build/vmlinux
             1741.00  1.2% find_revoke_record              /lib/modules/2.6.33-kvmymz/build/vmlinux


With vanilla host kernel, perf top data is stable and spinlock doesn't take too much cpu time.
With guest os, __ticket_spin_lock consumes 28% cpu time, and sometimes it fluctuates between 9%~28%.

Another interesting finding is aim7. If I start aim7 on tmpfs testing in guest os with 1GB memory,
the login hangs and cpu is busy. With the new patch, I could check what happens in guest os, where
spinlock is busy and kernel is shrinking memory mostly from slab.



--- linux-2.6.33/arch/x86/kernel/cpu/perf_event.c	2010-02-25 02:52:17.000000000 +0800
+++ linux-2.6.33_perfkvm/arch/x86/kernel/cpu/perf_event.c	2010-03-01 15:57:51.672990615 +0800
@@ -1621,6 +1621,7 @@ static void intel_pmu_drain_bts_buffer(s
 	struct perf_event_header header;
 	struct perf_sample_data data;
 	struct pt_regs regs;
+	int ret;
 
 	if (!event)
 		return;
@@ -1647,7 +1648,9 @@ static void intel_pmu_drain_bts_buffer(s
 	 * We will overwrite the from and to address before we output
 	 * the sample.
 	 */
-	perf_prepare_sample(&header, &data, event, &regs);
+	ret = perf_prepare_sample(&header, &data, event, &regs);
+	if (ret)
+		return;
 
 	if (perf_output_begin(&handle, event,
 			      header.size * (top - at), 1, 1))
--- linux-2.6.33/arch/x86/kvm/vmx.c	2010-02-25 02:52:17.000000000 +0800
+++ linux-2.6.33_perfkvm/arch/x86/kvm/vmx.c	2010-03-02 10:21:57.588586179 +0800
@@ -26,6 +26,7 @@
 #include <linux/sched.h>
 #include <linux/moduleparam.h>
 #include <linux/ftrace_event.h>
+#include <linux/perf_event.h>
 #include "kvm_cache_regs.h"
 #include "x86.h"
 
@@ -3553,8 +3554,19 @@ static void vmx_complete_interrupts(stru
 
 	/* We need to handle NMIs before interrupts are enabled */
 	if ((exit_intr_info & INTR_INFO_INTR_TYPE_MASK) == INTR_TYPE_NMI_INTR &&
-	    (exit_intr_info & INTR_INFO_VALID_MASK))
+	    (exit_intr_info & INTR_INFO_VALID_MASK)) {
+		u64 rip = vmcs_readl(GUEST_RIP);
+		int user_mode = vmcs_read16(GUEST_CS_SELECTOR);
+
+#ifdef CONFIG_X86_32
+		user_mode = (user_mode & SEGMENT_RPL_MASK) == USER_RPL;
+#else
+		user_mode = !!(user_mode & 3);
+#endif
+		perf_save_virt_ip(user_mode, rip);
 		asm("int $2");
+		perf_reset_virt_ip();
+	}
 
 	idtv_info_valid = idt_vectoring_info & VECTORING_INFO_VALID_MASK;
 
--- linux-2.6.33/include/linux/perf_event.h	2010-02-25 02:52:17.000000000 +0800
+++ linux-2.6.33_perfkvm/include/linux/perf_event.h	2010-03-02 12:26:15.050947780 +0800
@@ -125,8 +125,9 @@ enum perf_event_sample_format {
 	PERF_SAMPLE_PERIOD			= 1U << 8,
 	PERF_SAMPLE_STREAM_ID			= 1U << 9,
 	PERF_SAMPLE_RAW				= 1U << 10,
+        PERF_SAMPLE_KVM                         = 1U << 11,
 
-	PERF_SAMPLE_MAX = 1U << 11,		/* non-ABI */
+        PERF_SAMPLE_MAX = 1U << 12,             /* non-ABI */
 };
 
 /*
@@ -798,7 +799,7 @@ extern void perf_output_sample(struct pe
 			       struct perf_event_header *header,
 			       struct perf_sample_data *data,
 			       struct perf_event *event);
-extern void perf_prepare_sample(struct perf_event_header *header,
+extern int perf_prepare_sample(struct perf_event_header *header,
 				struct perf_sample_data *data,
 				struct perf_event *event,
 				struct pt_regs *regs);
@@ -858,7 +859,6 @@ extern void perf_bp_event(struct perf_ev
 #ifndef perf_misc_flags
 #define perf_misc_flags(regs)	(user_mode(regs) ? PERF_RECORD_MISC_USER : \
 				 PERF_RECORD_MISC_KERNEL)
-#define perf_instruction_pointer(regs)	instruction_pointer(regs)
 #endif
 
 extern int perf_output_begin(struct perf_output_handle *handle,
@@ -905,6 +905,34 @@ static inline void perf_event_enable(str
 static inline void perf_event_disable(struct perf_event *event)		{ }
 #endif
 
+//#if defined(CONFIG_PERF_EVENTS && CONFIG_PERF_HAS_VIRT_IP)
+#if defined(CONFIG_PERF_EVENTS)
+struct virt_ip_info {
+	int	user_mode;
+	u64	ip;
+};
+
+DECLARE_PER_CPU(struct virt_ip_info, perf_virt_ip);
+extern void perf_save_virt_ip(int user_mode, u64 ip);
+extern void perf_reset_virt_ip(void);
+extern int perf_get_virt_user_mode(void);
+static inline u64 perf_instruction_pointer(struct perf_event *event, struct pt_regs *regs)
+{
+	u64 ip;
+	if (event->attr.sample_type & PERF_SAMPLE_KVM)
+		ip = percpu_read(perf_virt_ip.ip);
+	else
+		ip = instruction_pointer(regs);
+	return ip;
+}
+#else
+static inline void perf_save_virt_ip(int user_mode, u64 ip)	{ }
+static inline void perf_reset_virt_ip(void)	{ }
+static inline int perf_get_virt_user_mode(void)	{ return -1; }
+#define perf_instruction_pointer(event, regs)	instruction_pointer(regs))
+#endif
+
+
 #define perf_output_put(handle, x) \
 	perf_output_copy((handle), &(x), sizeof(x))
 
--- linux-2.6.33/kernel/perf_event.c	2010-02-25 02:52:17.000000000 +0800
+++ linux-2.6.33_perfkvm/kernel/perf_event.c	2010-03-02 12:30:41.236003180 +0800
@@ -3077,7 +3077,38 @@ void perf_output_sample(struct perf_outp
 	}
 }
 
-void perf_prepare_sample(struct perf_event_header *header,
+//#ifdef CONFIG_PERF_VIRT_IP
+DEFINE_PER_CPU(struct virt_ip_info, perf_virt_ip) = {0,0};
+EXPORT_PER_CPU_SYMBOL(perf_virt_ip);
+
+void perf_save_virt_ip(int user_mode, u64 ip)
+{
+	if (!atomic_read(&nr_events))
+		return;
+	percpu_write(perf_virt_ip.user_mode, ip);
+	percpu_write(perf_virt_ip.ip, ip);
+}
+EXPORT_SYMBOL_GPL(perf_save_virt_ip);
+
+void perf_reset_virt_ip(void)
+{
+	if (!percpu_read(perf_virt_ip.ip))
+		return;
+	percpu_write(perf_virt_ip.user_mode, 0);
+	percpu_write(perf_virt_ip.ip, 0);
+}
+EXPORT_SYMBOL_GPL(perf_reset_virt_ip);
+
+int perf_get_virt_user_mode(void)
+{
+	if (!percpu_read(perf_virt_ip.ip))
+		return -1;
+	return percpu_read(perf_virt_ip.user_mode);
+}
+
+//#endif
+
+int perf_prepare_sample(struct perf_event_header *header,
 			 struct perf_sample_data *data,
 			 struct perf_event *event,
 			 struct pt_regs *regs)
@@ -3090,10 +3121,15 @@ void perf_prepare_sample(struct perf_eve
 	header->size = sizeof(*header);
 
 	header->misc = 0;
-	header->misc |= perf_misc_flags(regs);
+	if (event->attr.sample_type & PERF_SAMPLE_KVM)
+		header->misc |= percpu_read(perf_virt_ip.user_mode)?PERF_RECORD_MISC_USER:PERF_RECORD_MISC_KERNEL;
+	else
+		header->misc |= perf_misc_flags(regs);
 
 	if (sample_type & PERF_SAMPLE_IP) {
-		data->ip = perf_instruction_pointer(regs);
+		data->ip = perf_instruction_pointer(event, regs);
+		if (!data->ip)
+			return -1;
 
 		header->size += sizeof(data->ip);
 	}
@@ -3162,6 +3198,8 @@ void perf_prepare_sample(struct perf_eve
 		WARN_ON_ONCE(size & (sizeof(u64)-1));
 		header->size += size;
 	}
+
+	return 0;
 }
 
 static void perf_event_output(struct perf_event *event, int nmi,
@@ -3170,8 +3208,11 @@ static void perf_event_output(struct per
 {
 	struct perf_output_handle handle;
 	struct perf_event_header header;
+	int ret;
 
-	perf_prepare_sample(&header, data, event, regs);
+	ret = perf_prepare_sample(&header, data, event, regs);
+	if (ret)
+		return;
 
 	if (perf_output_begin(&handle, event, header.size, nmi, 1))
 		return;
--- linux-2.6.33/tools/perf/builtin-record.c	2010-02-25 02:52:17.000000000 +0800
+++ linux-2.6.33_perfkvm/tools/perf/builtin-record.c	2010-03-02 13:19:53.564376291 +0800
@@ -251,6 +251,8 @@ static void create_counter(int counter, 
 				  PERF_FORMAT_ID;
 
 	attr->sample_type	|= PERF_SAMPLE_IP | PERF_SAMPLE_TID;
+	if (sample_kvm)
+		attr->sample_type	|= PERF_SAMPLE_KVM;
 
 	if (freq) {
 		attr->sample_type	|= PERF_SAMPLE_PERIOD;
--- linux-2.6.33/tools/perf/builtin-top.c	2010-02-25 02:52:17.000000000 +0800
+++ linux-2.6.33_perfkvm/tools/perf/builtin-top.c	2010-03-01 16:35:41.972067501 +0800
@@ -1091,6 +1091,8 @@ static void start_counter(int i, int cou
 	attr = attrs + counter;
 
 	attr->sample_type	= PERF_SAMPLE_IP | PERF_SAMPLE_TID;
+	if (sample_kvm)
+		attr->sample_type	|= PERF_SAMPLE_KVM;
 
 	if (freq) {
 		attr->sample_type	|= PERF_SAMPLE_PERIOD;
--- linux-2.6.33/tools/perf/perf.c	2010-02-25 02:52:17.000000000 +0800
+++ linux-2.6.33_perfkvm/tools/perf/perf.c	2010-03-02 09:57:03.164001069 +0800
@@ -28,6 +28,8 @@ struct pager_config {
 	int val;
 };
 
+int sample_kvm = 0;
+
 static char debugfs_mntpt[MAXPATHLEN];
 
 static int pager_command_config(const char *var, const char *value, void *data)
@@ -320,6 +322,13 @@ static void handle_internal_command(int 
 		argv[0] = cmd = "help";
 	}
 
+	if (argc > 1 && !strcmp(argv[0], "kvm")) {
+		sample_kvm = 1;
+		argv++;
+		argc--;
+		cmd = argv[0];
+	}
+
 	for (i = 0; i < ARRAY_SIZE(commands); i++) {
 		struct cmd_struct *p = commands+i;
 		if (strcmp(p->cmd, cmd))
--- linux-2.6.33/tools/perf/perf.h	2010-02-25 02:52:17.000000000 +0800
+++ linux-2.6.33_perfkvm/tools/perf/perf.h	2010-03-01 16:12:42.470082418 +0800
@@ -131,4 +131,6 @@ struct ip_callchain {
 	u64 ips[0];
 };
 
+extern int sample_kvm;
+
 #endif



^ permalink raw reply	[flat|nested] 99+ messages in thread

* Re: KVM PMU virtualization
  2010-03-02  7:09         ` Zhang, Yanmin
@ 2010-03-02  9:36           ` Ingo Molnar
  2010-03-03  3:32             ` Zhang, Yanmin
  2010-03-02  9:57           ` Peter Zijlstra
  1 sibling, 1 reply; 99+ messages in thread
From: Ingo Molnar @ 2010-03-02  9:36 UTC (permalink / raw)
  To: Zhang, Yanmin
  Cc: Joerg Roedel, Jes Sorensen, KVM General, Peter Zijlstra,
	Avi Kivity, Zachary Amsden, Gleb Natapov, ming.m.lin


* Zhang, Yanmin <yanmin_zhang@linux.intel.com> wrote:

> On Fri, 2010-02-26 at 10:17 +0100, Ingo Molnar wrote:

> > My suggestion, as always, would be to start very simple and very minimal:
> > 
> > Enable 'perf kvm top' to show guest overhead. Use the exact same kernel 
> > image both as a host and as guest (for testing), to not have to deal with 
> > the symbol space transport problem initially. Enable 'perf kvm record' to 
> > only record guest events by default. Etc.
> > 
> > This alone will be a quite useful result already - and gives a basis for 
> > further work. No need to spend months to do the big grand design straight 
> > away, all of this can be done gradually and in the order of usefulness - 
> > and you'll always have something that actually works (and helps your other 
> > KVM projects) along the way.
>
> It took me for a couple of hours to read the emails on the topic. Based on 
> above idea, I worked out a prototype which is ugly, but does work with 
> top/record when both guest side and host side use the same kernel image, 
> while compiling most needed modules into kernel directly..
> 
> The commands are:
> perf kvm top
> perf kvm record
> perf kvm report
> 
> They just collect guest kernel hot functions.

Fantastic, and there's some really interesting KVM guest/host comparison 
profiles you've done with this prototype!

> With my patch, I collected dbench data on Nehalem machine (2*4*2 logical 
> cpu).
>
> 1) Vanilla host kernel (6G memory):
> ------------------------------------------------------------------------------------------------------------------------
>    PerfTop:   15491 irqs/sec  kernel:93.6% [1000Hz cycles],  (all, 16 CPUs)
> ------------------------------------------------------------------------------------------------------------------------
> 
>              samples  pcnt function                        DSO
>              _______ _____ _______________________________ ________________________________________
> 
>             99376.00 40.5% ext3_test_allocatable           /lib/modules/2.6.33-kvmymz/build/vmlinux
>             41239.00 16.8% bitmap_search_next_usable_block /lib/modules/2.6.33-kvmymz/build/vmlinux
>              7019.00  2.9% __ticket_spin_lock              /lib/modules/2.6.33-kvmymz/build/vmlinux
>              5350.00  2.2% copy_user_generic_string        /lib/modules/2.6.33-kvmymz/build/vmlinux
>              5208.00  2.1% do_get_write_access             /lib/modules/2.6.33-kvmymz/build/vmlinux
>              4484.00  1.8% journal_dirty_metadata          /lib/modules/2.6.33-kvmymz/build/vmlinux
>              4078.00  1.7% ext3_free_blocks_sb             /lib/modules/2.6.33-kvmymz/build/vmlinux
>              3856.00  1.6% ext3_new_blocks                 /lib/modules/2.6.33-kvmymz/build/vmlinux
>              3485.00  1.4% journal_get_undo_access         /lib/modules/2.6.33-kvmymz/build/vmlinux
>              2803.00  1.1% ext3_try_to_allocate            /lib/modules/2.6.33-kvmymz/build/vmlinux
>              2241.00  0.9% __find_get_block                /lib/modules/2.6.33-kvmymz/build/vmlinux
>              1957.00  0.8% find_revoke_record              /lib/modules/2.6.33-kvmymz/build/vmlinux
> 
> 2) guest os: start one guest os with 4GB memory.
> ------------------------------------------------------------------------------------------------------------------------
>    PerfTop:     827 irqs/sec  kernel: 0.0% [1000Hz cycles],  (all, 16 CPUs)
> ------------------------------------------------------------------------------------------------------------------------
> 
>              samples  pcnt function                        DSO
>              _______ _____ _______________________________ ________________________________________
> 
>             41701.00 28.1% __ticket_spin_lock              /lib/modules/2.6.33-kvmymz/build/vmlinux
>             33843.00 22.8% ext3_test_allocatable           /lib/modules/2.6.33-kvmymz/build/vmlinux
>             16862.00 11.4% bitmap_search_next_usable_block /lib/modules/2.6.33-kvmymz/build/vmlinux
>              3278.00  2.2% native_flush_tlb_others         /lib/modules/2.6.33-kvmymz/build/vmlinux
>              3200.00  2.2% copy_user_generic_string        /lib/modules/2.6.33-kvmymz/build/vmlinux
>              3009.00  2.0% do_get_write_access             /lib/modules/2.6.33-kvmymz/build/vmlinux
>              2834.00  1.9% journal_dirty_metadata          /lib/modules/2.6.33-kvmymz/build/vmlinux
>              1965.00  1.3% journal_get_undo_access         /lib/modules/2.6.33-kvmymz/build/vmlinux
>              1907.00  1.3% ext3_new_blocks                 /lib/modules/2.6.33-kvmymz/build/vmlinux
>              1790.00  1.2% ext3_free_blocks_sb             /lib/modules/2.6.33-kvmymz/build/vmlinux
>              1741.00  1.2% find_revoke_record              /lib/modules/2.6.33-kvmymz/build/vmlinux
> 
> 
> With vanilla host kernel, perf top data is stable and spinlock doesn't take 
> too much cpu time. With guest os, __ticket_spin_lock consumes 28% cpu time, 
> and sometimes it fluctuates between 9%~28%.

Looks quite convenient to be able to profile guest and host from the same 
space, right?

Btw, another, convenient way to compare profiles is 'perf diff':

$ perf diff

# Baseline  Delta          Shared Object  Symbol
# ........ ..........  .................  ......
#
     5.45%     +4.31%  [kernel.kallsyms]  [k] _raw_spin_lock
     3.52%     +3.74%  [kernel.kallsyms]  [k] copy_user_generic_string
     3.11%     +4.08%  [kernel.kallsyms]  [k] sock_alloc_send_pskb
     4.32%     +2.62%  [kernel.kallsyms]  [k] _raw_spin_lock_irqsave
     3.34%     +2.31%  [kernel.kallsyms]  [k] __cache_free
     1.49%     +3.49%  [kernel.kallsyms]  [k] _raw_read_lock
     7.44%     -3.06%  [kernel.kallsyms]  [k] avc_has_perm_noaudit
     0.14%     +2.49%  [kernel.kallsyms]  [k] skb_release_head_state
     0.22%     +2.29%  [kernel.kallsyms]  [k] vfs_read
     1.67%     +0.75%  [kernel.kallsyms]  [k] file_has_perm
     0.09%     +2.31%  [kernel.kallsyms]  [k] rw_verify_area

By default it compares the last two profiles done in the current directory, if 
you have two separate data files, say perf.data.host and perf.data.guest, you 
can do:

  perf diff perf.data.host perf.data.guest

To get a host -> guest slowdown comparison.

Another suggestion: you could add --guest / --host convenience flag to 'perf 
kvm', to allow for an easy host/guest comparison workflow:

  perf kvm record --guest     # creates perf.data.guest
  perf kvm record --host      # creates perf.data.host
  perf kvm diff               # shortcut for: 'perf diff perf.data.host perf.data.guest'

> Another interesting finding is aim7. If I start aim7 on tmpfs testing in 
> guest os with 1GB memory, the login hangs and cpu is busy. With the new 
> patch, I could check what happens in guest os, where spinlock is busy and 
> kernel is shrinking memory mostly from slab.

This is exactly the kind of usage proper perf events integration would allow! 
Your 'perf kvm' looks very powerful, even in its early prototype.

Now, regarding the technical details of your patch:

> +++ linux-2.6.33_perfkvm/arch/x86/kvm/vmx.c	2010-03-02 10:21:57.588586179 +0800

> @@ -3553,8 +3554,19 @@ static void vmx_complete_interrupts(stru
>  
>  	/* We need to handle NMIs before interrupts are enabled */
>  	if ((exit_intr_info & INTR_INFO_INTR_TYPE_MASK) == INTR_TYPE_NMI_INTR &&
> -	    (exit_intr_info & INTR_INFO_VALID_MASK))
> +	    (exit_intr_info & INTR_INFO_VALID_MASK)) {
> +		u64 rip = vmcs_readl(GUEST_RIP);
> +		int user_mode = vmcs_read16(GUEST_CS_SELECTOR);
> +
> +#ifdef CONFIG_X86_32
> +		user_mode = (user_mode & SEGMENT_RPL_MASK) == USER_RPL;
> +#else
> +		user_mode = !!(user_mode & 3);
> +#endif

This test could use a helper i guess, to remove the #ifdef?

> +		perf_save_virt_ip(user_mode, rip);
>  		asm("int $2");
> +		perf_reset_virt_ip();
> +	}

> --- linux-2.6.33/include/linux/perf_event.h	2010-02-25 02:52:17.000000000 +0800
> +++ linux-2.6.33_perfkvm/include/linux/perf_event.h	2010-03-02 12:26:15.050947780 +0800
> @@ -125,8 +125,9 @@ enum perf_event_sample_format {
>  	PERF_SAMPLE_PERIOD			= 1U << 8,
>  	PERF_SAMPLE_STREAM_ID			= 1U << 9,
>  	PERF_SAMPLE_RAW				= 1U << 10,
> +        PERF_SAMPLE_KVM                         = 1U << 11,

yep, we can extend it like this, but maybe there's another method:

> +//#if defined(CONFIG_PERF_EVENTS && CONFIG_PERF_HAS_VIRT_IP)
> +#if defined(CONFIG_PERF_EVENTS)
> +struct virt_ip_info {
> +	int	user_mode;
> +	u64	ip;
> +};

basically what we want is a 'this is guest user mode' differentiator in the 
call frame stream data, right?

We already have such separators (it's just not used very much):

enum perf_callchain_context {
        PERF_CONTEXT_HV                 = (__u64)-32,
        PERF_CONTEXT_KERNEL             = (__u64)-128,
        PERF_CONTEXT_USER               = (__u64)-512,

        PERF_CONTEXT_GUEST              = (__u64)-2048,
        PERF_CONTEXT_GUEST_KERNEL       = (__u64)-2176,
        PERF_CONTEXT_GUEST_USER         = (__u64)-2560,

        PERF_CONTEXT_MAX                = (__u64)-4095,
};

Basically KVM's guest context could be expressed as a PERF_CONTEXT_GUEST 
separator pushed into the stream (and then recognized by 'perf kvm' - and 
generally by 'perf report' et al), followed by the virtual RIP as a regular 
IP.

That way we dont need PERF_SAMPLE_KVM. Note that the tooling doesnt know about 
such separators very well yet, so there might be some gotchas along the way. 
Please let us know if you run into any problems here.

Peter, what's your preference for this KVM profiling ABI detail?

> +DECLARE_PER_CPU(struct virt_ip_info, perf_virt_ip);
> +extern void perf_save_virt_ip(int user_mode, u64 ip);
> +extern void perf_reset_virt_ip(void);
> +extern int perf_get_virt_user_mode(void);
> +static inline u64 perf_instruction_pointer(struct perf_event *event, struct pt_regs *regs)
> +{
> +	u64 ip;
> +	if (event->attr.sample_type & PERF_SAMPLE_KVM)
> +		ip = percpu_read(perf_virt_ip.ip);
> +	else
> +		ip = instruction_pointer(regs);
> +	return ip;

And this complication we can perhaps avoid by extending the stack-trace engine 
(arch/x86/kernel/stacktrace.c) to 'know' about virtual guest RIPs 
automatically?

( Note that this would be a bit of an advantage for regular oops printing as 
  well: if a KVM thread crashes or generates a stack-dump it could 
  automatically print the guest virtual RIP as well. )

Walking into the guest context is more complex, but not impossible either - 
and that bit definitely has to be done not in an NMI context but in the KVM 
thread context.

So maybe we should not extend dump_stack_trace() after all (it cannot really 
work from NMI or from oops context), but add the KVM variant and let it 
directly inject into the perf data stream? (like your patch does it pretty 
much)

> +++ linux-2.6.33_perfkvm/tools/perf/perf.c	2010-03-02 09:57:03.164001069 +0800

> +	if (argc > 1 && !strcmp(argv[0], "kvm")) {
> +		sample_kvm = 1;
> +		argv++;
> +		argc--;
> +		cmd = argv[0];
> +	}

this is fine as a quick hack. For the real thing i suspect we want to add 
'perf kvm' as a real builtin-kvm.c command - see builtin-sched.c and 
builtin-lock.c about how to create such 'subsystem commands'. builtin-record.c 
can be extended with various host/guest recording detail switches (off by 
default), and builtin-kvm.c can use those. For example builtin-sched.c 
implements 'perf sched record' the following way:

static const char *record_args[] = {
        "record",
        "-a",
        "-R",
        "-M",
        "-f",
        "-m", "1024",
        "-c", "1",
        "-e", "sched:sched_switch:r",
        "-e", "sched:sched_stat_wait:r",
        "-e", "sched:sched_stat_sleep:r",
        "-e", "sched:sched_stat_iowait:r",
        "-e", "sched:sched_stat_runtime:r",
        "-e", "sched:sched_process_exit:r",
        "-e", "sched:sched_process_fork:r",
        "-e", "sched:sched_wakeup:r",
        "-e", "sched:sched_migrate_task:r",
};

So it simply passes these arguments to perf record. 'perf kvm record' could do 
something similar.

Anyway, your patch already shows great progress and it's the kind of direction 
for enhanced performance analysis of KVM that i think would be very fruitful 
to KVM developers to pursue.

Thanks,

	Ingo

^ permalink raw reply	[flat|nested] 99+ messages in thread

* Re: KVM PMU virtualization
  2010-03-02  7:09         ` Zhang, Yanmin
  2010-03-02  9:36           ` Ingo Molnar
@ 2010-03-02  9:57           ` Peter Zijlstra
  1 sibling, 0 replies; 99+ messages in thread
From: Peter Zijlstra @ 2010-03-02  9:57 UTC (permalink / raw)
  To: Zhang, Yanmin
  Cc: Ingo Molnar, Joerg Roedel, Jes Sorensen, KVM General, Avi Kivity,
	Zachary Amsden, Gleb Natapov, ming.m.lin

On Tue, 2010-03-02 at 15:09 +0800, Zhang, Yanmin wrote:
> With vanilla host kernel, perf top data is stable and spinlock doesn't take too much cpu time.
> With guest os, __ticket_spin_lock consumes 28% cpu time, and sometimes it fluctuates between 9%~28%.
> 
> Another interesting finding is aim7. If I start aim7 on tmpfs testing in guest os with 1GB memory,
> the login hangs and cpu is busy. With the new patch, I could check what happens in guest os, where
> spinlock is busy and kernel is shrinking memory mostly from slab.

Hehe, you've just discovered the reason for paravirt spinlocks ;-)

But neat stuff, although I don't think you need PERF_SAMPLE_KVM, it
should simply always report the guest sample if it came from the guest,
you can extend PERF_RECORD_MISC_CPUMODE_MASK to add guest states.




^ permalink raw reply	[flat|nested] 99+ messages in thread

* Re: KVM PMU virtualization
  2010-03-02  9:36           ` Ingo Molnar
@ 2010-03-03  3:32             ` Zhang, Yanmin
  2010-03-03  9:27               ` Zhang, Yanmin
  0 siblings, 1 reply; 99+ messages in thread
From: Zhang, Yanmin @ 2010-03-03  3:32 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Joerg Roedel, Jes Sorensen, KVM General, Peter Zijlstra,
	Avi Kivity, Zachary Amsden, Gleb Natapov, ming.m.lin

On Tue, 2010-03-02 at 10:36 +0100, Ingo Molnar wrote:
> * Zhang, Yanmin <yanmin_zhang@linux.intel.com> wrote:
> 
> > On Fri, 2010-02-26 at 10:17 +0100, Ingo Molnar wrote:
> 
> > > My suggestion, as always, would be to start very simple and very minimal:
> > > 
> > > Enable 'perf kvm top' to show guest overhead. Use the exact same kernel 
> > > image both as a host and as guest (for testing), to not have to deal with 
> > > the symbol space transport problem initially. Enable 'perf kvm record' to 
> > > only record guest events by default. Etc.
> > > 
> > > This alone will be a quite useful result already - and gives a basis for 
> > > further work. No need to spend months to do the big grand design straight 
> > > away, all of this can be done gradually and in the order of usefulness - 
> > > and you'll always have something that actually works (and helps your other 
> > > KVM projects) along the way.
> >
> > It took me for a couple of hours to read the emails on the topic. Based on 
> > above idea, I worked out a prototype which is ugly, but does work with 
> > top/record when both guest side and host side use the same kernel image, 
> > while compiling most needed modules into kernel directly..
> > 
> > The commands are:
> > perf kvm top
> > perf kvm record
> > perf kvm report
> > 
> > They just collect guest kernel hot functions.
> 
> Fantastic, and there's some really interesting KVM guest/host comparison 
> profiles you've done with this prototype!
> 
> > With my patch, I collected dbench data on Nehalem machine (2*4*2 logical 
> > cpu).
> >
> > 1) Vanilla host kernel (6G memory):
> > ------------------------------------------------------------------------------------------------------------------------
> >    PerfTop:   15491 irqs/sec  kernel:93.6% [1000Hz cycles],  (all, 16 CPUs)
> > ------------------------------------------------------------------------------------------------------------------------
> > 
> >              samples  pcnt function                        DSO
> >              _______ _____ _______________________________ ________________________________________
> > 
> >             99376.00 40.5% ext3_test_allocatable           /lib/modules/2.6.33-kvmymz/build/vmlinux
> >             41239.00 16.8% bitmap_search_next_usable_block /lib/modules/2.6.33-kvmymz/build/vmlinux
> >              7019.00  2.9% __ticket_spin_lock              /lib/modules/2.6.33-kvmymz/build/vmlinux
> >              5350.00  2.2% copy_user_generic_string        /lib/modules/2.6.33-kvmymz/build/vmlinux
> >              5208.00  2.1% do_get_write_access             /lib/modules/2.6.33-kvmymz/build/vmlinux
> >              4484.00  1.8% journal_dirty_metadata          /lib/modules/2.6.33-kvmymz/build/vmlinux
> >              4078.00  1.7% ext3_free_blocks_sb             /lib/modules/2.6.33-kvmymz/build/vmlinux
> >              3856.00  1.6% ext3_new_blocks                 /lib/modules/2.6.33-kvmymz/build/vmlinux
> >              3485.00  1.4% journal_get_undo_access         /lib/modules/2.6.33-kvmymz/build/vmlinux
> >              2803.00  1.1% ext3_try_to_allocate            /lib/modules/2.6.33-kvmymz/build/vmlinux
> >              2241.00  0.9% __find_get_block                /lib/modules/2.6.33-kvmymz/build/vmlinux
> >              1957.00  0.8% find_revoke_record              /lib/modules/2.6.33-kvmymz/build/vmlinux
> > 
> > 2) guest os: start one guest os with 4GB memory.
> > ------------------------------------------------------------------------------------------------------------------------
> >    PerfTop:     827 irqs/sec  kernel: 0.0% [1000Hz cycles],  (all, 16 CPUs)
> > ------------------------------------------------------------------------------------------------------------------------
> > 
> >              samples  pcnt function                        DSO
> >              _______ _____ _______________________________ ________________________________________
> > 
> >             41701.00 28.1% __ticket_spin_lock              /lib/modules/2.6.33-kvmymz/build/vmlinux
> >             33843.00 22.8% ext3_test_allocatable           /lib/modules/2.6.33-kvmymz/build/vmlinux
> >             16862.00 11.4% bitmap_search_next_usable_block /lib/modules/2.6.33-kvmymz/build/vmlinux
> >              3278.00  2.2% native_flush_tlb_others         /lib/modules/2.6.33-kvmymz/build/vmlinux
> >              3200.00  2.2% copy_user_generic_string        /lib/modules/2.6.33-kvmymz/build/vmlinux
> >              3009.00  2.0% do_get_write_access             /lib/modules/2.6.33-kvmymz/build/vmlinux
> >              2834.00  1.9% journal_dirty_metadata          /lib/modules/2.6.33-kvmymz/build/vmlinux
> >              1965.00  1.3% journal_get_undo_access         /lib/modules/2.6.33-kvmymz/build/vmlinux
> >              1907.00  1.3% ext3_new_blocks                 /lib/modules/2.6.33-kvmymz/build/vmlinux
> >              1790.00  1.2% ext3_free_blocks_sb             /lib/modules/2.6.33-kvmymz/build/vmlinux
> >              1741.00  1.2% find_revoke_record              /lib/modules/2.6.33-kvmymz/build/vmlinux
> > 
> > 
> > With vanilla host kernel, perf top data is stable and spinlock doesn't take 
> > too much cpu time. With guest os, __ticket_spin_lock consumes 28% cpu time, 
> > and sometimes it fluctuates between 9%~28%.
> 
> Looks quite convenient to be able to profile guest and host from the same 
> space, right?
> 
> Btw, another, convenient way to compare profiles is 'perf diff':
> 
> $ perf diff
> 
> # Baseline  Delta          Shared Object  Symbol
> # ........ ..........  .................  ......
> #
>      5.45%     +4.31%  [kernel.kallsyms]  [k] _raw_spin_lock
>      3.52%     +3.74%  [kernel.kallsyms]  [k] copy_user_generic_string
>      3.11%     +4.08%  [kernel.kallsyms]  [k] sock_alloc_send_pskb
>      4.32%     +2.62%  [kernel.kallsyms]  [k] _raw_spin_lock_irqsave
>      3.34%     +2.31%  [kernel.kallsyms]  [k] __cache_free
>      1.49%     +3.49%  [kernel.kallsyms]  [k] _raw_read_lock
>      7.44%     -3.06%  [kernel.kallsyms]  [k] avc_has_perm_noaudit
>      0.14%     +2.49%  [kernel.kallsyms]  [k] skb_release_head_state
>      0.22%     +2.29%  [kernel.kallsyms]  [k] vfs_read
>      1.67%     +0.75%  [kernel.kallsyms]  [k] file_has_perm
>      0.09%     +2.31%  [kernel.kallsyms]  [k] rw_verify_area
> 
> By default it compares the last two profiles done in the current directory, if 
> you have two separate data files, say perf.data.host and perf.data.guest, you 
> can do:
> 
>   perf diff perf.data.host perf.data.guest
> 
> To get a host -> guest slowdown comparison.
Thanks for your good pointer. That would be more user friendly.

> 
> Another suggestion: you could add --guest / --host convenience flag to 'perf 
> kvm', to allow for an easy host/guest comparison workflow:
> 

>   perf kvm record --guest     # creates perf.data.guest
>   perf kvm record --host      # creates perf.data.host
So here we would have a new meaning that --guest means just collects guest os
data and --host just collects host data.


>   perf kvm diff               # shortcut for: 'perf diff perf.data.host perf.data.guest'
> 
> > Another interesting finding is aim7. If I start aim7 on tmpfs testing in 
> > guest os with 1GB memory, the login hangs and cpu is busy. With the new 
> > patch, I could check what happens in guest os, where spinlock is busy and 
> > kernel is shrinking memory mostly from slab.
> 
> This is exactly the kind of usage proper perf events integration would allow! 
> Your 'perf kvm' looks very powerful, even in its early prototype.
> 
> Now, regarding the technical details of your patch:
> 
> > +++ linux-2.6.33_perfkvm/arch/x86/kvm/vmx.c	2010-03-02 10:21:57.588586179 +0800
> 
> > @@ -3553,8 +3554,19 @@ static void vmx_complete_interrupts(stru
> >  
> >  	/* We need to handle NMIs before interrupts are enabled */
> >  	if ((exit_intr_info & INTR_INFO_INTR_TYPE_MASK) == INTR_TYPE_NMI_INTR &&
> > -	    (exit_intr_info & INTR_INFO_VALID_MASK))
> > +	    (exit_intr_info & INTR_INFO_VALID_MASK)) {
> > +		u64 rip = vmcs_readl(GUEST_RIP);
> > +		int user_mode = vmcs_read16(GUEST_CS_SELECTOR);
> > +
> > +#ifdef CONFIG_X86_32
> > +		user_mode = (user_mode & SEGMENT_RPL_MASK) == USER_RPL;
> > +#else
> > +		user_mode = !!(user_mode & 3);
> > +#endif
> 
> This test could use a helper i guess, to remove the #ifdef?
Right.

> 
> > +		perf_save_virt_ip(user_mode, rip);
> >  		asm("int $2");
> > +		perf_reset_virt_ip();
> > +	}
> 
> > --- linux-2.6.33/include/linux/perf_event.h	2010-02-25 02:52:17.000000000 +0800
> > +++ linux-2.6.33_perfkvm/include/linux/perf_event.h	2010-03-02 12:26:15.050947780 +0800
> > @@ -125,8 +125,9 @@ enum perf_event_sample_format {
> >  	PERF_SAMPLE_PERIOD			= 1U << 8,
> >  	PERF_SAMPLE_STREAM_ID			= 1U << 9,
> >  	PERF_SAMPLE_RAW				= 1U << 10,
> > +        PERF_SAMPLE_KVM                         = 1U << 11,
> 
> yep, we can extend it like this, but maybe there's another method:
Here PERF_SAMPLE_KVM is just used by tool perf to notify kernel that we
want to collect KVM guest os data instead of whole system. It isn't
used when parsing data. Interface sys_perf_event_open has no flag to notify
that we want to collect guest os event. Another solution is
add new member in perf_event_attr, such like
typedef enum {
	PERF_SAMPLE_HOST,
	PERF_SAMPLE_GUEST
} perf_os;
per_event_attr {
	...
	perf_os os;
	...
};


> 
> > +//#if defined(CONFIG_PERF_EVENTS && CONFIG_PERF_HAS_VIRT_IP)
> > +#if defined(CONFIG_PERF_EVENTS)
> > +struct virt_ip_info {
> > +	int	user_mode;
> > +	u64	ip;
> > +};
> 
> basically what we want is a 'this is guest user mode' differentiator in the 
> call frame stream data, right?
Here user mode is to match PERF_RECORD_MISC_USER/PERF_RECORD_MISC_KERNEL.
If checking current codes of tool perf and kernel, we could
find they uses both perf_callchain_context and
PERF_RECORD_MISC_USER/PERF_RECORD_MISC_KERNEL. The 1st is used for callchain
and the 2nd is used for non-callchain. As for the callchain, we could use
perf_callchain_context=>PERF_CONTEXT_GUEST_KERNEL.

As for non callchain, the doable solution is just like what Peter suggested
that we need add a new PERF_RECORD_MISC_CPUMODE_MASK, such like
PERF_RECORD_MISC_GUEST_KERNEL and PERF_RECORD_MISC_GUEST_USER. The new flag
will be used in perf's function thread__find_addr_location to find the
right map for guest os.

> 
> We already have such separators (it's just not used very much):
> 
> enum perf_callchain_context {
>         PERF_CONTEXT_HV                 = (__u64)-32,
>         PERF_CONTEXT_KERNEL             = (__u64)-128,
>         PERF_CONTEXT_USER               = (__u64)-512,
> 
>         PERF_CONTEXT_GUEST              = (__u64)-2048,
>         PERF_CONTEXT_GUEST_KERNEL       = (__u64)-2176,
>         PERF_CONTEXT_GUEST_USER         = (__u64)-2560,
> 
>         PERF_CONTEXT_MAX                = (__u64)-4095,
> };
> 
> Basically KVM's guest context could be expressed as a PERF_CONTEXT_GUEST 
> separator pushed into the stream (and then recognized by 'perf kvm' - and 
> generally by 'perf report' et al), followed by the virtual RIP as a regular 
> IP.
Right when we use it for callchain.

> 
> That way we dont need PERF_SAMPLE_KVM. Note that the tooling doesnt know about 
> such separators very well yet, so there might be some gotchas along the way. 
> Please let us know if you run into any problems here.
> 
> Peter, what's your preference for this KVM profiling ABI detail?
> 
> > +DECLARE_PER_CPU(struct virt_ip_info, perf_virt_ip);
> > +extern void perf_save_virt_ip(int user_mode, u64 ip);
> > +extern void perf_reset_virt_ip(void);
> > +extern int perf_get_virt_user_mode(void);
> > +static inline u64 perf_instruction_pointer(struct perf_event *event, struct pt_regs *regs)
> > +{
> > +	u64 ip;
> > +	if (event->attr.sample_type & PERF_SAMPLE_KVM)
> > +		ip = percpu_read(perf_virt_ip.ip);
> > +	else
> > +		ip = instruction_pointer(regs);
> > +	return ip;
> 
> And this complication we can perhaps avoid by extending the stack-trace engine 
> (arch/x86/kernel/stacktrace.c) to 'know' about virtual guest RIPs 
> automatically?
> 
> ( Note that this would be a bit of an advantage for regular oops printing as 
>   well: if a KVM thread crashes or generates a stack-dump it could 
>   automatically print the guest virtual RIP as well. )
> 
> Walking into the guest context is more complex, but not impossible either - 
> and that bit definitely has to be done not in an NMI context but in the KVM 
> thread context.
> 
> So maybe we should not extend dump_stack_trace() after all (it cannot really 
> work from NMI or from oops context), but add the KVM variant and let it 
> directly inject into the perf data stream? (like your patch does it pretty 
> much)
> 
> > +++ linux-2.6.33_perfkvm/tools/perf/perf.c	2010-03-02 09:57:03.164001069 +0800
> 
> > +	if (argc > 1 && !strcmp(argv[0], "kvm")) {
> > +		sample_kvm = 1;
> > +		argv++;
> > +		argc--;
> > +		cmd = argv[0];
> > +	}
> 
> this is fine as a quick hack. For the real thing i suspect we want to add 
> 'perf kvm' as a real builtin-kvm.c command - see builtin-sched.c and 
> builtin-lock.c about how to create such 'subsystem commands'. builtin-record.c 
> can be extended with various host/guest recording detail switches (off by 
> default), and builtin-kvm.c can use those. For example builtin-sched.c 
> implements 'perf sched record' the following way:
> 
> static const char *record_args[] = {
>         "record",
>         "-a",
>         "-R",
>         "-M",
>         "-f",
>         "-m", "1024",
>         "-c", "1",
>         "-e", "sched:sched_switch:r",
>         "-e", "sched:sched_stat_wait:r",
>         "-e", "sched:sched_stat_sleep:r",
>         "-e", "sched:sched_stat_iowait:r",
>         "-e", "sched:sched_stat_runtime:r",
>         "-e", "sched:sched_process_exit:r",
>         "-e", "sched:sched_process_fork:r",
>         "-e", "sched:sched_wakeup:r",
>         "-e", "sched:sched_migrate_task:r",
> };
> 
> So it simply passes these arguments to perf record. 'perf kvm record' could do 
> something similar.
That's a good pointer. I will try it late.

> 
> Anyway, your patch already shows great progress and it's the kind of direction 
> for enhanced performance analysis of KVM that i think would be very fruitful 
> to KVM developers to pursue.



^ permalink raw reply	[flat|nested] 99+ messages in thread

* Re: KVM PMU virtualization
  2010-03-03  3:32             ` Zhang, Yanmin
@ 2010-03-03  9:27               ` Zhang, Yanmin
  2010-03-03 10:13                 ` Peter Zijlstra
  2010-03-03 10:15                 ` Peter Zijlstra
  0 siblings, 2 replies; 99+ messages in thread
From: Zhang, Yanmin @ 2010-03-03  9:27 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Joerg Roedel, Jes Sorensen, KVM General, Peter Zijlstra,
	Avi Kivity, Zachary Amsden, Gleb Natapov, ming.m.lin

On Wed, 2010-03-03 at 11:32 +0800, Zhang, Yanmin wrote:
> On Tue, 2010-03-02 at 10:36 +0100, Ingo Molnar wrote:
> > * Zhang, Yanmin <yanmin_zhang@linux.intel.com> wrote:
> > 
> > > On Fri, 2010-02-26 at 10:17 +0100, Ingo Molnar wrote:
> > 
> > > > My suggestion, as always, would be to start very simple and very minimal:
> > > > 
> > > > Enable 'perf kvm top' to show guest overhead. Use the exact same kernel 
> > > > image both as a host and as guest (for testing), to not have to deal with 
> > > > the symbol space transport problem initially. Enable 'perf kvm record' to 
> > > > only record guest events by default. Etc.
> > > > 
> > > > This alone will be a quite useful result already - and gives a basis for 
> > > > further work. No need to spend months to do the big grand design straight 
> > > > away, all of this can be done gradually and in the order of usefulness - 
> > > > and you'll always have something that actually works (and helps your other 
> > > > KVM projects) along the way.
> > >
> > > It took me for a couple of hours to read the emails on the topic. Based on 
> > > above idea, I worked out a prototype which is ugly, but does work with 
> > > top/record when both guest side and host side use the same kernel image, 
> > > while compiling most needed modules into kernel directly..
> > > 
> > > The commands are:
> > > perf kvm top
> > > perf kvm record
> > > perf kvm report
> > > 
> > > They just collect guest kernel hot functions.
> > 
> > Fantastic, and there's some really interesting KVM guest/host comparison 
> > profiles you've done with this prototype!
> > 
> > > With my patch, I collected dbench data on Nehalem machine (2*4*2 logical 
> > > cpu).
> > >
> > > 1) Vanilla host kernel (6G memory):
> > > ------------------------------------------------------------------------------------------------------------------------
> > >    PerfTop:   15491 irqs/sec  kernel:93.6% [1000Hz cycles],  (all, 16 CPUs)
> > > ------------------------------------------------------------------------------------------------------------------------
> > > 
> > >              samples  pcnt function                        DSO
> > >              _______ _____ _______________________________ ________________________________________
> > > 
> > >             99376.00 40.5% ext3_test_allocatable           /lib/modules/2.6.33-kvmymz/build/vmlinux
> > >             41239.00 16.8% bitmap_search_next_usable_block /lib/modules/2.6.33-kvmymz/build/vmlinux
> > >              7019.00  2.9% __ticket_spin_lock              /lib/modules/2.6.33-kvmymz/build/vmlinux
> > >              5350.00  2.2% copy_user_generic_string        /lib/modules/2.6.33-kvmymz/build/vmlinux
> > >              5208.00  2.1% do_get_write_access             /lib/modules/2.6.33-kvmymz/build/vmlinux
> > >              4484.00  1.8% journal_dirty_metadata          /lib/modules/2.6.33-kvmymz/build/vmlinux
> > >              4078.00  1.7% ext3_free_blocks_sb             /lib/modules/2.6.33-kvmymz/build/vmlinux
> > >              3856.00  1.6% ext3_new_blocks                 /lib/modules/2.6.33-kvmymz/build/vmlinux
> > >              3485.00  1.4% journal_get_undo_access         /lib/modules/2.6.33-kvmymz/build/vmlinux
> > >              2803.00  1.1% ext3_try_to_allocate            /lib/modules/2.6.33-kvmymz/build/vmlinux
> > >              2241.00  0.9% __find_get_block                /lib/modules/2.6.33-kvmymz/build/vmlinux
> > >              1957.00  0.8% find_revoke_record              /lib/modules/2.6.33-kvmymz/build/vmlinux
> > > 
> > > 2) guest os: start one guest os with 4GB memory.
> > > ------------------------------------------------------------------------------------------------------------------------
> > >    PerfTop:     827 irqs/sec  kernel: 0.0% [1000Hz cycles],  (all, 16 CPUs)
> > > ------------------------------------------------------------------------------------------------------------------------
> > > 
> > >              samples  pcnt function                        DSO
> > >              _______ _____ _______________________________ ________________________________________
> > > 
> > >             41701.00 28.1% __ticket_spin_lock              /lib/modules/2.6.33-kvmymz/build/vmlinux
> > >             33843.00 22.8% ext3_test_allocatable           /lib/modules/2.6.33-kvmymz/build/vmlinux
> > >             16862.00 11.4% bitmap_search_next_usable_block /lib/modules/2.6.33-kvmymz/build/vmlinux
> > >              3278.00  2.2% native_flush_tlb_others         /lib/modules/2.6.33-kvmymz/build/vmlinux
> > >              3200.00  2.2% copy_user_generic_string        /lib/modules/2.6.33-kvmymz/build/vmlinux
> > >              3009.00  2.0% do_get_write_access             /lib/modules/2.6.33-kvmymz/build/vmlinux
> > >              2834.00  1.9% journal_dirty_metadata          /lib/modules/2.6.33-kvmymz/build/vmlinux
> > >              1965.00  1.3% journal_get_undo_access         /lib/modules/2.6.33-kvmymz/build/vmlinux
> > >              1907.00  1.3% ext3_new_blocks                 /lib/modules/2.6.33-kvmymz/build/vmlinux
> > >              1790.00  1.2% ext3_free_blocks_sb             /lib/modules/2.6.33-kvmymz/build/vmlinux
> > >              1741.00  1.2% find_revoke_record              /lib/modules/2.6.33-kvmymz/build/vmlinux
> > > 
> > > 
> > > With vanilla host kernel, perf top data is stable and spinlock doesn't take 
> > > too much cpu time. With guest os, __ticket_spin_lock consumes 28% cpu time, 
> > > and sometimes it fluctuates between 9%~28%.
> > 
> > Looks quite convenient to be able to profile guest and host from the same 
> > space, right?
> > 
> > Btw, another, convenient way to compare profiles is 'perf diff':
> > 
> > $ perf diff
> > 
> > # Baseline  Delta          Shared Object  Symbol
> > # ........ ..........  .................  ......
> > #
> >      5.45%     +4.31%  [kernel.kallsyms]  [k] _raw_spin_lock
> >      3.52%     +3.74%  [kernel.kallsyms]  [k] copy_user_generic_string
> >      3.11%     +4.08%  [kernel.kallsyms]  [k] sock_alloc_send_pskb
> >      4.32%     +2.62%  [kernel.kallsyms]  [k] _raw_spin_lock_irqsave
> >      3.34%     +2.31%  [kernel.kallsyms]  [k] __cache_free
> >      1.49%     +3.49%  [kernel.kallsyms]  [k] _raw_read_lock
> >      7.44%     -3.06%  [kernel.kallsyms]  [k] avc_has_perm_noaudit
> >      0.14%     +2.49%  [kernel.kallsyms]  [k] skb_release_head_state
> >      0.22%     +2.29%  [kernel.kallsyms]  [k] vfs_read
> >      1.67%     +0.75%  [kernel.kallsyms]  [k] file_has_perm
> >      0.09%     +2.31%  [kernel.kallsyms]  [k] rw_verify_area
> > 
> > By default it compares the last two profiles done in the current directory, if 
> > you have two separate data files, say perf.data.host and perf.data.guest, you 
> > can do:
> > 
> >   perf diff perf.data.host perf.data.guest
> > 
> > To get a host -> guest slowdown comparison.
> Thanks for your good pointer. That would be more user friendly.
> 
> > 
> > Another suggestion: you could add --guest / --host convenience flag to 'perf 
> > kvm', to allow for an easy host/guest comparison workflow:
> > 
> 
> >   perf kvm record --guest     # creates perf.data.guest
> >   perf kvm record --host      # creates perf.data.host
> So here we would have a new meaning that --guest means just collects guest os
> data and --host just collects host data.
> 
> 
> >   perf kvm diff               # shortcut for: 'perf diff perf.data.host perf.data.guest'
> > 
> > > Another interesting finding is aim7. If I start aim7 on tmpfs testing in 
> > > guest os with 1GB memory, the login hangs and cpu is busy. With the new 
> > > patch, I could check what happens in guest os, where spinlock is busy and 
> > > kernel is shrinking memory mostly from slab.
> > 
> > This is exactly the kind of usage proper perf events integration would allow! 
> > Your 'perf kvm' looks very powerful, even in its early prototype.
> > 
> > Now, regarding the technical details of your patch:
> > 
> > > +++ linux-2.6.33_perfkvm/arch/x86/kvm/vmx.c	2010-03-02 10:21:57.588586179 +0800
> > 
> > > @@ -3553,8 +3554,19 @@ static void vmx_complete_interrupts(stru
> > >  
> > >  	/* We need to handle NMIs before interrupts are enabled */
> > >  	if ((exit_intr_info & INTR_INFO_INTR_TYPE_MASK) == INTR_TYPE_NMI_INTR &&
> > > -	    (exit_intr_info & INTR_INFO_VALID_MASK))
> > > +	    (exit_intr_info & INTR_INFO_VALID_MASK)) {
> > > +		u64 rip = vmcs_readl(GUEST_RIP);
> > > +		int user_mode = vmcs_read16(GUEST_CS_SELECTOR);
> > > +
> > > +#ifdef CONFIG_X86_32
> > > +		user_mode = (user_mode & SEGMENT_RPL_MASK) == USER_RPL;
> > > +#else
> > > +		user_mode = !!(user_mode & 3);
> > > +#endif
> > 
> > This test could use a helper i guess, to remove the #ifdef?
> Right.
> 
> > 
> > > +		perf_save_virt_ip(user_mode, rip);
> > >  		asm("int $2");
> > > +		perf_reset_virt_ip();
> > > +	}
> > 
> > > --- linux-2.6.33/include/linux/perf_event.h	2010-02-25 02:52:17.000000000 +0800
> > > +++ linux-2.6.33_perfkvm/include/linux/perf_event.h	2010-03-02 12:26:15.050947780 +0800
> > > @@ -125,8 +125,9 @@ enum perf_event_sample_format {
> > >  	PERF_SAMPLE_PERIOD			= 1U << 8,
> > >  	PERF_SAMPLE_STREAM_ID			= 1U << 9,
> > >  	PERF_SAMPLE_RAW				= 1U << 10,
> > > +        PERF_SAMPLE_KVM                         = 1U << 11,
> > 
> > yep, we can extend it like this, but maybe there's another method:
> Here PERF_SAMPLE_KVM is just used by tool perf to notify kernel that we
> want to collect KVM guest os data instead of whole system. It isn't
> used when parsing data. Interface sys_perf_event_open has no flag to notify
> that we want to collect guest os event. Another solution is
> add new member in perf_event_attr, such like
> typedef enum {
> 	PERF_SAMPLE_HOST,
> 	PERF_SAMPLE_GUEST
> } perf_os;
> per_event_attr {
> 	...
> 	perf_os os;
> 	...
> };
> 
> 
> > 
> > > +//#if defined(CONFIG_PERF_EVENTS && CONFIG_PERF_HAS_VIRT_IP)
> > > +#if defined(CONFIG_PERF_EVENTS)
> > > +struct virt_ip_info {
> > > +	int	user_mode;
> > > +	u64	ip;
> > > +};
> > 
> > basically what we want is a 'this is guest user mode' differentiator in the 
> > call frame stream data, right?
> Here user mode is to match PERF_RECORD_MISC_USER/PERF_RECORD_MISC_KERNEL.
> If checking current codes of tool perf and kernel, we could
> find they uses both perf_callchain_context and
> PERF_RECORD_MISC_USER/PERF_RECORD_MISC_KERNEL. The 1st is used for callchain
> and the 2nd is used for non-callchain. As for the callchain, we could use
> perf_callchain_context=>PERF_CONTEXT_GUEST_KERNEL.
> 
> As for non callchain, the doable solution is just like what Peter suggested
> that we need add a new PERF_RECORD_MISC_CPUMODE_MASK, such like
> PERF_RECORD_MISC_GUEST_KERNEL and PERF_RECORD_MISC_GUEST_USER. The new flag
> will be used in perf's function thread__find_addr_location to find the
> right map for guest os.
> 
> > 
> > We already have such separators (it's just not used very much):
> > 
> > enum perf_callchain_context {
> >         PERF_CONTEXT_HV                 = (__u64)-32,
> >         PERF_CONTEXT_KERNEL             = (__u64)-128,
> >         PERF_CONTEXT_USER               = (__u64)-512,
> > 
> >         PERF_CONTEXT_GUEST              = (__u64)-2048,
> >         PERF_CONTEXT_GUEST_KERNEL       = (__u64)-2176,
> >         PERF_CONTEXT_GUEST_USER         = (__u64)-2560,
> > 
> >         PERF_CONTEXT_MAX                = (__u64)-4095,
> > };
> > 
> > Basically KVM's guest context could be expressed as a PERF_CONTEXT_GUEST 
> > separator pushed into the stream (and then recognized by 'perf kvm' - and 
> > generally by 'perf report' et al), followed by the virtual RIP as a regular 
> > IP.
> Right when we use it for callchain.
> 
> > 
> > That way we dont need PERF_SAMPLE_KVM. Note that the tooling doesnt know about 
> > such separators very well yet, so there might be some gotchas along the way. 
> > Please let us know if you run into any problems here.
> > 
> > Peter, what's your preference for this KVM profiling ABI detail?
> > 
> > > +DECLARE_PER_CPU(struct virt_ip_info, perf_virt_ip);
> > > +extern void perf_save_virt_ip(int user_mode, u64 ip);
> > > +extern void perf_reset_virt_ip(void);
> > > +extern int perf_get_virt_user_mode(void);
> > > +static inline u64 perf_instruction_pointer(struct perf_event *event, struct pt_regs *regs)
> > > +{
> > > +	u64 ip;
> > > +	if (event->attr.sample_type & PERF_SAMPLE_KVM)
> > > +		ip = percpu_read(perf_virt_ip.ip);
> > > +	else
> > > +		ip = instruction_pointer(regs);
> > > +	return ip;
> > 
> > And this complication we can perhaps avoid by extending the stack-trace engine 
> > (arch/x86/kernel/stacktrace.c) to 'know' about virtual guest RIPs 
> > automatically?
> > 
> > ( Note that this would be a bit of an advantage for regular oops printing as 
> >   well: if a KVM thread crashes or generates a stack-dump it could 
> >   automatically print the guest virtual RIP as well. )
> > 
> > Walking into the guest context is more complex, but not impossible either - 
> > and that bit definitely has to be done not in an NMI context but in the KVM 
> > thread context.
> > 
> > So maybe we should not extend dump_stack_trace() after all (it cannot really 
> > work from NMI or from oops context), but add the KVM variant and let it 
> > directly inject into the perf data stream? (like your patch does it pretty 
> > much)
> > 
> > > +++ linux-2.6.33_perfkvm/tools/perf/perf.c	2010-03-02 09:57:03.164001069 +0800
> > 
> > > +	if (argc > 1 && !strcmp(argv[0], "kvm")) {
> > > +		sample_kvm = 1;
> > > +		argv++;
> > > +		argc--;
> > > +		cmd = argv[0];
> > > +	}
> > 
> > this is fine as a quick hack. For the real thing i suspect we want to add 
> > 'perf kvm' as a real builtin-kvm.c command - see builtin-sched.c and 
> > builtin-lock.c about how to create such 'subsystem commands'. builtin-record.c 
> > can be extended with various host/guest recording detail switches (off by 
> > default), and builtin-kvm.c can use those. For example builtin-sched.c 
> > implements 'perf sched record' the following way:
> > 
> > static const char *record_args[] = {
> >         "record",
> >         "-a",
> >         "-R",
> >         "-M",
> >         "-f",
> >         "-m", "1024",
> >         "-c", "1",
> >         "-e", "sched:sched_switch:r",
> >         "-e", "sched:sched_stat_wait:r",
> >         "-e", "sched:sched_stat_sleep:r",
> >         "-e", "sched:sched_stat_iowait:r",
> >         "-e", "sched:sched_stat_runtime:r",
> >         "-e", "sched:sched_process_exit:r",
> >         "-e", "sched:sched_process_fork:r",
> >         "-e", "sched:sched_wakeup:r",
> >         "-e", "sched:sched_migrate_task:r",
> > };
> > 
> > So it simply passes these arguments to perf record. 'perf kvm record' could do 
> > something similar.
> That's a good pointer. I will try it late.
> 
> > 
> > Anyway, your patch already shows great progress and it's the kind of direction 
> > for enhanced performance analysis of KVM that i think would be very fruitful 
> > to KVM developers to pursue.
Below is the kernel patch. I'm still working on perf tool in userspace.

The new patch always tries to collect kvm guest os rip. If it finds the guest os rip
is 0, then it uses the interrupted rip. We would filter records out in user space by tool
perf.

---

diff -Nraup linux-2.6.33/arch/x86/include/asm/ptrace.h linux-2.6.33_perfkvm/arch/x86/include/asm/ptrace.h
--- linux-2.6.33/arch/x86/include/asm/ptrace.h	2010-02-25 02:52:17.000000000 +0800
+++ linux-2.6.33_perfkvm/arch/x86/include/asm/ptrace.h	2010-03-03 14:57:03.792070616 +0800
@@ -167,6 +167,15 @@ static inline int user_mode(struct pt_re
 #endif
 }
 
+static inline int user_mode_cs(u16 cs)
+{
+#ifdef CONFIG_X86_32
+	return (cs & SEGMENT_RPL_MASK) == USER_RPL;
+#else
+	return !!(cs & 3);
+#endif
+}
+
 static inline int user_mode_vm(struct pt_regs *regs)
 {
 #ifdef CONFIG_X86_32
diff -Nraup linux-2.6.33/arch/x86/kvm/vmx.c linux-2.6.33_perfkvm/arch/x86/kvm/vmx.c
--- linux-2.6.33/arch/x86/kvm/vmx.c	2010-02-25 02:52:17.000000000 +0800
+++ linux-2.6.33_perfkvm/arch/x86/kvm/vmx.c	2010-03-03 15:06:01.660057862 +0800
@@ -26,6 +26,7 @@
 #include <linux/sched.h>
 #include <linux/moduleparam.h>
 #include <linux/ftrace_event.h>
+#include <linux/perf_event.h>
 #include "kvm_cache_regs.h"
 #include "x86.h"
 
@@ -3553,8 +3554,14 @@ static void vmx_complete_interrupts(stru
 
 	/* We need to handle NMIs before interrupts are enabled */
 	if ((exit_intr_info & INTR_INFO_INTR_TYPE_MASK) == INTR_TYPE_NMI_INTR &&
-	    (exit_intr_info & INTR_INFO_VALID_MASK))
+	    (exit_intr_info & INTR_INFO_VALID_MASK)) {
+		u64 rip = vmcs_readl(GUEST_RIP);
+		int user_mode;
+		user_mode = user_mode_cs(vmcs_read16(GUEST_CS_SELECTOR));
+		perf_save_virt_ip(user_mode, rip);
 		asm("int $2");
+		perf_reset_virt_ip();
+	}
 
 	idtv_info_valid = idt_vectoring_info & VECTORING_INFO_VALID_MASK;
 
diff -Nraup linux-2.6.33/include/linux/perf_event.h linux-2.6.33_perfkvm/include/linux/perf_event.h
--- linux-2.6.33/include/linux/perf_event.h	2010-02-25 02:52:17.000000000 +0800
+++ linux-2.6.33_perfkvm/include/linux/perf_event.h	2010-03-03 15:22:05.325064001 +0800
@@ -287,11 +287,13 @@ struct perf_event_mmap_page {
 	__u64	data_tail;		/* user-space written tail */
 };
 
-#define PERF_RECORD_MISC_CPUMODE_MASK		(3 << 0)
-#define PERF_RECORD_MISC_CPUMODE_UNKNOWN		(0 << 0)
+#define PERF_RECORD_MISC_CPUMODE_MASK		(7 << 0)
+#define PERF_RECORD_MISC_CPUMODE_UNKNOWN	(0 << 0)
 #define PERF_RECORD_MISC_KERNEL			(1 << 0)
 #define PERF_RECORD_MISC_USER			(2 << 0)
 #define PERF_RECORD_MISC_HYPERVISOR		(3 << 0)
+#define PERF_RECORD_MISC_GUEST_KERNEL		(4 << 0)
+#define PERF_RECORD_MISC_GUEST_USER		(5 << 0)
 
 struct perf_event_header {
 	__u32	type;
@@ -841,6 +843,37 @@ static inline void perf_event_mmap(struc
 		__perf_event_mmap(vma);
 }
 
+struct perf_virt_ip_info {
+	int	user_mode;
+	u64	ip;
+};
+
+DECLARE_PER_CPU(struct perf_virt_ip_info, perf_virt_ip);
+extern void perf_save_virt_ip(int user_mode, u64 ip);
+extern void perf_reset_virt_ip(void);
+
+static inline u64 perf_instruction_pointer(struct pt_regs *regs)
+{
+	u64 ip;
+	ip = percpu_read(perf_virt_ip.ip);
+	if (!ip)
+		ip = instruction_pointer(regs);
+	else
+		perf_reset_virt_ip();
+	return ip;
+}
+
+static inline unsigned int perf_misc_flags(struct pt_regs *regs)
+{
+	if (percpu_read(perf_virt_ip.ip)) {
+		return percpu_read(perf_virt_ip.user_mode) ?
+			PERF_RECORD_MISC_GUEST_USER :
+			PERF_RECORD_MISC_GUEST_KERNEL;
+	} else
+		return user_mode(regs) ? PERF_RECORD_MISC_USER :
+				 PERF_RECORD_MISC_KERNEL;
+}
+
 extern void perf_event_comm(struct task_struct *tsk);
 extern void perf_event_fork(struct task_struct *tsk);
 
@@ -855,12 +888,6 @@ extern void perf_tp_event(int event_id, 
 				 void *record, int entry_size);
 extern void perf_bp_event(struct perf_event *event, void *data);
 
-#ifndef perf_misc_flags
-#define perf_misc_flags(regs)	(user_mode(regs) ? PERF_RECORD_MISC_USER : \
-				 PERF_RECORD_MISC_KERNEL)
-#define perf_instruction_pointer(regs)	instruction_pointer(regs)
-#endif
-
 extern int perf_output_begin(struct perf_output_handle *handle,
 			     struct perf_event *event, unsigned int size,
 			     int nmi, int sample);
@@ -895,6 +922,10 @@ perf_sw_event(u32 event_id, u64 nr, int 
 static inline void
 perf_bp_event(struct perf_event *event, void *data)		{ }
 
+static inline void perf_save_virt_ip(int user_mode, u64 ip)	{ }
+static inline void perf_reset_virt_ip(void)	{ }
+#define perf_instruction_pointer(event, regs)	instruction_pointer(regs)
+
 static inline void perf_event_mmap(struct vm_area_struct *vma)		{ }
 static inline void perf_event_comm(struct task_struct *tsk)		{ }
 static inline void perf_event_fork(struct task_struct *tsk)		{ }
diff -Nraup linux-2.6.33/kernel/perf_event.c linux-2.6.33_perfkvm/kernel/perf_event.c
--- linux-2.6.33/kernel/perf_event.c	2010-02-25 02:52:17.000000000 +0800
+++ linux-2.6.33_perfkvm/kernel/perf_event.c	2010-03-03 15:16:25.521592849 +0800
@@ -3077,6 +3077,26 @@ void perf_output_sample(struct perf_outp
 	}
 }
 
+DEFINE_PER_CPU(struct perf_virt_ip_info, perf_virt_ip) = {0,0};
+EXPORT_PER_CPU_SYMBOL(perf_virt_ip);
+
+void perf_save_virt_ip(int user_mode, u64 ip)
+{
+	if (!atomic_read(&nr_events))
+		return;
+	percpu_write(perf_virt_ip.user_mode, ip);
+	percpu_write(perf_virt_ip.ip, ip);
+}
+EXPORT_SYMBOL_GPL(perf_save_virt_ip);
+
+void perf_reset_virt_ip(void)
+{
+	if (!percpu_read(perf_virt_ip.ip))
+		return;
+	percpu_write(perf_virt_ip.ip, 0);
+}
+EXPORT_SYMBOL_GPL(perf_reset_virt_ip);
+
 void perf_prepare_sample(struct perf_event_header *header,
 			 struct perf_sample_data *data,
 			 struct perf_event *event,



^ permalink raw reply	[flat|nested] 99+ messages in thread

* Re: KVM PMU virtualization
  2010-03-03  9:27               ` Zhang, Yanmin
@ 2010-03-03 10:13                 ` Peter Zijlstra
  2010-03-04  0:52                   ` Zhang, Yanmin
  2010-03-03 10:15                 ` Peter Zijlstra
  1 sibling, 1 reply; 99+ messages in thread
From: Peter Zijlstra @ 2010-03-03 10:13 UTC (permalink / raw)
  To: Zhang, Yanmin
  Cc: Ingo Molnar, Joerg Roedel, Jes Sorensen, KVM General, Avi Kivity,
	Zachary Amsden, Gleb Natapov, ming.m.lin

On Wed, 2010-03-03 at 17:27 +0800, Zhang, Yanmin wrote:
> +static inline u64 perf_instruction_pointer(struct pt_regs *regs)
> +{
> +       u64 ip;
> +       ip = percpu_read(perf_virt_ip.ip);
> +       if (!ip)
> +               ip = instruction_pointer(regs);
> +       else
> +               perf_reset_virt_ip();
> +       return ip;
> +}
> +
> +static inline unsigned int perf_misc_flags(struct pt_regs *regs)
> +{
> +       if (percpu_read(perf_virt_ip.ip)) {
> +               return percpu_read(perf_virt_ip.user_mode) ?
> +                       PERF_RECORD_MISC_GUEST_USER :
> +                       PERF_RECORD_MISC_GUEST_KERNEL;
> +       } else
> +               return user_mode(regs) ? PERF_RECORD_MISC_USER :
> +                                PERF_RECORD_MISC_KERNEL;
> +} 

This codes in the assumption that perf_misc_flags() must only be called
before perf_instruction_pointer(), which is currently true, but you
might want to put a comment near to remind us of this.


^ permalink raw reply	[flat|nested] 99+ messages in thread

* Re: KVM PMU virtualization
  2010-03-03  9:27               ` Zhang, Yanmin
  2010-03-03 10:13                 ` Peter Zijlstra
@ 2010-03-03 10:15                 ` Peter Zijlstra
  2010-03-04  1:00                   ` Zhang, Yanmin
  1 sibling, 1 reply; 99+ messages in thread
From: Peter Zijlstra @ 2010-03-03 10:15 UTC (permalink / raw)
  To: Zhang, Yanmin
  Cc: Ingo Molnar, Joerg Roedel, Jes Sorensen, KVM General, Avi Kivity,
	Zachary Amsden, Gleb Natapov, ming.m.lin

On Wed, 2010-03-03 at 17:27 +0800, Zhang, Yanmin wrote:
> -#ifndef perf_misc_flags
> -#define perf_misc_flags(regs)  (user_mode(regs) ? PERF_RECORD_MISC_USER : \
> -                                PERF_RECORD_MISC_KERNEL)
> -#define perf_instruction_pointer(regs) instruction_pointer(regs)
> -#endif 

Ah, that #ifndef is for powerpc, which I think you just broke.


^ permalink raw reply	[flat|nested] 99+ messages in thread

* Re: KVM PMU virtualization
  2010-03-03 10:13                 ` Peter Zijlstra
@ 2010-03-04  0:52                   ` Zhang, Yanmin
  0 siblings, 0 replies; 99+ messages in thread
From: Zhang, Yanmin @ 2010-03-04  0:52 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Ingo Molnar, Joerg Roedel, Jes Sorensen, KVM General, Avi Kivity,
	Zachary Amsden, Gleb Natapov, ming.m.lin

On Wed, 2010-03-03 at 11:13 +0100, Peter Zijlstra wrote:
> On Wed, 2010-03-03 at 17:27 +0800, Zhang, Yanmin wrote:
> > +static inline u64 perf_instruction_pointer(struct pt_regs *regs)
> > +{
> > +       u64 ip;
> > +       ip = percpu_read(perf_virt_ip.ip);
> > +       if (!ip)
> > +               ip = instruction_pointer(regs);
> > +       else
> > +               perf_reset_virt_ip();
> > +       return ip;
> > +}
> > +
> > +static inline unsigned int perf_misc_flags(struct pt_regs *regs)
> > +{
> > +       if (percpu_read(perf_virt_ip.ip)) {
> > +               return percpu_read(perf_virt_ip.user_mode) ?
> > +                       PERF_RECORD_MISC_GUEST_USER :
> > +                       PERF_RECORD_MISC_GUEST_KERNEL;
> > +       } else
> > +               return user_mode(regs) ? PERF_RECORD_MISC_USER :
> > +                                PERF_RECORD_MISC_KERNEL;
> > +} 
> 
> This codes in the assumption that perf_misc_flags() must only be called
> before perf_instruction_pointer(), which is currently true, but you
> might want to put a comment near to remind us of this.
I will change the logic with a clear reset operation in caller.


^ permalink raw reply	[flat|nested] 99+ messages in thread

* Re: KVM PMU virtualization
  2010-03-03 10:15                 ` Peter Zijlstra
@ 2010-03-04  1:00                   ` Zhang, Yanmin
  2010-03-10  9:29                     ` Zhang, Yanmin
  0 siblings, 1 reply; 99+ messages in thread
From: Zhang, Yanmin @ 2010-03-04  1:00 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Ingo Molnar, Joerg Roedel, Jes Sorensen, KVM General, Avi Kivity,
	Zachary Amsden, Gleb Natapov, ming.m.lin

On Wed, 2010-03-03 at 11:15 +0100, Peter Zijlstra wrote:
> On Wed, 2010-03-03 at 17:27 +0800, Zhang, Yanmin wrote:
> > -#ifndef perf_misc_flags
> > -#define perf_misc_flags(regs)  (user_mode(regs) ? PERF_RECORD_MISC_USER : \
> > -                                PERF_RECORD_MISC_KERNEL)
> > -#define perf_instruction_pointer(regs) instruction_pointer(regs)
> > -#endif 
> 
> Ah, that #ifndef is for powerpc, which I think you just broke.
Thanks for the reminder. I deleted powerpc codes when building cscope
lib.

It seems perf_save_virt_ip/perf_reset_virt_ip interfaces are ugly. I plan to
change them to a callback function struct and kvm registers its version to perf.

Such like:
struct perf_guest_info_callbacks {
	int (*is_in_guest)();
	u64 (*get_guest_ip)();
	int (*copy_guest_stack)();
	int (*reset_in_guest)();
	...
};
int perf_register_guest_info_callbacks(struct perf_guest_info_callbacks *);
int perf_unregister_guest_info_callbacks(struct perf_guest_info_callbacks *);

It's more scalable and neater.



^ permalink raw reply	[flat|nested] 99+ messages in thread

* Re: KVM PMU virtualization
  2010-03-01 17:17                           ` Peter Zijlstra
  2010-03-01 18:36                             ` Joerg Roedel
@ 2010-03-08 10:15                             ` Avi Kivity
  1 sibling, 0 replies; 99+ messages in thread
From: Avi Kivity @ 2010-03-08 10:15 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Joerg Roedel, Ingo Molnar, Jes Sorensen, KVM General,
	Zachary Amsden, Gleb Natapov, ming.m.lin, Zhang, Yanmin,
	Thomas Gleixner, H. Peter Anvin, Arjan van de Ven,
	Fr??d??ric Weisbecker, Arnaldo Carvalho de Melo

On 03/01/2010 07:17 PM, Peter Zijlstra wrote:
>
>
>> 2. For every emulated performance counter the guest activates kvm
>>     allocates a perf_event and configures it for the guest (we may allow
>>     kvm to specify the counter index, the guest would be able to use
>>     rdpmc unintercepted then). Event filtering is also done in this step.
>>      
> rdpmc can never be used unintercepted, for perf might be multiplexing
> the actual hw.
>    

How often is rdpmc used?  If it is invoked on high frequency 
software-only events (like context switches), then this may be a 
performance issue.  If it is only issued on perf interrupts, we may be 
able to live with it (since we already took an exit for the interrupt).

-- 
error compiling committee.c: too many arguments to function


^ permalink raw reply	[flat|nested] 99+ messages in thread

* Re: KVM PMU virtualization
  2010-02-26 14:42                   ` Peter Zijlstra
@ 2010-03-08 18:14                     ` Avi Kivity
  0 siblings, 0 replies; 99+ messages in thread
From: Avi Kivity @ 2010-03-08 18:14 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Jes Sorensen, Ingo Molnar, Joerg Roedel, KVM General,
	Zachary Amsden, Gleb Natapov, ming.m.lin, Zhang, Yanmin,
	Thomas Gleixner, H. Peter Anvin, Arjan van de Ven,
	Fr??d??ric Weisbecker, Arnaldo Carvalho de Melo

On 02/26/2010 04:42 PM, Peter Zijlstra wrote:
>
> Also, intel debugstore things requires a host linear address,

It requires a linear address, not a host linear address.  Of course, it 
might not like the linear address mappings changing under its feet.  If 
it has a private tlb, then this won't work.

>   again, not
> something a vcpu can easily provide (although that might be worked
> around with an msr trap, but that still limits you to 1 page data sizes,
> not a limitation all software will respect).
>    

If you're willing to pin pages, you can map the guest's buffer.  That 
won't work if BTS can happen in parallel with a #VMEXIT, or if there are 
interactions with npt/ept.  Will have to ask the vendors.

>> All that said, what we really want is for Intel+AMD to come up with
>> proper hw PMU virtualization support that makes it easy to rotate the
>> full PMU in and out for a guest. Then this whole discussion will become
>> a non issue.
>>      
> As it stands there simply are a number of PMU features that defy being
> virtualized, simply because the virt stuff doesn't do system topology.
> So even if they were to support a virtualized pmu, it would likely be a
> different beast than the native hardware is, and it will be several
> hardware models in the future, coming up with a paravirt interface and
> getting !linux hosts to adapt and !linux guests to use is probably as
> 'easy'.
>    

!linux hosts are someone else's problem, but how would be get !linux 
guests to use a soft pmu?

The only way I see that happening is if a soft pmu is standardized 
across hypervisors, which is unfortunately unlikely.

-- 
error compiling committee.c: too many arguments to function


^ permalink raw reply	[flat|nested] 99+ messages in thread

* Re: KVM PMU virtualization
  2010-03-04  1:00                   ` Zhang, Yanmin
@ 2010-03-10  9:29                     ` Zhang, Yanmin
  0 siblings, 0 replies; 99+ messages in thread
From: Zhang, Yanmin @ 2010-03-10  9:29 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Ingo Molnar, Joerg Roedel, Jes Sorensen, KVM General, Avi Kivity,
	Zachary Amsden, Gleb Natapov, ming.m.lin, sheng.yang

On Thu, 2010-03-04 at 09:00 +0800, Zhang, Yanmin wrote:
> On Wed, 2010-03-03 at 11:15 +0100, Peter Zijlstra wrote:
> > On Wed, 2010-03-03 at 17:27 +0800, Zhang, Yanmin wrote:
> > > -#ifndef perf_misc_flags
> > > -#define perf_misc_flags(regs)  (user_mode(regs) ? PERF_RECORD_MISC_USER : \
> > > -                                PERF_RECORD_MISC_KERNEL)
> > > -#define perf_instruction_pointer(regs) instruction_pointer(regs)
> > > -#endif 
> > 
> > Ah, that #ifndef is for powerpc, which I think you just broke.
> Thanks for the reminder. I deleted powerpc codes when building cscope
> lib.
> 
> It seems perf_save_virt_ip/perf_reset_virt_ip interfaces are ugly. I plan to
> change them to a callback function struct and kvm registers its version to perf.
> 
> Such like:
> struct perf_guest_info_callbacks {
> 	int (*is_in_guest)();
> 	u64 (*get_guest_ip)();
> 	int (*copy_guest_stack)();
> 	int (*reset_in_guest)();
> 	...
> };
> int perf_register_guest_info_callbacks(struct perf_guest_info_callbacks *);
> int perf_unregister_guest_info_callbacks(struct perf_guest_info_callbacks *);
> 
> It's more scalable and neater.
In case you guys might lose patience, I worked out a new patch against 2.6.34-rc1.

It could work with:
#perf kvm --guest --guestkallsyms /guest/os/kernel/proc/kallsyms --guestmodules
/guest/os/proc/modules top
It also support to collect both host side and guest side at the same time:
#perf kvm --host --guest --guestkallsyms /guest/os/kernel/proc/kallsyms --guestmodules
/guest/os/proc/modules top

The first output line of top has guest kernel/user space percentage.

Or just host side:
#perf kvm --host

As tool perf source codes have lots of changes, I am still working on perf kvm record
and report.

---

diff -Nraup linux-2.6.34-rc1/arch/x86/include/asm/ptrace.h linux-2.6.34-rc1_work/arch/x86/include/asm/ptrace.h
--- linux-2.6.34-rc1/arch/x86/include/asm/ptrace.h	2010-03-09 13:04:20.730596079 +0800
+++ linux-2.6.34-rc1_work/arch/x86/include/asm/ptrace.h	2010-03-10 17:06:34.228953260 +0800
@@ -167,6 +167,15 @@ static inline int user_mode(struct pt_re
 #endif
 }
 
+static inline int user_mode_cs(u16 cs)
+{
+#ifdef CONFIG_X86_32
+	return (cs & SEGMENT_RPL_MASK) == USER_RPL;
+#else
+	return !!(cs & 3);
+#endif
+}
+
 static inline int user_mode_vm(struct pt_regs *regs)
 {
 #ifdef CONFIG_X86_32
diff -Nraup linux-2.6.34-rc1/arch/x86/kvm/vmx.c linux-2.6.34-rc1_work/arch/x86/kvm/vmx.c
--- linux-2.6.34-rc1/arch/x86/kvm/vmx.c	2010-03-09 13:04:20.758593132 +0800
+++ linux-2.6.34-rc1_work/arch/x86/kvm/vmx.c	2010-03-10 17:11:49.709019136 +0800
@@ -26,6 +26,7 @@
 #include <linux/sched.h>
 #include <linux/moduleparam.h>
 #include <linux/ftrace_event.h>
+#include <linux/perf_event.h>
 #include "kvm_cache_regs.h"
 #include "x86.h"
 
@@ -3632,6 +3633,43 @@ static void update_cr8_intercept(struct 
 	vmcs_write32(TPR_THRESHOLD, irr);
 }
 
+DEFINE_PER_CPU(int, kvm_in_guest) = {0};
+
+static void kvm_set_in_guest(void)
+{
+	percpu_write(kvm_in_guest, 1);
+}
+
+static int kvm_is_in_guest(void)
+{
+	return percpu_read(kvm_in_guest);
+}
+
+static int kvm_is_user_mode(void)
+{
+	int user_mode;
+	user_mode = user_mode_cs(vmcs_read16(GUEST_CS_SELECTOR));
+	return user_mode;
+}
+
+static u64 kvm_get_guest_ip(void)
+{
+	return vmcs_readl(GUEST_RIP);
+}
+
+static void kvm_reset_in_guest(void)
+{
+	if (percpu_read(kvm_in_guest))
+		percpu_write(kvm_in_guest, 0);
+}
+
+static struct perf_guest_info_callbacks kvm_guest_cbs = {
+	.is_in_guest 		= kvm_is_in_guest,
+	.is_user_mode		= kvm_is_user_mode,
+	.get_guest_ip		= kvm_get_guest_ip,
+	.reset_in_guest		= kvm_reset_in_guest
+};
+
 static void vmx_complete_interrupts(struct vcpu_vmx *vmx)
 {
 	u32 exit_intr_info;
@@ -3653,8 +3691,11 @@ static void vmx_complete_interrupts(stru
 
 	/* We need to handle NMIs before interrupts are enabled */
 	if ((exit_intr_info & INTR_INFO_INTR_TYPE_MASK) == INTR_TYPE_NMI_INTR &&
-	    (exit_intr_info & INTR_INFO_VALID_MASK))
+		(exit_intr_info & INTR_INFO_VALID_MASK)) {
+		kvm_set_in_guest();
 		asm("int $2");
+		kvm_reset_in_guest();
+	}
 
 	idtv_info_valid = idt_vectoring_info & VECTORING_INFO_VALID_MASK;
 
@@ -4251,6 +4292,8 @@ static int __init vmx_init(void)
 	if (bypass_guest_pf)
 		kvm_mmu_set_nonpresent_ptes(~0xffeull, 0ull);
 
+	perf_register_guest_info_callbacks(&kvm_guest_cbs);
+
 	return 0;
 
 out3:
@@ -4266,6 +4309,8 @@ out:
 
 static void __exit vmx_exit(void)
 {
+	perf_unregister_guest_info_callbacks(&kvm_guest_cbs);
+
 	free_page((unsigned long)vmx_msr_bitmap_legacy);
 	free_page((unsigned long)vmx_msr_bitmap_longmode);
 	free_page((unsigned long)vmx_io_bitmap_b);
diff -Nraup linux-2.6.34-rc1/include/linux/perf_event.h linux-2.6.34-rc1_work/include/linux/perf_event.h
--- linux-2.6.34-rc1/include/linux/perf_event.h	2010-03-09 13:04:28.905944253 +0800
+++ linux-2.6.34-rc1_work/include/linux/perf_event.h	2010-03-10 17:06:34.228953260 +0800
@@ -287,11 +287,13 @@ struct perf_event_mmap_page {
 	__u64	data_tail;		/* user-space written tail */
 };
 
-#define PERF_RECORD_MISC_CPUMODE_MASK		(3 << 0)
+#define PERF_RECORD_MISC_CPUMODE_MASK		(7 << 0)
 #define PERF_RECORD_MISC_CPUMODE_UNKNOWN	(0 << 0)
 #define PERF_RECORD_MISC_KERNEL			(1 << 0)
 #define PERF_RECORD_MISC_USER			(2 << 0)
 #define PERF_RECORD_MISC_HYPERVISOR		(3 << 0)
+#define PERF_RECORD_MISC_GUEST_KERNEL		(4 << 0)
+#define PERF_RECORD_MISC_GUEST_USER		(5 << 0)
 
 struct perf_event_header {
 	__u32	type;
@@ -439,6 +441,13 @@ enum perf_callchain_context {
 # include <asm/perf_event.h>
 #endif
 
+struct perf_guest_info_callbacks {
+	int (*is_in_guest) (void);
+	int (*is_user_mode) (void);
+	u64 (*get_guest_ip) (void);
+	void (*reset_in_guest) (void);
+};
+
 #ifdef CONFIG_HAVE_HW_BREAKPOINT
 #include <asm/hw_breakpoint.h>
 #endif
@@ -849,6 +858,10 @@ static inline void perf_event_mmap(struc
 		__perf_event_mmap(vma);
 }
 
+extern u64 perf_instruction_pointer(struct pt_regs *regs);
+int perf_register_guest_info_callbacks(struct perf_guest_info_callbacks *);
+int perf_unregister_guest_info_callbacks(struct perf_guest_info_callbacks *);
+
 extern void perf_event_comm(struct task_struct *tsk);
 extern void perf_event_fork(struct task_struct *tsk);
 
@@ -862,12 +875,6 @@ extern void perf_event_init(void);
 extern void perf_tp_event(int event_id, u64 addr, u64 count, void *record, int entry_size);
 extern void perf_bp_event(struct perf_event *event, void *data);
 
-#ifndef perf_misc_flags
-#define perf_misc_flags(regs)	(user_mode(regs) ? PERF_RECORD_MISC_USER : \
-				 PERF_RECORD_MISC_KERNEL)
-#define perf_instruction_pointer(regs)	instruction_pointer(regs)
-#endif
-
 extern int perf_output_begin(struct perf_output_handle *handle,
 			     struct perf_event *event, unsigned int size,
 			     int nmi, int sample);
@@ -902,6 +909,13 @@ perf_sw_event(u32 event_id, u64 nr, int 
 static inline void
 perf_bp_event(struct perf_event *event, void *data)			{ }
 
+static inline int perf_register_guest_info_callbacks
+(struct perf_guest_info_callbacks *)	{return 0; }
+static inline int perf_unregister_guest_info_callbacks
+(struct perf_guest_info_callbacks *)	{return 0; }
+
+#define perf_instruction_pointer(event, regs)	instruction_pointer(regs)
+
 static inline void perf_event_mmap(struct vm_area_struct *vma)		{ }
 static inline void perf_event_comm(struct task_struct *tsk)		{ }
 static inline void perf_event_fork(struct task_struct *tsk)		{ }
diff -Nraup linux-2.6.34-rc1/kernel/perf_event.c linux-2.6.34-rc1_work/kernel/perf_event.c
--- linux-2.6.34-rc1/kernel/perf_event.c	2010-03-09 13:04:30.085942017 +0800
+++ linux-2.6.34-rc1_work/kernel/perf_event.c	2010-03-10 17:06:34.232905199 +0800
@@ -2807,6 +2807,50 @@ __weak struct perf_callchain_entry *perf
 }
 
 /*
+ * We assume there is only KVM supporting the callbacks.
+ * Later on, we might change it to a list if there is
+ * another virtualization implementation supporting the callbacks.
+ */
+static struct perf_guest_info_callbacks *perf_guest_cbs;
+
+int perf_register_guest_info_callbacks(struct perf_guest_info_callbacks * cbs)
+{
+	perf_guest_cbs = cbs;
+	return 0;
+}
+EXPORT_SYMBOL_GPL(perf_register_guest_info_callbacks);
+
+int perf_unregister_guest_info_callbacks(struct perf_guest_info_callbacks * cbs)
+{
+	perf_guest_cbs = NULL;
+	return 0;
+}
+EXPORT_SYMBOL_GPL(perf_unregister_guest_info_callbacks);
+
+u64 perf_instruction_pointer(struct pt_regs *regs)
+{
+	u64 ip;
+	if (perf_guest_cbs && perf_guest_cbs->is_in_guest()) {
+		ip = perf_guest_cbs->get_guest_ip();
+	} else
+		ip = instruction_pointer(regs);
+	return ip;
+}
+
+#ifndef perf_misc_flags
+static inline unsigned int perf_misc_flags(struct pt_regs *regs)
+{
+	if (perf_guest_cbs && perf_guest_cbs->is_in_guest()) {
+		return perf_guest_cbs->is_user_mode() ?
+			PERF_RECORD_MISC_GUEST_USER :
+			PERF_RECORD_MISC_GUEST_KERNEL;
+	} else
+		return user_mode(regs) ? PERF_RECORD_MISC_USER :
+			PERF_RECORD_MISC_KERNEL;
+}
+#endif
+
+/*
  * Output
  */
 static bool perf_output_space(struct perf_mmap_data *data, unsigned long tail,
diff -Nraup linux-2.6.34-rc1/tools/perf/builtin-diff.c linux-2.6.34-rc1_work/tools/perf/builtin-diff.c
--- linux-2.6.34-rc1/tools/perf/builtin-diff.c	2010-03-09 13:04:31.373942905 +0800
+++ linux-2.6.34-rc1_work/tools/perf/builtin-diff.c	2010-03-10 17:06:34.232905199 +0800
@@ -222,6 +222,9 @@ int cmd_diff(int argc, const char **argv
 			input_new = argv[1];
 		} else
 			input_new = argv[0];
+	} else if (symbol_conf.guest_vmlinux_name || symbol_conf.guest_kallsyms) {
+		input_old = "perf.data.host";
+		input_new = "perf.data.guest";
 	}
 
 	symbol_conf.exclude_other = false;
diff -Nraup linux-2.6.34-rc1/tools/perf/builtin.h linux-2.6.34-rc1_work/tools/perf/builtin.h
--- linux-2.6.34-rc1/tools/perf/builtin.h	2010-03-09 13:04:31.377861392 +0800
+++ linux-2.6.34-rc1_work/tools/perf/builtin.h	2010-03-10 17:06:34.232905199 +0800
@@ -32,5 +32,6 @@ extern int cmd_version(int argc, const c
 extern int cmd_probe(int argc, const char **argv, const char *prefix);
 extern int cmd_kmem(int argc, const char **argv, const char *prefix);
 extern int cmd_lock(int argc, const char **argv, const char *prefix);
+extern int cmd_kvm(int argc, const char **argv, const char *prefix);
 
 #endif
diff -Nraup linux-2.6.34-rc1/tools/perf/builtin-kvm.c linux-2.6.34-rc1_work/tools/perf/builtin-kvm.c
--- linux-2.6.34-rc1/tools/perf/builtin-kvm.c	1970-01-01 08:00:00.000000000 +0800
+++ linux-2.6.34-rc1_work/tools/perf/builtin-kvm.c	2010-03-10 17:06:34.232905199 +0800
@@ -0,0 +1,123 @@
+#include "builtin.h"
+#include "perf.h"
+
+#include "util/util.h"
+#include "util/cache.h"
+#include "util/symbol.h"
+#include "util/thread.h"
+#include "util/header.h"
+#include "util/session.h"
+
+#include "util/parse-options.h"
+#include "util/trace-event.h"
+
+#include "util/debug.h"
+
+#include <sys/prctl.h>
+
+#include <semaphore.h>
+#include <pthread.h>
+#include <math.h>
+
+static char			*file_name = NULL;
+static char			name_buffer[256];
+
+int				perf_host = 1;
+int				perf_guest = 0;
+
+static const char * const kvm_usage[] = {
+	"perf kvm [<options>] {top|record|report|diff}",
+	NULL
+};
+
+static const struct option kvm_options[] = {
+	OPT_STRING('i', "input", &file_name, "file",
+		    "Input file name"),
+	OPT_STRING('o', "output", &file_name, "file",
+		    "Output file name"),
+	OPT_BOOLEAN(0, "guest", &perf_guest,
+		    "Collect guest os data"),
+	OPT_BOOLEAN(0, "host", &perf_host,
+		    "Collect guest os data"),
+	OPT_STRING(0, "guestvmlinux", &symbol_conf.guest_vmlinux_name, "file",
+		    "file saving guest os vmlinux"),
+	OPT_STRING(0, "guestkallsyms", &symbol_conf.guest_kallsyms, "file",
+		    "file saving guest os /proc/kallsyms"),
+	OPT_STRING(0, "guestmodules", &symbol_conf.guest_modules, "file",
+		    "file saving guest os /proc/modules"),
+	OPT_END()
+};
+
+static int __cmd_record(int argc, const char **argv)
+{
+	int rec_argc, i = 0, j;
+	const char **rec_argv;
+
+	rec_argc = argc + 2;
+	rec_argv = calloc(rec_argc + 1, sizeof(char *));
+	rec_argv[i++] = strdup("record");
+	rec_argv[i++] = strdup("-o");
+	rec_argv[i++] = strdup(file_name);
+	for (j = 1; j < argc; j++, i++)
+		rec_argv[i] = argv[j];
+
+	BUG_ON(i != rec_argc);
+
+	return cmd_record(i, rec_argv, NULL);
+}
+
+static int __cmd_report(int argc, const char **argv)
+{
+	int rec_argc, i = 0, j;
+	const char **rec_argv;
+
+	rec_argc = argc + 2;
+	rec_argv = calloc(rec_argc + 1, sizeof(char *));
+	rec_argv[i++] = strdup("report");
+	rec_argv[i++] = strdup("-i");
+	rec_argv[i++] = strdup(file_name);
+	for (j = 1; j < argc; j++, i++)
+		rec_argv[i] = argv[j];
+
+	BUG_ON(i != rec_argc);
+
+	return cmd_report(i, rec_argv, NULL);
+}
+
+int cmd_kvm(int argc, const char **argv, const char *prefix __used)
+{
+	perf_host = perf_guest = 0;
+
+	argc = parse_options(argc, argv, kvm_options, kvm_usage,
+			PARSE_OPT_STOP_AT_NON_OPTION);
+	if (!argc)
+		usage_with_options(kvm_usage, kvm_options);
+
+	if (!perf_host)
+		perf_guest = 1;
+
+	if (!file_name) {
+		if (perf_host && !perf_guest)
+			sprintf(name_buffer, "perf.data.host");
+		else if (!perf_host && perf_guest)
+			sprintf(name_buffer, "perf.data.guest");
+		else
+			sprintf(name_buffer, "perf.data.kvm");
+		file_name = name_buffer;
+	}
+
+	if (!strncmp(argv[0], "rec", 3)) {
+		return __cmd_record(argc, argv);
+	} else if (!strncmp(argv[0], "rep", 3)) {
+		return __cmd_report(argc, argv);
+	} else if (!strncmp(argv[0], "diff", 4)) {
+		return cmd_diff(argc, argv, NULL);
+	} else if (!strncmp(argv[0], "top", 3)) {
+		return cmd_top(argc, argv, NULL);
+	} else {
+		usage_with_options(kvm_usage, kvm_options);
+	}
+
+	return 0;
+}
+
diff -Nraup linux-2.6.34-rc1/tools/perf/builtin-top.c linux-2.6.34-rc1_work/tools/perf/builtin-top.c
--- linux-2.6.34-rc1/tools/perf/builtin-top.c	2010-03-09 13:04:31.377861392 +0800
+++ linux-2.6.34-rc1_work/tools/perf/builtin-top.c	2010-03-10 17:06:34.232905199 +0800
@@ -409,7 +409,8 @@ static double sym_weight(const struct sy
 }
 
 static long			samples;
-static long			userspace_samples;
+static long			kernel_samples, userspace_samples;
+static long			guest_us_samples, guest_kernel_samples;
 static const char		CONSOLE_CLEAR[] = "^[[H^[[2J";
 
 static void __list_insert_active_sym(struct sym_entry *syme)
@@ -449,7 +450,10 @@ static void print_sym_table(void)
 	int printed = 0, j;
 	int counter, snap = !display_weighted ? sym_counter : 0;
 	float samples_per_sec = samples/delay_secs;
-	float ksamples_per_sec = (samples-userspace_samples)/delay_secs;
+	float ksamples_per_sec = (kernel_samples)/delay_secs;
+	float userspace_samples_per_sec = (userspace_samples)/delay_secs;
+	float guest_kernel_samples_per_sec = (guest_kernel_samples)/delay_secs;
+	float guest_us_samples_per_sec = (guest_us_samples)/delay_secs;
 	float sum_ksamples = 0.0;
 	struct sym_entry *syme, *n;
 	struct rb_root tmp = RB_ROOT;
@@ -457,7 +461,8 @@ static void print_sym_table(void)
 	int sym_width = 0, dso_width = 0, max_dso_width;
 	const int win_width = winsize.ws_col - 1;
 
-	samples = userspace_samples = 0;
+	samples = kernel_samples = userspace_samples = 0;
+	guest_kernel_samples = guest_us_samples = 0;
 
 	/* Sort the active symbols */
 	pthread_mutex_lock(&active_symbols_lock);
@@ -488,9 +493,19 @@ static void print_sym_table(void)
 	puts(CONSOLE_CLEAR);
 
 	printf("%-*.*s\n", win_width, win_width, graph_dotted_line);
-	printf( "   PerfTop:%8.0f irqs/sec  kernel:%4.1f%% [",
-		samples_per_sec,
-		100.0 - (100.0*((samples_per_sec-ksamples_per_sec)/samples_per_sec)));
+	if (!perf_guest) {
+		printf( "   PerfTop:%8.0f irqs/sec  kernel:%4.1f%% [",
+			samples_per_sec,
+			100.0 - (100.0*((samples_per_sec-ksamples_per_sec)/samples_per_sec)));
+	} else {
+		printf( "   PerfTop:%8.0f irqs/sec  kernel:%4.1f%% user:%4.1f%% guest kernel:%4.1f%% guest user:%4.1f%% [",
+			samples_per_sec,
+			100.0 - (100.0*((samples_per_sec-ksamples_per_sec)/samples_per_sec)),
+			100.0 - (100.0*((samples_per_sec-userspace_samples_per_sec)/samples_per_sec)),
+			100.0 - (100.0*((samples_per_sec-guest_kernel_samples_per_sec)/samples_per_sec)),
+			100.0 - (100.0*((samples_per_sec-guest_us_samples_per_sec)/samples_per_sec))
+			);
+	}
 
 	if (nr_counters == 1 || !display_weighted) {
 		printf("%Ld", (u64)attrs[0].sample_period);
@@ -947,9 +962,17 @@ static void event__process_sample(const 
 			return;
 		break;
 	case PERF_RECORD_MISC_KERNEL:
+		++kernel_samples;
 		if (hide_kernel_symbols)
 			return;
 		break;
+	case PERF_RECORD_MISC_GUEST_KERNEL:
+		++guest_kernel_samples;
+		break;
+	case PERF_RECORD_MISC_GUEST_USER:
+		++guest_us_samples;
+		/* TODO: we don't process guest user from host side. */
+		return;
 	default:
 		return;
 	}
diff -Nraup linux-2.6.34-rc1/tools/perf/Makefile linux-2.6.34-rc1_work/tools/perf/Makefile
--- linux-2.6.34-rc1/tools/perf/Makefile	2010-03-09 13:04:31.341942020 +0800
+++ linux-2.6.34-rc1_work/tools/perf/Makefile	2010-03-10 17:06:34.232905199 +0800
@@ -458,6 +458,7 @@ BUILTIN_OBJS += builtin-trace.o
 BUILTIN_OBJS += builtin-probe.o
 BUILTIN_OBJS += builtin-kmem.o
 BUILTIN_OBJS += builtin-lock.o
+BUILTIN_OBJS += builtin-kvm.o
 
 PERFLIBS = $(LIB_FILE)
 
diff -Nraup linux-2.6.34-rc1/tools/perf/perf.c linux-2.6.34-rc1_work/tools/perf/perf.c
--- linux-2.6.34-rc1/tools/perf/perf.c	2010-03-09 13:04:31.377861392 +0800
+++ linux-2.6.34-rc1_work/tools/perf/perf.c	2010-03-10 17:06:34.232905199 +0800
@@ -304,6 +304,7 @@ static void handle_internal_command(int 
 		{ "probe",	cmd_probe,	0 },
 		{ "kmem",	cmd_kmem,	0 },
 		{ "lock",	cmd_lock,	0 },
+		{ "kvm",	cmd_kvm,	0 },
 	};
 	unsigned int i;
 	static const char ext[] = STRIP_EXTENSION;
diff -Nraup linux-2.6.34-rc1/tools/perf/perf.h linux-2.6.34-rc1_work/tools/perf/perf.h
--- linux-2.6.34-rc1/tools/perf/perf.h	2010-03-09 13:04:16.357945701 +0800
+++ linux-2.6.34-rc1_work/tools/perf/perf.h	2010-03-10 17:06:34.236904596 +0800
@@ -131,4 +131,6 @@ struct ip_callchain {
 	u64 ips[0];
 };
 
+extern int perf_host, perf_guest;
+
 #endif
diff -Nraup linux-2.6.34-rc1/tools/perf/util/event.c linux-2.6.34-rc1_work/tools/perf/util/event.c
--- linux-2.6.34-rc1/tools/perf/util/event.c	2010-03-09 13:04:31.381941876 +0800
+++ linux-2.6.34-rc1_work/tools/perf/util/event.c	2010-03-10 17:06:34.236904596 +0800
@@ -442,12 +442,16 @@ void thread__find_addr_map(struct thread
 	al->thread = self;
 	al->addr = addr;
 
-	if (cpumode == PERF_RECORD_MISC_KERNEL) {
+	if (cpumode == PERF_RECORD_MISC_KERNEL && perf_host) {
 		al->level = 'k';
 		mg = &session->kmaps;
-	} else if (cpumode == PERF_RECORD_MISC_USER)
+	} else if (cpumode == PERF_RECORD_MISC_USER && perf_host) {
 		al->level = '.';
-	else {
+	} else if (cpumode == PERF_RECORD_MISC_GUEST_KERNEL && perf_guest) {
+		al->level = 'g';
+		mg = &session->guest_kmaps;
+	} else {
+		/* TODO: We don't support guest user space. Might support late */
 		al->level = 'H';
 		al->map = NULL;
 		return;
@@ -464,10 +468,18 @@ try_again:
 		 * "[vdso]" dso, but for now lets use the old trick of looking
 		 * in the whole kernel symbol list.
 		 */
-		if ((long long)al->addr < 0 && mg != &session->kmaps) {
+		if ((long long)al->addr < 0 &&
+			mg != &session->kmaps &&
+			cpumode == PERF_RECORD_MISC_KERNEL) {
 			mg = &session->kmaps;
 			goto try_again;
 		}
+		if ((long long)al->addr < 0 &&
+				mg != &session->guest_kmaps &&
+				cpumode == PERF_RECORD_MISC_GUEST_KERNEL) {
+			mg = &session->guest_kmaps;
+			goto try_again;
+		}
 	} else
 		al->addr = al->map->map_ip(al->map, al->addr);
 }
diff -Nraup linux-2.6.34-rc1/tools/perf/util/session.c linux-2.6.34-rc1_work/tools/perf/util/session.c
--- linux-2.6.34-rc1/tools/perf/util/session.c	2010-03-09 13:04:31.385942104 +0800
+++ linux-2.6.34-rc1_work/tools/perf/util/session.c	2010-03-10 17:06:34.236904596 +0800
@@ -54,7 +54,12 @@ out_close:
 
 static inline int perf_session__create_kernel_maps(struct perf_session *self)
 {
-	return map_groups__create_kernel_maps(&self->kmaps, self->vmlinux_maps);
+	int ret;
+	ret = map_groups__create_kernel_maps(&self->kmaps, self->vmlinux_maps);
+	if (ret >= 0)
+		ret = map_groups__create_guest_kernel_maps(&self->guest_kmaps,
+				self->guest_vmlinux_maps);
+	return ret;
 }
 
 struct perf_session *perf_session__new(const char *filename, int mode, bool force)
@@ -76,6 +81,7 @@ struct perf_session *perf_session__new(c
 	self->cwdlen = 0;
 	self->unknown_events = 0;
 	map_groups__init(&self->kmaps);
+	map_groups__init(&self->guest_kmaps);
 
 	if (mode == O_RDONLY) {
 		if (perf_session__open(self, force) < 0)
diff -Nraup linux-2.6.34-rc1/tools/perf/util/session.h linux-2.6.34-rc1_work/tools/perf/util/session.h
--- linux-2.6.34-rc1/tools/perf/util/session.h	2010-03-09 13:04:31.385942104 +0800
+++ linux-2.6.34-rc1_work/tools/perf/util/session.h	2010-03-10 17:06:34.236904596 +0800
@@ -16,9 +16,11 @@ struct perf_session {
 	unsigned long		size;
 	unsigned long		mmap_window;
 	struct map_groups	kmaps;
+	struct map_groups	guest_kmaps;
 	struct rb_root		threads;
 	struct thread		*last_match;
 	struct map		*vmlinux_maps[MAP__NR_TYPES];
+	struct map		*guest_vmlinux_maps[MAP__NR_TYPES];
 	struct events_stats	events_stats;
 	unsigned long		event_total[PERF_RECORD_MAX];
 	unsigned long		unknown_events;
@@ -83,6 +85,6 @@ static inline struct map *
 	perf_session__new_module_map(struct perf_session *self,
 				     u64 start, const char *filename)
 {
-	return map_groups__new_module(&self->kmaps, start, filename);
+	return map_groups__new_module(&self->kmaps, start, filename, 0);
 }
 #endif /* __PERF_SESSION_H */
diff -Nraup linux-2.6.34-rc1/tools/perf/util/symbol.c linux-2.6.34-rc1_work/tools/perf/util/symbol.c
--- linux-2.6.34-rc1/tools/perf/util/symbol.c	2010-03-09 13:04:31.385942104 +0800
+++ linux-2.6.34-rc1_work/tools/perf/util/symbol.c	2010-03-10 17:06:34.236904596 +0800
@@ -27,6 +27,8 @@ enum dso_origin {
 	DSO__ORIG_BUILDID,
 	DSO__ORIG_DSO,
 	DSO__ORIG_KMODULE,
+	DSO__ORIG_GUEST_KERNEL,
+	DSO__ORIG_GUEST_KMODULE,
 	DSO__ORIG_NOT_FOUND,
 };
 
@@ -34,6 +36,8 @@ static void dsos__add(struct list_head *
 static struct map *map__new2(u64 start, struct dso *dso, enum map_type type);
 static int dso__load_kernel_sym(struct dso *self, struct map *map,
 				symbol_filter_t filter);
+static int dso__load_guest_kernel_sym(struct dso *self, struct map *map,
+			symbol_filter_t filter);
 static int vmlinux_path__nr_entries;
 static char **vmlinux_path;
 
@@ -184,6 +188,7 @@ struct dso *dso__new(const char *name)
 		self->loaded = 0;
 		self->sorted_by_name = 0;
 		self->has_build_id = 0;
+		self->kernel = DSO_TYPE_USER;
 	}
 
 	return self;
@@ -523,13 +528,19 @@ static int dso__split_kallsyms(struct ds
 			char dso_name[PATH_MAX];
 			struct dso *dso;
 
-			snprintf(dso_name, sizeof(dso_name), "[kernel].%d",
-				 kernel_range++);
+			if (self->kernel == DSO_TYPE_GUEST_KERNEL)
+				snprintf(dso_name, sizeof(dso_name), "[guest.kernel].%d",
+						kernel_range++);
+			else
+				snprintf(dso_name, sizeof(dso_name), "[kernel].%d",
+						kernel_range++);
 
 			dso = dso__new(dso_name);
 			if (dso == NULL)
 				return -1;
 
+			dso->kernel = self->kernel;
+
 			curr_map = map__new2(pos->start, dso, map->type);
 			if (curr_map == NULL) {
 				dso__delete(dso);
@@ -563,7 +574,10 @@ int dso__load_kallsyms(struct dso *self,
 		return -1;
 
 	symbols__fixup_end(&self->symbols[map->type]);
-	self->origin = DSO__ORIG_KERNEL;
+	if (self->kernel == DSO_TYPE_GUEST_KERNEL)
+		self->origin = DSO__ORIG_GUEST_KERNEL;
+	else
+		self->origin = DSO__ORIG_KERNEL;
 
 	return dso__split_kallsyms(self, map, filter);
 }
@@ -951,7 +965,7 @@ static int dso__load_sym(struct dso *sel
 	nr_syms = shdr.sh_size / shdr.sh_entsize;
 
 	memset(&sym, 0, sizeof(sym));
-	if (!self->kernel) {
+	if (self->kernel == DSO_TYPE_USER) {
 		self->adjust_symbols = (ehdr.e_type == ET_EXEC ||
 				elf_section_by_name(elf, &ehdr, &shdr,
 						     ".gnu.prelink_undo",
@@ -983,7 +997,7 @@ static int dso__load_sym(struct dso *sel
 
 		section_name = elf_sec__name(&shdr, secstrs);
 
-		if (self->kernel || kmodule) {
+		if (self->kernel != DSO_TYPE_USER || kmodule) {
 			char dso_name[PATH_MAX];
 
 			if (strcmp(section_name,
@@ -1009,6 +1023,7 @@ static int dso__load_sym(struct dso *sel
 				curr_dso = dso__new(dso_name);
 				if (curr_dso == NULL)
 					goto out_elf_end;
+				curr_dso->kernel = self->kernel;
 				curr_map = map__new2(start, curr_dso,
 						     map->type);
 				if (curr_map == NULL) {
@@ -1017,9 +1032,15 @@ static int dso__load_sym(struct dso *sel
 				}
 				curr_map->map_ip = identity__map_ip;
 				curr_map->unmap_ip = identity__map_ip;
-				curr_dso->origin = DSO__ORIG_KERNEL;
+				if (curr_dso->kernel == DSO_TYPE_GUEST_KERNEL) {
+					curr_dso->origin = DSO__ORIG_GUEST_KERNEL;
+					dsos__add(&dsos__guest_kernel, curr_dso);
+				} else {
+					curr_dso->origin = DSO__ORIG_KERNEL;
+					dsos__add(&dsos__kernel, curr_dso);
+				}
+
 				map_groups__insert(kmap->kmaps, curr_map);
-				dsos__add(&dsos__kernel, curr_dso);
 				dso__set_loaded(curr_dso, map->type);
 			} else
 				curr_dso = curr_map->dso;
@@ -1240,6 +1261,8 @@ char dso__symtab_origin(const struct dso
 		[DSO__ORIG_BUILDID] =  'b',
 		[DSO__ORIG_DSO] =      'd',
 		[DSO__ORIG_KMODULE] =  'K',
+		[DSO__ORIG_GUEST_KERNEL] =  'g',
+		[DSO__ORIG_GUEST_KMODULE] =  'G',
 	};
 
 	if (self == NULL || self->origin == DSO__ORIG_NOT_FOUND)
@@ -1258,8 +1281,10 @@ int dso__load(struct dso *self, struct m
 
 	dso__set_loaded(self, map->type);
 
-	if (self->kernel)
+	if (self->kernel == DSO_TYPE_KERNEL)
 		return dso__load_kernel_sym(self, map, filter);
+	else if (self->kernel == DSO_TYPE_GUEST_KERNEL)
+		return dso__load_guest_kernel_sym(self, map, filter);
 
 	name = malloc(size);
 	if (!name)
@@ -1463,7 +1488,7 @@ static int map_groups__set_modules_path(
 static struct map *map__new2(u64 start, struct dso *dso, enum map_type type)
 {
 	struct map *self = zalloc(sizeof(*self) +
-				  (dso->kernel ? sizeof(struct kmap) : 0));
+			  (dso->kernel != DSO_TYPE_USER ? sizeof(struct kmap) : 0));
 	if (self != NULL) {
 		/*
 		 * ->end will be filled after we load all the symbols
@@ -1475,11 +1500,15 @@ static struct map *map__new2(u64 start, 
 }
 
 struct map *map_groups__new_module(struct map_groups *self, u64 start,
-				   const char *filename)
+				   const char *filename, int guest)
 {
 	struct map *map;
 	struct dso *dso = __dsos__findnew(&dsos__kernel, filename);
 
+	if (!guest)
+		dso = __dsos__findnew(&dsos__kernel, filename);
+	else
+		dso = __dsos__findnew(&dsos__guest_kernel, filename);
 	if (dso == NULL)
 		return NULL;
 
@@ -1487,16 +1516,20 @@ struct map *map_groups__new_module(struc
 	if (map == NULL)
 		return NULL;
 
-	dso->origin = DSO__ORIG_KMODULE;
+	if (guest)
+		dso->origin = DSO__ORIG_GUEST_KMODULE;
+	else
+		dso->origin = DSO__ORIG_KMODULE;
 	map_groups__insert(self, map);
 	return map;
 }
 
-static int map_groups__create_modules(struct map_groups *self)
+static int __map_groups__create_modules(struct map_groups *self,
+			const char * filename, int guest)
 {
 	char *line = NULL;
 	size_t n;
-	FILE *file = fopen("/proc/modules", "r");
+	FILE *file = fopen(filename, "r");
 	struct map *map;
 
 	if (file == NULL)
@@ -1530,16 +1563,17 @@ static int map_groups__create_modules(st
 		*sep = '\0';
 
 		snprintf(name, sizeof(name), "[%s]", line);
-		map = map_groups__new_module(self, start, name);
+		map = map_groups__new_module(self, start, name, guest);
 		if (map == NULL)
 			goto out_delete_line;
-		dso__kernel_module_get_build_id(map->dso);
+		if (!guest)
+			dso__kernel_module_get_build_id(map->dso);
 	}
 
 	free(line);
 	fclose(file);
 
-	return map_groups__set_modules_path(self);
+	return 0;
 
 out_delete_line:
 	free(line);
@@ -1547,6 +1581,21 @@ out_failure:
 	return -1;
 }
 
+static int map_groups__create_modules(struct map_groups *self)
+{
+	int ret;
+
+	ret = __map_groups__create_modules(self, "/proc/modules", 0);
+	if (ret >= 0)
+		ret = map_groups__set_modules_path(self);
+	return ret;
+}
+
+static int map_groups__create_guest_modules(struct map_groups *self)
+{
+	return  __map_groups__create_modules(self, symbol_conf.guest_modules, 1);
+}
+
 static int dso__load_vmlinux(struct dso *self, struct map *map,
 			     const char *vmlinux, symbol_filter_t filter)
 {
@@ -1706,8 +1755,44 @@ out_fixup:
 	return err;
 }
 
+static int dso__load_guest_kernel_sym(struct dso *self, struct map *map,
+				symbol_filter_t filter)
+{
+	int err;
+	const char *kallsyms_filename;
+	/*
+	 * if the user specified a vmlinux filename, use it and only
+	 * it, reporting errors to the user if it cannot be used.
+	 * Or use file guest_kallsyms inputted by user on commandline
+	 */
+	if (symbol_conf.guest_vmlinux_name != NULL) {
+		err = dso__load_vmlinux(self, map,
+					symbol_conf.guest_vmlinux_name, filter);
+		goto out_try_fixup;
+	}
+
+	kallsyms_filename = symbol_conf.guest_kallsyms;
+	if (!kallsyms_filename)
+		return -1;
+	err = dso__load_kallsyms(self, kallsyms_filename, map, filter);
+	if (err > 0)
+		pr_debug("Using %s for symbols\n", kallsyms_filename);
+
+out_try_fixup:
+	if (err > 0) {
+		if (kallsyms_filename != NULL)
+			dso__set_long_name(self, strdup("[guest.kernel.kallsyms]"));
+		map__fixup_start(map);
+		map__fixup_end(map);
+	}
+
+	return err;
+}
+
 LIST_HEAD(dsos__user);
 LIST_HEAD(dsos__kernel);
+LIST_HEAD(dsos__guest_user);
+LIST_HEAD(dsos__guest_kernel);
 
 static void dsos__add(struct list_head *head, struct dso *dso)
 {
@@ -1754,6 +1839,8 @@ void dsos__fprintf(FILE *fp)
 {
 	__dsos__fprintf(&dsos__kernel, fp);
 	__dsos__fprintf(&dsos__user, fp);
+	__dsos__fprintf(&dsos__guest_kernel, fp);
+	__dsos__fprintf(&dsos__guest_user, fp);
 }
 
 static size_t __dsos__fprintf_buildid(struct list_head *head, FILE *fp,
@@ -1783,7 +1870,19 @@ struct dso *dso__new_kernel(const char *
 
 	if (self != NULL) {
 		self->short_name = "[kernel]";
-		self->kernel	 = 1;
+		self->kernel	 = DSO_TYPE_KERNEL;
+	}
+
+	return self;
+}
+
+struct dso *dso__new_guest_kernel(const char *name)
+{
+	struct dso *self = dso__new(name ?: "[guest.kernel.kallsyms]");
+
+	if (self != NULL) {
+		self->short_name = "[guest.kernel]";
+		self->kernel	 = DSO_TYPE_GUEST_KERNEL;
 	}
 
 	return self;
@@ -1808,6 +1907,16 @@ static struct dso *dsos__create_kernel(c
 	return kernel;
 }
 
+static struct dso *dsos__create_guest_kernel(const char *vmlinux)
+{
+	struct dso *kernel = dso__new_guest_kernel(vmlinux);
+
+	kernel->kernel = DSO_TYPE_GUEST_KERNEL;
+	if (kernel != NULL)
+		dsos__add(&dsos__guest_kernel, kernel);
+	return kernel;
+}
+
 int __map_groups__create_kernel_maps(struct map_groups *self,
 				     struct map *vmlinux_maps[MAP__NR_TYPES],
 				     struct dso *kernel)
@@ -1956,3 +2065,24 @@ int map_groups__create_kernel_maps(struc
 	map_groups__fixup_end(self);
 	return 0;
 }
+
+int map_groups__create_guest_kernel_maps(struct map_groups *self,
+				   struct map *vmlinux_maps[MAP__NR_TYPES])
+{
+	struct dso *kernel = dsos__create_guest_kernel(symbol_conf.guest_vmlinux_name);
+
+	if (kernel == NULL)
+		return -1;
+
+	if (__map_groups__create_kernel_maps(self, vmlinux_maps, kernel) < 0)
+		return -1;
+
+	if (symbol_conf.use_modules && map_groups__create_guest_modules(self) < 0)
+		pr_debug("Problems creating module maps, continuing anyway...\n");
+	/*
+	 * Now that we have all the maps created, just set the ->end of them:
+	 */
+	map_groups__fixup_end(self);
+	return 0;
+}
+
diff -Nraup linux-2.6.34-rc1/tools/perf/util/symbol.h linux-2.6.34-rc1_work/tools/perf/util/symbol.h
--- linux-2.6.34-rc1/tools/perf/util/symbol.h	2010-03-09 13:04:31.385942104 +0800
+++ linux-2.6.34-rc1_work/tools/perf/util/symbol.h	2010-03-10 17:06:34.236904596 +0800
@@ -66,7 +66,10 @@ struct symbol_conf {
 			full_paths;
 	const char	*vmlinux_name,
 			*field_sep;
-	char            *dso_list_str,
+	const char	*guest_vmlinux_name,
+			*guest_kallsyms,
+			*guest_modules;
+	char		*dso_list_str,
 			*comm_list_str,
 			*sym_list_str,
 			*col_width_list_str;
@@ -97,6 +100,12 @@ struct addr_location {
 	bool	      filtered;
 };
 
+enum dso_kernel_type {
+	DSO_TYPE_USER = 0,
+	DSO_TYPE_KERNEL,
+	DSO_TYPE_GUEST_KERNEL
+};
+
 struct dso {
 	struct list_head node;
 	struct rb_root	 symbols[MAP__NR_TYPES];
@@ -104,7 +113,7 @@ struct dso {
 	u8		 adjust_symbols:1;
 	u8		 slen_calculated:1;
 	u8		 has_build_id:1;
-	u8		 kernel:1;
+	enum dso_kernel_type	kernel;
 	u8		 hit:1;
 	unsigned char	 origin;
 	u8		 sorted_by_name;
@@ -118,6 +127,7 @@ struct dso {
 
 struct dso *dso__new(const char *name);
 struct dso *dso__new_kernel(const char *name);
+struct dso *dso__new_guest_kernel(const char *name);
 void dso__delete(struct dso *self);
 
 bool dso__loaded(const struct dso *self, enum map_type type);
@@ -130,7 +140,7 @@ static inline void dso__set_loaded(struc
 
 void dso__sort_by_name(struct dso *self, enum map_type type);
 
-extern struct list_head dsos__user, dsos__kernel;
+extern struct list_head dsos__user, dsos__kernel, dsos__guest_user, dsos__guest_kernel;
 
 struct dso *__dsos__findnew(struct list_head *head, const char *name);
 
diff -Nraup linux-2.6.34-rc1/tools/perf/util/thread.h linux-2.6.34-rc1_work/tools/perf/util/thread.h
--- linux-2.6.34-rc1/tools/perf/util/thread.h	2010-03-09 13:04:31.385942104 +0800
+++ linux-2.6.34-rc1_work/tools/perf/util/thread.h	2010-03-10 17:06:34.236904596 +0800
@@ -79,6 +79,9 @@ int __map_groups__create_kernel_maps(str
 int map_groups__create_kernel_maps(struct map_groups *self,
 				   struct map *vmlinux_maps[MAP__NR_TYPES]);
 
+int map_groups__create_guest_kernel_maps(struct map_groups *self,
+				   struct map *vmlinux_maps[MAP__NR_TYPES]);
+
 struct map *map_groups__new_module(struct map_groups *self, u64 start,
-				   const char *filename);
+				   const char *filename, int guest);
 #endif	/* __PERF_THREAD_H */



^ permalink raw reply	[flat|nested] 99+ messages in thread

end of thread, other threads:[~2010-03-10  9:30 UTC | newest]

Thread overview: 99+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2010-02-25 15:04 KVM PMU virtualization Jes Sorensen
2010-02-25 15:44 ` Jan Kiszka
2010-02-25 16:26   ` Ingo Molnar
2010-02-26  2:52     ` Zhang, Yanmin
2010-02-26  8:45       ` Ingo Molnar
2010-02-26 11:03     ` Jes Sorensen
2010-02-25 17:34 ` Joerg Roedel
2010-02-26  2:55   ` Zhang, Yanmin
2010-02-26  8:51     ` Joerg Roedel
2010-02-26  9:17       ` Ingo Molnar
2010-02-26 10:42         ` Joerg Roedel
2010-02-26 10:56           ` Ingo Molnar
2010-03-02  7:09         ` Zhang, Yanmin
2010-03-02  9:36           ` Ingo Molnar
2010-03-03  3:32             ` Zhang, Yanmin
2010-03-03  9:27               ` Zhang, Yanmin
2010-03-03 10:13                 ` Peter Zijlstra
2010-03-04  0:52                   ` Zhang, Yanmin
2010-03-03 10:15                 ` Peter Zijlstra
2010-03-04  1:00                   ` Zhang, Yanmin
2010-03-10  9:29                     ` Zhang, Yanmin
2010-03-02  9:57           ` Peter Zijlstra
2010-02-26  8:42   ` Ingo Molnar
2010-02-26  9:46     ` Avi Kivity
2010-02-26 10:39       ` Joerg Roedel
2010-02-26 10:46         ` Ingo Molnar
2010-02-26 10:51           ` Avi Kivity
2010-02-26 11:06           ` Joerg Roedel
2010-02-26 11:18             ` Jes Sorensen
2010-02-26 11:24               ` Ingo Molnar
2010-02-26 11:25                 ` Jes Sorensen
2010-02-26 11:20             ` Ingo Molnar
2010-02-26 10:44       ` Ingo Molnar
2010-02-26 11:16         ` Avi Kivity
2010-02-26 11:26           ` Ingo Molnar
2010-02-26 11:47             ` Avi Kivity
2010-02-26 11:23         ` Jes Sorensen
2010-02-26 11:42           ` Ingo Molnar
2010-02-26 11:51             ` Avi Kivity
2010-02-26 12:07               ` Ingo Molnar
2010-02-26 12:20                 ` Avi Kivity
2010-02-26 12:38                   ` Ingo Molnar
2010-02-26 13:04                     ` Avi Kivity
2010-02-26 13:13                       ` Jes Sorensen
2010-02-26 13:27                         ` Ingo Molnar
2010-02-26 13:33                           ` Avi Kivity
2010-02-26 14:07                           ` Jes Sorensen
2010-02-26 14:11                             ` Avi Kivity
2010-02-26 13:18                       ` Ingo Molnar
2010-02-26 13:34                         ` Jes Sorensen
2010-02-26 12:56                   ` Jes Sorensen
2010-02-26 13:31                   ` Ingo Molnar
2010-02-26 13:37                     ` Jes Sorensen
2010-02-26 13:55                       ` Avi Kivity
2010-02-26 14:27                         ` Peter Zijlstra
2010-02-26 14:54                           ` Avi Kivity
2010-02-26 15:08                             ` Peter Zijlstra
2010-02-26 15:11                               ` Avi Kivity
2010-02-26 15:18                                 ` Peter Zijlstra
2010-02-26 15:55                                 ` Peter Zijlstra
2010-02-26 16:06                                   ` Avi Kivity
2010-03-01 19:03                                   ` Zachary Amsden
2010-03-01 18:54                       ` Zachary Amsden
2010-02-26 13:40                     ` Avi Kivity
2010-02-26 14:01                       ` Ingo Molnar
2010-02-26 14:22                         ` Avi Kivity
2010-02-26 14:37                           ` Ingo Molnar
2010-02-26 16:03                             ` Avi Kivity
2010-02-26 16:07                               ` Avi Kivity
2010-02-26 13:28               ` Peter Zijlstra
2010-02-26 13:44                 ` Avi Kivity
2010-02-26 13:51                 ` Jes Sorensen
2010-02-26 14:42                   ` Peter Zijlstra
2010-03-08 18:14                     ` Avi Kivity
2010-02-26 12:49             ` Jes Sorensen
2010-02-26 13:06               ` Ingo Molnar
2010-02-26 13:30                 ` Avi Kivity
2010-02-26 13:32                   ` Jes Sorensen
2010-02-26 13:44                   ` Ingo Molnar
2010-02-26 13:53                     ` Avi Kivity
2010-02-26 14:12                       ` Ingo Molnar
2010-02-26 14:53                         ` Avi Kivity
2010-02-26 15:14                           ` Peter Zijlstra
2010-02-28 16:34                             ` Joerg Roedel
2010-02-28 16:31                         ` Joerg Roedel
2010-02-28 16:11                     ` Joerg Roedel
2010-03-01  8:39                       ` Ingo Molnar
2010-03-01  8:58                         ` Joerg Roedel
2010-03-01  9:04                           ` Ingo Molnar
2010-03-01  8:44                       ` Ingo Molnar
2010-03-01 11:11                         ` Joerg Roedel
2010-03-01 17:17                           ` Peter Zijlstra
2010-03-01 18:36                             ` Joerg Roedel
2010-03-08 10:15                             ` Avi Kivity
2010-02-26 14:49                   ` Peter Zijlstra
2010-02-26 14:50                   ` Peter Zijlstra
2010-02-26 13:31                 ` Jes Sorensen
2010-03-01 17:22             ` Zachary Amsden
2010-02-26 11:01   ` Jes Sorensen

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).