From mboxrd@z Thu Jan  1 00:00:00 1970
From: Jes Sorensen <Jes.Sorensen@redhat.com>
Subject: Re: KVM PMU virtualization
Date: Fri, 26 Feb 2010 14:51:04 +0100
Message-ID: <4B87D1C8.5090901@redhat.com>
References: <4B86917C.4070102@redhat.com> <20100225173423.GB4246@8bytes.org>	 <20100226084241.GF15885@elte.hu> <4B87987A.2020302@redhat.com>	 <20100226104437.GB7463@elte.hu> <4B87AF44.9090702@redhat.com>	 <20100226114217.GI7463@elte.hu>  <4B87B5DE.30503@redhat.com> <1267190907.22519.601.camel@laptop>
Mime-Version: 1.0
Content-Type: text/plain; charset=UTF-8; format=flowed
Content-Transfer-Encoding: 7bit
Cc: Avi Kivity <avi@redhat.com>, Ingo Molnar <mingo@elte.hu>,
	Joerg Roedel <joro@8bytes.org>,
	KVM General <kvm@vger.kernel.org>,
	Zachary Amsden <zamsden@redhat.com>,
	Gleb Natapov <gleb@redhat.com>, ming.m.lin@intel.com,
	"Zhang, Yanmin" <yanmin_zhang@linux.intel.com>,
	Thomas Gleixner <tglx@linutronix.de>,
	"H. Peter Anvin" <hpa@zytor.com>,
	Arjan van de Ven <arjan@infradead.org>,
	Fr??d??ric Weisbecker <fweisbec@gmail.com>,
	Arnaldo Carvalho de Melo <acme@redhat.com>
To: Peter Zijlstra <peterz@infradead.org>
Return-path: <kvm-owner@vger.kernel.org>
Received: from mx1.redhat.com ([209.132.183.28]:17910 "EHLO mx1.redhat.com"
	rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP
	id S936217Ab0BZNyh (ORCPT <rfc822;kvm@vger.kernel.org>);
	Fri, 26 Feb 2010 08:54:37 -0500
In-Reply-To: <1267190907.22519.601.camel@laptop>
Sender: kvm-owner@vger.kernel.org
List-ID: <kvm.vger.kernel.org>

On 02/26/10 14:28, Peter Zijlstra wrote:
> On Fri, 2010-02-26 at 13:51 +0200, Avi Kivity wrote:
>
>> It would be the other way round - the host would steal the pmu from the
>> guest.  Later we can try to time-slice and extrapolate, though that's
>> not going to be easy.
>
> Right, so perf already does the time slicing and interpolating thing, so
> a soft-pmu gets that for free.

What I don't like here is that without rewriting the guest OS, there
will be two layers of time-slicing and extrapolation. That is going to
make the reported numbers close to useless.

> Anyway, this discussion seems somewhat in a stale-mate position.
>
> The KVM folks basically demand a full PMU MSR shadow with PMI
> passthrough so that their $legacy shit works without modification.
>
> My question with that is how $legacy muck can ever know how the current
> PMU works, you can't even properly emulate a core2 pmu on a nehalem
> because intel keeps messing with the event codes for every new model.
>
> So basically for this to work means the guest can't run legacy stuff
> anyway, but needs to run very up-to-date software, so we might as well
> create a soft-pmu/paravirt interface now and have all up-to-date
> software support that for the next generation.

That is the problem. Today there is a large install base out there of
core2 users who wish to measure their stuff on the hardware they have.
The same will be true for Nehalem based stuff, when whatever replaces
Nehalem comes out makes that incompatible.

Since we are unable to emulate Core2 on Nehalem, and almost certainly
will be unable to emulate Nehalem on it's successor, we are stuck with
this.

A para-virt interface is a nice idea, but since we cannot emulate an
old CPU properly it still means there isn't much we can do as we're
stuck with the same limitations. I simply see the value of introducing
a para-virt interface for this.

> Furthermore, when KVM doesn't virtualize the physical system topology,
> some PMU features cannot even be sanely used from a vcpu.

That is definitely an issue, and there is nothing we can really do about
that. Having two guests running in parallel under KVM means that they
are going to see more cache misses than they would if they ran barebone
on the hardware.

However even with all of this, we have to keep in mind who is going to
use the performance monitoring in a guest. It is going to be application
writers, mostly people writing analytical/scientific applications. They
rarely have control over the OS they are running on, but are given
systems and told to work on what they are given. Driver upgrades and
things like that don't come quickly. However they also tend to
understand limitations like these and will be able to still benefit from
perf on a system like that.

> So while currently a root user can already tie up all of the pmu using
> perf, simply using that to hand the full pmu off to the guest still
> leaves lots of issues.

Well isn't that the case with the current setup anyway? If enough user
apps start requesting PMU resources, the hw is going to run out of
counters very quickly anyway.

The real issue here IMHO is whether or not is it possible to use a PMU
to count anything on different CPU? If that is really possible, sharing
the PMU is not an option :(

All that said, what we really want is for Intel+AMD to come up with
proper hw PMU virtualization support that makes it easy to rotate the
full PMU in and out for a guest. Then this whole discussion will become
a non issue.

Cheers,
Jes