Re: kvm PCI assignment & VFIO ramblings

linuxppc-dev.lists.ozlabs.org archive mirror
 help / color / mirror / Atom feed

From: Avi Kivity <avi@redhat.com>
To: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Alexey Kardashevskiy <aik@au1.ibm.com>,
	kvm@vger.kernel.org, Paul Mackerras <pmac@au1.ibm.com>,
	"linux-pci@vger.kernel.org" <linux-pci@vger.kernel.org>,
	David Gibson <dwg@au1.ibm.com>,
	Alex Williamson <alex.williamson@redhat.com>,
	Anthony Liguori <anthony@codemonkey.ws>,
	linuxppc-dev <linuxppc-dev@lists.ozlabs.org>
Subject: Re: kvm PCI assignment & VFIO ramblings
Date: Tue, 02 Aug 2011 12:12:02 +0300	[thread overview]
Message-ID: <4E37BF62.2060809@redhat.com> (raw)
In-Reply-To: <1312248479.8793.827.camel@pasglop>

On 08/02/2011 04:27 AM, Benjamin Herrenschmidt wrote:
> >
> >  I have a feeling you'll be getting the same capabilities sooner or
> >  later, or you won't be able to make use of S/R IOV VFs.
>
> I'm not sure why you mean. We can do SR/IOV just fine (well, with some
> limitations due to constraints with how our MMIO segmenting works and
> indeed some of those are being lifted in our future chipsets but
> overall, it works).

Don't those limitations include "all VFs must be assigned to the same 
guest"?

PCI on x86 has function granularity, SRIOV reduces this to VF 
granularity, but I thought power has partition or group granularity 
which is much coarser?

> In -theory-, one could do the grouping dynamically with some kind of API
> for us as well. However the constraints are such that it's not
> practical. Filtering on RID is based on number of bits to match in the
> bus number and whether to match the dev and fn. So it's not arbitrary
> (but works fine for SR-IOV).
>
> The MMIO segmentation is a bit special too. There is a single MMIO
> region in 32-bit space (size is configurable but that's not very
> practical so for now we stick it to 1G) which is evenly divided into N
> segments (where N is the number of PE# supported by the host bridge,
> typically 128 with the current bridges).
>
> Each segment goes through a remapping table to select the actual PE# (so
> large BARs use consecutive segments mapped to the same PE#).
>
> For SR-IOV we plan to not use the M32 region. We also have 64-bit MMIO
> regions which act as some kind of "accordions", they are evenly divided
> into segments in different PE# and there's several of them which we can
> "move around" and typically use to map VF BARs.

So, SRIOV VFs *don't* have the group limitation?  Sorry, I'm deluged by 
technical details with no ppc background to put them to, I can't say I'm 
making any sense of this.

> >  >
> >  >  VFIO here is basically designed for one and only one thing: expose the
> >  >  entire guest physical address space to the device more/less 1:1.
> >
> >  A single level iommu cannot be exposed to guests.  Well, it can be
> >  exposed as an iommu that does not provide per-device mapping.
>
> Well, x86 ones can't maybe but on POWER we can and must thanks to our
> essentially paravirt model :-) Even if it' wasn't and we used trapping
> of accesses to the table, it would work because in practice, even with
> filtering, what we end up having is a per-device (or rather per-PE#
> table).
>
> >  A two level iommu can be emulated and exposed to the guest.  See
> >  http://support.amd.com/us/Processor_TechDocs/48882.pdf for an example.
>
> What you mean 2-level is two passes through two trees (ie 6 or 8 levels
> right ?).

(16 or 25)

> We don't have that and probably never will. But again, because
> we have a paravirt interface to the iommu, it's less of an issue.

Well, then, I guess we need an additional interface to expose that to 
the guest.

> >  >  This means:
> >  >
> >  >     - It only works with iommu's that provide complete DMA address spaces
> >  >  to devices. Won't work with a single 'segmented' address space like we
> >  >  have on POWER.
> >  >
> >  >     - It requires the guest to be pinned. Pass-through ->   no more swap
> >
> >  Newer iommus (and devices, unfortunately) (will) support I/O page faults
> >  and then the requirement can be removed.
>
> No. -Some- newer devices will. Out of these, a bunch will have so many
> bugs in it it's not usable. Some never will. It's a mess really and I
> wouldn't design my stuff based on those premises just yet. Making it
> possible to support it for sure, having it in mind, but not making it
> the fundation on which the whole API is designed.

The API is not designed around pinning.  It's a side effect of how the 
IOMMU works.  If your IOMMU only maps pages which are under active DMA, 
then it would only pin those pages.

But I see what you mean, the API is designed around up-front 
specification of all guest memory.

> >  >     - It doesn't work for POWER server anyways because of our need to
> >  >  provide a paravirt iommu interface to the guest since that's how pHyp
> >  >  works today and how existing OSes expect to operate.
> >
> >  Then you need to provide that same interface, and implement it using the
> >  real iommu.
>
> Yes. Working on it. It's not very practical due to how VFIO interacts in
> terms of APIs but solvable. Eventually, we'll make the iommu Hcalls
> almost entirely real-mode for performance reasons.

The original kvm device assignment code was (and is) part of kvm 
itself.  We're trying to move to vfio to allow sharing with non-kvm 
users, but it does reduce flexibility.  We can have an internal vfio-kvm 
interface to update mappings in real time.

> >  >  - Performance sucks of course, the vfio map ioctl wasn't mean for that
> >  >  and has quite a bit of overhead. However we'll want to do the paravirt
> >  >  call directly in the kernel eventually ...
> >
> >  Does the guest iomap each request?  Why?
>
> Not sure what you mean... the guest calls h-calls for every iommu page
> mapping/unmapping, yes. So the performance of these is critical. So yes,
> we'll eventually do it in kernel. We just haven't yet.

I see.  x86 traditionally doesn't do it for every request.  We had some 
proposals to do a pviommu that does map every request, but none reached 
maturity.

> >
> >  So, you have interrupt redirection?  That is, MSI-x table values encode
> >  the vcpu, not pcpu?
>
> Not exactly. The MSI-X address is a real PCI address to an MSI port and
> the value is a real interrupt number in the PIC.
>
> However, the MSI port filters by RID (using the same matching as PE#) to
> ensure that only allowed devices can write to it, and the PIC has a
> matching PE# information to ensure that only allowed devices can trigger
> the interrupt.
>
> As for the guest knowing what values to put in there (what port address
> and interrupt source numbers to use), this is part of the paravirt APIs.
>
> So the paravirt APIs handles the configuration and the HW ensures that
> the guest cannot do anything else than what it's allowed to.

Okay, this is something that x86 doesn't have.  Strange that it can 
filter DMA at a fine granularity but not MSI, which is practically the 
same thing.

> >
> >  Does the BAR value contain the segment base address?  Or is that added
> >  later?
>
> It's a shared address space. With a basic configuration on p7ioc for
> example we have MMIO going from 3G to 4G (PCI side addresses). BARs
> contain the normal PCI address there. But that 1G is divided in 128
> segments of equal size which can separately be assigned to PE#'s.
>
> So BARs are allocated by firmware or the kernel PCI code so that devices
> in different PEs don't share segments.

Okay, and config space virtualization ensures that the guest can't remap?

-- 
error compiling committee.c: too many arguments to function

next prev parent reply	other threads:[~2011-08-02  9:12 UTC|newest]

Thread overview: 118+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2011-07-29 23:58 kvm PCI assignment & VFIO ramblings Benjamin Herrenschmidt
2011-07-30 18:20 ` Alex Williamson
2011-07-30 23:54   ` Benjamin Herrenschmidt
2011-08-01 18:59     ` Alex Williamson
2011-08-02  2:00       ` Benjamin Herrenschmidt
2011-07-30 23:55   ` Benjamin Herrenschmidt
2011-08-02  8:28   ` David Gibson
2011-08-02 18:14     ` Alex Williamson
2011-08-02 18:35       ` Alex Williamson
2011-08-03  2:04         ` David Gibson
2011-08-03  3:44           ` Alex Williamson
2011-08-04  0:39             ` David Gibson
2011-08-08  8:28           ` Avi Kivity
2011-08-09 23:24             ` Alex Williamson
2011-08-10  2:48               ` Benjamin Herrenschmidt
2011-08-20 16:51                 ` Alex Williamson
2011-08-22  5:55                   ` David Gibson
2011-08-22 15:45                     ` Alex Williamson
2011-08-22 21:01                       ` Benjamin Herrenschmidt
2011-08-23 19:30                         ` Alex Williamson
2011-08-23 23:51                           ` Benjamin Herrenschmidt
2011-08-24  3:40                             ` Alexander Graf
2011-08-24 14:47                             ` Alex Williamson
2011-08-24  8:43                           ` Joerg Roedel
2011-08-24 14:56                             ` Alex Williamson
2011-08-25 11:01                               ` Roedel, Joerg
2011-08-23  2:38                       ` David Gibson
2011-08-23 16:23                         ` Alex Williamson
2011-08-23 23:41                           ` Benjamin Herrenschmidt
2011-08-24  3:36                             ` Alexander Graf
2011-08-22  6:30                   ` Avi Kivity
2011-08-22 10:46                     ` Joerg Roedel
2011-08-22 10:51                       ` Avi Kivity
2011-08-22 12:36                         ` Roedel, Joerg
2011-08-22 12:42                           ` Avi Kivity
2011-08-22 12:55                             ` Roedel, Joerg
2011-08-22 13:06                               ` Avi Kivity
2011-08-22 13:15                                 ` Roedel, Joerg
2011-08-22 13:17                                   ` Avi Kivity
2011-08-22 14:37                                     ` Roedel, Joerg
2011-08-22 20:53                     ` Benjamin Herrenschmidt
2011-08-22 17:25                   ` Joerg Roedel
2011-08-22 19:17                     ` Alex Williamson
2011-08-23 13:14                       ` Roedel, Joerg
2011-08-23 17:08                         ` Alex Williamson
2011-08-24  8:52                           ` Roedel, Joerg
2011-08-24 15:07                             ` Alex Williamson
2011-08-25 12:31                               ` Roedel, Joerg
2011-08-25 13:25                                 ` Alexander Graf
2011-08-26  4:24                                   ` David Gibson
2011-08-26  9:24                                     ` Roedel, Joerg
2011-08-28 13:14                                       ` Avi Kivity
2011-08-28 13:56                                         ` Joerg Roedel
2011-08-28 14:04                                           ` Avi Kivity
2011-08-30 16:14                                             ` Joerg Roedel
2011-08-22 21:03                     ` Benjamin Herrenschmidt
2011-08-23 13:18                       ` Roedel, Joerg
2011-08-23 23:35                         ` Benjamin Herrenschmidt
2011-08-24  8:53                           ` Roedel, Joerg
2011-08-22 20:29                   ` aafabbri
2011-08-22 20:49                     ` Benjamin Herrenschmidt
2011-08-22 21:38                       ` aafabbri
2011-08-22 21:49                         ` Benjamin Herrenschmidt
2011-08-23  0:52                           ` aafabbri
2011-08-23  6:54                             ` Benjamin Herrenschmidt
2011-08-23 11:09                               ` Joerg Roedel
2011-08-23 17:01                               ` Alex Williamson
2011-08-23 17:33                                 ` Aaron Fabbri
2011-08-23 18:01                                   ` Alex Williamson
2011-08-24  9:10                                   ` Joerg Roedel
2011-08-24 21:13                                     ` Alex Williamson
2011-08-25 10:54                                       ` Roedel, Joerg
2011-08-25 15:38                                         ` Don Dutile
2011-08-25 16:46                                           ` Roedel, Joerg
2011-08-25 17:20                                         ` Alex Williamson
2011-08-25 18:05                                           ` Joerg Roedel
2011-08-26 18:04                                             ` Alex Williamson
2011-08-30 16:13                                               ` Joerg Roedel
2011-08-23 11:04                             ` Joerg Roedel
2011-08-23 16:54                               ` aafabbri
2011-08-24  9:14                                 ` Roedel, Joerg
2011-08-24  9:33                                   ` David Gibson
2011-08-24 11:03                                     ` Roedel, Joerg
2011-08-26  4:20                                       ` David Gibson
2011-08-26  9:33                                         ` Roedel, Joerg
2011-08-26 14:07                                           ` Alexander Graf
2011-08-26 15:24                                             ` Joerg Roedel
2011-08-26 15:29                                               ` Alexander Graf
2011-08-26 17:52                                             ` Aaron Fabbri
2011-08-26 19:35                                               ` Chris Wright
2011-08-26 20:17                                                 ` Aaron Fabbri
2011-08-26 21:06                                                   ` Chris Wright
2011-08-30  1:29                                                   ` David Gibson
2011-08-04 10:35   ` Joerg Roedel
2011-07-30 22:21 ` Benjamin Herrenschmidt
2011-08-01 16:40   ` Alex Williamson
2011-08-02  1:29     ` Benjamin Herrenschmidt
2011-07-31 14:09 ` Avi Kivity
2011-08-01 20:27   ` Alex Williamson
2011-08-02  8:32     ` Avi Kivity
2011-08-04 10:41     ` Joerg Roedel
2011-08-05 10:26       ` Benjamin Herrenschmidt
2011-08-05 12:57         ` Joerg Roedel
2011-08-02  1:27   ` Benjamin Herrenschmidt
2011-08-02  9:12     ` Avi Kivity [this message]
2011-08-02 12:58       ` Benjamin Herrenschmidt
2011-08-02 13:39         ` Avi Kivity
2011-08-02 15:34         ` Alex Williamson
2011-08-02 21:29           ` Konrad Rzeszutek Wilk
2011-08-03  1:02             ` Alex Williamson
2011-08-02 14:39     ` Alex Williamson
2011-08-01  2:48 ` David Gibson
2011-08-04 10:27 ` Joerg Roedel
2011-08-05 10:42   ` Benjamin Herrenschmidt
2011-08-05 13:44     ` Joerg Roedel
2011-08-05 22:49       ` Benjamin Herrenschmidt
2011-08-05 15:10     ` Alex Williamson
2011-08-08  6:07       ` David Gibson

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=4E37BF62.2060809@redhat.com \
    --to=avi@redhat.com \
    --cc=aik@au1.ibm.com \
    --cc=alex.williamson@redhat.com \
    --cc=anthony@codemonkey.ws \
    --cc=benh@kernel.crashing.org \
    --cc=dwg@au1.ibm.com \
    --cc=kvm@vger.kernel.org \
    --cc=linux-pci@vger.kernel.org \
    --cc=linuxppc-dev@lists.ozlabs.org \
    --cc=pmac@au1.ibm.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).