From: Avi Kivity <avi@redhat.com>
To: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Alexey Kardashevskiy <aik@au1.ibm.com>,
kvm@vger.kernel.org, Paul Mackerras <pmac@au1.ibm.com>,
"linux-pci@vger.kernel.org" <linux-pci@vger.kernel.org>,
David Gibson <dwg@au1.ibm.com>,
Alex Williamson <alex.williamson@redhat.com>,
Anthony Liguori <anthony@codemonkey.ws>,
linuxppc-dev <linuxppc-dev@lists.ozlabs.org>
Subject: Re: kvm PCI assignment & VFIO ramblings
Date: Tue, 02 Aug 2011 12:12:02 +0300 [thread overview]
Message-ID: <4E37BF62.2060809@redhat.com> (raw)
In-Reply-To: <1312248479.8793.827.camel@pasglop>
On 08/02/2011 04:27 AM, Benjamin Herrenschmidt wrote:
> >
> > I have a feeling you'll be getting the same capabilities sooner or
> > later, or you won't be able to make use of S/R IOV VFs.
>
> I'm not sure why you mean. We can do SR/IOV just fine (well, with some
> limitations due to constraints with how our MMIO segmenting works and
> indeed some of those are being lifted in our future chipsets but
> overall, it works).
Don't those limitations include "all VFs must be assigned to the same
guest"?
PCI on x86 has function granularity, SRIOV reduces this to VF
granularity, but I thought power has partition or group granularity
which is much coarser?
> In -theory-, one could do the grouping dynamically with some kind of API
> for us as well. However the constraints are such that it's not
> practical. Filtering on RID is based on number of bits to match in the
> bus number and whether to match the dev and fn. So it's not arbitrary
> (but works fine for SR-IOV).
>
> The MMIO segmentation is a bit special too. There is a single MMIO
> region in 32-bit space (size is configurable but that's not very
> practical so for now we stick it to 1G) which is evenly divided into N
> segments (where N is the number of PE# supported by the host bridge,
> typically 128 with the current bridges).
>
> Each segment goes through a remapping table to select the actual PE# (so
> large BARs use consecutive segments mapped to the same PE#).
>
> For SR-IOV we plan to not use the M32 region. We also have 64-bit MMIO
> regions which act as some kind of "accordions", they are evenly divided
> into segments in different PE# and there's several of them which we can
> "move around" and typically use to map VF BARs.
So, SRIOV VFs *don't* have the group limitation? Sorry, I'm deluged by
technical details with no ppc background to put them to, I can't say I'm
making any sense of this.
> > >
> > > VFIO here is basically designed for one and only one thing: expose the
> > > entire guest physical address space to the device more/less 1:1.
> >
> > A single level iommu cannot be exposed to guests. Well, it can be
> > exposed as an iommu that does not provide per-device mapping.
>
> Well, x86 ones can't maybe but on POWER we can and must thanks to our
> essentially paravirt model :-) Even if it' wasn't and we used trapping
> of accesses to the table, it would work because in practice, even with
> filtering, what we end up having is a per-device (or rather per-PE#
> table).
>
> > A two level iommu can be emulated and exposed to the guest. See
> > http://support.amd.com/us/Processor_TechDocs/48882.pdf for an example.
>
> What you mean 2-level is two passes through two trees (ie 6 or 8 levels
> right ?).
(16 or 25)
> We don't have that and probably never will. But again, because
> we have a paravirt interface to the iommu, it's less of an issue.
Well, then, I guess we need an additional interface to expose that to
the guest.
> > > This means:
> > >
> > > - It only works with iommu's that provide complete DMA address spaces
> > > to devices. Won't work with a single 'segmented' address space like we
> > > have on POWER.
> > >
> > > - It requires the guest to be pinned. Pass-through -> no more swap
> >
> > Newer iommus (and devices, unfortunately) (will) support I/O page faults
> > and then the requirement can be removed.
>
> No. -Some- newer devices will. Out of these, a bunch will have so many
> bugs in it it's not usable. Some never will. It's a mess really and I
> wouldn't design my stuff based on those premises just yet. Making it
> possible to support it for sure, having it in mind, but not making it
> the fundation on which the whole API is designed.
The API is not designed around pinning. It's a side effect of how the
IOMMU works. If your IOMMU only maps pages which are under active DMA,
then it would only pin those pages.
But I see what you mean, the API is designed around up-front
specification of all guest memory.
> > > - It doesn't work for POWER server anyways because of our need to
> > > provide a paravirt iommu interface to the guest since that's how pHyp
> > > works today and how existing OSes expect to operate.
> >
> > Then you need to provide that same interface, and implement it using the
> > real iommu.
>
> Yes. Working on it. It's not very practical due to how VFIO interacts in
> terms of APIs but solvable. Eventually, we'll make the iommu Hcalls
> almost entirely real-mode for performance reasons.
The original kvm device assignment code was (and is) part of kvm
itself. We're trying to move to vfio to allow sharing with non-kvm
users, but it does reduce flexibility. We can have an internal vfio-kvm
interface to update mappings in real time.
> > > - Performance sucks of course, the vfio map ioctl wasn't mean for that
> > > and has quite a bit of overhead. However we'll want to do the paravirt
> > > call directly in the kernel eventually ...
> >
> > Does the guest iomap each request? Why?
>
> Not sure what you mean... the guest calls h-calls for every iommu page
> mapping/unmapping, yes. So the performance of these is critical. So yes,
> we'll eventually do it in kernel. We just haven't yet.
I see. x86 traditionally doesn't do it for every request. We had some
proposals to do a pviommu that does map every request, but none reached
maturity.
> >
> > So, you have interrupt redirection? That is, MSI-x table values encode
> > the vcpu, not pcpu?
>
> Not exactly. The MSI-X address is a real PCI address to an MSI port and
> the value is a real interrupt number in the PIC.
>
> However, the MSI port filters by RID (using the same matching as PE#) to
> ensure that only allowed devices can write to it, and the PIC has a
> matching PE# information to ensure that only allowed devices can trigger
> the interrupt.
>
> As for the guest knowing what values to put in there (what port address
> and interrupt source numbers to use), this is part of the paravirt APIs.
>
> So the paravirt APIs handles the configuration and the HW ensures that
> the guest cannot do anything else than what it's allowed to.
Okay, this is something that x86 doesn't have. Strange that it can
filter DMA at a fine granularity but not MSI, which is practically the
same thing.
> >
> > Does the BAR value contain the segment base address? Or is that added
> > later?
>
> It's a shared address space. With a basic configuration on p7ioc for
> example we have MMIO going from 3G to 4G (PCI side addresses). BARs
> contain the normal PCI address there. But that 1G is divided in 128
> segments of equal size which can separately be assigned to PE#'s.
>
> So BARs are allocated by firmware or the kernel PCI code so that devices
> in different PEs don't share segments.
Okay, and config space virtualization ensures that the guest can't remap?
--
error compiling committee.c: too many arguments to function
next prev parent reply other threads:[~2011-08-02 9:12 UTC|newest]
Thread overview: 118+ messages / expand[flat|nested] mbox.gz Atom feed top
2011-07-29 23:58 kvm PCI assignment & VFIO ramblings Benjamin Herrenschmidt
2011-07-30 18:20 ` Alex Williamson
2011-07-30 23:54 ` Benjamin Herrenschmidt
2011-08-01 18:59 ` Alex Williamson
2011-08-02 2:00 ` Benjamin Herrenschmidt
2011-07-30 23:55 ` Benjamin Herrenschmidt
2011-08-02 8:28 ` David Gibson
2011-08-02 18:14 ` Alex Williamson
2011-08-02 18:35 ` Alex Williamson
2011-08-03 2:04 ` David Gibson
2011-08-03 3:44 ` Alex Williamson
2011-08-04 0:39 ` David Gibson
2011-08-08 8:28 ` Avi Kivity
2011-08-09 23:24 ` Alex Williamson
2011-08-10 2:48 ` Benjamin Herrenschmidt
2011-08-20 16:51 ` Alex Williamson
2011-08-22 5:55 ` David Gibson
2011-08-22 15:45 ` Alex Williamson
2011-08-22 21:01 ` Benjamin Herrenschmidt
2011-08-23 19:30 ` Alex Williamson
2011-08-23 23:51 ` Benjamin Herrenschmidt
2011-08-24 3:40 ` Alexander Graf
2011-08-24 14:47 ` Alex Williamson
2011-08-24 8:43 ` Joerg Roedel
2011-08-24 14:56 ` Alex Williamson
2011-08-25 11:01 ` Roedel, Joerg
2011-08-23 2:38 ` David Gibson
2011-08-23 16:23 ` Alex Williamson
2011-08-23 23:41 ` Benjamin Herrenschmidt
2011-08-24 3:36 ` Alexander Graf
2011-08-22 6:30 ` Avi Kivity
2011-08-22 10:46 ` Joerg Roedel
2011-08-22 10:51 ` Avi Kivity
2011-08-22 12:36 ` Roedel, Joerg
2011-08-22 12:42 ` Avi Kivity
2011-08-22 12:55 ` Roedel, Joerg
2011-08-22 13:06 ` Avi Kivity
2011-08-22 13:15 ` Roedel, Joerg
2011-08-22 13:17 ` Avi Kivity
2011-08-22 14:37 ` Roedel, Joerg
2011-08-22 20:53 ` Benjamin Herrenschmidt
2011-08-22 17:25 ` Joerg Roedel
2011-08-22 19:17 ` Alex Williamson
2011-08-23 13:14 ` Roedel, Joerg
2011-08-23 17:08 ` Alex Williamson
2011-08-24 8:52 ` Roedel, Joerg
2011-08-24 15:07 ` Alex Williamson
2011-08-25 12:31 ` Roedel, Joerg
2011-08-25 13:25 ` Alexander Graf
2011-08-26 4:24 ` David Gibson
2011-08-26 9:24 ` Roedel, Joerg
2011-08-28 13:14 ` Avi Kivity
2011-08-28 13:56 ` Joerg Roedel
2011-08-28 14:04 ` Avi Kivity
2011-08-30 16:14 ` Joerg Roedel
2011-08-22 21:03 ` Benjamin Herrenschmidt
2011-08-23 13:18 ` Roedel, Joerg
2011-08-23 23:35 ` Benjamin Herrenschmidt
2011-08-24 8:53 ` Roedel, Joerg
2011-08-22 20:29 ` aafabbri
2011-08-22 20:49 ` Benjamin Herrenschmidt
2011-08-22 21:38 ` aafabbri
2011-08-22 21:49 ` Benjamin Herrenschmidt
2011-08-23 0:52 ` aafabbri
2011-08-23 6:54 ` Benjamin Herrenschmidt
2011-08-23 11:09 ` Joerg Roedel
2011-08-23 17:01 ` Alex Williamson
2011-08-23 17:33 ` Aaron Fabbri
2011-08-23 18:01 ` Alex Williamson
2011-08-24 9:10 ` Joerg Roedel
2011-08-24 21:13 ` Alex Williamson
2011-08-25 10:54 ` Roedel, Joerg
2011-08-25 15:38 ` Don Dutile
2011-08-25 16:46 ` Roedel, Joerg
2011-08-25 17:20 ` Alex Williamson
2011-08-25 18:05 ` Joerg Roedel
2011-08-26 18:04 ` Alex Williamson
2011-08-30 16:13 ` Joerg Roedel
2011-08-23 11:04 ` Joerg Roedel
2011-08-23 16:54 ` aafabbri
2011-08-24 9:14 ` Roedel, Joerg
2011-08-24 9:33 ` David Gibson
2011-08-24 11:03 ` Roedel, Joerg
2011-08-26 4:20 ` David Gibson
2011-08-26 9:33 ` Roedel, Joerg
2011-08-26 14:07 ` Alexander Graf
2011-08-26 15:24 ` Joerg Roedel
2011-08-26 15:29 ` Alexander Graf
2011-08-26 17:52 ` Aaron Fabbri
2011-08-26 19:35 ` Chris Wright
2011-08-26 20:17 ` Aaron Fabbri
2011-08-26 21:06 ` Chris Wright
2011-08-30 1:29 ` David Gibson
2011-08-04 10:35 ` Joerg Roedel
2011-07-30 22:21 ` Benjamin Herrenschmidt
2011-08-01 16:40 ` Alex Williamson
2011-08-02 1:29 ` Benjamin Herrenschmidt
2011-07-31 14:09 ` Avi Kivity
2011-08-01 20:27 ` Alex Williamson
2011-08-02 8:32 ` Avi Kivity
2011-08-04 10:41 ` Joerg Roedel
2011-08-05 10:26 ` Benjamin Herrenschmidt
2011-08-05 12:57 ` Joerg Roedel
2011-08-02 1:27 ` Benjamin Herrenschmidt
2011-08-02 9:12 ` Avi Kivity [this message]
2011-08-02 12:58 ` Benjamin Herrenschmidt
2011-08-02 13:39 ` Avi Kivity
2011-08-02 15:34 ` Alex Williamson
2011-08-02 21:29 ` Konrad Rzeszutek Wilk
2011-08-03 1:02 ` Alex Williamson
2011-08-02 14:39 ` Alex Williamson
2011-08-01 2:48 ` David Gibson
2011-08-04 10:27 ` Joerg Roedel
2011-08-05 10:42 ` Benjamin Herrenschmidt
2011-08-05 13:44 ` Joerg Roedel
2011-08-05 22:49 ` Benjamin Herrenschmidt
2011-08-05 15:10 ` Alex Williamson
2011-08-08 6:07 ` David Gibson
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=4E37BF62.2060809@redhat.com \
--to=avi@redhat.com \
--cc=aik@au1.ibm.com \
--cc=alex.williamson@redhat.com \
--cc=anthony@codemonkey.ws \
--cc=benh@kernel.crashing.org \
--cc=dwg@au1.ibm.com \
--cc=kvm@vger.kernel.org \
--cc=linux-pci@vger.kernel.org \
--cc=linuxppc-dev@lists.ozlabs.org \
--cc=pmac@au1.ibm.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).