From: Avi Kivity <avi@redhat.com>
To: Alex Williamson <alex.williamson@redhat.com>
Cc: aik@ozlabs.ru, qemu-devel@nongnu.org, kvm@vger.kernel.org
Subject: Re: [Qemu-devel] [RFC PATCH] vfio: VFIO PCI driver for Qemu
Date: Tue, 31 Jul 2012 15:34:38 +0300 [thread overview]
Message-ID: <5017D0DE.7090004@redhat.com> (raw)
In-Reply-To: <1343687399.8073.213.camel@ul30vt>
On 07/31/2012 01:29 AM, Alex Williamson wrote:
>>
>> If the region size is zero, then both memory_region_del_subregion()
>> (assuming the region is parented) and munmap() do nothing. So you could
>> call this unconditionally.
>
> I suppose parenting them is the key. I'm counting on memory_region_size
> of zero for an uninitialized, g_malloc0() MemoryRegion.
That's a no-no. We have APIs for a reason. Maybe I'll start encrypting
the contents by xoring with a private variable.
> Initializing
> them just to have a parent so we can unconditionally remove them here
> seems like it's just shifting complexity from one function to another.
> The majority of BARs aren't even implemented, so we'd actually be
> setting up a lot of dummy infrastructure for a slightly cleaner unmap
> function. I'll keep looking at this, but I'm not optimistic there's an
> overall simplification here.
Ok. But use your own bool, don't overload an something from MemoryRegion.
>
>> >> > +
>> >> > + if (vdev->msix && vdev->msix->table_bar == nr) {
>> >> > + size = memory_region_size(&vdev->msix->mmap_mem);
>> >> > + if (size) {
>> >> > + memory_region_del_subregion(&bar->mem, &vdev->msix->mmap_mem);
>> >> > + munmap(vdev->msix->mmap, size);
>> >> > + }
>> >> > + }
>> >
>> > And this one potentially unmaps the overlap after the vector table if
>> > there's any space for one.
>> >
>> >> Are the three size checks needed? Everything should work without them
>> >> from the memory core point of view.
>> >
>> > I haven't tried, but I strongly suspect I shouldn't be munmap'ing
>> > NULL... no?
>>
>> NULL isn't the problem (well some kernels protect against mmaping NULL
>> to avoid kernel exploits), but it seems the kernel doesn't like a zero
>> length.
>
> in mm/mmap.c:do_munmap() I see:
>
> if ((len = PAGE_ALIGN(len)) == 0)
> return -EINVAL;
>
> Before anything scary happens, so that should be ok. It's not really
> worthwhile to call the munmaps unconditionally if we already have the
> condition tests because the subregions are unparented though.
Yeah.
>
>> >> > +
>> >> > + /*
>> >> > + * We can't mmap areas overlapping the MSIX vector table, so we
>> >> > + * potentially insert a direct-mapped subregion before and after it.
>> >> > + */
>> >>
>> >> This splitting is what the memory core really enjoys. You can just
>> >> place the MSIX page over the RAM page and let it do the cut-n-paste.
>> >
>> > Sure, but VFIO won't allow us to mmap over the MSI-X table for security
>> > reasons. It might be worthwhile to someday make VFIO insert an
>> > anonymous page over the MSI-X table to allow this, but it didn't look
>> > trivial for my novice mm abilities. Easy to add a flag from the VFIO
>> > kernel structure where we learn about this BAR if we add it in the
>> > future.
>>
>> I meant due it purely in qemu. Instead of an emulated region overlaid
>> by two assigned regions, have an assigned region overlaid by the
>> emulated region. The regions seen by the vfio listener will be the same.
>
> Sure, that's what KVM device assignment does, but it requires being able
> to mmap the whole BAR, including an MSI-X table. The VFIO kernel side
> can't assume userspace isn't malicious so it has to prevent this.
I wonder whether it should prevent the mmap(), or let it though and just
SIGBUS on accesses.
>> >
>> > This is actually kind of complicated. Opening /dev/vfio/vfio gives us
>> > an instance of a container in the kernel. A group can only be attached
>> > to one container. So whoever calls us with passed fds needs to track
>> > this very carefully. This is also why I've dropped any kind of shared
>> > IOMMU option to give us a hint whether to try to cram everything in the
>> > same container (~= iommu domain). It's too easy to pass conflicting
>> > info to share a container for one device, but not another... yet they
>> > may be in the same group. I'll work on the fd passing though and try to
>> > come up with a reasonable model.
>>
>> I didn't really follow the container stuff so I can't comment here. But
>> suppose all assigned devices are done via fd passing, isn't it
>> sufficient to just pass the fd for the device (and keep the iommu group
>> fd in the managment tool)?
>
> Nope.
>
> containerfd = open(/dev/vfio/vfio)
> groupfd = open(/dev/vfio/$GROUPID)
> devicefd = ioctl(groupfd, VFIO_GROUP_GET_DEVICE_FD)
>
> The container provides access to the iommu, the group is the unit of
> ownership and privilege, and device cannot be accessed without iommu
> protection. Therefore to get to a devicefd, we first need to privilege
> the container by attaching a group to it, that let's us initialize the
> iommu, which allows us to get the device fd. At a minimum, we'd need
> both container and device fds, which means libvirt would be responsible
> for determining what type of iommu interface to initialize. Doing that
> makes adding another device tenuous. It's not impossible, but VFIO is
> design such that /dev/vfio/vfio is completely harmless on it's own, safe
> for mode 0666 access, just like /dev/kvm. The groupfd is the important
> access point, so maybe it's sufficient that libvirt could pass only that
> and let qemu open /dev/vfio/vfio on it's own. The only problem then is
> that libvirt needs to pass the same groupfd for each device that gets
> assigned within a group.
What I was thinking was that libvirt would do all the setup, including
attaching the iommu, then pass something that is safe to qemu. I don't
see an issue with libvirt keeping tracks of groups; libvirt is supposed
to be doing the host-side management anyway. But I'm not familiar with
the API so I guess it can't be done. Maybe an extension?
>
>> >> > +
>> >> > +
>> >> > +typedef struct MSIVector {
>> >> > + EventNotifier interrupt; /* eventfd triggered on interrupt */
>> >> > + struct VFIODevice *vdev; /* back pointer to device */
>> >> > + int vector; /* the vector number for this element */
>> >> > + int virq; /* KVM irqchip route for Qemu bypass */
>> >>
>> >> This calls for an abstraction (don't we have a cache where we look those
>> >> up?)
>> >
>> > I haven't see one, pointer? I tried to follow vhost's lead here.
>>
>> See kvm_irqchip_send_msi(). But this isn't integrated with irqfd yet.
>
> Right, the irqfd is what we're really after.
Ok, I guess both vhost and vfio could use a qemu_irq_eventfd() which
creates an irqfd if available, or emulates it by adding a listener to
that eventfd and injecting the interrupt (either through tcg or kvm) itself.
>
>> >> > + bool use;
>> >> > +} MSIVector;
>> >> > +
>> >> > +
>> >> > +typedef struct VFIOContainer {
>> >> > + int fd; /* /dev/vfio/vfio, empowered by the attached groups */
>> >> > + struct {
>> >> > + /* enable abstraction to support various iommu backends */
>> >> > + union {
>> >> > + MemoryListener listener; /* Used by type1 iommu */
>> >> > + };
>> >>
>> >> The usual was is to have a Type1VFIOContainer deriving from
>> >> VFIOContainer and adding a MemoryListener.
>> >
>> > Yep, that would work too. It gets a bit more complicated that way
>> > though because we have to know when the container is allocated what type
>> > it's going to be. This way we can step though possible iommu types and
>> > support the right one. Eventually there may be more than one type
>> > supported on the same platform (ex. one that enables PRI). Do-able, but
>> > I'm not sure it's worth it at this point.
>>
>> An alternative alternative is to put a pointer to an abstract type here,
>> then you can defer the decision on the concrete type later. But I agree
>> it's not worth it at this point. Maybe just drop the union and decide
>> later when a second iommu type is added.
>
> A pointer doesn't allow us to use container_of to get back to the
> VFIOContainer from the memory listener callback, so we'd have to create
> some new struct just to hold that back pointer. Alexey's proposed POWER
> support for VFIO already makes use of the union, so it seems like a
> sufficient solution for now. We'll have to re-evaluate if it's getting
> unwieldy after we get a few though.
Ok.
--
error compiling committee.c: too many arguments to function
next prev parent reply other threads:[~2012-07-31 12:34 UTC|newest]
Thread overview: 22+ messages / expand[flat|nested] mbox.gz Atom feed top
2012-07-25 17:03 [Qemu-devel] [RFC PATCH] vfio: VFIO PCI driver for Qemu Alex Williamson
2012-07-25 19:30 ` Avi Kivity
2012-07-25 19:53 ` Alex Williamson
2012-07-26 8:35 ` Avi Kivity
2012-07-26 14:56 ` Alex Williamson
2012-07-26 15:59 ` Avi Kivity
2012-07-26 16:33 ` Alex Williamson
2012-07-26 16:40 ` Avi Kivity
2012-07-26 19:11 ` Alex Williamson
2012-07-26 16:06 ` Avi Kivity
2012-07-26 16:40 ` Alex Williamson
2012-07-26 16:47 ` Avi Kivity
2012-07-26 15:09 ` Alex Williamson
2012-07-26 16:34 ` Avi Kivity
2012-07-26 17:40 ` Alex Williamson
2012-07-29 13:47 ` Avi Kivity
2012-07-30 22:29 ` Alex Williamson
2012-07-31 12:34 ` Avi Kivity [this message]
2012-07-31 16:56 ` Alex Williamson
2012-07-27 19:22 ` Blue Swirl
2012-07-27 20:28 ` Alex Williamson
2012-07-28 2:55 ` Alexey Kardashevskiy
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=5017D0DE.7090004@redhat.com \
--to=avi@redhat.com \
--cc=aik@ozlabs.ru \
--cc=alex.williamson@redhat.com \
--cc=kvm@vger.kernel.org \
--cc=qemu-devel@nongnu.org \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).