From: Alex Williamson <alex.williamson@redhat.com>
To: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Cc: aafabbri@cisco.com, aik@au1.ibm.com, kvm@vger.kernel.org,
pmac@au1.ibm.com, qemu-devel@nongnu.org, joerg.roedel@amd.com,
agraf@suse.de, dwg@au1.ibm.com, chrisw@sous-sol.org,
B08248@freescale.com, iommu@lists.linux-foundation.org,
avi@redhat.com, linux-pci@vger.kernel.org, B07421@freescale.com,
benve@cisco.com
Subject: Re: [Qemu-devel] [RFC PATCH] vfio: VFIO Driver core framework
Date: Thu, 17 Nov 2011 13:22:17 -0700 [thread overview]
Message-ID: <1321561337.2633.40.camel@x201.home> (raw)
In-Reply-To: <20111116165227.GB2793@phenom.dumpdata.com>
On Wed, 2011-11-16 at 11:52 -0500, Konrad Rzeszutek Wilk wrote:
> On Fri, Nov 11, 2011 at 03:10:56PM -0700, Alex Williamson wrote:
<snip>
> > > > +
> > > > +Regions are described by a struct vfio_region_info, which is retrieved by
> > > > +using the GET_REGION_INFO ioctl with vfio_region_info.index field set to
> > > > +the desired region (0 based index). Note that devices may implement zero
> > > >
> > > +sized regions (vfio-pci does this to provide a 1:1 BAR to region index
> > > > +mapping).
> > >
> > > Huh?
> >
> > PCI has the following static mapping:
> >
> > enum {
> > VFIO_PCI_BAR0_REGION_INDEX,
> > VFIO_PCI_BAR1_REGION_INDEX,
> > VFIO_PCI_BAR2_REGION_INDEX,
> > VFIO_PCI_BAR3_REGION_INDEX,
> > VFIO_PCI_BAR4_REGION_INDEX,
> > VFIO_PCI_BAR5_REGION_INDEX,
> > VFIO_PCI_ROM_REGION_INDEX,
> > VFIO_PCI_CONFIG_REGION_INDEX,
> > VFIO_PCI_NUM_REGIONS
> > };
> >
> > So 8 regions are always reported regardless of whether the device
> > implements all the BARs and the ROM. Then we have a fixed bar:index
> > mapping so we don't have to create a region_info field to describe the
> > bar number for the index.
>
> OK. Is that a problem if the real device actually has a zero sized BAR?
> Or is zero sized BAR in PCI spec equal to "disabled, not in use" ? Just
> wondering whether (-1ULL) should be used instead? (Which seems the case
> in QEMU code).
Yes, PCI spec defines that unimplemented BARs are hardwired to zero, so
the sizing operation returns zero for the size.
<snip>
> > > > +struct vfio_irq_info {
> > > > + __u32 len; /* length of structure */
> > > > + __u32 index; /* IRQ number */
> > > > + __u32 count; /* number of individual IRQs */
> > > > + __u64 flags;
> > > > +#define VFIO_IRQ_INFO_FLAG_LEVEL (1 << 0)
> > > > +};
> > > > +
> > > > +Again, zero count entries are allowed (vfio-pci uses a static interrupt
> > > > +type to index mapping).
> > >
> > > I am not really sure what that means.
> >
> > This is so PCI can expose:
> >
> > enum {
> > VFIO_PCI_INTX_IRQ_INDEX,
> > VFIO_PCI_MSI_IRQ_INDEX,
> > VFIO_PCI_MSIX_IRQ_INDEX,
> > VFIO_PCI_NUM_IRQS
> > };
> >
> > So like regions it always exposes 3 IRQ indexes where count=0 if the
> > device doesn't actually support that type of interrupt. I just want to
> > spell out that bus drivers have this kind of flexibility.
>
> I think you should change the comment that says 'IRQ number', as the
> first thing that comes in my mind is 'GSI' or MSI/MSI-x vector.
> Perhaps '/* index to be used with return value from GET_NUM_IRQS ioctl.
> Order of structures can be unsorted. */
Ah, yes. I see the confusion. They can't really be unsorted though,
the user needs some point of reference. For PCI they will be strictly
ordered. For Device Tree, I assume there will be a path referencing the
index. I'll update the doc to clarify.
<snip>
> > > > +
> > > > +When a level triggered interrupt is signaled, the interrupt is masked
> > > > +on the host. This prevents an unresponsive userspace driver from
> > > > +continuing to interrupt the host system. After servicing the interrupt,
> > > > +UNMASK_IRQ is used to allow the interrupt to retrigger. Note that level
> > > > +triggered interrupts implicitly have a count of 1 per index.
> > >
> > > So they are enabled automatically? Meaning you don't even hav to do
> > > SET_IRQ_EVENTFDS b/c the count is set to 1?
> >
> > I suppose that should be "no more than 1 per index" (ie. PCI would
> > report a count of 0 for VFIO_PCI_INTX_IRQ_INDEX if the device doesn't
> > support INTx). I think you might be confusing VFIO_DEVICE_GET_IRQ_INFO
> > which tells how many are available with VFIO_DEVICE_SET_IRQ_EVENTFDS
> > which does the enabling/disabling. All interrupts are disabled by
> > default because userspace needs to give us a way to signal them via
> > eventfds. It will be device dependent whether multiple index can be
> > enabled simultaneously. Hmm, is that another flag on the irq_info
> > struct or do we expect drivers to implicitly have that kind of
> > knowledge?
>
> Right, that was what I was wondering. Not sure how the PowerPC
> world works with this.
On second thought, I think an exclusive flag isn't appropriate. VFIO is
not meant to abstract the device to the level that a user could write a
generic "vfio driver". The user will always need to understand the type
of device, VFIO just provides the conduit to make use of it. There's
too much left undefined with a simplistic exclusive flag.
<snip>
> > > > +SET_UNMASK_IRQ_EVENTFD to set the file descriptor for this.
> > >
> > > So only level triggered? Hmm, how do I know whether the device is
> > > level or edge? Or is that edge (MSI) can also be unmaked using the
> > > eventfs
> >
> > Yes, only for level. Isn't a device going to know what type of
> > interrupt it uses? MSI masking is PCI specific, not handled by this.
>
> I certainly hope it knows, but you know buggy drivers do exist.
>
> What would be the return value if somebody tried to unmask an edge one?
> Should that be documented here? -ENOSPEC?
I would assume EINVAL or EFAULT since the user is providing an invalid
argument/bad address.
> > > > +
> > > > +/* Set unmask eventfd, arg[0] = index, arg[1] = eventfd */
> > > > +#define VFIO_DEVICE_SET_UNMASK_IRQ_EVENTFD _IOW(';', 115, int)
> > > > +
> > > > +When supported, as indicated by the device flags, reset the device.
> > > > +
> > > > +#define VFIO_DEVICE_RESET _IO(';', 116)
> > >
> > > Does it disable the 'count'? Err, does it disable the IRQ on the
> > > device after this and one should call VFIO_DEVICE_SET_IRQ_EVENTFDS
> > > to set new eventfds? Or does it re-use the eventfds and the device
> > > is enabled after this?
> >
> > It doesn't affect the interrupt programming. Should it?
>
> I would hope not, but I am trying to think of ways one could screw this up.
> Perhaps just saying that - "No need to call VFIO_DEVICE_SET_IRQ_EVENTFDS
> as the kernel (and the device) will retain the interrupt.".
Ok, I added some words around this in the doc.
> .. snip..
> > > I am not really sure what this section purpose is? Could this be part
> > > of the header file or the code? It does not look to be part of the
> > > ioctl API?
> >
> > We've passed into the "VFIO bus driver API" section of the document, to
> > explain the interaction between vfio-core and vfio bus drivers.
>
> Perhaps a different file?
The entire file is ~300 lines. Seems excessive to split.
<snip>
> > > > +static void __vfio_iommu_detach_dev(struct vfio_iommu *iommu,
> > > > + struct vfio_device *device)
> > > > +{
> > > > + BUG_ON(!iommu->domain && device->attached);
> > >
> > > Whoa. Heavy hammer there.
> > >
> > > Perhaps WARN_ON as you do check it later on.
> >
> > I think it's warranted, internal consistency is broken if we have a
> > device that thinks it's attached to an iommu domain that doesn't exist.
> > It should, of course, never happen and this isn't a performance path.
> >
> > > > +
> > > > + if (!iommu->domain || !device->attached)
> > > > + return;
>
> Well, the deal is that you BUG_ON earlier, but you check for it here.
> But the BUG_ON will stop execution , so the check 'if ..' is actually
> not needed.
The BUG_ON is a subtly different check:
domain | attached
-------+---------
0 | 0 Nothing to do
0 | 1 <--- BUG_ON, we're broken
1 | 0 Nothing to do
1 | 1 Do stuff
Writing out the truth table, I see now I could just make this:
if (!attached) {return;}
since the BUG_ON takes care of the other case.
The reason for the laziness of allowing this to simply return is that if
we hit an error attaching an individual device within a group, we just
push the whole group back through __vfio_iommu_detach_group(), so some
devices may have never been attached.
> > > > +
> > > > + iommu_detach_device(iommu->domain, device->dev);
> > > > + device->attached = false;
> > > > +}
> > > > +
> > > > +static void __vfio_iommu_detach_group(struct vfio_iommu *iommu,
> > > > + struct vfio_group *group)
> > > > +{
> > > > + struct list_head *pos;
> > > > +
> > > > + list_for_each(pos, &group->device_list) {
> > > > + struct vfio_device *device;
> > > > +
> > > > + device = list_entry(pos, struct vfio_device, device_next);
> > > > + __vfio_iommu_detach_dev(iommu, device);
> > > > + }
> > > > +}
> > > > +
> > > > +static int __vfio_iommu_attach_dev(struct vfio_iommu *iommu,
> > > > + struct vfio_device *device)
> > > > +{
> > > > + int ret;
> > > > +
> > > > + BUG_ON(device->attached);
> > >
> > > How about:
> > >
> > > WARN_ON(device->attached, "The engineer who wrote the user-space device driver is trying to register
> > > the device again! Tell him/her to stop please.\n");
> >
> > I would almost demote this one to a WARN_ON, but userspace isn't in
> > control of attaching and detaching devices from the iommu. That's a
> > side effect of getting the iommu or device file descriptor. So again,
> > this is an internal consistency check and it should never happen,
> > regardless of userspace.
> >
>
> Ok, then you might want to expand it to
>
> BUG_ON(!device || device->attached);
>
> In case something has gone horribly wrong.
Impressive, that exceeds even my paranoia ;) For that we would have had
to walk the group->device_list and come up with a NULL device pointer.
I think we can assume that won't happen. I've also got this though:
if (!iommu || !iommu->domain)
return -EINVAL;
Which is effectively just being lazy without a good excuse like above.
That could probably be folded into the BUG_ON.
>
> .. snip..
> > > > + group->devt = MKDEV(MAJOR(vfio.devt), minor);
> > > > + device_create(vfio.class, NULL, group->devt,
> > > > + group, "%u", groupid);
> > > > +
> > > > + group->bus = dev->bus;
> > >
> > >
> > > Oh, so that is how the IOMMU iommu_ops get copied! You might
> > > want to mention that - I was not sure where the 'handoff' is
> > > was done to insert a device so that it can do iommu_ops properly.
> > >
> > > Ok, so the time when a device is detected whether it can do
> > > IOMMU is when we try to open it - as that is when iommu_domain_alloc
> > > is called which can return NULL if the iommu_ops is not set.
> > >
> > > So what about devices that don't have an iommu_ops? Say they
> > > are using SWIOTLB? (like the AMD-Vi sometimes does if the
> > > device is not on its list).
> > >
> > > Can we use iommu_present?
> >
> > I'm not sure I'm following your revelation ;) Take a look at the
>
> I am trying to figure out who sets the iommu_ops call on the devices.
The iommu driver registers ops with bus_set_iommu, so then we just need
to pass the bus pointer and iommu_ops figures out the rest. If there's
no iommu_ops for a device or the iommu_ops doesn't implement the
device_group callback, it gets skipped by vfio and therefore won't be
usable by this interface.
> > pointer to iommu_device_group I pasted above, or these:
> >
> > https://github.com/awilliam/linux-vfio/commit/37dd08c90d149caaed7779d4f38850a8f7ed0fa5
> > https://github.com/awilliam/linux-vfio/commit/63ca8543533d8130db23d7949133e548c3891c97
> > https://github.com/awilliam/linux-vfio/commit/8d7d70eb8e714fbf8710848a06f8cab0c741631e
> >
> > That call includes an iommu_present() check, so if there's no iommu or
> > the iommu can't provide a groupid, the device is skipped over from vfio
> > (can't be used).
> >
> > So the ordering is:
> >
> > - bus driver registers device
> > - if it has an iommu group, add it to the vfio device/group tracking
> >
> > - group gets opened
> > - user gets iommu or device fd results in iommu_domain_alloc
> >
> > Devices without iommu_ops don't get to play in the vfio world.
>
> Right, and I think the answer of which devices get iommu_ops is done via
> bus_set_iommu.
Exactly.
> (Thinking in long-term of what would be required to make this work
> with Xen and it sounds like I will need to implement a Xen IOMMU driver)
Yeah, that would make sense. Thanks!
Alex
next prev parent reply other threads:[~2011-11-17 20:22 UTC|newest]
Thread overview: 63+ messages / expand[flat|nested] mbox.gz Atom feed top
2011-11-03 20:12 [Qemu-devel] [RFC PATCH] vfio: VFIO Driver core framework Alex Williamson
2011-11-09 4:17 ` Aaron Fabbri
2011-11-09 4:41 ` Alex Williamson
2011-11-09 8:11 ` Christian Benvenuti (benve)
2011-11-09 18:02 ` Alex Williamson
2011-11-09 21:08 ` Christian Benvenuti (benve)
2011-11-09 23:40 ` Alex Williamson
2011-11-10 0:57 ` Christian Benvenuti (benve)
2011-11-11 18:04 ` Alex Williamson
2011-11-11 22:22 ` Christian Benvenuti (benve)
2011-11-14 22:59 ` Alex Williamson
2011-11-15 0:05 ` David Gibson
2011-11-15 0:49 ` Benjamin Herrenschmidt
2011-11-11 17:51 ` Konrad Rzeszutek Wilk
2011-11-11 22:10 ` Alex Williamson
2011-11-15 0:00 ` David Gibson
2011-11-16 16:52 ` Konrad Rzeszutek Wilk
2011-11-17 20:22 ` Alex Williamson [this message]
2011-11-17 20:56 ` Scott Wood
2011-11-16 17:47 ` Scott Wood
2011-11-17 20:52 ` Alex Williamson
2011-11-12 0:14 ` Scott Wood
2011-11-14 20:54 ` Alex Williamson
2011-11-14 21:46 ` Alex Williamson
2011-11-14 22:26 ` Scott Wood
2011-11-14 22:48 ` Alexander Graf
2011-11-15 2:29 ` Alex Williamson
2011-11-15 6:34 ` David Gibson
2011-11-15 18:01 ` Alex Williamson
2011-11-17 0:02 ` David Gibson
2011-11-18 20:32 ` Alex Williamson
2011-11-18 21:09 ` Scott Wood
2011-11-22 19:16 ` Alex Williamson
2011-11-22 20:00 ` Scott Wood
2011-11-22 21:28 ` Alex Williamson
2011-11-21 2:47 ` David Gibson
2011-11-22 18:22 ` Alex Williamson
2011-11-15 20:10 ` Scott Wood
2011-11-15 21:40 ` Aaron Fabbri
2011-11-15 22:29 ` Scott Wood
2011-11-16 23:34 ` Alex Williamson
2011-11-29 1:52 ` Alexey Kardashevskiy
2011-11-29 2:01 ` Alexey Kardashevskiy
2011-11-29 2:11 ` Alexey Kardashevskiy
2011-11-29 3:54 ` Alex Williamson
2011-11-29 19:26 ` Alex Williamson
2011-11-29 23:20 ` Stuart Yoder
2011-11-29 23:44 ` Alex Williamson
2011-11-30 15:41 ` Stuart Yoder
2011-11-30 16:58 ` Alex Williamson
2011-12-01 20:58 ` Stuart Yoder
2011-12-01 21:25 ` Alex Williamson
2011-12-02 14:40 ` Stuart Yoder
2011-12-02 18:11 ` Bhushan Bharat-R65777
2011-12-02 18:27 ` Scott Wood
2011-12-02 18:35 ` Bhushan Bharat-R65777
2011-12-02 18:45 ` Bhushan Bharat-R65777
2011-12-02 18:52 ` Scott Wood
2011-12-02 18:21 ` Scott Wood
2011-11-29 3:46 ` Alex Williamson
2011-11-29 4:34 ` Alexey Kardashevskiy
2011-11-29 5:48 ` Alex Williamson
2011-12-02 5:06 ` Alexey Kardashevskiy
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=1321561337.2633.40.camel@x201.home \
--to=alex.williamson@redhat.com \
--cc=B07421@freescale.com \
--cc=B08248@freescale.com \
--cc=aafabbri@cisco.com \
--cc=agraf@suse.de \
--cc=aik@au1.ibm.com \
--cc=avi@redhat.com \
--cc=benve@cisco.com \
--cc=chrisw@sous-sol.org \
--cc=dwg@au1.ibm.com \
--cc=iommu@lists.linux-foundation.org \
--cc=joerg.roedel@amd.com \
--cc=konrad.wilk@oracle.com \
--cc=kvm@vger.kernel.org \
--cc=linux-pci@vger.kernel.org \
--cc=pmac@au1.ibm.com \
--cc=qemu-devel@nongnu.org \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).