* Re: [Qemu-devel] RFC [v2]: vfio / device assignment -- layout of device fd files
@ 2011-09-26 18:34 ` Alex Williamson
0 siblings, 0 replies; 49+ messages in thread
From: Alex Williamson @ 2011-09-26 18:34 UTC (permalink / raw)
To: Alexander Graf
Cc: kvm@vger.kernel.org, Stuart Yoder, Benjamin Herrenschmidt,
qemu-devel@nongnu.org, avi@redhat.com, Scott Wood, David Gibson
On Mon, 2011-09-26 at 12:04 +0200, Alexander Graf wrote:
> Am 26.09.2011 um 09:51 schrieb David Gibson <david@gibson.dropbear.id.au>:
>
> > On Fri, Sep 09, 2011 at 08:11:54AM -0500, Stuart Yoder wrote:
> >> Based on the discussions over the last couple of weeks
> >> I have updated the device fd file layout proposal and
> >> tried to specify it a bit more formally.
> >>
> >> ===============================================================
> >>
> >> 1. Overview
> >>
> >> This specification describes the layout of device files
> >> used in the context of vfio, which gives user space
> >> direct access to I/O devices that have been bound to
> >> vfio.
> >>
> >> When a device fd is opened and read, offset 0x0 contains
> >> a fixed sized header followed by a number of variable length
> >> records that describe different characteristics
> >> of the device-- addressable regions, interrupts, etc.
> >>
> >> 0x0 +-------------+-------------+
> >> | magic | u32 // identifies this as a vfio
> >> device file
> >> +---------------------------+ and identifies the type of bus
> >> | version | u32 // specifies the version of this
> >> +---------------------------+
> >> | flags | u32 // encodes any flags
> >> +---------------------------+
> >> | dev info record 0 |
> >> | type | u32 // type of record
> >> | rec_len | u32 // length in bytes of record
> >> | | (including record header)
> >> | flags | u32 // type specific flags
> >> | ...content... | // record content, which could
> >> +---------------------------+ // include sub-records
> >> | dev info record 1 |
> >> +---------------------------+
> >> | dev info record N |
> >> +---------------------------+
> >
> > I really should have chimed in on this earlier, but I've been very
> > busy.
> >
> > Um, not to put too fine a point on it, this is madness.
> >
> > Yes, it's very flexible and can thereby cover a very wide range of
> > cases. But it's much, much too complex. Userspace has to parse a
> > complex, multilayered data structure, with variable length elements
> > just to get an address at which to do IO. I can pretty much guarantee
> > that if we went with this, most userspace programs using this
> > interface would just ignore this metadata and directly map the
> > offsets at which they happen to know the kernel will put things for
> > the type of device they care about.
> >
> > _At least_ for PCI, I think the original VFIO layout of each BAR at a
> > fixed, well known offset is much better. Despite its limitations,
> > just advertising a "device type" ID which describes one of a few fixed
> > layouts would be preferable to this. I'm still hoping, that we can do
> > a bit better than that. But we should try really hard to at the very
> > least force the metadata into a simple array of resources each with a
> > fixed size record describing it, even if it means some space wastage
> > with occasionally-used fields. Anything more complex than that and
> > userspace is just never going to use it properly.
>
> We will have 2 different types of user space. One wants to be as
> generic as possible and needs all this dynamic information. QEMU would
> fall into this category.
>
> The other one is device specific drivers in user space. Here
> hardcoding might make sense.
>
> For the generic interface, we need something that us as verbose as
> possible and lets us enumerate all the device properties, so we can
> properly map and forward them to the guest.
>
> However, nothing keeps us from mapping BARs always at static offsets
> into the file. If you don't need the generic info, then you don't need
> it.
>
> Also, if you can come up with an interface that does not have variable
> length descriptors but is still able to export all the required
> generic information, please send a proposal to the list :)
>
Hi,
The other obvious possibility is a pure ioctl interface. To match what
this proposal is trying to describe, plus the runtime interfaces, we'd
need something like:
/* :0 - PCI devices, :1 - Devices path device, 63:2 - reserved */
#define VFIO_DEVICE_GET_FLAGS _IOR(, , u64)
/* Return number of mmio/iop/config regions.
* For PCI this is always 8 (BAR0-5 + ROM + Config) */
#define VFIO_DEVICE_GET_NUM_REGIONS _IOR(, , int)
/* Return length for region index (may be zero) */
#define VFIO_DEVICE_GET_REGION_LEN _IOWR(, , u64)
/* Return flags for region index
* :0 - mmap'able, :1 - read-only, 63:2 - reserved */
#define VFIO_DEVICE_GET_REGION_FLAGS _IOR(, , u64)
/* Return file offset for region index */
#define VFIO_DEVICE_GET_REGION_OFFSET _IOWR(, , u64)
/* Return physical address for region index - not implemented for PCI */
#define VFIO_DEVICE_GET_REGION_PHYS_ADDR _IOWR(, , u64)
/* Return number of IRQs (Not including MSI/MSI-X for PCI) */
#define VFIO_DEVICE_GET_NUM_IRQ _IOR(, , int)
/* Set IRQ eventfd for IRQ index, arg[0] = index, arg[1] = fd */
#define VFIO_DEVICE_SET_IRQ_EVENTFD _IOW(, , int)
/* Unmask IRQ index */
#define VFIO_DEVICE_UNMASK_IRQ _IOW(, , int)
/* Set unmask eventfd for index, arg[0] = index, arg[1] = fd */
#define VFIO_DEVICE_SET_UNMASK_IRQ_EVENTFD _IOW(, , int)
/* Return the device tree path for type/index into the user
* allocated buffer */
struct dtpath {
u32 type; (0 = region, 1 = IRQ)
u32 index;
u32 buf_len;
char *buf;
};
#define VFIO_DEVICE_GET_DTPATH _IOWR(, , struct dtpath)
/* Return the device tree index for type/index */
struct dtindex {
u32 type; (0 = region, 1 = IRQ)
u32 index;
u32 prop_type;
u32 prop_index;
};
#define VFIO_DEVICE_GET_DTINDEX _IOWR(, , struct dtindex)
/* Reset the device */
#define VFIO_DEVICE_RESET _IO(, ,)
/* PCI MSI setup, arg[0] = #, arg[1-n] = eventfds */
#define VFIO_DEVICE_PCI_SET_MSI_EVENTFDS _IOW(, , int)
#define VFIO_DEVICE_PCI_SET_MSIX_EVENTFDS _IOW(, , int)
Hope that covers it. Something I prefer about this interface is that
everything can easily be generated on the fly, whereas reading out a
table from the device means we really need to have that table somewhere
in kernel memory to easily support reading random offsets. Thoughts?
Thanks,
Alex
^ permalink raw reply [flat|nested] 49+ messages in thread* Re: [Qemu-devel] RFC [v2]: vfio / device assignment -- layout of device fd files
2011-09-26 18:34 ` Alex Williamson
@ 2011-09-26 20:03 ` Stuart Yoder
-1 siblings, 0 replies; 49+ messages in thread
From: Stuart Yoder @ 2011-09-26 20:03 UTC (permalink / raw)
To: Alex Williamson
Cc: Alexander Graf, David Gibson, Scott Wood, qemu-devel@nongnu.org,
kvm@vger.kernel.org, avi@redhat.com, anthony@codemonkey.ws,
Benjamin Herrenschmidt
>
> The other obvious possibility is a pure ioctl interface. To match what
> this proposal is trying to describe, plus the runtime interfaces, we'd
> need something like:
>
> /* :0 - PCI devices, :1 - Devices path device, 63:2 - reserved */
> #define VFIO_DEVICE_GET_FLAGS _IOR(, , u64)
>
>
> /* Return number of mmio/iop/config regions.
> * For PCI this is always 8 (BAR0-5 + ROM + Config) */
> #define VFIO_DEVICE_GET_NUM_REGIONS _IOR(, , int)
>
> /* Return length for region index (may be zero) */
> #define VFIO_DEVICE_GET_REGION_LEN _IOWR(, , u64)
>
> /* Return flags for region index
> * :0 - mmap'able, :1 - read-only, 63:2 - reserved */
> #define VFIO_DEVICE_GET_REGION_FLAGS _IOR(, , u64)
>
> /* Return file offset for region index */
> #define VFIO_DEVICE_GET_REGION_OFFSET _IOWR(, , u64)
>
> /* Return physical address for region index - not implemented for PCI */
> #define VFIO_DEVICE_GET_REGION_PHYS_ADDR _IOWR(, , u64)
>
>
>
> /* Return number of IRQs (Not including MSI/MSI-X for PCI) */
> #define VFIO_DEVICE_GET_NUM_IRQ _IOR(, , int)
>
> /* Set IRQ eventfd for IRQ index, arg[0] = index, arg[1] = fd */
> #define VFIO_DEVICE_SET_IRQ_EVENTFD _IOW(, , int)
>
> /* Unmask IRQ index */
> #define VFIO_DEVICE_UNMASK_IRQ _IOW(, , int)
>
> /* Set unmask eventfd for index, arg[0] = index, arg[1] = fd */
> #define VFIO_DEVICE_SET_UNMASK_IRQ_EVENTFD _IOW(, , int)
>
>
> /* Return the device tree path for type/index into the user
> * allocated buffer */
> struct dtpath {
> u32 type; (0 = region, 1 = IRQ)
> u32 index;
> u32 buf_len;
> char *buf;
> };
> #define VFIO_DEVICE_GET_DTPATH _IOWR(, , struct dtpath)
>
> /* Return the device tree index for type/index */
> struct dtindex {
> u32 type; (0 = region, 1 = IRQ)
> u32 index;
> u32 prop_type;
> u32 prop_index;
> };
> #define VFIO_DEVICE_GET_DTINDEX _IOWR(, , struct dtindex)
>
>
> /* Reset the device */
> #define VFIO_DEVICE_RESET _IO(, ,)
>
>
> /* PCI MSI setup, arg[0] = #, arg[1-n] = eventfds */
> #define VFIO_DEVICE_PCI_SET_MSI_EVENTFDS _IOW(, , int)
> #define VFIO_DEVICE_PCI_SET_MSIX_EVENTFDS _IOW(, , int)
>
> Hope that covers it. Something I prefer about this interface is that
> everything can easily be generated on the fly, whereas reading out a
> table from the device means we really need to have that table somewhere
> in kernel memory to easily support reading random offsets. Thoughts?
I think this could work, but I'm not sure it makes the problem David
had any better-- you substitute the complexity of parsing the
variable length regions with invoking a set of APIs.
Stuart
^ permalink raw reply [flat|nested] 49+ messages in thread* Re: [Qemu-devel] RFC [v2]: vfio / device assignment -- layout of device fd files
@ 2011-09-26 20:03 ` Stuart Yoder
0 siblings, 0 replies; 49+ messages in thread
From: Stuart Yoder @ 2011-09-26 20:03 UTC (permalink / raw)
To: Alex Williamson
Cc: kvm@vger.kernel.org, Benjamin Herrenschmidt,
qemu-devel@nongnu.org, Alexander Graf, avi@redhat.com, Scott Wood,
David Gibson
>
> The other obvious possibility is a pure ioctl interface. To match what
> this proposal is trying to describe, plus the runtime interfaces, we'd
> need something like:
>
> /* :0 - PCI devices, :1 - Devices path device, 63:2 - reserved */
> #define VFIO_DEVICE_GET_FLAGS _IOR(, , u64)
>
>
> /* Return number of mmio/iop/config regions.
> * For PCI this is always 8 (BAR0-5 + ROM + Config) */
> #define VFIO_DEVICE_GET_NUM_REGIONS _IOR(, , int)
>
> /* Return length for region index (may be zero) */
> #define VFIO_DEVICE_GET_REGION_LEN _IOWR(, , u64)
>
> /* Return flags for region index
> * :0 - mmap'able, :1 - read-only, 63:2 - reserved */
> #define VFIO_DEVICE_GET_REGION_FLAGS _IOR(, , u64)
>
> /* Return file offset for region index */
> #define VFIO_DEVICE_GET_REGION_OFFSET _IOWR(, , u64)
>
> /* Return physical address for region index - not implemented for PCI */
> #define VFIO_DEVICE_GET_REGION_PHYS_ADDR _IOWR(, , u64)
>
>
>
> /* Return number of IRQs (Not including MSI/MSI-X for PCI) */
> #define VFIO_DEVICE_GET_NUM_IRQ _IOR(, , int)
>
> /* Set IRQ eventfd for IRQ index, arg[0] = index, arg[1] = fd */
> #define VFIO_DEVICE_SET_IRQ_EVENTFD _IOW(, , int)
>
> /* Unmask IRQ index */
> #define VFIO_DEVICE_UNMASK_IRQ _IOW(, , int)
>
> /* Set unmask eventfd for index, arg[0] = index, arg[1] = fd */
> #define VFIO_DEVICE_SET_UNMASK_IRQ_EVENTFD _IOW(, , int)
>
>
> /* Return the device tree path for type/index into the user
> * allocated buffer */
> struct dtpath {
> u32 type; (0 = region, 1 = IRQ)
> u32 index;
> u32 buf_len;
> char *buf;
> };
> #define VFIO_DEVICE_GET_DTPATH _IOWR(, , struct dtpath)
>
> /* Return the device tree index for type/index */
> struct dtindex {
> u32 type; (0 = region, 1 = IRQ)
> u32 index;
> u32 prop_type;
> u32 prop_index;
> };
> #define VFIO_DEVICE_GET_DTINDEX _IOWR(, , struct dtindex)
>
>
> /* Reset the device */
> #define VFIO_DEVICE_RESET _IO(, ,)
>
>
> /* PCI MSI setup, arg[0] = #, arg[1-n] = eventfds */
> #define VFIO_DEVICE_PCI_SET_MSI_EVENTFDS _IOW(, , int)
> #define VFIO_DEVICE_PCI_SET_MSIX_EVENTFDS _IOW(, , int)
>
> Hope that covers it. Something I prefer about this interface is that
> everything can easily be generated on the fly, whereas reading out a
> table from the device means we really need to have that table somewhere
> in kernel memory to easily support reading random offsets. Thoughts?
I think this could work, but I'm not sure it makes the problem David
had any better-- you substitute the complexity of parsing the
variable length regions with invoking a set of APIs.
Stuart
^ permalink raw reply [flat|nested] 49+ messages in thread* Re: [Qemu-devel] RFC [v2]: vfio / device assignment -- layout of device fd files
2011-09-26 20:03 ` Stuart Yoder
@ 2011-09-26 20:42 ` Alex Williamson
-1 siblings, 0 replies; 49+ messages in thread
From: Alex Williamson @ 2011-09-26 20:42 UTC (permalink / raw)
To: Stuart Yoder
Cc: Alexander Graf, David Gibson, Scott Wood, qemu-devel@nongnu.org,
kvm@vger.kernel.org, avi@redhat.com, anthony@codemonkey.ws,
Benjamin Herrenschmidt
On Mon, 2011-09-26 at 15:03 -0500, Stuart Yoder wrote:
> >
> > The other obvious possibility is a pure ioctl interface. To match what
> > this proposal is trying to describe, plus the runtime interfaces, we'd
> > need something like:
> >
> > /* :0 - PCI devices, :1 - Devices path device, 63:2 - reserved */
> > #define VFIO_DEVICE_GET_FLAGS _IOR(, , u64)
> >
> >
> > /* Return number of mmio/iop/config regions.
> > * For PCI this is always 8 (BAR0-5 + ROM + Config) */
> > #define VFIO_DEVICE_GET_NUM_REGIONS _IOR(, , int)
> >
> > /* Return length for region index (may be zero) */
> > #define VFIO_DEVICE_GET_REGION_LEN _IOWR(, , u64)
> >
> > /* Return flags for region index
> > * :0 - mmap'able, :1 - read-only, 63:2 - reserved */
> > #define VFIO_DEVICE_GET_REGION_FLAGS _IOR(, , u64)
> >
> > /* Return file offset for region index */
> > #define VFIO_DEVICE_GET_REGION_OFFSET _IOWR(, , u64)
> >
> > /* Return physical address for region index - not implemented for PCI */
> > #define VFIO_DEVICE_GET_REGION_PHYS_ADDR _IOWR(, , u64)
> >
> >
> >
> > /* Return number of IRQs (Not including MSI/MSI-X for PCI) */
> > #define VFIO_DEVICE_GET_NUM_IRQ _IOR(, , int)
> >
> > /* Set IRQ eventfd for IRQ index, arg[0] = index, arg[1] = fd */
> > #define VFIO_DEVICE_SET_IRQ_EVENTFD _IOW(, , int)
> >
> > /* Unmask IRQ index */
> > #define VFIO_DEVICE_UNMASK_IRQ _IOW(, , int)
> >
> > /* Set unmask eventfd for index, arg[0] = index, arg[1] = fd */
> > #define VFIO_DEVICE_SET_UNMASK_IRQ_EVENTFD _IOW(, , int)
> >
> >
> > /* Return the device tree path for type/index into the user
> > * allocated buffer */
> > struct dtpath {
> > u32 type; (0 = region, 1 = IRQ)
> > u32 index;
> > u32 buf_len;
> > char *buf;
> > };
> > #define VFIO_DEVICE_GET_DTPATH _IOWR(, , struct dtpath)
> >
> > /* Return the device tree index for type/index */
> > struct dtindex {
> > u32 type; (0 = region, 1 = IRQ)
> > u32 index;
> > u32 prop_type;
> > u32 prop_index;
> > };
> > #define VFIO_DEVICE_GET_DTINDEX _IOWR(, , struct dtindex)
> >
> >
> > /* Reset the device */
> > #define VFIO_DEVICE_RESET _IO(, ,)
> >
> >
> > /* PCI MSI setup, arg[0] = #, arg[1-n] = eventfds */
> > #define VFIO_DEVICE_PCI_SET_MSI_EVENTFDS _IOW(, , int)
> > #define VFIO_DEVICE_PCI_SET_MSIX_EVENTFDS _IOW(, , int)
> >
> > Hope that covers it. Something I prefer about this interface is that
> > everything can easily be generated on the fly, whereas reading out a
> > table from the device means we really need to have that table somewhere
> > in kernel memory to easily support reading random offsets. Thoughts?
>
> I think this could work, but I'm not sure it makes the problem David
> had any better-- you substitute the complexity of parsing the
> variable length regions with invoking a set of APIs.
I read it as mostly the complexity problem, which I think this makes
fairly trivial. It also eliminates a lot of complexity on the kernel
side of supporting the table driven interface. Thanks,
Alex
if (!(GET_FLAGS & PCI))
return error;
if (GET_NUM_REGIONS < 8)
return error;
GET_REGION_LEN(7)
GET_REGION_OFFSET(7) // setup config access
for (i = 0; i < 6; i++) {
if (GET_REGION_LEN(i)) {
GET_REGION_OFFSET(i)
setup mmap/rw
}
}
if (GET_REGION_LEN(6)) {
GET_REGION_OFFSET(6)
setup ROM access
}
^ permalink raw reply [flat|nested] 49+ messages in thread* Re: [Qemu-devel] RFC [v2]: vfio / device assignment -- layout of device fd files
@ 2011-09-26 20:42 ` Alex Williamson
0 siblings, 0 replies; 49+ messages in thread
From: Alex Williamson @ 2011-09-26 20:42 UTC (permalink / raw)
To: Stuart Yoder
Cc: kvm@vger.kernel.org, Benjamin Herrenschmidt,
qemu-devel@nongnu.org, Alexander Graf, avi@redhat.com, Scott Wood,
David Gibson
On Mon, 2011-09-26 at 15:03 -0500, Stuart Yoder wrote:
> >
> > The other obvious possibility is a pure ioctl interface. To match what
> > this proposal is trying to describe, plus the runtime interfaces, we'd
> > need something like:
> >
> > /* :0 - PCI devices, :1 - Devices path device, 63:2 - reserved */
> > #define VFIO_DEVICE_GET_FLAGS _IOR(, , u64)
> >
> >
> > /* Return number of mmio/iop/config regions.
> > * For PCI this is always 8 (BAR0-5 + ROM + Config) */
> > #define VFIO_DEVICE_GET_NUM_REGIONS _IOR(, , int)
> >
> > /* Return length for region index (may be zero) */
> > #define VFIO_DEVICE_GET_REGION_LEN _IOWR(, , u64)
> >
> > /* Return flags for region index
> > * :0 - mmap'able, :1 - read-only, 63:2 - reserved */
> > #define VFIO_DEVICE_GET_REGION_FLAGS _IOR(, , u64)
> >
> > /* Return file offset for region index */
> > #define VFIO_DEVICE_GET_REGION_OFFSET _IOWR(, , u64)
> >
> > /* Return physical address for region index - not implemented for PCI */
> > #define VFIO_DEVICE_GET_REGION_PHYS_ADDR _IOWR(, , u64)
> >
> >
> >
> > /* Return number of IRQs (Not including MSI/MSI-X for PCI) */
> > #define VFIO_DEVICE_GET_NUM_IRQ _IOR(, , int)
> >
> > /* Set IRQ eventfd for IRQ index, arg[0] = index, arg[1] = fd */
> > #define VFIO_DEVICE_SET_IRQ_EVENTFD _IOW(, , int)
> >
> > /* Unmask IRQ index */
> > #define VFIO_DEVICE_UNMASK_IRQ _IOW(, , int)
> >
> > /* Set unmask eventfd for index, arg[0] = index, arg[1] = fd */
> > #define VFIO_DEVICE_SET_UNMASK_IRQ_EVENTFD _IOW(, , int)
> >
> >
> > /* Return the device tree path for type/index into the user
> > * allocated buffer */
> > struct dtpath {
> > u32 type; (0 = region, 1 = IRQ)
> > u32 index;
> > u32 buf_len;
> > char *buf;
> > };
> > #define VFIO_DEVICE_GET_DTPATH _IOWR(, , struct dtpath)
> >
> > /* Return the device tree index for type/index */
> > struct dtindex {
> > u32 type; (0 = region, 1 = IRQ)
> > u32 index;
> > u32 prop_type;
> > u32 prop_index;
> > };
> > #define VFIO_DEVICE_GET_DTINDEX _IOWR(, , struct dtindex)
> >
> >
> > /* Reset the device */
> > #define VFIO_DEVICE_RESET _IO(, ,)
> >
> >
> > /* PCI MSI setup, arg[0] = #, arg[1-n] = eventfds */
> > #define VFIO_DEVICE_PCI_SET_MSI_EVENTFDS _IOW(, , int)
> > #define VFIO_DEVICE_PCI_SET_MSIX_EVENTFDS _IOW(, , int)
> >
> > Hope that covers it. Something I prefer about this interface is that
> > everything can easily be generated on the fly, whereas reading out a
> > table from the device means we really need to have that table somewhere
> > in kernel memory to easily support reading random offsets. Thoughts?
>
> I think this could work, but I'm not sure it makes the problem David
> had any better-- you substitute the complexity of parsing the
> variable length regions with invoking a set of APIs.
I read it as mostly the complexity problem, which I think this makes
fairly trivial. It also eliminates a lot of complexity on the kernel
side of supporting the table driven interface. Thanks,
Alex
if (!(GET_FLAGS & PCI))
return error;
if (GET_NUM_REGIONS < 8)
return error;
GET_REGION_LEN(7)
GET_REGION_OFFSET(7) // setup config access
for (i = 0; i < 6; i++) {
if (GET_REGION_LEN(i)) {
GET_REGION_OFFSET(i)
setup mmap/rw
}
}
if (GET_REGION_LEN(6)) {
GET_REGION_OFFSET(6)
setup ROM access
}
^ permalink raw reply [flat|nested] 49+ messages in thread
* Re: [Qemu-devel] RFC [v2]: vfio / device assignment -- layout of device fd files
2011-09-26 18:34 ` Alex Williamson
@ 2011-09-26 23:59 ` Scott Wood
-1 siblings, 0 replies; 49+ messages in thread
From: Scott Wood @ 2011-09-26 23:59 UTC (permalink / raw)
To: Alex Williamson
Cc: Alexander Graf, David Gibson, Stuart Yoder, qemu-devel@nongnu.org,
kvm@vger.kernel.org, avi@redhat.com, anthony@codemonkey.ws,
Benjamin Herrenschmidt
On 09/26/2011 01:34 PM, Alex Williamson wrote:
> The other obvious possibility is a pure ioctl interface. To match what
> this proposal is trying to describe, plus the runtime interfaces, we'd
> need something like:
>
> /* :0 - PCI devices, :1 - Devices path device, 63:2 - reserved */
> #define VFIO_DEVICE_GET_FLAGS _IOR(, , u64)
>
>
> /* Return number of mmio/iop/config regions.
> * For PCI this is always 8 (BAR0-5 + ROM + Config) */
> #define VFIO_DEVICE_GET_NUM_REGIONS _IOR(, , int)
How do you handle BARs that a particular device doesn't use? Zero-length?
> /* Return the device tree path for type/index into the user
> * allocated buffer */
> struct dtpath {
> u32 type; (0 = region, 1 = IRQ)
> u32 index;
> u32 buf_len;
> char *buf;
> };
> #define VFIO_DEVICE_GET_DTPATH _IOWR(, , struct dtpath)
So now the user needs to guess a buffer length in advance... and what
happens if it's too small?
> /* Reset the device */
> #define VFIO_DEVICE_RESET _IO(, ,)
What generic way do we have to do this? We should probably have a way
to determine whether it's possible, without actually asking to do it.
> /* PCI MSI setup, arg[0] = #, arg[1-n] = eventfds */
> #define VFIO_DEVICE_PCI_SET_MSI_EVENTFDS _IOW(, , int)
> #define VFIO_DEVICE_PCI_SET_MSIX_EVENTFDS _IOW(, , int)
>
> Hope that covers it.
It could be done this way, but I predict that the code (both kernel and
user side) will be larger. Maybe not much more complex, but more
boilerplate.
How will you manage extensions to the interface? With the table it's
simple, you see a new (sub)record type and you either understand it or
you skip it. With ioctls you need to call every information-gathering
ioctl you know and care about (or are told is present via some feature
advertisement), and see if there's anything there.
> Something I prefer about this interface is that
> everything can easily be generated on the fly, whereas reading out a
> table from the device means we really need to have that table somewhere
> in kernel memory to easily support reading random offsets. Thoughts?
The table should not be particularly large, and you'll need to keep the
information around in some form regardless. Maybe in the PCI case you
could produce it dynamically (though I probably wouldn't), but it really
wouldn't make sense in the device tree case.
You also lose the ability to easily have a human look at the hexdump for
debugging; you'll need a special "lsvfio" tool. You might want one
anyway to pretty-print the info, but with ioctls it's mandatory.
-Scott
^ permalink raw reply [flat|nested] 49+ messages in thread* Re: [Qemu-devel] RFC [v2]: vfio / device assignment -- layout of device fd files
@ 2011-09-26 23:59 ` Scott Wood
0 siblings, 0 replies; 49+ messages in thread
From: Scott Wood @ 2011-09-26 23:59 UTC (permalink / raw)
To: Alex Williamson
Cc: kvm@vger.kernel.org, Stuart Yoder, Benjamin Herrenschmidt,
qemu-devel@nongnu.org, Alexander Graf, avi@redhat.com,
David Gibson
On 09/26/2011 01:34 PM, Alex Williamson wrote:
> The other obvious possibility is a pure ioctl interface. To match what
> this proposal is trying to describe, plus the runtime interfaces, we'd
> need something like:
>
> /* :0 - PCI devices, :1 - Devices path device, 63:2 - reserved */
> #define VFIO_DEVICE_GET_FLAGS _IOR(, , u64)
>
>
> /* Return number of mmio/iop/config regions.
> * For PCI this is always 8 (BAR0-5 + ROM + Config) */
> #define VFIO_DEVICE_GET_NUM_REGIONS _IOR(, , int)
How do you handle BARs that a particular device doesn't use? Zero-length?
> /* Return the device tree path for type/index into the user
> * allocated buffer */
> struct dtpath {
> u32 type; (0 = region, 1 = IRQ)
> u32 index;
> u32 buf_len;
> char *buf;
> };
> #define VFIO_DEVICE_GET_DTPATH _IOWR(, , struct dtpath)
So now the user needs to guess a buffer length in advance... and what
happens if it's too small?
> /* Reset the device */
> #define VFIO_DEVICE_RESET _IO(, ,)
What generic way do we have to do this? We should probably have a way
to determine whether it's possible, without actually asking to do it.
> /* PCI MSI setup, arg[0] = #, arg[1-n] = eventfds */
> #define VFIO_DEVICE_PCI_SET_MSI_EVENTFDS _IOW(, , int)
> #define VFIO_DEVICE_PCI_SET_MSIX_EVENTFDS _IOW(, , int)
>
> Hope that covers it.
It could be done this way, but I predict that the code (both kernel and
user side) will be larger. Maybe not much more complex, but more
boilerplate.
How will you manage extensions to the interface? With the table it's
simple, you see a new (sub)record type and you either understand it or
you skip it. With ioctls you need to call every information-gathering
ioctl you know and care about (or are told is present via some feature
advertisement), and see if there's anything there.
> Something I prefer about this interface is that
> everything can easily be generated on the fly, whereas reading out a
> table from the device means we really need to have that table somewhere
> in kernel memory to easily support reading random offsets. Thoughts?
The table should not be particularly large, and you'll need to keep the
information around in some form regardless. Maybe in the PCI case you
could produce it dynamically (though I probably wouldn't), but it really
wouldn't make sense in the device tree case.
You also lose the ability to easily have a human look at the hexdump for
debugging; you'll need a special "lsvfio" tool. You might want one
anyway to pretty-print the info, but with ioctls it's mandatory.
-Scott
^ permalink raw reply [flat|nested] 49+ messages in thread* Re: [Qemu-devel] RFC [v2]: vfio / device assignment -- layout of device fd files
2011-09-26 23:59 ` Scott Wood
@ 2011-09-27 0:45 ` Alex Williamson
-1 siblings, 0 replies; 49+ messages in thread
From: Alex Williamson @ 2011-09-27 0:45 UTC (permalink / raw)
To: Scott Wood
Cc: Alexander Graf, David Gibson, Stuart Yoder, qemu-devel@nongnu.org,
kvm@vger.kernel.org, avi@redhat.com, anthony@codemonkey.ws,
Benjamin Herrenschmidt
On Mon, 2011-09-26 at 18:59 -0500, Scott Wood wrote:
> On 09/26/2011 01:34 PM, Alex Williamson wrote:
> > The other obvious possibility is a pure ioctl interface. To match what
> > this proposal is trying to describe, plus the runtime interfaces, we'd
> > need something like:
> >
> > /* :0 - PCI devices, :1 - Devices path device, 63:2 - reserved */
> > #define VFIO_DEVICE_GET_FLAGS _IOR(, , u64)
> >
> >
> > /* Return number of mmio/iop/config regions.
> > * For PCI this is always 8 (BAR0-5 + ROM + Config) */
> > #define VFIO_DEVICE_GET_NUM_REGIONS _IOR(, , int)
>
> How do you handle BARs that a particular device doesn't use? Zero-length?
Yep
> > /* Return the device tree path for type/index into the user
> > * allocated buffer */
> > struct dtpath {
> > u32 type; (0 = region, 1 = IRQ)
> > u32 index;
> > u32 buf_len;
> > char *buf;
> > };
> > #define VFIO_DEVICE_GET_DTPATH _IOWR(, , struct dtpath)
>
> So now the user needs to guess a buffer length in advance... and what
> happens if it's too small?
-ENOSPC. Call with buf_len = 0 and it could indicate the size.
> > /* Reset the device */
> > #define VFIO_DEVICE_RESET _IO(, ,)
>
> What generic way do we have to do this? We should probably have a way
> to determine whether it's possible, without actually asking to do it.
It's not generic, it could be a VFIO_DEVICE_PCI_RESET or we could add a
bit to the device flags to indicate if it's available or we could add a
"probe" arg to the ioctl to either check for existence or do it.
> > /* PCI MSI setup, arg[0] = #, arg[1-n] = eventfds */
> > #define VFIO_DEVICE_PCI_SET_MSI_EVENTFDS _IOW(, , int)
> > #define VFIO_DEVICE_PCI_SET_MSIX_EVENTFDS _IOW(, , int)
> >
> > Hope that covers it.
>
> It could be done this way, but I predict that the code (both kernel and
> user side) will be larger. Maybe not much more complex, but more
> boilerplate.
>
> How will you manage extensions to the interface?
I would assume we'd do something similar to the kvm capabilities checks.
I don't know if that's just bits of GET_FLAGS or a different ioctl.
> With the table it's
> simple, you see a new (sub)record type and you either understand it or
> you skip it. With ioctls you need to call every information-gathering
> ioctl you know and care about (or are told is present via some feature
> advertisement), and see if there's anything there.
I don't really see much difference between the interfaces here. You'd
pick and choose which table entries you care about and pick and choose
ioctls. For one you see it in the table, for the other there's a bit
indicating the capability exists.
> > Something I prefer about this interface is that
> > everything can easily be generated on the fly, whereas reading out a
> > table from the device means we really need to have that table somewhere
> > in kernel memory to easily support reading random offsets. Thoughts?
>
> The table should not be particularly large, and you'll need to keep the
> information around in some form regardless. Maybe in the PCI case you
> could produce it dynamically (though I probably wouldn't), but it really
> wouldn't make sense in the device tree case.
It would be entirely dynamic for PCI, there's no advantage to caching
it. Even for device tree, if you can't fetch it dynamically, you'd have
to duplicate it between an internal data structure and a buffer reading
the table.
> You also lose the ability to easily have a human look at the hexdump for
> debugging; you'll need a special "lsvfio" tool. You might want one
> anyway to pretty-print the info, but with ioctls it's mandatory.
I don't think this alone justifies duplicating the data and making it
difficult to parse on both ends. Chances are we won't need such a tool
for the ioctl interface because it's easier to get it right the first
time ;)
Note that I'm not stuck on this interface, I was just thinking about how
to generate the table last week, it seemed like a pain so I thought I'd
spend a few minutes outlining an ioctl interface... turns out it's not
so bad. Thanks,
Alex
^ permalink raw reply [flat|nested] 49+ messages in thread* Re: [Qemu-devel] RFC [v2]: vfio / device assignment -- layout of device fd files
@ 2011-09-27 0:45 ` Alex Williamson
0 siblings, 0 replies; 49+ messages in thread
From: Alex Williamson @ 2011-09-27 0:45 UTC (permalink / raw)
To: Scott Wood
Cc: kvm@vger.kernel.org, Stuart Yoder, Benjamin Herrenschmidt,
qemu-devel@nongnu.org, Alexander Graf, avi@redhat.com,
David Gibson
On Mon, 2011-09-26 at 18:59 -0500, Scott Wood wrote:
> On 09/26/2011 01:34 PM, Alex Williamson wrote:
> > The other obvious possibility is a pure ioctl interface. To match what
> > this proposal is trying to describe, plus the runtime interfaces, we'd
> > need something like:
> >
> > /* :0 - PCI devices, :1 - Devices path device, 63:2 - reserved */
> > #define VFIO_DEVICE_GET_FLAGS _IOR(, , u64)
> >
> >
> > /* Return number of mmio/iop/config regions.
> > * For PCI this is always 8 (BAR0-5 + ROM + Config) */
> > #define VFIO_DEVICE_GET_NUM_REGIONS _IOR(, , int)
>
> How do you handle BARs that a particular device doesn't use? Zero-length?
Yep
> > /* Return the device tree path for type/index into the user
> > * allocated buffer */
> > struct dtpath {
> > u32 type; (0 = region, 1 = IRQ)
> > u32 index;
> > u32 buf_len;
> > char *buf;
> > };
> > #define VFIO_DEVICE_GET_DTPATH _IOWR(, , struct dtpath)
>
> So now the user needs to guess a buffer length in advance... and what
> happens if it's too small?
-ENOSPC. Call with buf_len = 0 and it could indicate the size.
> > /* Reset the device */
> > #define VFIO_DEVICE_RESET _IO(, ,)
>
> What generic way do we have to do this? We should probably have a way
> to determine whether it's possible, without actually asking to do it.
It's not generic, it could be a VFIO_DEVICE_PCI_RESET or we could add a
bit to the device flags to indicate if it's available or we could add a
"probe" arg to the ioctl to either check for existence or do it.
> > /* PCI MSI setup, arg[0] = #, arg[1-n] = eventfds */
> > #define VFIO_DEVICE_PCI_SET_MSI_EVENTFDS _IOW(, , int)
> > #define VFIO_DEVICE_PCI_SET_MSIX_EVENTFDS _IOW(, , int)
> >
> > Hope that covers it.
>
> It could be done this way, but I predict that the code (both kernel and
> user side) will be larger. Maybe not much more complex, but more
> boilerplate.
>
> How will you manage extensions to the interface?
I would assume we'd do something similar to the kvm capabilities checks.
I don't know if that's just bits of GET_FLAGS or a different ioctl.
> With the table it's
> simple, you see a new (sub)record type and you either understand it or
> you skip it. With ioctls you need to call every information-gathering
> ioctl you know and care about (or are told is present via some feature
> advertisement), and see if there's anything there.
I don't really see much difference between the interfaces here. You'd
pick and choose which table entries you care about and pick and choose
ioctls. For one you see it in the table, for the other there's a bit
indicating the capability exists.
> > Something I prefer about this interface is that
> > everything can easily be generated on the fly, whereas reading out a
> > table from the device means we really need to have that table somewhere
> > in kernel memory to easily support reading random offsets. Thoughts?
>
> The table should not be particularly large, and you'll need to keep the
> information around in some form regardless. Maybe in the PCI case you
> could produce it dynamically (though I probably wouldn't), but it really
> wouldn't make sense in the device tree case.
It would be entirely dynamic for PCI, there's no advantage to caching
it. Even for device tree, if you can't fetch it dynamically, you'd have
to duplicate it between an internal data structure and a buffer reading
the table.
> You also lose the ability to easily have a human look at the hexdump for
> debugging; you'll need a special "lsvfio" tool. You might want one
> anyway to pretty-print the info, but with ioctls it's mandatory.
I don't think this alone justifies duplicating the data and making it
difficult to parse on both ends. Chances are we won't need such a tool
for the ioctl interface because it's easier to get it right the first
time ;)
Note that I'm not stuck on this interface, I was just thinking about how
to generate the table last week, it seemed like a pain so I thought I'd
spend a few minutes outlining an ioctl interface... turns out it's not
so bad. Thanks,
Alex
^ permalink raw reply [flat|nested] 49+ messages in thread* Re: [Qemu-devel] RFC [v2]: vfio / device assignment -- layout of device fd files
2011-09-27 0:45 ` Alex Williamson
@ 2011-09-27 21:28 ` Scott Wood
-1 siblings, 0 replies; 49+ messages in thread
From: Scott Wood @ 2011-09-27 21:28 UTC (permalink / raw)
To: Alex Williamson
Cc: kvm@vger.kernel.org, Stuart Yoder, Benjamin Herrenschmidt,
qemu-devel@nongnu.org, Alexander Graf, avi@redhat.com,
David Gibson
On 09/26/2011 07:45 PM, Alex Williamson wrote:
> On Mon, 2011-09-26 at 18:59 -0500, Scott Wood wrote:
>> On 09/26/2011 01:34 PM, Alex Williamson wrote:
>>> /* Reset the device */
>>> #define VFIO_DEVICE_RESET _IO(, ,)
>>
>> What generic way do we have to do this? We should probably have a way
>> to determine whether it's possible, without actually asking to do it.
>
> It's not generic, it could be a VFIO_DEVICE_PCI_RESET or we could add a
> bit to the device flags to indicate if it's available or we could add a
> "probe" arg to the ioctl to either check for existence or do it.
Even with PCI, isn't this only possible if function-level reset is
supported? I think we need a flag.
For devices that can't be reset by the kernel, we'll want the ability to
stop/start DMA acccess through the IOMMU (or other bus-specific means),
separate from whether the fd is open. If a device is assigned to a
partition and that partition gets reset, we'll want to disable DMA
before we re-use the memory, and enable it after the partition has reset
or quiesced the device (which requires the fd to be open).
>>> /* PCI MSI setup, arg[0] = #, arg[1-n] = eventfds */
>>> #define VFIO_DEVICE_PCI_SET_MSI_EVENTFDS _IOW(, , int)
>>> #define VFIO_DEVICE_PCI_SET_MSIX_EVENTFDS _IOW(, , int)
>>>
>>> Hope that covers it.
>>
>> It could be done this way, but I predict that the code (both kernel and
>> user side) will be larger. Maybe not much more complex, but more
>> boilerplate.
>>
>> How will you manage extensions to the interface?
>
> I would assume we'd do something similar to the kvm capabilities checks.
This information is already built into the data-structure approach.
>> The table should not be particularly large, and you'll need to keep the
>> information around in some form regardless. Maybe in the PCI case you
>> could produce it dynamically (though I probably wouldn't), but it really
>> wouldn't make sense in the device tree case.
>
> It would be entirely dynamic for PCI, there's no advantage to caching
> it. Even for device tree, if you can't fetch it dynamically, you'd have
> to duplicate it between an internal data structure and a buffer reading
> the table.
I don't think we'd need to keep the device tree path/index info around
for anything but the table -- but really, this is a minor consideration.
>> You also lose the ability to easily have a human look at the hexdump for
>> debugging; you'll need a special "lsvfio" tool. You might want one
>> anyway to pretty-print the info, but with ioctls it's mandatory.
>
> I don't think this alone justifies duplicating the data and making it
> difficult to parse on both ends. Chances are we won't need such a tool
> for the ioctl interface because it's easier to get it right the first
> time ;)
It's not just useful for getting the code right, but for e.g. sanity
checking that the devices were bound properly. I think such a tool
would be generally useful, no matter what the kernel interface ends up
being. I don't just use lspci to debug the PCI subsystem. :-)
> Note that I'm not stuck on this interface, I was just thinking about how
> to generate the table last week, it seemed like a pain so I thought I'd
> spend a few minutes outlining an ioctl interface... turns out it's not
> so bad. Thanks,
Yeah, it can work either way, as long as the information's there and
there's a way to add new bits of information, or new bus types, down the
road. Mainly a matter of aesthetics between the two.
-Scott
^ permalink raw reply [flat|nested] 49+ messages in thread
* Re: [Qemu-devel] RFC [v2]: vfio / device assignment -- layout of device fd files
@ 2011-09-27 21:28 ` Scott Wood
0 siblings, 0 replies; 49+ messages in thread
From: Scott Wood @ 2011-09-27 21:28 UTC (permalink / raw)
To: Alex Williamson
Cc: kvm@vger.kernel.org, Benjamin Herrenschmidt, Stuart Yoder,
Alexander Graf, qemu-devel@nongnu.org, avi@redhat.com,
David Gibson
On 09/26/2011 07:45 PM, Alex Williamson wrote:
> On Mon, 2011-09-26 at 18:59 -0500, Scott Wood wrote:
>> On 09/26/2011 01:34 PM, Alex Williamson wrote:
>>> /* Reset the device */
>>> #define VFIO_DEVICE_RESET _IO(, ,)
>>
>> What generic way do we have to do this? We should probably have a way
>> to determine whether it's possible, without actually asking to do it.
>
> It's not generic, it could be a VFIO_DEVICE_PCI_RESET or we could add a
> bit to the device flags to indicate if it's available or we could add a
> "probe" arg to the ioctl to either check for existence or do it.
Even with PCI, isn't this only possible if function-level reset is
supported? I think we need a flag.
For devices that can't be reset by the kernel, we'll want the ability to
stop/start DMA acccess through the IOMMU (or other bus-specific means),
separate from whether the fd is open. If a device is assigned to a
partition and that partition gets reset, we'll want to disable DMA
before we re-use the memory, and enable it after the partition has reset
or quiesced the device (which requires the fd to be open).
>>> /* PCI MSI setup, arg[0] = #, arg[1-n] = eventfds */
>>> #define VFIO_DEVICE_PCI_SET_MSI_EVENTFDS _IOW(, , int)
>>> #define VFIO_DEVICE_PCI_SET_MSIX_EVENTFDS _IOW(, , int)
>>>
>>> Hope that covers it.
>>
>> It could be done this way, but I predict that the code (both kernel and
>> user side) will be larger. Maybe not much more complex, but more
>> boilerplate.
>>
>> How will you manage extensions to the interface?
>
> I would assume we'd do something similar to the kvm capabilities checks.
This information is already built into the data-structure approach.
>> The table should not be particularly large, and you'll need to keep the
>> information around in some form regardless. Maybe in the PCI case you
>> could produce it dynamically (though I probably wouldn't), but it really
>> wouldn't make sense in the device tree case.
>
> It would be entirely dynamic for PCI, there's no advantage to caching
> it. Even for device tree, if you can't fetch it dynamically, you'd have
> to duplicate it between an internal data structure and a buffer reading
> the table.
I don't think we'd need to keep the device tree path/index info around
for anything but the table -- but really, this is a minor consideration.
>> You also lose the ability to easily have a human look at the hexdump for
>> debugging; you'll need a special "lsvfio" tool. You might want one
>> anyway to pretty-print the info, but with ioctls it's mandatory.
>
> I don't think this alone justifies duplicating the data and making it
> difficult to parse on both ends. Chances are we won't need such a tool
> for the ioctl interface because it's easier to get it right the first
> time ;)
It's not just useful for getting the code right, but for e.g. sanity
checking that the devices were bound properly. I think such a tool
would be generally useful, no matter what the kernel interface ends up
being. I don't just use lspci to debug the PCI subsystem. :-)
> Note that I'm not stuck on this interface, I was just thinking about how
> to generate the table last week, it seemed like a pain so I thought I'd
> spend a few minutes outlining an ioctl interface... turns out it's not
> so bad. Thanks,
Yeah, it can work either way, as long as the information's there and
there's a way to add new bits of information, or new bus types, down the
road. Mainly a matter of aesthetics between the two.
-Scott
^ permalink raw reply [flat|nested] 49+ messages in thread
* Re: RFC [v2]: vfio / device assignment -- layout of device fd files
2011-09-27 21:28 ` Scott Wood
@ 2011-09-28 2:40 ` Alex Williamson
-1 siblings, 0 replies; 49+ messages in thread
From: Alex Williamson @ 2011-09-28 2:40 UTC (permalink / raw)
To: Scott Wood
Cc: kvm@vger.kernel.org, Benjamin Herrenschmidt, Stuart Yoder,
Alexander Graf, qemu-devel@nongnu.org, avi@redhat.com,
David Gibson
On Tue, 2011-09-27 at 16:28 -0500, Scott Wood wrote:
> On 09/26/2011 07:45 PM, Alex Williamson wrote:
> > On Mon, 2011-09-26 at 18:59 -0500, Scott Wood wrote:
> >> On 09/26/2011 01:34 PM, Alex Williamson wrote:
> >>> /* Reset the device */
> >>> #define VFIO_DEVICE_RESET _IO(, ,)
> >>
> >> What generic way do we have to do this? We should probably have a way
> >> to determine whether it's possible, without actually asking to do it.
> >
> > It's not generic, it could be a VFIO_DEVICE_PCI_RESET or we could add a
> > bit to the device flags to indicate if it's available or we could add a
> > "probe" arg to the ioctl to either check for existence or do it.
>
> Even with PCI, isn't this only possible if function-level reset is
> supported?
There are a couple other things we can do if FLR isn't present (D3hot
transition, secondary bus reset, device specific resets are possible).
> I think we need a flag.
Ok, PCI has a pci_probe_reset_function() and pci_reset_function(). I'd
probably mimic those in the vfio device ops. Common vfio code can probe
the reset and set the flag appropriately and we can have a common
VFIO_DEVICE_RESET ioctl that calls into the device ops reset function.
> For devices that can't be reset by the kernel, we'll want the ability to
> stop/start DMA acccess through the IOMMU (or other bus-specific means),
> separate from whether the fd is open. If a device is assigned to a
> partition and that partition gets reset, we'll want to disable DMA
> before we re-use the memory, and enable it after the partition has reset
> or quiesced the device (which requires the fd to be open).
Maybe this can be accomplished via an iommu_detach_device() to
temporarily disassociate it from the domain. We could also unmap all
the DMA. Anyway, a few possibilities.
> >>> /* PCI MSI setup, arg[0] = #, arg[1-n] = eventfds */
> >>> #define VFIO_DEVICE_PCI_SET_MSI_EVENTFDS _IOW(, , int)
> >>> #define VFIO_DEVICE_PCI_SET_MSIX_EVENTFDS _IOW(, , int)
> >>>
> >>> Hope that covers it.
> >>
> >> It could be done this way, but I predict that the code (both kernel and
> >> user side) will be larger. Maybe not much more complex, but more
> >> boilerplate.
> >>
> >> How will you manage extensions to the interface?
> >
> > I would assume we'd do something similar to the kvm capabilities checks.
>
> This information is already built into the data-structure approach.
If we define it to be part of the flags, then it's built-in to the ioctl
approach too...
> >> The table should not be particularly large, and you'll need to keep the
> >> information around in some form regardless. Maybe in the PCI case you
> >> could produce it dynamically (though I probably wouldn't), but it really
> >> wouldn't make sense in the device tree case.
> >
> > It would be entirely dynamic for PCI, there's no advantage to caching
> > it. Even for device tree, if you can't fetch it dynamically, you'd have
> > to duplicate it between an internal data structure and a buffer reading
> > the table.
>
> I don't think we'd need to keep the device tree path/index info around
> for anything but the table -- but really, this is a minor consideration.
>
> >> You also lose the ability to easily have a human look at the hexdump for
> >> debugging; you'll need a special "lsvfio" tool. You might want one
> >> anyway to pretty-print the info, but with ioctls it's mandatory.
> >
> > I don't think this alone justifies duplicating the data and making it
> > difficult to parse on both ends. Chances are we won't need such a tool
> > for the ioctl interface because it's easier to get it right the first
> > time ;)
>
> It's not just useful for getting the code right, but for e.g. sanity
> checking that the devices were bound properly. I think such a tool
> would be generally useful, no matter what the kernel interface ends up
> being. I don't just use lspci to debug the PCI subsystem. :-)
This is also a minor consideration. Looking at hexdumps isn't much to
rely on for debugging and if we take the step of writing a tool, it's
not much harder to write for either interface. The table is more akin
to dumping the data, but I feel the ioctl is easier for how a driver
would probably make use of the data (linear vs random access).
> > Note that I'm not stuck on this interface, I was just thinking about how
> > to generate the table last week, it seemed like a pain so I thought I'd
> > spend a few minutes outlining an ioctl interface... turns out it's not
> > so bad. Thanks,
>
> Yeah, it can work either way, as long as the information's there and
> there's a way to add new bits of information, or new bus types, down the
> road. Mainly a matter of aesthetics between the two.
It'd be nice if David would chime back in since he objected to the
table. Does an ioctl interface look better? Alex Graf, any opinions?
Thanks,
Alex
^ permalink raw reply [flat|nested] 49+ messages in thread
* Re: [Qemu-devel] RFC [v2]: vfio / device assignment -- layout of device fd files
@ 2011-09-28 2:40 ` Alex Williamson
0 siblings, 0 replies; 49+ messages in thread
From: Alex Williamson @ 2011-09-28 2:40 UTC (permalink / raw)
To: Scott Wood
Cc: kvm@vger.kernel.org, Benjamin Herrenschmidt, Stuart Yoder,
Alexander Graf, qemu-devel@nongnu.org, avi@redhat.com,
David Gibson
On Tue, 2011-09-27 at 16:28 -0500, Scott Wood wrote:
> On 09/26/2011 07:45 PM, Alex Williamson wrote:
> > On Mon, 2011-09-26 at 18:59 -0500, Scott Wood wrote:
> >> On 09/26/2011 01:34 PM, Alex Williamson wrote:
> >>> /* Reset the device */
> >>> #define VFIO_DEVICE_RESET _IO(, ,)
> >>
> >> What generic way do we have to do this? We should probably have a way
> >> to determine whether it's possible, without actually asking to do it.
> >
> > It's not generic, it could be a VFIO_DEVICE_PCI_RESET or we could add a
> > bit to the device flags to indicate if it's available or we could add a
> > "probe" arg to the ioctl to either check for existence or do it.
>
> Even with PCI, isn't this only possible if function-level reset is
> supported?
There are a couple other things we can do if FLR isn't present (D3hot
transition, secondary bus reset, device specific resets are possible).
> I think we need a flag.
Ok, PCI has a pci_probe_reset_function() and pci_reset_function(). I'd
probably mimic those in the vfio device ops. Common vfio code can probe
the reset and set the flag appropriately and we can have a common
VFIO_DEVICE_RESET ioctl that calls into the device ops reset function.
> For devices that can't be reset by the kernel, we'll want the ability to
> stop/start DMA acccess through the IOMMU (or other bus-specific means),
> separate from whether the fd is open. If a device is assigned to a
> partition and that partition gets reset, we'll want to disable DMA
> before we re-use the memory, and enable it after the partition has reset
> or quiesced the device (which requires the fd to be open).
Maybe this can be accomplished via an iommu_detach_device() to
temporarily disassociate it from the domain. We could also unmap all
the DMA. Anyway, a few possibilities.
> >>> /* PCI MSI setup, arg[0] = #, arg[1-n] = eventfds */
> >>> #define VFIO_DEVICE_PCI_SET_MSI_EVENTFDS _IOW(, , int)
> >>> #define VFIO_DEVICE_PCI_SET_MSIX_EVENTFDS _IOW(, , int)
> >>>
> >>> Hope that covers it.
> >>
> >> It could be done this way, but I predict that the code (both kernel and
> >> user side) will be larger. Maybe not much more complex, but more
> >> boilerplate.
> >>
> >> How will you manage extensions to the interface?
> >
> > I would assume we'd do something similar to the kvm capabilities checks.
>
> This information is already built into the data-structure approach.
If we define it to be part of the flags, then it's built-in to the ioctl
approach too...
> >> The table should not be particularly large, and you'll need to keep the
> >> information around in some form regardless. Maybe in the PCI case you
> >> could produce it dynamically (though I probably wouldn't), but it really
> >> wouldn't make sense in the device tree case.
> >
> > It would be entirely dynamic for PCI, there's no advantage to caching
> > it. Even for device tree, if you can't fetch it dynamically, you'd have
> > to duplicate it between an internal data structure and a buffer reading
> > the table.
>
> I don't think we'd need to keep the device tree path/index info around
> for anything but the table -- but really, this is a minor consideration.
>
> >> You also lose the ability to easily have a human look at the hexdump for
> >> debugging; you'll need a special "lsvfio" tool. You might want one
> >> anyway to pretty-print the info, but with ioctls it's mandatory.
> >
> > I don't think this alone justifies duplicating the data and making it
> > difficult to parse on both ends. Chances are we won't need such a tool
> > for the ioctl interface because it's easier to get it right the first
> > time ;)
>
> It's not just useful for getting the code right, but for e.g. sanity
> checking that the devices were bound properly. I think such a tool
> would be generally useful, no matter what the kernel interface ends up
> being. I don't just use lspci to debug the PCI subsystem. :-)
This is also a minor consideration. Looking at hexdumps isn't much to
rely on for debugging and if we take the step of writing a tool, it's
not much harder to write for either interface. The table is more akin
to dumping the data, but I feel the ioctl is easier for how a driver
would probably make use of the data (linear vs random access).
> > Note that I'm not stuck on this interface, I was just thinking about how
> > to generate the table last week, it seemed like a pain so I thought I'd
> > spend a few minutes outlining an ioctl interface... turns out it's not
> > so bad. Thanks,
>
> Yeah, it can work either way, as long as the information's there and
> there's a way to add new bits of information, or new bus types, down the
> road. Mainly a matter of aesthetics between the two.
It'd be nice if David would chime back in since he objected to the
table. Does an ioctl interface look better? Alex Graf, any opinions?
Thanks,
Alex
^ permalink raw reply [flat|nested] 49+ messages in thread
* Re: [Qemu-devel] RFC [v2]: vfio / device assignment -- layout of device fd files
2011-09-28 2:40 ` [Qemu-devel] " Alex Williamson
@ 2011-09-28 8:58 ` Alexander Graf
-1 siblings, 0 replies; 49+ messages in thread
From: Alexander Graf @ 2011-09-28 8:58 UTC (permalink / raw)
To: Alex Williamson
Cc: Scott Wood, kvm@vger.kernel.org, Stuart Yoder,
Benjamin Herrenschmidt, qemu-devel@nongnu.org, avi@redhat.com,
David Gibson
On 28.09.2011, at 04:40, Alex Williamson wrote:
> On Tue, 2011-09-27 at 16:28 -0500, Scott Wood wrote:
>> On 09/26/2011 07:45 PM, Alex Williamson wrote:
>>> On Mon, 2011-09-26 at 18:59 -0500, Scott Wood wrote:
>>>> On 09/26/2011 01:34 PM, Alex Williamson wrote:
>>>>> /* Reset the device */
>>>>> #define VFIO_DEVICE_RESET _IO(, ,)
>>>>
>>>> What generic way do we have to do this? We should probably have a way
>>>> to determine whether it's possible, without actually asking to do it.
>>>
>>> It's not generic, it could be a VFIO_DEVICE_PCI_RESET or we could add a
>>> bit to the device flags to indicate if it's available or we could add a
>>> "probe" arg to the ioctl to either check for existence or do it.
>>
>> Even with PCI, isn't this only possible if function-level reset is
>> supported?
>
> There are a couple other things we can do if FLR isn't present (D3hot
> transition, secondary bus reset, device specific resets are possible).
>
>> I think we need a flag.
>
> Ok, PCI has a pci_probe_reset_function() and pci_reset_function(). I'd
> probably mimic those in the vfio device ops. Common vfio code can probe
> the reset and set the flag appropriately and we can have a common
> VFIO_DEVICE_RESET ioctl that calls into the device ops reset function.
>
>> For devices that can't be reset by the kernel, we'll want the ability to
>> stop/start DMA acccess through the IOMMU (or other bus-specific means),
>> separate from whether the fd is open. If a device is assigned to a
>> partition and that partition gets reset, we'll want to disable DMA
>> before we re-use the memory, and enable it after the partition has reset
>> or quiesced the device (which requires the fd to be open).
>
> Maybe this can be accomplished via an iommu_detach_device() to
> temporarily disassociate it from the domain. We could also unmap all
> the DMA. Anyway, a few possibilities.
>
>>>>> /* PCI MSI setup, arg[0] = #, arg[1-n] = eventfds */
>>>>> #define VFIO_DEVICE_PCI_SET_MSI_EVENTFDS _IOW(, , int)
>>>>> #define VFIO_DEVICE_PCI_SET_MSIX_EVENTFDS _IOW(, , int)
>>>>>
>>>>> Hope that covers it.
>>>>
>>>> It could be done this way, but I predict that the code (both kernel and
>>>> user side) will be larger. Maybe not much more complex, but more
>>>> boilerplate.
>>>>
>>>> How will you manage extensions to the interface?
>>>
>>> I would assume we'd do something similar to the kvm capabilities checks.
>>
>> This information is already built into the data-structure approach.
>
> If we define it to be part of the flags, then it's built-in to the ioctl
> approach too...
>
>>>> The table should not be particularly large, and you'll need to keep the
>>>> information around in some form regardless. Maybe in the PCI case you
>>>> could produce it dynamically (though I probably wouldn't), but it really
>>>> wouldn't make sense in the device tree case.
>>>
>>> It would be entirely dynamic for PCI, there's no advantage to caching
>>> it. Even for device tree, if you can't fetch it dynamically, you'd have
>>> to duplicate it between an internal data structure and a buffer reading
>>> the table.
>>
>> I don't think we'd need to keep the device tree path/index info around
>> for anything but the table -- but really, this is a minor consideration.
>>
>>>> You also lose the ability to easily have a human look at the hexdump for
>>>> debugging; you'll need a special "lsvfio" tool. You might want one
>>>> anyway to pretty-print the info, but with ioctls it's mandatory.
>>>
>>> I don't think this alone justifies duplicating the data and making it
>>> difficult to parse on both ends. Chances are we won't need such a tool
>>> for the ioctl interface because it's easier to get it right the first
>>> time ;)
>>
>> It's not just useful for getting the code right, but for e.g. sanity
>> checking that the devices were bound properly. I think such a tool
>> would be generally useful, no matter what the kernel interface ends up
>> being. I don't just use lspci to debug the PCI subsystem. :-)
>
> This is also a minor consideration. Looking at hexdumps isn't much to
> rely on for debugging and if we take the step of writing a tool, it's
> not much harder to write for either interface. The table is more akin
> to dumping the data, but I feel the ioctl is easier for how a driver
> would probably make use of the data (linear vs random access).
>
>>> Note that I'm not stuck on this interface, I was just thinking about how
>>> to generate the table last week, it seemed like a pain so I thought I'd
>>> spend a few minutes outlining an ioctl interface... turns out it's not
>>> so bad. Thanks,
>>
>> Yeah, it can work either way, as long as the information's there and
>> there's a way to add new bits of information, or new bus types, down the
>> road. Mainly a matter of aesthetics between the two.
>
> It'd be nice if David would chime back in since he objected to the
> table. Does an ioctl interface look better? Alex Graf, any opinions?
I'm honestly pretty indifferent on ioctl vs. linear read. I got the impression that people dislike ioctls for whatever reason, so we went ahead and did the design based on read(). With KVM, ioctls are a constant pain to extend, but so are the constant sized fields here.
Whatever you do, please introduce a "flags" field to every struct you use and add some padding at the end, so it can possibly be extended.
Alex
^ permalink raw reply [flat|nested] 49+ messages in thread
* Re: [Qemu-devel] RFC [v2]: vfio / device assignment -- layout of device fd files
@ 2011-09-28 8:58 ` Alexander Graf
0 siblings, 0 replies; 49+ messages in thread
From: Alexander Graf @ 2011-09-28 8:58 UTC (permalink / raw)
To: Alex Williamson
Cc: kvm@vger.kernel.org, Benjamin Herrenschmidt, Stuart Yoder,
qemu-devel@nongnu.org, avi@redhat.com, Scott Wood, David Gibson
On 28.09.2011, at 04:40, Alex Williamson wrote:
> On Tue, 2011-09-27 at 16:28 -0500, Scott Wood wrote:
>> On 09/26/2011 07:45 PM, Alex Williamson wrote:
>>> On Mon, 2011-09-26 at 18:59 -0500, Scott Wood wrote:
>>>> On 09/26/2011 01:34 PM, Alex Williamson wrote:
>>>>> /* Reset the device */
>>>>> #define VFIO_DEVICE_RESET _IO(, ,)
>>>>
>>>> What generic way do we have to do this? We should probably have a way
>>>> to determine whether it's possible, without actually asking to do it.
>>>
>>> It's not generic, it could be a VFIO_DEVICE_PCI_RESET or we could add a
>>> bit to the device flags to indicate if it's available or we could add a
>>> "probe" arg to the ioctl to either check for existence or do it.
>>
>> Even with PCI, isn't this only possible if function-level reset is
>> supported?
>
> There are a couple other things we can do if FLR isn't present (D3hot
> transition, secondary bus reset, device specific resets are possible).
>
>> I think we need a flag.
>
> Ok, PCI has a pci_probe_reset_function() and pci_reset_function(). I'd
> probably mimic those in the vfio device ops. Common vfio code can probe
> the reset and set the flag appropriately and we can have a common
> VFIO_DEVICE_RESET ioctl that calls into the device ops reset function.
>
>> For devices that can't be reset by the kernel, we'll want the ability to
>> stop/start DMA acccess through the IOMMU (or other bus-specific means),
>> separate from whether the fd is open. If a device is assigned to a
>> partition and that partition gets reset, we'll want to disable DMA
>> before we re-use the memory, and enable it after the partition has reset
>> or quiesced the device (which requires the fd to be open).
>
> Maybe this can be accomplished via an iommu_detach_device() to
> temporarily disassociate it from the domain. We could also unmap all
> the DMA. Anyway, a few possibilities.
>
>>>>> /* PCI MSI setup, arg[0] = #, arg[1-n] = eventfds */
>>>>> #define VFIO_DEVICE_PCI_SET_MSI_EVENTFDS _IOW(, , int)
>>>>> #define VFIO_DEVICE_PCI_SET_MSIX_EVENTFDS _IOW(, , int)
>>>>>
>>>>> Hope that covers it.
>>>>
>>>> It could be done this way, but I predict that the code (both kernel and
>>>> user side) will be larger. Maybe not much more complex, but more
>>>> boilerplate.
>>>>
>>>> How will you manage extensions to the interface?
>>>
>>> I would assume we'd do something similar to the kvm capabilities checks.
>>
>> This information is already built into the data-structure approach.
>
> If we define it to be part of the flags, then it's built-in to the ioctl
> approach too...
>
>>>> The table should not be particularly large, and you'll need to keep the
>>>> information around in some form regardless. Maybe in the PCI case you
>>>> could produce it dynamically (though I probably wouldn't), but it really
>>>> wouldn't make sense in the device tree case.
>>>
>>> It would be entirely dynamic for PCI, there's no advantage to caching
>>> it. Even for device tree, if you can't fetch it dynamically, you'd have
>>> to duplicate it between an internal data structure and a buffer reading
>>> the table.
>>
>> I don't think we'd need to keep the device tree path/index info around
>> for anything but the table -- but really, this is a minor consideration.
>>
>>>> You also lose the ability to easily have a human look at the hexdump for
>>>> debugging; you'll need a special "lsvfio" tool. You might want one
>>>> anyway to pretty-print the info, but with ioctls it's mandatory.
>>>
>>> I don't think this alone justifies duplicating the data and making it
>>> difficult to parse on both ends. Chances are we won't need such a tool
>>> for the ioctl interface because it's easier to get it right the first
>>> time ;)
>>
>> It's not just useful for getting the code right, but for e.g. sanity
>> checking that the devices were bound properly. I think such a tool
>> would be generally useful, no matter what the kernel interface ends up
>> being. I don't just use lspci to debug the PCI subsystem. :-)
>
> This is also a minor consideration. Looking at hexdumps isn't much to
> rely on for debugging and if we take the step of writing a tool, it's
> not much harder to write for either interface. The table is more akin
> to dumping the data, but I feel the ioctl is easier for how a driver
> would probably make use of the data (linear vs random access).
>
>>> Note that I'm not stuck on this interface, I was just thinking about how
>>> to generate the table last week, it seemed like a pain so I thought I'd
>>> spend a few minutes outlining an ioctl interface... turns out it's not
>>> so bad. Thanks,
>>
>> Yeah, it can work either way, as long as the information's there and
>> there's a way to add new bits of information, or new bus types, down the
>> road. Mainly a matter of aesthetics between the two.
>
> It'd be nice if David would chime back in since he objected to the
> table. Does an ioctl interface look better? Alex Graf, any opinions?
I'm honestly pretty indifferent on ioctl vs. linear read. I got the impression that people dislike ioctls for whatever reason, so we went ahead and did the design based on read(). With KVM, ioctls are a constant pain to extend, but so are the constant sized fields here.
Whatever you do, please introduce a "flags" field to every struct you use and add some padding at the end, so it can possibly be extended.
Alex
^ permalink raw reply [flat|nested] 49+ messages in thread
* Re: [Qemu-devel] RFC [v2]: vfio / device assignment -- layout of device fd files
2011-09-28 8:58 ` Alexander Graf
@ 2011-09-30 8:55 ` David Gibson
-1 siblings, 0 replies; 49+ messages in thread
From: David Gibson @ 2011-09-30 8:55 UTC (permalink / raw)
To: Alexander Graf
Cc: Alex Williamson, Scott Wood, kvm@vger.kernel.org, Stuart Yoder,
Benjamin Herrenschmidt, qemu-devel@nongnu.org, avi@redhat.com
On Wed, Sep 28, 2011 at 10:58:26AM +0200, Alexander Graf wrote:
> On 28.09.2011, at 04:40, Alex Williamson wrote:
> > On Tue, 2011-09-27 at 16:28 -0500, Scott Wood wrote:
[snip]
> I'm honestly pretty indifferent on ioctl vs. linear read. I got the
> impression that people dislike ioctls for whatever reason, so we
> went ahead and did the design based on read(). With KVM, ioctls are
> a constant pain to extend, but so are the constant sized fields
> here.
>
> Whatever you do, please introduce a "flags" field to every struct
> you use and add some padding at the end, so it can possibly be
> extended.
Yes, those are good ideas. Plus whenever creating an ioctl() that has
a structure "return value", it's a good idea for the size of the
structure to be specified as an input. Again this allows later
extension. Older userspace which only knows the old structure won't
get the new fields, but presumably it wouldn't know or care what to do
with them anyway.
--
David Gibson | I'll have my music baroque, and my code
david AT gibson.dropbear.id.au | minimalist, thank you. NOT _the_ _other_
| _way_ _around_!
http://www.ozlabs.org/~dgibson
^ permalink raw reply [flat|nested] 49+ messages in thread
* Re: [Qemu-devel] RFC [v2]: vfio / device assignment -- layout of device fd files
@ 2011-09-30 8:55 ` David Gibson
0 siblings, 0 replies; 49+ messages in thread
From: David Gibson @ 2011-09-30 8:55 UTC (permalink / raw)
To: Alexander Graf
Cc: Benjamin Herrenschmidt, kvm@vger.kernel.org, Stuart Yoder,
qemu-devel@nongnu.org, Alex Williamson, avi@redhat.com,
Scott Wood
On Wed, Sep 28, 2011 at 10:58:26AM +0200, Alexander Graf wrote:
> On 28.09.2011, at 04:40, Alex Williamson wrote:
> > On Tue, 2011-09-27 at 16:28 -0500, Scott Wood wrote:
[snip]
> I'm honestly pretty indifferent on ioctl vs. linear read. I got the
> impression that people dislike ioctls for whatever reason, so we
> went ahead and did the design based on read(). With KVM, ioctls are
> a constant pain to extend, but so are the constant sized fields
> here.
>
> Whatever you do, please introduce a "flags" field to every struct
> you use and add some padding at the end, so it can possibly be
> extended.
Yes, those are good ideas. Plus whenever creating an ioctl() that has
a structure "return value", it's a good idea for the size of the
structure to be specified as an input. Again this allows later
extension. Older userspace which only knows the old structure won't
get the new fields, but presumably it wouldn't know or care what to do
with them anyway.
--
David Gibson | I'll have my music baroque, and my code
david AT gibson.dropbear.id.au | minimalist, thank you. NOT _the_ _other_
| _way_ _around_!
http://www.ozlabs.org/~dgibson
^ permalink raw reply [flat|nested] 49+ messages in thread
* Re: [Qemu-devel] RFC [v2]: vfio / device assignment -- layout of device fd files
2011-09-26 23:59 ` Scott Wood
@ 2011-09-30 8:50 ` David Gibson
-1 siblings, 0 replies; 49+ messages in thread
From: David Gibson @ 2011-09-30 8:50 UTC (permalink / raw)
To: Scott Wood
Cc: Alex Williamson, Alexander Graf, Stuart Yoder,
qemu-devel@nongnu.org, kvm@vger.kernel.org, avi@redhat.com,
anthony@codemonkey.ws, Benjamin Herrenschmidt
On Mon, Sep 26, 2011 at 06:59:33PM -0500, Scott Wood wrote:
> On 09/26/2011 01:34 PM, Alex Williamson wrote:
> > The other obvious possibility is a pure ioctl interface. To match what
> > this proposal is trying to describe, plus the runtime interfaces, we'd
> > need something like:
> >
> > /* :0 - PCI devices, :1 - Devices path device, 63:2 - reserved */
> > #define VFIO_DEVICE_GET_FLAGS _IOR(, , u64)
> >
> >
> > /* Return number of mmio/iop/config regions.
> > * For PCI this is always 8 (BAR0-5 + ROM + Config) */
> > #define VFIO_DEVICE_GET_NUM_REGIONS _IOR(, , int)
>
> How do you handle BARs that a particular device doesn't use? Zero-length?
>
> > /* Return the device tree path for type/index into the user
> > * allocated buffer */
> > struct dtpath {
> > u32 type; (0 = region, 1 = IRQ)
> > u32 index;
> > u32 buf_len;
> > char *buf;
> > };
> > #define VFIO_DEVICE_GET_DTPATH _IOWR(, , struct dtpath)
>
> So now the user needs to guess a buffer length in advance... and what
> happens if it's too small?
>
> > /* Reset the device */
> > #define VFIO_DEVICE_RESET _IO(, ,)
>
> What generic way do we have to do this? We should probably have a way
> to determine whether it's possible, without actually asking to do it.
That's a good point. PCI devices have a standardized reset, but
embedded devices often won't. Mind you we could just fail the call in
that cse.
> > /* PCI MSI setup, arg[0] = #, arg[1-n] = eventfds */
> > #define VFIO_DEVICE_PCI_SET_MSI_EVENTFDS _IOW(, , int)
> > #define VFIO_DEVICE_PCI_SET_MSIX_EVENTFDS _IOW(, , int)
> >
> > Hope that covers it.
>
> It could be done this way, but I predict that the code (both kernel and
> user side) will be larger. Maybe not much more complex, but more
> boilerplate.
>
> How will you manage extensions to the interface? With the table it's
> simple, you see a new (sub)record type and you either understand it or
> you skip it. With ioctls you need to call every information-gathering
> ioctl you know and care about (or are told is present via some feature
> advertisement), and see if there's anything there.
No.. quite the opposite. With ioctl()s you call the ones your
userspace program cares about / can implement. When an extended
interface is added, they keep working as is. Newer userspace which
uses the new features will call the new ioctls() if it cares about
them.
> > Something I prefer about this interface is that
> > everything can easily be generated on the fly, whereas reading out a
> > table from the device means we really need to have that table somewhere
> > in kernel memory to easily support reading random offsets. Thoughts?
>
> The table should not be particularly large, and you'll need to keep the
> information around in some form regardless. Maybe in the PCI case you
> could produce it dynamically (though I probably wouldn't), but it really
> wouldn't make sense in the device tree case.
>
> You also lose the ability to easily have a human look at the hexdump for
> debugging; you'll need a special "lsvfio" tool. You might want one
> anyway to pretty-print the info, but with ioctls it's mandatory.
>
> -Scott
>
--
David Gibson | I'll have my music baroque, and my code
david AT gibson.dropbear.id.au | minimalist, thank you. NOT _the_ _other_
| _way_ _around_!
http://www.ozlabs.org/~dgibson
^ permalink raw reply [flat|nested] 49+ messages in thread* Re: [Qemu-devel] RFC [v2]: vfio / device assignment -- layout of device fd files
@ 2011-09-30 8:50 ` David Gibson
0 siblings, 0 replies; 49+ messages in thread
From: David Gibson @ 2011-09-30 8:50 UTC (permalink / raw)
To: Scott Wood
Cc: Benjamin Herrenschmidt, kvm@vger.kernel.org, Stuart Yoder,
qemu-devel@nongnu.org, Alexander Graf, Alex Williamson,
avi@redhat.com
On Mon, Sep 26, 2011 at 06:59:33PM -0500, Scott Wood wrote:
> On 09/26/2011 01:34 PM, Alex Williamson wrote:
> > The other obvious possibility is a pure ioctl interface. To match what
> > this proposal is trying to describe, plus the runtime interfaces, we'd
> > need something like:
> >
> > /* :0 - PCI devices, :1 - Devices path device, 63:2 - reserved */
> > #define VFIO_DEVICE_GET_FLAGS _IOR(, , u64)
> >
> >
> > /* Return number of mmio/iop/config regions.
> > * For PCI this is always 8 (BAR0-5 + ROM + Config) */
> > #define VFIO_DEVICE_GET_NUM_REGIONS _IOR(, , int)
>
> How do you handle BARs that a particular device doesn't use? Zero-length?
>
> > /* Return the device tree path for type/index into the user
> > * allocated buffer */
> > struct dtpath {
> > u32 type; (0 = region, 1 = IRQ)
> > u32 index;
> > u32 buf_len;
> > char *buf;
> > };
> > #define VFIO_DEVICE_GET_DTPATH _IOWR(, , struct dtpath)
>
> So now the user needs to guess a buffer length in advance... and what
> happens if it's too small?
>
> > /* Reset the device */
> > #define VFIO_DEVICE_RESET _IO(, ,)
>
> What generic way do we have to do this? We should probably have a way
> to determine whether it's possible, without actually asking to do it.
That's a good point. PCI devices have a standardized reset, but
embedded devices often won't. Mind you we could just fail the call in
that cse.
> > /* PCI MSI setup, arg[0] = #, arg[1-n] = eventfds */
> > #define VFIO_DEVICE_PCI_SET_MSI_EVENTFDS _IOW(, , int)
> > #define VFIO_DEVICE_PCI_SET_MSIX_EVENTFDS _IOW(, , int)
> >
> > Hope that covers it.
>
> It could be done this way, but I predict that the code (both kernel and
> user side) will be larger. Maybe not much more complex, but more
> boilerplate.
>
> How will you manage extensions to the interface? With the table it's
> simple, you see a new (sub)record type and you either understand it or
> you skip it. With ioctls you need to call every information-gathering
> ioctl you know and care about (or are told is present via some feature
> advertisement), and see if there's anything there.
No.. quite the opposite. With ioctl()s you call the ones your
userspace program cares about / can implement. When an extended
interface is added, they keep working as is. Newer userspace which
uses the new features will call the new ioctls() if it cares about
them.
> > Something I prefer about this interface is that
> > everything can easily be generated on the fly, whereas reading out a
> > table from the device means we really need to have that table somewhere
> > in kernel memory to easily support reading random offsets. Thoughts?
>
> The table should not be particularly large, and you'll need to keep the
> information around in some form regardless. Maybe in the PCI case you
> could produce it dynamically (though I probably wouldn't), but it really
> wouldn't make sense in the device tree case.
>
> You also lose the ability to easily have a human look at the hexdump for
> debugging; you'll need a special "lsvfio" tool. You might want one
> anyway to pretty-print the info, but with ioctls it's mandatory.
>
> -Scott
>
--
David Gibson | I'll have my music baroque, and my code
david AT gibson.dropbear.id.au | minimalist, thank you. NOT _the_ _other_
| _way_ _around_!
http://www.ozlabs.org/~dgibson
^ permalink raw reply [flat|nested] 49+ messages in thread
* Re: [Qemu-devel] RFC [v2]: vfio / device assignment -- layout of device fd files
2011-09-26 18:34 ` Alex Williamson
@ 2011-09-30 8:46 ` David Gibson
-1 siblings, 0 replies; 49+ messages in thread
From: David Gibson @ 2011-09-30 8:46 UTC (permalink / raw)
To: Alex Williamson
Cc: Alexander Graf, Stuart Yoder, Scott Wood, qemu-devel@nongnu.org,
kvm@vger.kernel.org, avi@redhat.com, anthony@codemonkey.ws,
Benjamin Herrenschmidt
On Mon, Sep 26, 2011 at 12:34:52PM -0600, Alex Williamson wrote:
> On Mon, 2011-09-26 at 12:04 +0200, Alexander Graf wrote:
> > Am 26.09.2011 um 09:51 schrieb David Gibson <david@gibson.dropbear.id.au>:
[snip]
> > Also, if you can come up with an interface that does not have variable
> > length descriptors but is still able to export all the required
> > generic information, please send a proposal to the list :)
> >
>
> Hi,
>
> The other obvious possibility is a pure ioctl interface. To match what
> this proposal is trying to describe, plus the runtime interfaces, we'd
> need something like:
Right, this also seems a reasonable possibility to me, depending on
the details.
> /* :0 - PCI devices, :1 - Devices path device, 63:2 - reserved */
> #define VFIO_DEVICE_GET_FLAGS _IOR(, , u64)
>
>
> /* Return number of mmio/iop/config regions.
> * For PCI this is always 8 (BAR0-5 + ROM + Config) */
> #define VFIO_DEVICE_GET_NUM_REGIONS _IOR(, , int)
>
> /* Return length for region index (may be zero) */
> #define VFIO_DEVICE_GET_REGION_LEN _IOWR(, , u64)
>
> /* Return flags for region index
> * :0 - mmap'able, :1 - read-only, 63:2 - reserved */
> #define VFIO_DEVICE_GET_REGION_FLAGS _IOR(, , u64)
>
> /* Return file offset for region index */
> #define VFIO_DEVICE_GET_REGION_OFFSET _IOWR(, , u64)
The above 3 can be be folded into one "getregioninfo" call.
> /* Return physical address for region index - not implemented for PCI */
> #define VFIO_DEVICE_GET_REGION_PHYS_ADDR _IOWR(, , u64)
>
>
>
> /* Return number of IRQs (Not including MSI/MSI-X for PCI) */
> #define VFIO_DEVICE_GET_NUM_IRQ _IOR(, , int)
>
> /* Set IRQ eventfd for IRQ index, arg[0] = index, arg[1] = fd */
> #define VFIO_DEVICE_SET_IRQ_EVENTFD _IOW(, , int)
>
> /* Unmask IRQ index */
> #define VFIO_DEVICE_UNMASK_IRQ _IOW(, , int)
>
> /* Set unmask eventfd for index, arg[0] = index, arg[1] = fd */
> #define VFIO_DEVICE_SET_UNMASK_IRQ_EVENTFD _IOW(, , int)
>
>
> /* Return the device tree path for type/index into the user
> * allocated buffer */
> struct dtpath {
> u32 type; (0 = region, 1 = IRQ)
> u32 index;
> u32 buf_len;
> char *buf;
> };
> #define VFIO_DEVICE_GET_DTPATH _IOWR(, , struct dtpath)
>
> /* Return the device tree index for type/index */
> struct dtindex {
> u32 type; (0 = region, 1 = IRQ)
> u32 index;
> u32 prop_type;
> u32 prop_index;
> };
> #define VFIO_DEVICE_GET_DTINDEX _IOWR(, , struct dtindex)
I think those need some work, but that doesn't impinge on the core
semantics.
> /* Reset the device */
> #define VFIO_DEVICE_RESET _IO(, ,)
>
>
> /* PCI MSI setup, arg[0] = #, arg[1-n] = eventfds */
> #define VFIO_DEVICE_PCI_SET_MSI_EVENTFDS _IOW(, , int)
> #define VFIO_DEVICE_PCI_SET_MSIX_EVENTFDS _IOW(, , int)
Why does this need seperate controls, rather than just treating MSIs
as interrupts beyond the first for PCI devices?
> Hope that covers it. Something I prefer about this interface is that
> everything can easily be generated on the fly, whereas reading out a
> table from the device means we really need to have that table somewhere
> in kernel memory to easily support reading random offsets.
> Thoughts?
I certainly prefer it to the previous proposal.
--
David Gibson | I'll have my music baroque, and my code
david AT gibson.dropbear.id.au | minimalist, thank you. NOT _the_ _other_
| _way_ _around_!
http://www.ozlabs.org/~dgibson
^ permalink raw reply [flat|nested] 49+ messages in thread* Re: [Qemu-devel] RFC [v2]: vfio / device assignment -- layout of device fd files
@ 2011-09-30 8:46 ` David Gibson
0 siblings, 0 replies; 49+ messages in thread
From: David Gibson @ 2011-09-30 8:46 UTC (permalink / raw)
To: Alex Williamson
Cc: Benjamin Herrenschmidt, kvm@vger.kernel.org, Stuart Yoder,
qemu-devel@nongnu.org, Alexander Graf, avi@redhat.com, Scott Wood
On Mon, Sep 26, 2011 at 12:34:52PM -0600, Alex Williamson wrote:
> On Mon, 2011-09-26 at 12:04 +0200, Alexander Graf wrote:
> > Am 26.09.2011 um 09:51 schrieb David Gibson <david@gibson.dropbear.id.au>:
[snip]
> > Also, if you can come up with an interface that does not have variable
> > length descriptors but is still able to export all the required
> > generic information, please send a proposal to the list :)
> >
>
> Hi,
>
> The other obvious possibility is a pure ioctl interface. To match what
> this proposal is trying to describe, plus the runtime interfaces, we'd
> need something like:
Right, this also seems a reasonable possibility to me, depending on
the details.
> /* :0 - PCI devices, :1 - Devices path device, 63:2 - reserved */
> #define VFIO_DEVICE_GET_FLAGS _IOR(, , u64)
>
>
> /* Return number of mmio/iop/config regions.
> * For PCI this is always 8 (BAR0-5 + ROM + Config) */
> #define VFIO_DEVICE_GET_NUM_REGIONS _IOR(, , int)
>
> /* Return length for region index (may be zero) */
> #define VFIO_DEVICE_GET_REGION_LEN _IOWR(, , u64)
>
> /* Return flags for region index
> * :0 - mmap'able, :1 - read-only, 63:2 - reserved */
> #define VFIO_DEVICE_GET_REGION_FLAGS _IOR(, , u64)
>
> /* Return file offset for region index */
> #define VFIO_DEVICE_GET_REGION_OFFSET _IOWR(, , u64)
The above 3 can be be folded into one "getregioninfo" call.
> /* Return physical address for region index - not implemented for PCI */
> #define VFIO_DEVICE_GET_REGION_PHYS_ADDR _IOWR(, , u64)
>
>
>
> /* Return number of IRQs (Not including MSI/MSI-X for PCI) */
> #define VFIO_DEVICE_GET_NUM_IRQ _IOR(, , int)
>
> /* Set IRQ eventfd for IRQ index, arg[0] = index, arg[1] = fd */
> #define VFIO_DEVICE_SET_IRQ_EVENTFD _IOW(, , int)
>
> /* Unmask IRQ index */
> #define VFIO_DEVICE_UNMASK_IRQ _IOW(, , int)
>
> /* Set unmask eventfd for index, arg[0] = index, arg[1] = fd */
> #define VFIO_DEVICE_SET_UNMASK_IRQ_EVENTFD _IOW(, , int)
>
>
> /* Return the device tree path for type/index into the user
> * allocated buffer */
> struct dtpath {
> u32 type; (0 = region, 1 = IRQ)
> u32 index;
> u32 buf_len;
> char *buf;
> };
> #define VFIO_DEVICE_GET_DTPATH _IOWR(, , struct dtpath)
>
> /* Return the device tree index for type/index */
> struct dtindex {
> u32 type; (0 = region, 1 = IRQ)
> u32 index;
> u32 prop_type;
> u32 prop_index;
> };
> #define VFIO_DEVICE_GET_DTINDEX _IOWR(, , struct dtindex)
I think those need some work, but that doesn't impinge on the core
semantics.
> /* Reset the device */
> #define VFIO_DEVICE_RESET _IO(, ,)
>
>
> /* PCI MSI setup, arg[0] = #, arg[1-n] = eventfds */
> #define VFIO_DEVICE_PCI_SET_MSI_EVENTFDS _IOW(, , int)
> #define VFIO_DEVICE_PCI_SET_MSIX_EVENTFDS _IOW(, , int)
Why does this need seperate controls, rather than just treating MSIs
as interrupts beyond the first for PCI devices?
> Hope that covers it. Something I prefer about this interface is that
> everything can easily be generated on the fly, whereas reading out a
> table from the device means we really need to have that table somewhere
> in kernel memory to easily support reading random offsets.
> Thoughts?
I certainly prefer it to the previous proposal.
--
David Gibson | I'll have my music baroque, and my code
david AT gibson.dropbear.id.au | minimalist, thank you. NOT _the_ _other_
| _way_ _around_!
http://www.ozlabs.org/~dgibson
^ permalink raw reply [flat|nested] 49+ messages in thread* Re: [Qemu-devel] RFC [v2]: vfio / device assignment -- layout of device fd files
2011-09-30 8:46 ` David Gibson
@ 2011-09-30 16:37 ` Alex Williamson
-1 siblings, 0 replies; 49+ messages in thread
From: Alex Williamson @ 2011-09-30 16:37 UTC (permalink / raw)
To: David Gibson
Cc: Alexander Graf, Stuart Yoder, Scott Wood, qemu-devel@nongnu.org,
kvm@vger.kernel.org, avi@redhat.com, anthony@codemonkey.ws,
Benjamin Herrenschmidt
On Fri, 2011-09-30 at 18:46 +1000, David Gibson wrote:
> On Mon, Sep 26, 2011 at 12:34:52PM -0600, Alex Williamson wrote:
> > On Mon, 2011-09-26 at 12:04 +0200, Alexander Graf wrote:
> > > Am 26.09.2011 um 09:51 schrieb David Gibson <david@gibson.dropbear.id.au>:
> [snip]
> > > Also, if you can come up with an interface that does not have variable
> > > length descriptors but is still able to export all the required
> > > generic information, please send a proposal to the list :)
> > >
> >
> > Hi,
> >
> > The other obvious possibility is a pure ioctl interface. To match what
> > this proposal is trying to describe, plus the runtime interfaces, we'd
> > need something like:
>
> Right, this also seems a reasonable possibility to me, depending on
> the details.
>
> > /* :0 - PCI devices, :1 - Devices path device, 63:2 - reserved */
> > #define VFIO_DEVICE_GET_FLAGS _IOR(, , u64)
> >
> >
> > /* Return number of mmio/iop/config regions.
> > * For PCI this is always 8 (BAR0-5 + ROM + Config) */
> > #define VFIO_DEVICE_GET_NUM_REGIONS _IOR(, , int)
> >
> > /* Return length for region index (may be zero) */
> > #define VFIO_DEVICE_GET_REGION_LEN _IOWR(, , u64)
> >
> > /* Return flags for region index
> > * :0 - mmap'able, :1 - read-only, 63:2 - reserved */
> > #define VFIO_DEVICE_GET_REGION_FLAGS _IOR(, , u64)
> >
> > /* Return file offset for region index */
> > #define VFIO_DEVICE_GET_REGION_OFFSET _IOWR(, , u64)
>
> The above 3 can be be folded into one "getregioninfo" call.
Yep, and the phys addr one below. We can use a flags bit to indicate
whether it's valid.
> > /* Return physical address for region index - not implemented for PCI */
> > #define VFIO_DEVICE_GET_REGION_PHYS_ADDR _IOWR(, , u64)
> >
> >
> >
> > /* Return number of IRQs (Not including MSI/MSI-X for PCI) */
> > #define VFIO_DEVICE_GET_NUM_IRQ _IOR(, , int)
> >
> > /* Set IRQ eventfd for IRQ index, arg[0] = index, arg[1] = fd */
> > #define VFIO_DEVICE_SET_IRQ_EVENTFD _IOW(, , int)
> >
> > /* Unmask IRQ index */
> > #define VFIO_DEVICE_UNMASK_IRQ _IOW(, , int)
> >
> > /* Set unmask eventfd for index, arg[0] = index, arg[1] = fd */
> > #define VFIO_DEVICE_SET_UNMASK_IRQ_EVENTFD _IOW(, , int)
> >
> >
> > /* Return the device tree path for type/index into the user
> > * allocated buffer */
> > struct dtpath {
> > u32 type; (0 = region, 1 = IRQ)
> > u32 index;
> > u32 buf_len;
> > char *buf;
> > };
> > #define VFIO_DEVICE_GET_DTPATH _IOWR(, , struct dtpath)
> >
> > /* Return the device tree index for type/index */
> > struct dtindex {
> > u32 type; (0 = region, 1 = IRQ)
> > u32 index;
> > u32 prop_type;
> > u32 prop_index;
> > };
> > #define VFIO_DEVICE_GET_DTINDEX _IOWR(, , struct dtindex)
>
> I think those need some work, but that doesn't impinge on the core
> semantics.
>
> > /* Reset the device */
> > #define VFIO_DEVICE_RESET _IO(, ,)
> >
> >
> > /* PCI MSI setup, arg[0] = #, arg[1-n] = eventfds */
> > #define VFIO_DEVICE_PCI_SET_MSI_EVENTFDS _IOW(, , int)
> > #define VFIO_DEVICE_PCI_SET_MSIX_EVENTFDS _IOW(, , int)
>
> Why does this need seperate controls, rather than just treating MSIs
> as interrupts beyond the first for PCI devices?
Well, we could say that PCI will always report 3 for
VFIO_DEVICE_GET_NUM_IRQ where 0 = legacy, 1 = MSI, 2 = MSI-X. ioctls on
unimplemented IRQs will fail, UNMASK* ioctls on non-level triggered
interrupts will fail, and the parameter to SET_IRQ_EVENTFD becomes
arg[0] = index, arg[1] = count, arg[2-n] = fd. Maybe we'd then have a
GET_IRQ_INFO that takes something like:
struct vfio_irq_info {
int index;
unsigned int count;
u64 flags;
#define VFIO_IRQ_INFO_FLAGS_LEVEL (1 << 0)
};
count would be 0 on PCI if the type of interrupt isn't supported.
Better? Thanks,
Alex
> > Hope that covers it. Something I prefer about this interface is that
> > everything can easily be generated on the fly, whereas reading out a
> > table from the device means we really need to have that table somewhere
> > in kernel memory to easily support reading random offsets.
> > Thoughts?
>
> I certainly prefer it to the previous proposal.
^ permalink raw reply [flat|nested] 49+ messages in thread* Re: [Qemu-devel] RFC [v2]: vfio / device assignment -- layout of device fd files
@ 2011-09-30 16:37 ` Alex Williamson
0 siblings, 0 replies; 49+ messages in thread
From: Alex Williamson @ 2011-09-30 16:37 UTC (permalink / raw)
To: David Gibson
Cc: Benjamin Herrenschmidt, kvm@vger.kernel.org, Stuart Yoder,
qemu-devel@nongnu.org, Alexander Graf, avi@redhat.com, Scott Wood
On Fri, 2011-09-30 at 18:46 +1000, David Gibson wrote:
> On Mon, Sep 26, 2011 at 12:34:52PM -0600, Alex Williamson wrote:
> > On Mon, 2011-09-26 at 12:04 +0200, Alexander Graf wrote:
> > > Am 26.09.2011 um 09:51 schrieb David Gibson <david@gibson.dropbear.id.au>:
> [snip]
> > > Also, if you can come up with an interface that does not have variable
> > > length descriptors but is still able to export all the required
> > > generic information, please send a proposal to the list :)
> > >
> >
> > Hi,
> >
> > The other obvious possibility is a pure ioctl interface. To match what
> > this proposal is trying to describe, plus the runtime interfaces, we'd
> > need something like:
>
> Right, this also seems a reasonable possibility to me, depending on
> the details.
>
> > /* :0 - PCI devices, :1 - Devices path device, 63:2 - reserved */
> > #define VFIO_DEVICE_GET_FLAGS _IOR(, , u64)
> >
> >
> > /* Return number of mmio/iop/config regions.
> > * For PCI this is always 8 (BAR0-5 + ROM + Config) */
> > #define VFIO_DEVICE_GET_NUM_REGIONS _IOR(, , int)
> >
> > /* Return length for region index (may be zero) */
> > #define VFIO_DEVICE_GET_REGION_LEN _IOWR(, , u64)
> >
> > /* Return flags for region index
> > * :0 - mmap'able, :1 - read-only, 63:2 - reserved */
> > #define VFIO_DEVICE_GET_REGION_FLAGS _IOR(, , u64)
> >
> > /* Return file offset for region index */
> > #define VFIO_DEVICE_GET_REGION_OFFSET _IOWR(, , u64)
>
> The above 3 can be be folded into one "getregioninfo" call.
Yep, and the phys addr one below. We can use a flags bit to indicate
whether it's valid.
> > /* Return physical address for region index - not implemented for PCI */
> > #define VFIO_DEVICE_GET_REGION_PHYS_ADDR _IOWR(, , u64)
> >
> >
> >
> > /* Return number of IRQs (Not including MSI/MSI-X for PCI) */
> > #define VFIO_DEVICE_GET_NUM_IRQ _IOR(, , int)
> >
> > /* Set IRQ eventfd for IRQ index, arg[0] = index, arg[1] = fd */
> > #define VFIO_DEVICE_SET_IRQ_EVENTFD _IOW(, , int)
> >
> > /* Unmask IRQ index */
> > #define VFIO_DEVICE_UNMASK_IRQ _IOW(, , int)
> >
> > /* Set unmask eventfd for index, arg[0] = index, arg[1] = fd */
> > #define VFIO_DEVICE_SET_UNMASK_IRQ_EVENTFD _IOW(, , int)
> >
> >
> > /* Return the device tree path for type/index into the user
> > * allocated buffer */
> > struct dtpath {
> > u32 type; (0 = region, 1 = IRQ)
> > u32 index;
> > u32 buf_len;
> > char *buf;
> > };
> > #define VFIO_DEVICE_GET_DTPATH _IOWR(, , struct dtpath)
> >
> > /* Return the device tree index for type/index */
> > struct dtindex {
> > u32 type; (0 = region, 1 = IRQ)
> > u32 index;
> > u32 prop_type;
> > u32 prop_index;
> > };
> > #define VFIO_DEVICE_GET_DTINDEX _IOWR(, , struct dtindex)
>
> I think those need some work, but that doesn't impinge on the core
> semantics.
>
> > /* Reset the device */
> > #define VFIO_DEVICE_RESET _IO(, ,)
> >
> >
> > /* PCI MSI setup, arg[0] = #, arg[1-n] = eventfds */
> > #define VFIO_DEVICE_PCI_SET_MSI_EVENTFDS _IOW(, , int)
> > #define VFIO_DEVICE_PCI_SET_MSIX_EVENTFDS _IOW(, , int)
>
> Why does this need seperate controls, rather than just treating MSIs
> as interrupts beyond the first for PCI devices?
Well, we could say that PCI will always report 3 for
VFIO_DEVICE_GET_NUM_IRQ where 0 = legacy, 1 = MSI, 2 = MSI-X. ioctls on
unimplemented IRQs will fail, UNMASK* ioctls on non-level triggered
interrupts will fail, and the parameter to SET_IRQ_EVENTFD becomes
arg[0] = index, arg[1] = count, arg[2-n] = fd. Maybe we'd then have a
GET_IRQ_INFO that takes something like:
struct vfio_irq_info {
int index;
unsigned int count;
u64 flags;
#define VFIO_IRQ_INFO_FLAGS_LEVEL (1 << 0)
};
count would be 0 on PCI if the type of interrupt isn't supported.
Better? Thanks,
Alex
> > Hope that covers it. Something I prefer about this interface is that
> > everything can easily be generated on the fly, whereas reading out a
> > table from the device means we really need to have that table somewhere
> > in kernel memory to easily support reading random offsets.
> > Thoughts?
>
> I certainly prefer it to the previous proposal.
^ permalink raw reply [flat|nested] 49+ messages in thread
* Re: [Qemu-devel] RFC [v2]: vfio / device assignment -- layout of device fd files
2011-09-30 8:46 ` David Gibson
@ 2011-09-30 21:59 ` Alex Williamson
-1 siblings, 0 replies; 49+ messages in thread
From: Alex Williamson @ 2011-09-30 21:59 UTC (permalink / raw)
To: David Gibson
Cc: Alexander Graf, Stuart Yoder, Scott Wood, qemu-devel@nongnu.org,
kvm@vger.kernel.org, avi@redhat.com, anthony@codemonkey.ws,
Benjamin Herrenschmidt
On Fri, 2011-09-30 at 10:37 -0600, Alex Williamson wrote:
> On Fri, 2011-09-30 at 18:46 +1000, David Gibson wrote:
> > On Mon, Sep 26, 2011 at 12:34:52PM -0600, Alex Williamson wrote:
> > > On Mon, 2011-09-26 at 12:04 +0200, Alexander Graf wrote:
> > > > Am 26.09.2011 um 09:51 schrieb David Gibson <david@gibson.dropbear.id.au>:
> > [snip]
> > > > Also, if you can come up with an interface that does not have variable
> > > > length descriptors but is still able to export all the required
> > > > generic information, please send a proposal to the list :)
> > > >
> > >
> > > Hi,
> > >
> > > The other obvious possibility is a pure ioctl interface. To match what
> > > this proposal is trying to describe, plus the runtime interfaces, we'd
> > > need something like:
> >
> > Right, this also seems a reasonable possibility to me, depending on
> > the details.
> >
> > > /* :0 - PCI devices, :1 - Devices path device, 63:2 - reserved */
> > > #define VFIO_DEVICE_GET_FLAGS _IOR(, , u64)
> > >
> > >
> > > /* Return number of mmio/iop/config regions.
> > > * For PCI this is always 8 (BAR0-5 + ROM + Config) */
> > > #define VFIO_DEVICE_GET_NUM_REGIONS _IOR(, , int)
> > >
> > > /* Return length for region index (may be zero) */
> > > #define VFIO_DEVICE_GET_REGION_LEN _IOWR(, , u64)
> > >
> > > /* Return flags for region index
> > > * :0 - mmap'able, :1 - read-only, 63:2 - reserved */
> > > #define VFIO_DEVICE_GET_REGION_FLAGS _IOR(, , u64)
> > >
> > > /* Return file offset for region index */
> > > #define VFIO_DEVICE_GET_REGION_OFFSET _IOWR(, , u64)
> >
> > The above 3 can be be folded into one "getregioninfo" call.
>
> Yep, and the phys addr one below. We can use a flags bit to indicate
> whether it's valid.
>
> > > /* Return physical address for region index - not implemented for PCI */
> > > #define VFIO_DEVICE_GET_REGION_PHYS_ADDR _IOWR(, , u64)
> > >
> > >
> > >
> > > /* Return number of IRQs (Not including MSI/MSI-X for PCI) */
> > > #define VFIO_DEVICE_GET_NUM_IRQ _IOR(, , int)
> > >
> > > /* Set IRQ eventfd for IRQ index, arg[0] = index, arg[1] = fd */
> > > #define VFIO_DEVICE_SET_IRQ_EVENTFD _IOW(, , int)
> > >
> > > /* Unmask IRQ index */
> > > #define VFIO_DEVICE_UNMASK_IRQ _IOW(, , int)
> > >
> > > /* Set unmask eventfd for index, arg[0] = index, arg[1] = fd */
> > > #define VFIO_DEVICE_SET_UNMASK_IRQ_EVENTFD _IOW(, , int)
> > >
> > >
> > > /* Return the device tree path for type/index into the user
> > > * allocated buffer */
> > > struct dtpath {
> > > u32 type; (0 = region, 1 = IRQ)
> > > u32 index;
> > > u32 buf_len;
> > > char *buf;
> > > };
> > > #define VFIO_DEVICE_GET_DTPATH _IOWR(, , struct dtpath)
> > >
> > > /* Return the device tree index for type/index */
> > > struct dtindex {
> > > u32 type; (0 = region, 1 = IRQ)
> > > u32 index;
> > > u32 prop_type;
> > > u32 prop_index;
> > > };
> > > #define VFIO_DEVICE_GET_DTINDEX _IOWR(, , struct dtindex)
> >
> > I think those need some work, but that doesn't impinge on the core
> > semantics.
> >
> > > /* Reset the device */
> > > #define VFIO_DEVICE_RESET _IO(, ,)
> > >
> > >
> > > /* PCI MSI setup, arg[0] = #, arg[1-n] = eventfds */
> > > #define VFIO_DEVICE_PCI_SET_MSI_EVENTFDS _IOW(, , int)
> > > #define VFIO_DEVICE_PCI_SET_MSIX_EVENTFDS _IOW(, , int)
> >
> > Why does this need seperate controls, rather than just treating MSIs
> > as interrupts beyond the first for PCI devices?
>
> Well, we could say that PCI will always report 3 for
> VFIO_DEVICE_GET_NUM_IRQ where 0 = legacy, 1 = MSI, 2 = MSI-X. ioctls on
> unimplemented IRQs will fail, UNMASK* ioctls on non-level triggered
> interrupts will fail, and the parameter to SET_IRQ_EVENTFD becomes
> arg[0] = index, arg[1] = count, arg[2-n] = fd. Maybe we'd then have a
> GET_IRQ_INFO that takes something like:
>
> struct vfio_irq_info {
> int index;
> unsigned int count;
> u64 flags;
> #define VFIO_IRQ_INFO_FLAGS_LEVEL (1 << 0)
> };
>
> count would be 0 on PCI if the type of interrupt isn't supported.
> Better? Thanks,
FYI for all, I've pushed a branch out to github with the current state
of the re-write. You can find it here
https://awilliam@github.com/awilliam/linux-vfio.git
git://github.com/awilliam/linux-vfio.git
The vfio-ng branch is the latest. The framework is quite a bit more
solid now, so I figure it's time to move into the device and iommu
implementation. vfio-pci is now it's own module that depends on vfio, I
expect vfio-dt to be implemented the same. The PCI ioctl is in place
and supports the interface described above. I'll continue to port
pieces of the old vfio code into this new infrastructure. Comments and
patches welcome. Thanks,
Alex
^ permalink raw reply [flat|nested] 49+ messages in thread* Re: [Qemu-devel] RFC [v2]: vfio / device assignment -- layout of device fd files
@ 2011-09-30 21:59 ` Alex Williamson
0 siblings, 0 replies; 49+ messages in thread
From: Alex Williamson @ 2011-09-30 21:59 UTC (permalink / raw)
To: David Gibson
Cc: Benjamin Herrenschmidt, kvm@vger.kernel.org, Stuart Yoder,
qemu-devel@nongnu.org, Alexander Graf, avi@redhat.com, Scott Wood
On Fri, 2011-09-30 at 10:37 -0600, Alex Williamson wrote:
> On Fri, 2011-09-30 at 18:46 +1000, David Gibson wrote:
> > On Mon, Sep 26, 2011 at 12:34:52PM -0600, Alex Williamson wrote:
> > > On Mon, 2011-09-26 at 12:04 +0200, Alexander Graf wrote:
> > > > Am 26.09.2011 um 09:51 schrieb David Gibson <david@gibson.dropbear.id.au>:
> > [snip]
> > > > Also, if you can come up with an interface that does not have variable
> > > > length descriptors but is still able to export all the required
> > > > generic information, please send a proposal to the list :)
> > > >
> > >
> > > Hi,
> > >
> > > The other obvious possibility is a pure ioctl interface. To match what
> > > this proposal is trying to describe, plus the runtime interfaces, we'd
> > > need something like:
> >
> > Right, this also seems a reasonable possibility to me, depending on
> > the details.
> >
> > > /* :0 - PCI devices, :1 - Devices path device, 63:2 - reserved */
> > > #define VFIO_DEVICE_GET_FLAGS _IOR(, , u64)
> > >
> > >
> > > /* Return number of mmio/iop/config regions.
> > > * For PCI this is always 8 (BAR0-5 + ROM + Config) */
> > > #define VFIO_DEVICE_GET_NUM_REGIONS _IOR(, , int)
> > >
> > > /* Return length for region index (may be zero) */
> > > #define VFIO_DEVICE_GET_REGION_LEN _IOWR(, , u64)
> > >
> > > /* Return flags for region index
> > > * :0 - mmap'able, :1 - read-only, 63:2 - reserved */
> > > #define VFIO_DEVICE_GET_REGION_FLAGS _IOR(, , u64)
> > >
> > > /* Return file offset for region index */
> > > #define VFIO_DEVICE_GET_REGION_OFFSET _IOWR(, , u64)
> >
> > The above 3 can be be folded into one "getregioninfo" call.
>
> Yep, and the phys addr one below. We can use a flags bit to indicate
> whether it's valid.
>
> > > /* Return physical address for region index - not implemented for PCI */
> > > #define VFIO_DEVICE_GET_REGION_PHYS_ADDR _IOWR(, , u64)
> > >
> > >
> > >
> > > /* Return number of IRQs (Not including MSI/MSI-X for PCI) */
> > > #define VFIO_DEVICE_GET_NUM_IRQ _IOR(, , int)
> > >
> > > /* Set IRQ eventfd for IRQ index, arg[0] = index, arg[1] = fd */
> > > #define VFIO_DEVICE_SET_IRQ_EVENTFD _IOW(, , int)
> > >
> > > /* Unmask IRQ index */
> > > #define VFIO_DEVICE_UNMASK_IRQ _IOW(, , int)
> > >
> > > /* Set unmask eventfd for index, arg[0] = index, arg[1] = fd */
> > > #define VFIO_DEVICE_SET_UNMASK_IRQ_EVENTFD _IOW(, , int)
> > >
> > >
> > > /* Return the device tree path for type/index into the user
> > > * allocated buffer */
> > > struct dtpath {
> > > u32 type; (0 = region, 1 = IRQ)
> > > u32 index;
> > > u32 buf_len;
> > > char *buf;
> > > };
> > > #define VFIO_DEVICE_GET_DTPATH _IOWR(, , struct dtpath)
> > >
> > > /* Return the device tree index for type/index */
> > > struct dtindex {
> > > u32 type; (0 = region, 1 = IRQ)
> > > u32 index;
> > > u32 prop_type;
> > > u32 prop_index;
> > > };
> > > #define VFIO_DEVICE_GET_DTINDEX _IOWR(, , struct dtindex)
> >
> > I think those need some work, but that doesn't impinge on the core
> > semantics.
> >
> > > /* Reset the device */
> > > #define VFIO_DEVICE_RESET _IO(, ,)
> > >
> > >
> > > /* PCI MSI setup, arg[0] = #, arg[1-n] = eventfds */
> > > #define VFIO_DEVICE_PCI_SET_MSI_EVENTFDS _IOW(, , int)
> > > #define VFIO_DEVICE_PCI_SET_MSIX_EVENTFDS _IOW(, , int)
> >
> > Why does this need seperate controls, rather than just treating MSIs
> > as interrupts beyond the first for PCI devices?
>
> Well, we could say that PCI will always report 3 for
> VFIO_DEVICE_GET_NUM_IRQ where 0 = legacy, 1 = MSI, 2 = MSI-X. ioctls on
> unimplemented IRQs will fail, UNMASK* ioctls on non-level triggered
> interrupts will fail, and the parameter to SET_IRQ_EVENTFD becomes
> arg[0] = index, arg[1] = count, arg[2-n] = fd. Maybe we'd then have a
> GET_IRQ_INFO that takes something like:
>
> struct vfio_irq_info {
> int index;
> unsigned int count;
> u64 flags;
> #define VFIO_IRQ_INFO_FLAGS_LEVEL (1 << 0)
> };
>
> count would be 0 on PCI if the type of interrupt isn't supported.
> Better? Thanks,
FYI for all, I've pushed a branch out to github with the current state
of the re-write. You can find it here
https://awilliam@github.com/awilliam/linux-vfio.git
git://github.com/awilliam/linux-vfio.git
The vfio-ng branch is the latest. The framework is quite a bit more
solid now, so I figure it's time to move into the device and iommu
implementation. vfio-pci is now it's own module that depends on vfio, I
expect vfio-dt to be implemented the same. The PCI ioctl is in place
and supports the interface described above. I'll continue to port
pieces of the old vfio code into this new infrastructure. Comments and
patches welcome. Thanks,
Alex
^ permalink raw reply [flat|nested] 49+ messages in thread* Re: [Qemu-devel] RFC [v2]: vfio / device assignment -- layout of device fd files
2011-09-30 21:59 ` Alex Williamson
(?)
@ 2011-10-06 23:18 ` Aaron Fabbri
-1 siblings, 0 replies; 49+ messages in thread
From: Aaron Fabbri @ 2011-10-06 23:18 UTC (permalink / raw)
To: kvm
Alex Williamson <alex.williamson <at> redhat.com> writes:
>
> On Fri, 2011-09-30 at 10:37 -0600, Alex Williamson wrote:
> > On Fri, 2011-09-30 at 18:46 +1000, David Gibson wrote:
> > > On Mon, Sep 26 at 12:34:52PM -0600, Alex Williamson wrote:
> > > > On Mon, 2011-09-26 at 12:04 +0200, Alexander Graf wrote:
> > > > > Am 26.09.2011 um 09:51 schrieb David Gibson
> > > > > [snip]
> > > > The other obvious possibility is a pure ioctl interface.
[snip]
>
> FYI for all, I've pushed a branch out to github with the current state
> of the re-write. You can find it here
>
> https://awilliam <at> github.com/awilliam/linux-vfio.git
> git://github.com/awilliam/linux-vfio.git
I also vote for the ioctl approach. I'd rather not be parsing a stream,
no matter how easy it is.
Looking through the code now.
Aaron
^ permalink raw reply [flat|nested] 49+ messages in thread