qemu-devel.nongnu.org archive mirror
 help / color / mirror / Atom feed
* [Qemu-devel] RFC: vfio / device assignment -- layout of device fd files
@ 2011-08-29 16:51 Yoder Stuart-B08248
  2011-08-29 19:04 ` Anthony Liguori
                   ` (2 more replies)
  0 siblings, 3 replies; 13+ messages in thread
From: Yoder Stuart-B08248 @ 2011-08-29 16:51 UTC (permalink / raw)
  To: alex.williamson@redhat.com, Benjamin Herrenschmidt
  Cc: Wood Scott-B07421, kvm@vger.kernel.org, Joerg.Roedel@amd.com,
	qemu-devel@nongnu.org, Alexander Graf, avi@redhat.com,
	David Gibson

Alex Graf, Scott Wood, and I met last week to try to flesh out
some details as to how vfio could work for non-PCI devices,
like we have in embedded systems.   This most likely will
require a different kernel driver than vfio-- for now we are
calling it "dtio" (for device tree I/O) as there is no way
to discover these devices except from the device tree.   But
the dtio driver would use the same architecture and interfaces
as vfio.

For devices on a system bus and represented in a device
tree we have some different requirements than PCI for what
is exposed in the device fd file.  A device may have multiple
address regions, multiple interrupts, a variable length device
tree path, whether a region is mmapable, etc.

With existing vfio, the device fd file layout is something
like:
  0xF Config space offset
  ...
  0x6 ROM offset
  0x5 BAR 5 offset
  0x4 BAR 4 offset
  0x3 BAR 3 offset
  0x2 BAR 2 offset
  0x1 BAR 1 offset
  0x0 BAR 0 offset

We have an alternate proposal that we think is more flexible,
extensible, and will accommodate both PCI and system bus
type devices (and others).

Instead of config space fixed at 0xf, we would propose
a header and multiple 'device info' records at offset 0x0 that would
encode everything that user space needs to know about
the device:

  0x0  +-------------+-------------+
       | magic       |   version   | u64   // magic u64 identifies the type of
       |   "vfio"    |             |       // passthru I/O, plus version #
       |   "dtio"    |             |       //   "vfio" - PCI devices
       +-------------+-------------+       //   "dtio" - device tree devices
       |         flags             | u32   // encodes any flags (TBD)
       +---------------------------+
       |  dev info record N        |
       |    type                   | u32   // type of record
       |    rec_len                | u32   // length in bytes of record
       |                           |          (including record header)
       |    flags                  | u32   // type specific flags
       |    ...content...          |       // record content, which could
       +---------------------------+       // include sub-records
       |  dev info record N+1      |
       +---------------------------+
       |  dev info record N+2      |
       +---------------------------+
       ...

The device info records following the file header have the following
record types each with content encoded in a record specific way:

 REGION  - describes an addressable address range for the device
 DTPATH - describes the device tree path for the device
 DTINDEX - describes the index into the related device tree
           property (reg,ranges,interrupts,interrupt-map)
 INTERRUPT - describes an interrupt for the device
 PCI_CONFIG_SPACE - describes config space for the device
 PCI_INFO - domain:bus:device:func
 PCI_BAR_INFO - information about the BAR for a device

For a device tree type device the file may look like:

 0x0+---------------------------+
    |        header             |      
    +---------------------------+
    |   type = REGION           |      
    |   rec_len                 |      
    |   flags =                 | u32 // region specific flags
    |       is_mmapable         | 
    |   offset                  | u64 // seek offset to region from
    |                           |        from beginning
    |   len                     | u64 // length of region
    |   addr                    | u64 // phys addr of region
    |                           |      
    +---------------------------+
     \   type = DTPATH          \  // a sub-region
      |   rec_len                |      
      |   flags                  |      
      |   dev tree path          | char[] // device tree path
    +---------------------------+
     \   type = DTINDEX         \  // a sub-region
      |   rec_len                |      
      |   flags                  |      
      |   prop_type              | u32  // REG, RANGES
      |   prop_index             | u32  // index  into resource list
    +---------------------------+
    |  type = INTERRUPT         |      
    |  rec_len                  |      
    |  flags                    | u32 
    |  ioctl_handle             | u32 // argument to ioctl to get interrupts
    |                           |      
    +---------------------------+
     \   type = DTPATH         \  // a sub-region    
      |   rec_len               |      
      |   flags                 |      
      |   dev tree path         |  char[] // device tree path
    +---------------------------+
      \   type = DTINDEX       \  // a sub-region 
      |   rec_len               |      
      |   flags                 |      
      |   prop_type             | u32 // INTERRUPT,INTERRUPT_MAP
      |   prop_index            | u32 // index


PCI devices would have a PCI specific encoding.  Instead of
config space and the mappable BAR regions being at specific
predetermined offsets, the device info records would describe
this.  Something like:

0x0 +---------------------------+
    |   type = PCI_CONFIG_SPACE |      
    |   rec_len                 |      
    |   flags = 0x0             |      
    |   offset                  | u64 // seek offset to config space
    |                           |        from beginnning
    |   config_space_len        | u32 // length of config space
    +---------------------------+
    |   type = PCI_INFO         |      
    |   rec_len                 |      
    |   flags = 0x0             |      
    |   dom:bus:dev:func        | u32 // pci device info
    +---------------------------+
    |   type = REGION           |      
    |   rec_len                 |      
    |   flags =                 |      
    |       is_mmapable         |      
    |   offset                  | u64 // seek offset to region from
    |                           |        from beginning
    |   len                     | u64 // length of region
    |   addr                    | u64 // physical addr of region
    +---------------------------+
     \   type = PCI_BAR_INFO    \      
      |   rec_len                |      
      |   flags                  |      
      |   bar_type               |  // pio
      |                          |  // prefetable mmio
      |                          |  // non-prefetchable mmmio
      |   bar_index              |  // index of bar in device
      +--------------------------+

There may be other more complex device or bus types that
need their own special encodings, and this format would
allow the definition of new records to define devices.  Two
other types that come to mind are Serial Rapid I/O busses
commonly used in our networking SoCs and the FSL DPAA
portals which are very strange devices that may require
their own unique interface exposed to user space.

In short, when user space opens up a device fd it needs
some information about what this device is, and this
proposal tries to address that.

Regards,
Stuart Yoder

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [Qemu-devel] RFC: vfio / device assignment -- layout of device fd files
  2011-08-29 16:51 [Qemu-devel] RFC: vfio / device assignment -- layout of device fd files Yoder Stuart-B08248
@ 2011-08-29 19:04 ` Anthony Liguori
  2011-08-29 19:32   ` Scott Wood
  2011-08-29 19:51 ` Alex Williamson
  2011-09-01 20:00 ` Michael S. Tsirkin
  2 siblings, 1 reply; 13+ messages in thread
From: Anthony Liguori @ 2011-08-29 19:04 UTC (permalink / raw)
  To: Yoder Stuart-B08248
  Cc: Wood Scott-B07421, kvm@vger.kernel.org, qemu-devel@nongnu.org,
	Alexander Graf, alex.williamson@redhat.com, avi@redhat.com,
	Joerg.Roedel@amd.com, David Gibson

On 08/29/2011 11:51 AM, Yoder Stuart-B08248 wrote:
> Alex Graf, Scott Wood, and I met last week to try to flesh out
> some details as to how vfio could work for non-PCI devices,
> like we have in embedded systems.   This most likely will
> require a different kernel driver than vfio-- for now we are
> calling it "dtio" (for device tree I/O) as there is no way
> to discover these devices except from the device tree.   But
> the dtio driver would use the same architecture and interfaces
> as vfio.
>
> For devices on a system bus and represented in a device
> tree we have some different requirements than PCI for what
> is exposed in the device fd file.  A device may have multiple
> address regions, multiple interrupts, a variable length device
> tree path, whether a region is mmapable, etc.
>
> With existing vfio, the device fd file layout is something
> like:
>    0xF Config space offset
>    ...
>    0x6 ROM offset
>    0x5 BAR 5 offset
>    0x4 BAR 4 offset
>    0x3 BAR 3 offset
>    0x2 BAR 2 offset
>    0x1 BAR 1 offset
>    0x0 BAR 0 offset
>
> We have an alternate proposal that we think is more flexible,
> extensible, and will accommodate both PCI and system bus
> type devices (and others).
>
> Instead of config space fixed at 0xf, we would propose
> a header and multiple 'device info' records at offset 0x0 that would
> encode everything that user space needs to know about
> the device:

Why not just use an ioctl with a proper struct?

The config space is weird for PCI access because it's mirroring a well 
known binary blob.  It's not something to replicate if you're inventing 
something new.

Regards,

Anthony Liguori

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [Qemu-devel] RFC: vfio / device assignment -- layout of device fd files
  2011-08-29 19:04 ` Anthony Liguori
@ 2011-08-29 19:32   ` Scott Wood
  0 siblings, 0 replies; 13+ messages in thread
From: Scott Wood @ 2011-08-29 19:32 UTC (permalink / raw)
  To: Anthony Liguori
  Cc: Wood Scott-B07421, kvm@vger.kernel.org, Joerg.Roedel@amd.com,
	qemu-devel@nongnu.org, Alexander Graf, Yoder Stuart-B08248,
	alex.williamson@redhat.com, avi@redhat.com, David Gibson

On 08/29/2011 02:04 PM, Anthony Liguori wrote:
> On 08/29/2011 11:51 AM, Yoder Stuart-B08248 wrote:
>> Instead of config space fixed at 0xf, we would propose
>> a header and multiple 'device info' records at offset 0x0 that would
>> encode everything that user space needs to know about
>> the device:
> 
> Why not just use an ioctl with a proper struct?

This is more extensible than a struct -- both in features, and in the
number of each type of resource that you can have, length of strings you
can have, etc.

> The config space is weird for PCI access because it's mirroring a well
> known binary blob.  It's not something to replicate if you're inventing
> something new.

There's no intent to replicate config space in general -- config space
is provided as-is.  There's little overlap between config space and the
extra information provided.  Length can be had from config space, but
only by modifying it.  Physical address sort-of overlaps, though bus
addresess could be different from CPU physical addresses[1].  In both
cases, it'd be nice to stay consistent with device-tree regions.

"BAR type" is overlap, but doesn't seem too unreasonable to me.

-Scott

[1] The user is probably less likely to care about the physical address
at all in the PCI case, but consistency is nice.

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [Qemu-devel] RFC: vfio / device assignment -- layout of device fd files
  2011-08-29 16:51 [Qemu-devel] RFC: vfio / device assignment -- layout of device fd files Yoder Stuart-B08248
  2011-08-29 19:04 ` Anthony Liguori
@ 2011-08-29 19:51 ` Alex Williamson
  2011-08-29 21:58   ` Scott Wood
  2011-09-01 20:00 ` Michael S. Tsirkin
  2 siblings, 1 reply; 13+ messages in thread
From: Alex Williamson @ 2011-08-29 19:51 UTC (permalink / raw)
  To: Yoder Stuart-B08248
  Cc: Wood Scott-B07421, kvm@vger.kernel.org, qemu-devel@nongnu.org,
	Alexander Graf, avi@redhat.com, Joerg.Roedel@amd.com,
	David Gibson

On Mon, 2011-08-29 at 16:51 +0000, Yoder Stuart-B08248 wrote:
> Alex Graf, Scott Wood, and I met last week to try to flesh out
> some details as to how vfio could work for non-PCI devices,
> like we have in embedded systems.   This most likely will
> require a different kernel driver than vfio-- for now we are
> calling it "dtio" (for device tree I/O) as there is no way
> to discover these devices except from the device tree.   But
> the dtio driver would use the same architecture and interfaces
> as vfio.

Why is this a different kernel driver?  The difference will primarily be
in what bus types vfio registers drivers and the set of device types the
device fds support.  The group and iommu interfaces will be shared.
This sounds more like vfio .config options (CONFIG_VFIO_PCI,
CONFIG_VFIO_DT).

> For devices on a system bus and represented in a device
> tree we have some different requirements than PCI for what
> is exposed in the device fd file.  A device may have multiple
> address regions, multiple interrupts, a variable length device
> tree path, whether a region is mmapable, etc.
> 
> With existing vfio, the device fd file layout is something
> like:
>   0xF Config space offset
>   ...
>   0x6 ROM offset
>   0x5 BAR 5 offset
>   0x4 BAR 4 offset
>   0x3 BAR 3 offset
>   0x2 BAR 2 offset
>   0x1 BAR 1 offset
>   0x0 BAR 0 offset
> 
> We have an alternate proposal that we think is more flexible,
> extensible, and will accommodate both PCI and system bus
> type devices (and others).
> 
> Instead of config space fixed at 0xf, we would propose
> a header and multiple 'device info' records at offset 0x0 that would
> encode everything that user space needs to know about
> the device:
> 
>   0x0  +-------------+-------------+
>        | magic       |   version   | u64   // magic u64 identifies the type of
>        |   "vfio"    |             |       // passthru I/O, plus version #
>        |   "dtio"    |             |       //   "vfio" - PCI devices
>        +-------------+-------------+       //   "dtio" - device tree devices

Maybe magic = "pci", "dt", ...

>        |         flags             | u32   // encodes any flags (TBD)
>        +---------------------------+
>        |  dev info record N        |
>        |    type                   | u32   // type of record
>        |    rec_len                | u32   // length in bytes of record
>        |                           |          (including record header)
>        |    flags                  | u32   // type specific flags
>        |    ...content...          |       // record content, which could
>        +---------------------------+       // include sub-records
>        |  dev info record N+1      |
>        +---------------------------+
>        |  dev info record N+2      |
>        +---------------------------+
>        ...
> 
> The device info records following the file header have the following
> record types each with content encoded in a record specific way:
> 
>  REGION  - describes an addressable address range for the device
>  DTPATH - describes the device tree path for the device
>  DTINDEX - describes the index into the related device tree
>            property (reg,ranges,interrupts,interrupt-map)

I don't quite understand if these are physical or virtual.

>  INTERRUPT - describes an interrupt for the device
>  PCI_CONFIG_SPACE - describes config space for the device

I would have expected this to be a REGION with a property of
PCI_CONFIG_SPACE.

>  PCI_INFO - domain:bus:device:func

Not entirely sure we need this.  How are you imagining we get from a
group fd to a device fd (wondering if you're only including this for
enumeration)?  I'm currently coding it as a VFIO_GROUP_GET_DEVICE_FD
ioctl that takes a char* parameter that contains the dev_name() for the
device requested.  The list of devices under each group can be found by
read()ing the group fd.  If we keep this, we should make the interfaces
similar, in fact, maybe this is how we describe the capabilities of the
iommu too, reading a table from the iommu fd.

>  PCI_BAR_INFO - information about the BAR for a device
> 
> For a device tree type device the file may look like:
> 
>  0x0+---------------------------+
>     |        header             |      
>     +---------------------------+
>     |   type = REGION           |      
>     |   rec_len                 |      
>     |   flags =                 | u32 // region specific flags
>     |       is_mmapable         | 
>     |   offset                  | u64 // seek offset to region from
>     |                           |        from beginning
>     |   len                     | u64 // length of region
>     |   addr                    | u64 // phys addr of region

Would we only need to expose phys addr for 1:1 mapping requirements?
I'm not sure why we'd care to expose this otherwise.

>     |                           |      
>     +---------------------------+
>      \   type = DTPATH          \  // a sub-region
>       |   rec_len                |      
>       |   flags                  |      
>       |   dev tree path          | char[] // device tree path
>     +---------------------------+
>      \   type = DTINDEX         \  // a sub-region
>       |   rec_len                |      
>       |   flags                  |      
>       |   prop_type              | u32  // REG, RANGES
>       |   prop_index             | u32  // index  into resource list
>     +---------------------------+
>     |  type = INTERRUPT         |      
>     |  rec_len                  |      
>     |  flags                    | u32 
>     |  ioctl_handle             | u32 // argument to ioctl to get interrupts

Is this a dynamic ioctl number or just a u32 parameter to an ioctl like
VFIO_DEVICE_GET_IRQ_FD (ie. an instance number)?

>     |                           |      
>     +---------------------------+
>      \   type = DTPATH         \  // a sub-region    
>       |   rec_len               |      
>       |   flags                 |      
>       |   dev tree path         |  char[] // device tree path
>     +---------------------------+
>       \   type = DTINDEX       \  // a sub-region 
>       |   rec_len               |      
>       |   flags                 |      
>       |   prop_type             | u32 // INTERRUPT,INTERRUPT_MAP
>       |   prop_index            | u32 // index
> 
> 
> PCI devices would have a PCI specific encoding.  Instead of
> config space and the mappable BAR regions being at specific
> predetermined offsets, the device info records would describe
> this.  Something like:
> 
> 0x0 +---------------------------+
>     |   type = PCI_CONFIG_SPACE |      
>     |   rec_len                 |      
>     |   flags = 0x0             |      
>     |   offset                  | u64 // seek offset to config space
>     |                           |        from beginnning
>     |   config_space_len        | u32 // length of config space

Again, not sure why this isn't just a REGION.

>     +---------------------------+
>     |   type = PCI_INFO         |      
>     |   rec_len                 |      
>     |   flags = 0x0             |      
>     |   dom:bus:dev:func        | u32 // pci device info
>     +---------------------------+
>     |   type = REGION           |      
>     |   rec_len                 |      
>     |   flags =                 |      
>     |       is_mmapable         |      
>     |   offset                  | u64 // seek offset to region from
>     |                           |        from beginning
>     |   len                     | u64 // length of region
>     |   addr                    | u64 // physical addr of region
>     +---------------------------+
>      \   type = PCI_BAR_INFO    \      
>       |   rec_len                |      
>       |   flags                  |      
>       |   bar_type               |  // pio
>       |                          |  // prefetable mmio
>       |                          |  // non-prefetchable mmmio
>       |   bar_index              |  // index of bar in device

Aren't a lot of these typical region attributes?  Wondering if we should
just make them part of the REGION flags or we'll have a growing number
of sub-regions describing common things.  Even for non-PCI we need to
know if the region is pio/mmio32/mmio64/prefetchable/etc.  BAR index
could really just translate to a REGION instance number.

>       +--------------------------+
> 
> There may be other more complex device or bus types that
> need their own special encodings, and this format would
> allow the definition of new records to define devices.  Two
> other types that come to mind are Serial Rapid I/O busses
> commonly used in our networking SoCs and the FSL DPAA
> portals which are very strange devices that may require
> their own unique interface exposed to user space.
> 
> In short, when user space opens up a device fd it needs
> some information about what this device is, and this
> proposal tries to address that.

Thanks for trying to come up with a specification.

Alex

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [Qemu-devel] RFC: vfio / device assignment -- layout of device fd files
  2011-08-29 19:51 ` Alex Williamson
@ 2011-08-29 21:58   ` Scott Wood
  2011-08-29 22:46     ` Alex Williamson
  0 siblings, 1 reply; 13+ messages in thread
From: Scott Wood @ 2011-08-29 21:58 UTC (permalink / raw)
  To: Alex Williamson
  Cc: Wood Scott-B07421, kvm@vger.kernel.org, Alexander Graf,
	qemu-devel@nongnu.org, Yoder Stuart-B08248, avi@redhat.com,
	Joerg.Roedel@amd.com, David Gibson

On 08/29/2011 02:51 PM, Alex Williamson wrote:
> On Mon, 2011-08-29 at 16:51 +0000, Yoder Stuart-B08248 wrote:
>> The device info records following the file header have the following
>> record types each with content encoded in a record specific way:
>>
>>  REGION  - describes an addressable address range for the device
>>  DTPATH - describes the device tree path for the device
>>  DTINDEX - describes the index into the related device tree
>>            property (reg,ranges,interrupts,interrupt-map)
> 
> I don't quite understand if these are physical or virtual.

If what are physical or virtual?

>>  INTERRUPT - describes an interrupt for the device
>>  PCI_CONFIG_SPACE - describes config space for the device
> 
> I would have expected this to be a REGION with a property of
> PCI_CONFIG_SPACE.

Could be, if physical address is made optional.

>>  PCI_INFO - domain:bus:device:func
> 
> Not entirely sure we need this.  How are you imagining we get from a
> group fd to a device fd (wondering if you're only including this for
> enumeration)?  I'm currently coding it as a VFIO_GROUP_GET_DEVICE_FD
> ioctl that takes a char* parameter that contains the dev_name() for the
> device requested.  The list of devices under each group can be found by
> read()ing the group fd.  

Seems like it would be nice to keep this information around, rather than
require the user to pass it around separately.  Shouldn't cost much.

> If we keep this, we should make the interfaces similar, 

Yes.

>>  PCI_BAR_INFO - information about the BAR for a device
>>
>> For a device tree type device the file may look like:
>>
>>  0x0+---------------------------+
>>     |        header             |      
>>     +---------------------------+
>>     |   type = REGION           |      
>>     |   rec_len                 |      
>>     |   flags =                 | u32 // region specific flags
>>     |       is_mmapable         | 
>>     |   offset                  | u64 // seek offset to region from
>>     |                           |        from beginning
>>     |   len                     | u64 // length of region
>>     |   addr                    | u64 // phys addr of region
> 
> Would we only need to expose phys addr for 1:1 mapping requirements?
> I'm not sure why we'd care to expose this otherwise.

It's more important for non-PCI, where it avoids the need for userspace
to parse the device tree to find the guest address (we'll usually want
1:1), or to consolidate pages shared by multiple regions.  It could be
nice for debugging, as well.

Seems like something that's better to have and not need, than the other
way around.  It would need to be optional, though, if we want to be able
to describe things without a physical address like config space.

>>     |                           |      
>>     +---------------------------+
>>      \   type = DTPATH          \  // a sub-region
>>       |   rec_len                |      
>>       |   flags                  |      
>>       |   dev tree path          | char[] // device tree path
>>     +---------------------------+
>>      \   type = DTINDEX         \  // a sub-region
>>       |   rec_len                |      
>>       |   flags                  |      
>>       |   prop_type              | u32  // REG, RANGES
>>       |   prop_index             | u32  // index  into resource list
>>     +---------------------------+
>>     |  type = INTERRUPT         |      
>>     |  rec_len                  |      
>>     |  flags                    | u32 
>>     |  ioctl_handle             | u32 // argument to ioctl to get interrupts
> 
> Is this a dynamic ioctl number or just a u32 parameter to an ioctl like
> VFIO_DEVICE_GET_IRQ_FD (ie. an instance number)?

It's a parameter to VFIO_DEVICE_GET_IRQ_FD.

>>     +---------------------------+
>>     |   type = PCI_INFO         |      
>>     |   rec_len                 |      
>>     |   flags = 0x0             |      
>>     |   dom:bus:dev:func        | u32 // pci device info
>>     +---------------------------+
>>     |   type = REGION           |      
>>     |   rec_len                 |      
>>     |   flags =                 |      
>>     |       is_mmapable         |      
>>     |   offset                  | u64 // seek offset to region from
>>     |                           |        from beginning
>>     |   len                     | u64 // length of region
>>     |   addr                    | u64 // physical addr of region
>>     +---------------------------+
>>      \   type = PCI_BAR_INFO    \      
>>       |   rec_len                |      
>>       |   flags                  |      
>>       |   bar_type               |  // pio
>>       |                          |  // prefetable mmio
>>       |                          |  // non-prefetchable mmmio
>>       |   bar_index              |  // index of bar in device
> 
> Aren't a lot of these typical region attributes?  Wondering if we should
> just make them part of the REGION flags or we'll have a growing number
> of sub-regions describing common things.

It'd be nice to keep the base region record common among
PCI/DT/whatever.

PCI_BAR_INFO itself can grow using flags if necessary.

> Even for non-PCI we need to
> know if the region is pio/mmio32/mmio64/prefetchable/etc.

Outside of PCI, what standardized form would you put such information
in?  Where would the kernel get this information?  What does
mmio32/mmio64 mean in this context?

> BAR index could really just translate to a REGION instance number.

How would that work if you make non-BAR things (such as config space)
into regions?

-Scott

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [Qemu-devel] RFC: vfio / device assignment -- layout of device fd files
  2011-08-29 21:58   ` Scott Wood
@ 2011-08-29 22:46     ` Alex Williamson
  2011-08-29 23:14       ` Scott Wood
  0 siblings, 1 reply; 13+ messages in thread
From: Alex Williamson @ 2011-08-29 22:46 UTC (permalink / raw)
  To: Scott Wood
  Cc: Wood Scott-B07421, kvm@vger.kernel.org, Alexander Graf,
	qemu-devel@nongnu.org, Yoder Stuart-B08248, avi@redhat.com,
	Joerg.Roedel@amd.com, David Gibson

On Mon, 2011-08-29 at 16:58 -0500, Scott Wood wrote:
> On 08/29/2011 02:51 PM, Alex Williamson wrote:
> > On Mon, 2011-08-29 at 16:51 +0000, Yoder Stuart-B08248 wrote:
> >> The device info records following the file header have the following
> >> record types each with content encoded in a record specific way:
> >>
> >>  REGION  - describes an addressable address range for the device
> >>  DTPATH - describes the device tree path for the device
> >>  DTINDEX - describes the index into the related device tree
> >>            property (reg,ranges,interrupts,interrupt-map)
> > 
> > I don't quite understand if these are physical or virtual.
> 
> If what are physical or virtual?

Can you give an example of a path vs an index?  I don't understand
enough about these to ask a useful question about what they're
describing.

> >>  INTERRUPT - describes an interrupt for the device
> >>  PCI_CONFIG_SPACE - describes config space for the device
> > 
> > I would have expected this to be a REGION with a property of
> > PCI_CONFIG_SPACE.
> 
> Could be, if physical address is made optional.

Or physical address is also a property, aka sub-region.

> >>  PCI_INFO - domain:bus:device:func
> > 
> > Not entirely sure we need this.  How are you imagining we get from a
> > group fd to a device fd (wondering if you're only including this for
> > enumeration)?  I'm currently coding it as a VFIO_GROUP_GET_DEVICE_FD
> > ioctl that takes a char* parameter that contains the dev_name() for the
> > device requested.  The list of devices under each group can be found by
> > read()ing the group fd.  
> 
> Seems like it would be nice to keep this information around, rather than
> require the user to pass it around separately.  Shouldn't cost much.

Seems redundant.  The user had to know the d:b:d.f to get the device fd
in the first place.  At best it's a sanity check.

> > If we keep this, we should make the interfaces similar, 
> 
> Yes.
> 
> >>  PCI_BAR_INFO - information about the BAR for a device
> >>
> >> For a device tree type device the file may look like:
> >>
> >>  0x0+---------------------------+
> >>     |        header             |      
> >>     +---------------------------+
> >>     |   type = REGION           |      
> >>     |   rec_len                 |      
> >>     |   flags =                 | u32 // region specific flags
> >>     |       is_mmapable         | 
> >>     |   offset                  | u64 // seek offset to region from
> >>     |                           |        from beginning
> >>     |   len                     | u64 // length of region
> >>     |   addr                    | u64 // phys addr of region
> > 
> > Would we only need to expose phys addr for 1:1 mapping requirements?
> > I'm not sure why we'd care to expose this otherwise.
> 
> It's more important for non-PCI, where it avoids the need for userspace
> to parse the device tree to find the guest address (we'll usually want
> 1:1), or to consolidate pages shared by multiple regions.  It could be
> nice for debugging, as well.

So the device tree path is ripped straight from the system, so it's the
actual 1:1, matching physical hardware, path.

> Seems like something that's better to have and not need, than the other
> way around.  It would need to be optional, though, if we want to be able
> to describe things without a physical address like config space.

sub-region?

> >>     |                           |      
> >>     +---------------------------+
> >>      \   type = DTPATH          \  // a sub-region
> >>       |   rec_len                |      
> >>       |   flags                  |      
> >>       |   dev tree path          | char[] // device tree path
> >>     +---------------------------+
> >>      \   type = DTINDEX         \  // a sub-region
> >>       |   rec_len                |      
> >>       |   flags                  |      
> >>       |   prop_type              | u32  // REG, RANGES
> >>       |   prop_index             | u32  // index  into resource list
> >>     +---------------------------+
> >>     |  type = INTERRUPT         |      
> >>     |  rec_len                  |      
> >>     |  flags                    | u32 
> >>     |  ioctl_handle             | u32 // argument to ioctl to get interrupts
> > 
> > Is this a dynamic ioctl number or just a u32 parameter to an ioctl like
> > VFIO_DEVICE_GET_IRQ_FD (ie. an instance number)?
> 
> It's a parameter to VFIO_DEVICE_GET_IRQ_FD.
> 
> >>     +---------------------------+
> >>     |   type = PCI_INFO         |      
> >>     |   rec_len                 |      
> >>     |   flags = 0x0             |      
> >>     |   dom:bus:dev:func        | u32 // pci device info
> >>     +---------------------------+
> >>     |   type = REGION           |      
> >>     |   rec_len                 |      
> >>     |   flags =                 |      
> >>     |       is_mmapable         |      
> >>     |   offset                  | u64 // seek offset to region from
> >>     |                           |        from beginning
> >>     |   len                     | u64 // length of region
> >>     |   addr                    | u64 // physical addr of region
> >>     +---------------------------+
> >>      \   type = PCI_BAR_INFO    \      
> >>       |   rec_len                |      
> >>       |   flags                  |      
> >>       |   bar_type               |  // pio
> >>       |                          |  // prefetable mmio
> >>       |                          |  // non-prefetchable mmmio
> >>       |   bar_index              |  // index of bar in device
> > 
> > Aren't a lot of these typical region attributes?  Wondering if we should
> > just make them part of the REGION flags or we'll have a growing number
> > of sub-regions describing common things.
> 
> It'd be nice to keep the base region record common among
> PCI/DT/whatever.
> 
> PCI_BAR_INFO itself can grow using flags if necessary.
> 
> > Even for non-PCI we need to
> > know if the region is pio/mmio32/mmio64/prefetchable/etc.
> 
> Outside of PCI, what standardized form would you put such information
> in?  Where would the kernel get this information?  What does
> mmio32/mmio64 mean in this context?

I could imagine a platform device described by ACPI that might want to
differentiate.  The physical device doesn't get moved of course, but
guest drivers might care how the device is described if we need to
rebuild those ACPI tables.  ACPI might even be a good place to leverage
these data structures... /me ducks.

> > BAR index could really just translate to a REGION instance number.
> 
> How would that work if you make non-BAR things (such as config space)
> into regions?

Put their instance numbers outside of the BAR region?  We have a fixed
REGION space on PCI, so we could just define BAR0 == instance 0, BAR1 ==
instance 1... ROM == instance 6, CONFIG == instance 0xF (or 7).  BTW, we
need a read-only flag for ROMs too.  Thanks,

Alex

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [Qemu-devel] RFC: vfio / device assignment -- layout of device fd files
  2011-08-29 22:46     ` Alex Williamson
@ 2011-08-29 23:14       ` Scott Wood
  2011-08-30  4:55         ` Alex Williamson
  0 siblings, 1 reply; 13+ messages in thread
From: Scott Wood @ 2011-08-29 23:14 UTC (permalink / raw)
  To: Alex Williamson
  Cc: Wood Scott-B07421, kvm@vger.kernel.org, Alexander Graf,
	qemu-devel@nongnu.org, Yoder Stuart-B08248, avi@redhat.com,
	Joerg.Roedel@amd.com, David Gibson

On 08/29/2011 05:46 PM, Alex Williamson wrote:
> On Mon, 2011-08-29 at 16:58 -0500, Scott Wood wrote:
>> On 08/29/2011 02:51 PM, Alex Williamson wrote:
>>> On Mon, 2011-08-29 at 16:51 +0000, Yoder Stuart-B08248 wrote:
>>>> The device info records following the file header have the following
>>>> record types each with content encoded in a record specific way:
>>>>
>>>>  REGION  - describes an addressable address range for the device
>>>>  DTPATH - describes the device tree path for the device
>>>>  DTINDEX - describes the index into the related device tree
>>>>            property (reg,ranges,interrupts,interrupt-map)
>>>
>>> I don't quite understand if these are physical or virtual.
>>
>> If what are physical or virtual?
> 
> Can you give an example of a path vs an index?  I don't understand
> enough about these to ask a useful question about what they're
> describing.

You'd have both path and index.

Example, for this tree:

/ {
	...
	foo {
		...
		bar {
			reg = <0x1000 64 0x1800 64>;
			ranges = <0 0x20000 0x10000>;
			...

			child {
				reg = <0x100 0x100>;
				...
			};
		};
	};
};

There would be 4 regions if you bind to /foo/bar:

// this is 64 bytes at 0x1000
DTPATH "/foo/bar"
DTINDEX prop_type=REG prop_index=0

// this is 64 bytes at 0x1800
DTPATH "/foo/bar"
DTINDEX prop_type=REG prop_index=1

// this is 16K at 0x20000
DTPATH "/foo/bar"
DTINDEX prop_type=RANGES prop_index=0

// this is 256 bytes at 0x20100
DTPATH "/foo/bar/child"
DTINDEX prop_type=REG prop_index=0

Both ranges and the child reg are needed, since ranges could be a simple
"ranges;" that passes everything with no translation, and child nodes
could be absent-but-implied in some other cases (such as when they
represent PCI devices which can be probed -- we still need to map the
ranges that correspond to PCI controller windows).

>>>>  INTERRUPT - describes an interrupt for the device
>>>>  PCI_CONFIG_SPACE - describes config space for the device
>>>
>>> I would have expected this to be a REGION with a property of
>>> PCI_CONFIG_SPACE.
>>
>> Could be, if physical address is made optional.
> 
> Or physical address is also a property, aka sub-region.

A subrecord of REGION is fine with me.

>>> Would we only need to expose phys addr for 1:1 mapping requirements?
>>> I'm not sure why we'd care to expose this otherwise.
>>
>> It's more important for non-PCI, where it avoids the need for userspace
>> to parse the device tree to find the guest address (we'll usually want
>> 1:1), or to consolidate pages shared by multiple regions.  It could be
>> nice for debugging, as well.
> 
> So the device tree path is ripped straight from the system, so it's the
> actual 1:1, matching physical hardware, path.

Yes.

>>> Even for non-PCI we need to
>>> know if the region is pio/mmio32/mmio64/prefetchable/etc.
>>
>> Outside of PCI, what standardized form would you put such information
>> in?  Where would the kernel get this information?  What does
>> mmio32/mmio64 mean in this context?
> 
> I could imagine a platform device described by ACPI that might want to
> differentiate.  The physical device doesn't get moved of course, but
> guest drivers might care how the device is described if we need to
> rebuild those ACPI tables.  ACPI might even be a good place to leverage
> these data structures... /me ducks.

ACPI info could be another subrecord type, but in the device tree
system-bus case we generally don't have this information at the generic
infrastructure level.  Drivers are expected to know how their devices'
regions should be mapped.

>>> BAR index could really just translate to a REGION instance number.
>>
>> How would that work if you make non-BAR things (such as config space)
>> into regions?
> 
> Put their instance numbers outside of the BAR region?  We have a fixed
> REGION space on PCI, so we could just define BAR0 == instance 0, BAR1 ==
> instance 1... ROM == instance 6, CONFIG == instance 0xF (or 7).

Seems more awkward than just having each region say what it is.  What do
you do to fill in the gaps?

-Scott

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [Qemu-devel] RFC: vfio / device assignment -- layout of device fd files
  2011-08-29 23:14       ` Scott Wood
@ 2011-08-30  4:55         ` Alex Williamson
  2011-08-30 16:54           ` Scott Wood
  0 siblings, 1 reply; 13+ messages in thread
From: Alex Williamson @ 2011-08-30  4:55 UTC (permalink / raw)
  To: Scott Wood
  Cc: Wood Scott-B07421, kvm@vger.kernel.org, Alexander Graf,
	qemu-devel@nongnu.org, Yoder Stuart-B08248, avi@redhat.com,
	Joerg.Roedel@amd.com, David Gibson

On Mon, 2011-08-29 at 18:14 -0500, Scott Wood wrote:
> On 08/29/2011 05:46 PM, Alex Williamson wrote:
> > On Mon, 2011-08-29 at 16:58 -0500, Scott Wood wrote:
> >> On 08/29/2011 02:51 PM, Alex Williamson wrote:
> >>> On Mon, 2011-08-29 at 16:51 +0000, Yoder Stuart-B08248 wrote:
> >>>> The device info records following the file header have the following
> >>>> record types each with content encoded in a record specific way:
> >>>>
> >>>>  REGION  - describes an addressable address range for the device
> >>>>  DTPATH - describes the device tree path for the device
> >>>>  DTINDEX - describes the index into the related device tree
> >>>>            property (reg,ranges,interrupts,interrupt-map)
> >>>
> >>> I don't quite understand if these are physical or virtual.
> >>
> >> If what are physical or virtual?
> > 
> > Can you give an example of a path vs an index?  I don't understand
> > enough about these to ask a useful question about what they're
> > describing.
> 
> You'd have both path and index.
> 
> Example, for this tree:
> 
> / {
> 	...
> 	foo {
> 		...
> 		bar {
> 			reg = <0x1000 64 0x1800 64>;
> 			ranges = <0 0x20000 0x10000>;
> 			...
> 
> 			child {
> 				reg = <0x100 0x100>;
> 				...
> 			};
> 		};
> 	};
> };
> 
> There would be 4 regions if you bind to /foo/bar:
> 
> // this is 64 bytes at 0x1000
> DTPATH "/foo/bar"
> DTINDEX prop_type=REG prop_index=0
> 
> // this is 64 bytes at 0x1800
> DTPATH "/foo/bar"
> DTINDEX prop_type=REG prop_index=1
> 
> // this is 16K at 0x20000
> DTPATH "/foo/bar"
> DTINDEX prop_type=RANGES prop_index=0
> 
> // this is 256 bytes at 0x20100
> DTPATH "/foo/bar/child"
> DTINDEX prop_type=REG prop_index=0
> 
> Both ranges and the child reg are needed, since ranges could be a simple
> "ranges;" that passes everything with no translation, and child nodes
> could be absent-but-implied in some other cases (such as when they
> represent PCI devices which can be probed -- we still need to map the
> ranges that correspond to PCI controller windows).

Thanks for the example.  Is it always the case that you need a path and
an index?  If so, why are they separate sub-regions instead of combined
into a DT_INFO sub-region?

> >>>>  INTERRUPT - describes an interrupt for the device
> >>>>  PCI_CONFIG_SPACE - describes config space for the device
> >>>
> >>> I would have expected this to be a REGION with a property of
> >>> PCI_CONFIG_SPACE.
> >>
> >> Could be, if physical address is made optional.
> > 
> > Or physical address is also a property, aka sub-region.
> 
> A subrecord of REGION is fine with me.
> 
> >>> Would we only need to expose phys addr for 1:1 mapping requirements?
> >>> I'm not sure why we'd care to expose this otherwise.
> >>
> >> It's more important for non-PCI, where it avoids the need for userspace
> >> to parse the device tree to find the guest address (we'll usually want
> >> 1:1), or to consolidate pages shared by multiple regions.  It could be
> >> nice for debugging, as well.
> > 
> > So the device tree path is ripped straight from the system, so it's the
> > actual 1:1, matching physical hardware, path.
> 
> Yes.
> 
> >>> Even for non-PCI we need to
> >>> know if the region is pio/mmio32/mmio64/prefetchable/etc.
> >>
> >> Outside of PCI, what standardized form would you put such information
> >> in?  Where would the kernel get this information?  What does
> >> mmio32/mmio64 mean in this context?
> > 
> > I could imagine a platform device described by ACPI that might want to
> > differentiate.  The physical device doesn't get moved of course, but
> > guest drivers might care how the device is described if we need to
> > rebuild those ACPI tables.  ACPI might even be a good place to leverage
> > these data structures... /me ducks.
> 
> ACPI info could be another subrecord type, but in the device tree
> system-bus case we generally don't have this information at the generic
> infrastructure level.  Drivers are expected to know how their devices'
> regions should be mapped.

The device tree tells them how they're mapped, right?  Or maybe more
precisely, the device tree tells them where they're mapped and it
doesn't really matter whether they're 32bit or 64bit because they can't
be moved.

Maybe this is sub-region material.  It just feels wrong to enumerate a
region and not be able to include any basic properties beyond offset and
size in a common field.  For PCI, we can also describe the properties
via config space, so sub-regions could still be optional.

> >>> BAR index could really just translate to a REGION instance number.
> >>
> >> How would that work if you make non-BAR things (such as config space)
> >> into regions?
> > 
> > Put their instance numbers outside of the BAR region?  We have a fixed
> > REGION space on PCI, so we could just define BAR0 == instance 0, BAR1 ==
> > instance 1... ROM == instance 6, CONFIG == instance 0xF (or 7).
> 
> Seems more awkward than just having each region say what it is.  What do
> you do to fill in the gaps?

You don't, instance numbers would just be non-contiguous.  The reason I
cringe at PCI_INFO sub-regions is that all the info is already available
in PCI config space and I'm not sure if there's any benefit to duplicate
it in this table.  If not, the minimum we need to know is the file
offset to access each region, config/BARs/ROM.  We also have MSI/X
interrupts on PCI.  Do we need to describe those via info records or
parse PCI config space and know that vfio-pci device fds support the
VFIO_PCI_DEVICE_SET_MSI_FDS ioctl?  Thanks,

Alex

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [Qemu-devel] RFC: vfio / device assignment -- layout of device fd files
  2011-08-30  4:55         ` Alex Williamson
@ 2011-08-30 16:54           ` Scott Wood
  0 siblings, 0 replies; 13+ messages in thread
From: Scott Wood @ 2011-08-30 16:54 UTC (permalink / raw)
  To: Alex Williamson
  Cc: Wood Scott-B07421, kvm@vger.kernel.org, Alexander Graf,
	qemu-devel@nongnu.org, Yoder Stuart-B08248, avi@redhat.com,
	Joerg.Roedel@amd.com, David Gibson

On 08/29/2011 11:55 PM, Alex Williamson wrote:
> Thanks for the example.  Is it always the case that you need a path and
> an index?  If so, why are they separate sub-regions instead of combined
> into a DT_INFO sub-region?

You'll always need both.  DTPATH was put into a separate subrecord
because it is of variable length.  Putting it at the end of DTINDEX
would inhibit DTINDEX's extensibility.

> On Mon, 2011-08-29 at 18:14 -0500, Scott Wood wrote:
>> On 08/29/2011 05:46 PM, Alex Williamson wrote:
>>> I could imagine a platform device described by ACPI that might want to
>>> differentiate.  The physical device doesn't get moved of course, but
>>> guest drivers might care how the device is described if we need to
>>> rebuild those ACPI tables.  ACPI might even be a good place to leverage
>>> these data structures... /me ducks.
>>
>> ACPI info could be another subrecord type, but in the device tree
>> system-bus case we generally don't have this information at the generic
>> infrastructure level.  Drivers are expected to know how their devices'
>> regions should be mapped.
> 
> The device tree tells them how they're mapped, right?
> Or maybe more precisely, the device tree tells them where they're mapped and it
> doesn't really matter whether they're 32bit or 64bit because they can't
> be moved.

The device tree says where they're mapped.  The concept of "32/64bit"
doesn't really apply to a fixed mapping -- or if not fixed, there may be
arbitrary constraints, and the OS should just treat it as fixed.

> Maybe this is sub-region material.  It just feels wrong to enumerate a
> region and not be able to include any basic properties beyond offset and
> size in a common field.

offset, size, and whether you can use mmap() are the only things we
could think of that were universal.  The rest can go in subrecords.

>>> Put their instance numbers outside of the BAR region?  We have a fixed
>>> REGION space on PCI, so we could just define BAR0 == instance 0, BAR1 ==
>>> instance 1... ROM == instance 6, CONFIG == instance 0xF (or 7).
>>
>> Seems more awkward than just having each region say what it is.  What do
>> you do to fill in the gaps?
> 
> You don't, instance numbers would just be non-contiguous.

OK, maybe I'm not understanding what you mean by instance numbers -- I
thought you were talking about the order in which they appear.  If you
mean to put an index field in the common REGION struct, I'd prefer not
to.  It would lack meaning without the context of some particular
encoding scheme.  Better to put it in a typed subrecord where we can be
explicit about what it means.

> The reason I cringe at PCI_INFO sub-regions is that all the info is already available
> in PCI config space

BAR index isn't.  I'm neutral on whether it makes sense to include BAR type.

> We also have MSI/X
> interrupts on PCI.  Do we need to describe those via info records or
> parse PCI config space and know that vfio-pci device fds support the
> VFIO_PCI_DEVICE_SET_MSI_FDS ioctl?

I'm not very familiar with MSI-X, but if it's workable without it then
probably no need.

-Scott

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [Qemu-devel] RFC: vfio / device assignment -- layout of device fd files
  2011-08-29 16:51 [Qemu-devel] RFC: vfio / device assignment -- layout of device fd files Yoder Stuart-B08248
  2011-08-29 19:04 ` Anthony Liguori
  2011-08-29 19:51 ` Alex Williamson
@ 2011-09-01 20:00 ` Michael S. Tsirkin
  2011-09-01 20:26   ` Scott Wood
  2 siblings, 1 reply; 13+ messages in thread
From: Michael S. Tsirkin @ 2011-09-01 20:00 UTC (permalink / raw)
  To: Yoder Stuart-B08248
  Cc: Wood Scott-B07421, kvm@vger.kernel.org, qemu-devel@nongnu.org,
	Alexander Graf, alex.williamson@redhat.com, avi@redhat.com,
	Joerg.Roedel@amd.com, David Gibson

On Mon, Aug 29, 2011 at 04:51:20PM +0000, Yoder Stuart-B08248 wrote:
> Alex Graf, Scott Wood, and I met last week to try to flesh out
> some details as to how vfio could work for non-PCI devices,
> like we have in embedded systems.   This most likely will
> require a different kernel driver than vfio-- for now we are
> calling it "dtio" (for device tree I/O) as there is no way
> to discover these devices except from the device tree.   But
> the dtio driver would use the same architecture and interfaces
> as vfio.
> 
> For devices on a system bus and represented in a device
> tree we have some different requirements than PCI for what
> is exposed in the device fd file.  A device may have multiple
> address regions, multiple interrupts, a variable length device
> tree path, whether a region is mmapable, etc.
> 
> With existing vfio, the device fd file layout is something
> like:
>   0xF Config space offset
>   ...
>   0x6 ROM offset
>   0x5 BAR 5 offset
>   0x4 BAR 4 offset
>   0x3 BAR 3 offset
>   0x2 BAR 2 offset
>   0x1 BAR 1 offset
>   0x0 BAR 0 offset
> 
> We have an alternate proposal that we think is more flexible,
> extensible, and will accommodate both PCI and system bus
> type devices (and others).
> 
> Instead of config space fixed at 0xf, we would propose
> a header and multiple 'device info' records at offset 0x0 that would
> encode everything that user space needs to know about
> the device:
> ....

That's a very rich interface, and easy to get wrong.
AFAIK the only reason vfio uses read/write for PCI was to avoid inventing
a custom interface. But if you do, it looks like a set of ioctls would
be much easier? You can even fit the existing uio infrastructure if you like.

> There may be other more complex device or bus types that
> need their own special encodings, and this format would
> allow the definition of new records to define devices.  Two
> other types that come to mind are Serial Rapid I/O busses
> commonly used in our networking SoCs and the FSL DPAA
> portals which are very strange devices that may require
> their own unique interface exposed to user space.
> 
> In short, when user space opens up a device fd it needs
> some information about what this device is, and this
> proposal tries to address that.
> 
> Regards,
> Stuart Yoder

Here's another idea:  all the information is likely already available
in sysfs. A way to query where the device is in sysfs
would give you *a ton* of information, including the type etc,
if the info is not there you will be able to add it
in a way that is useful to more than just vfio,
and you won't need to extend the interface each time
you find you need a new piece of info.


-- 
MST

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [Qemu-devel] RFC: vfio / device assignment -- layout of device fd files
  2011-09-01 20:00 ` Michael S. Tsirkin
@ 2011-09-01 20:26   ` Scott Wood
  2011-09-02 15:57     ` Michael S. Tsirkin
  0 siblings, 1 reply; 13+ messages in thread
From: Scott Wood @ 2011-09-01 20:26 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: Wood Scott-B07421, kvm@vger.kernel.org, Joerg.Roedel@amd.com,
	qemu-devel@nongnu.org, Alexander Graf, Yoder Stuart-B08248,
	alex.williamson@redhat.com, avi@redhat.com, David Gibson

On 09/01/2011 03:00 PM, Michael S. Tsirkin wrote:
> That's a very rich interface, and easy to get wrong.
> AFAIK the only reason vfio uses read/write for PCI was to avoid inventing
> a custom interface. But if you do, it looks like a set of ioctls would
> be much easier? You can even fit the existing uio infrastructure if you like.

How would it be easier than producing/parsing a static data structure?
What would it look like?

> Here's another idea:  all the information is likely already available
> in sysfs.

The only major thing that is likely available elsewhere is PCI config
space, and that was not new to this proposal.

Most other material is specifically related to the vfio/dtio interface
(e.g. offsets into the file handle, arguments to the "get irq fd" ioctl,
mapping of dtio regions/interrupts to device tree nodes), and could not
be "useful to more than just vfio".

> A way to query where the device is in sysfs
> would give you *a ton* of information, including the type etc,

For PCI, the user has domain/bus/dev/fn which should be sufficient to
find that, if desired.  For device-tree devices, there's a device tree
path provided for each region/interrupt.

-Scott

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [Qemu-devel] RFC: vfio / device assignment -- layout of device fd files
  2011-09-01 20:26   ` Scott Wood
@ 2011-09-02 15:57     ` Michael S. Tsirkin
  2011-09-02 17:50       ` Scott Wood
  0 siblings, 1 reply; 13+ messages in thread
From: Michael S. Tsirkin @ 2011-09-02 15:57 UTC (permalink / raw)
  To: Scott Wood
  Cc: Wood Scott-B07421, kvm@vger.kernel.org, Joerg.Roedel@amd.com,
	qemu-devel@nongnu.org, Alexander Graf, Yoder Stuart-B08248,
	alex.williamson@redhat.com, avi@redhat.com, David Gibson

On Thu, Sep 01, 2011 at 03:26:01PM -0500, Scott Wood wrote:
> On 09/01/2011 03:00 PM, Michael S. Tsirkin wrote:
> > That's a very rich interface, and easy to get wrong.
> > AFAIK the only reason vfio uses read/write for PCI was to avoid inventing
> > a custom interface. But if you do, it looks like a set of ioctls would
> > be much easier? You can even fit the existing uio infrastructure if you like.
> 
> How would it be easier than producing/parsing a static data structure?
> What would it look like?

For example, for a property X, instead of adding a structure
with identifier X, implement ioctl GET_X. Userspace
calls this ioctl, an error implies the property
is not present.


> > Here's another idea:  all the information is likely already available
> > in sysfs.
> 
> The only major thing that is likely available elsewhere is PCI config
> space, and that was not new to this proposal.
> Most other material is specifically related to the vfio/dtio interface
> (e.g. offsets into the file handle, arguments to the "get irq fd" ioctl,
> mapping of dtio regions/interrupts to device tree nodes), and could not
> be "useful to more than just vfio".

For example resources are already there in sysfs.

> > A way to query where the device is in sysfs
> > would give you *a ton* of information, including the type etc,
> 
> For PCI, the user has domain/bus/dev/fn which should be sufficient to
> find that, if desired.  For device-tree devices, there's a device tree
> path provided for each region/interrupt.
> 
> -Scott

So you are saying the user already has sysfs path?
I thought the point was to pass all info through
a single fd.

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [Qemu-devel] RFC: vfio / device assignment -- layout of device fd files
  2011-09-02 15:57     ` Michael S. Tsirkin
@ 2011-09-02 17:50       ` Scott Wood
  0 siblings, 0 replies; 13+ messages in thread
From: Scott Wood @ 2011-09-02 17:50 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: Wood Scott-B07421, kvm@vger.kernel.org, Joerg.Roedel@amd.com,
	qemu-devel@nongnu.org, Alexander Graf, Yoder Stuart-B08248,
	alex.williamson@redhat.com, avi@redhat.com, David Gibson

On 09/02/2011 10:57 AM, Michael S. Tsirkin wrote:
> On Thu, Sep 01, 2011 at 03:26:01PM -0500, Scott Wood wrote:
>> On 09/01/2011 03:00 PM, Michael S. Tsirkin wrote:
>>> That's a very rich interface, and easy to get wrong.
>>> AFAIK the only reason vfio uses read/write for PCI was to avoid inventing
>>> a custom interface. But if you do, it looks like a set of ioctls would
>>> be much easier? You can even fit the existing uio infrastructure if you like.
>>
>> How would it be easier than producing/parsing a static data structure?
>> What would it look like?
> 
> For example, for a property X, instead of adding a structure
> with identifier X, implement ioctl GET_X. Userspace
> calls this ioctl, an error implies the property
> is not present.

Why is this better?

>>> Here's another idea:  all the information is likely already available
>>> in sysfs.
>>
>> The only major thing that is likely available elsewhere is PCI config
>> space, and that was not new to this proposal.
>> Most other material is specifically related to the vfio/dtio interface
>> (e.g. offsets into the file handle, arguments to the "get irq fd" ioctl,
>> mapping of dtio regions/interrupts to device tree nodes), and could not
>> be "useful to more than just vfio".
> 
> For example resources are already there in sysfs.

Their relationship to vfio regions, and their offsets in the fd, is not.

>>> A way to query where the device is in sysfs
>>> would give you *a ton* of information, including the type etc,
>>
>> For PCI, the user has domain/bus/dev/fn which should be sufficient to
>> find that, if desired.  For device-tree devices, there's a device tree
>> path provided for each region/interrupt.
>>
>> -Scott
> 
> So you are saying the user already has sysfs path?
> I thought the point was to pass all info through
> a single fd.

No, as I understand it the point is to provide access through that fd,
as well as information that is specific to that fd.  Whether any
particular extra bit of information gets included there is a question of
convenience -- which should not be ignored in interface design.

-Scott

^ permalink raw reply	[flat|nested] 13+ messages in thread

end of thread, other threads:[~2011-09-02 17:50 UTC | newest]

Thread overview: 13+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2011-08-29 16:51 [Qemu-devel] RFC: vfio / device assignment -- layout of device fd files Yoder Stuart-B08248
2011-08-29 19:04 ` Anthony Liguori
2011-08-29 19:32   ` Scott Wood
2011-08-29 19:51 ` Alex Williamson
2011-08-29 21:58   ` Scott Wood
2011-08-29 22:46     ` Alex Williamson
2011-08-29 23:14       ` Scott Wood
2011-08-30  4:55         ` Alex Williamson
2011-08-30 16:54           ` Scott Wood
2011-09-01 20:00 ` Michael S. Tsirkin
2011-09-01 20:26   ` Scott Wood
2011-09-02 15:57     ` Michael S. Tsirkin
2011-09-02 17:50       ` Scott Wood

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).