[Qemu-devel] device assignment for embedded Power

qemu-devel.nongnu.org archive mirror
 help / color / mirror / Atom feed

* [Qemu-devel] device assignment for embedded Power
@ 2011-06-30 15:59 Yoder Stuart-B08248
  2011-07-01  0:58 ` Benjamin Herrenschmidt
  2011-07-01 11:16 ` Paul Brook
  0 siblings, 2 replies; 29+ messages in thread
From: Yoder Stuart-B08248 @ 2011-06-30 15:59 UTC (permalink / raw)
  To: qemu-devel@nongnu.org
  Cc: Wood Scott-B07421, Alexander Graf, dwg@au1.ibm.com,
	blauwirbel@gmail.com, alex.williamson@redhat.com,
	paul@codesourcery.com, joerg.roedel@amd.com, armbru@redhat.com

One feature we need for QEMU/KVM on embedded Power Architecture is the 
ability to do passthru assignment of SoC I/O devices and memory.  An 
important use case in embedded is creating static partitions-- 
taking physical memory and I/O devices (non-PCI) and partitioning
them between the host Linux and several virtual machines.   Things like
live migration would not be needed or supported in these types of scenarios.

SoC devices do not sit on a probeable bus and there are no identifiers 
like 01:00.0 with PCI that we can use to identify devices--  the host
Linux kernel is made aware of SoC I/O devices from nodes/properties in a 
device tree structure passed at boot.   QEMU needs to generate a
device tree to pass to the guest as well with all the guest's virtual
and physical resources.  Today a number of mostly complete guest device
trees are kept under ./pc-bios in QEMU, but this too static and
inflexible.

Some new mechanism is needed to assign SoC devices to guests, and we
(FSL + Alex Graf) have been discussing a few possible approaches
for doing this from QEMU and would like some feedback.

Some possibilities:

1. Option 1.  Pass the host dev tree to QEMU and assign devices
   by device tree path

     -dtb ./mpc8572ds.dtb -device assigned-soc-dev,dev=/soc/i2c@3000

   /soc/i2c@3000 is the device tree path to the assigned device.
   The device node 'i2c@3000' has some number of properties (e.g. 
   address, interrupt info) and possibly subnodes under
   it.   QEMU copies that node when generating the guest dev tree.
   See snippet of entire node:  http://paste2.org/p/1496460

2. Option 2.  Pass the entire assigned device node as a string to
   QEMU

     -device assigned-soc-dev,dev=/i2c@3000,dev-node='#address-cells = <1>;
      #size-cells = <0>; cell-index = <0>; compatible = "fsl-i2c";
      reg = <0xffe03000 0x100>; interrupts = <43 2>;
      interrupt-parent = <&mpic>; dfsrr;'

   This avoids needing to pass the host device tree, but could 
   get awkward-- the i2c example above is very simple, some device
   nodes are very large with a complex hierarchy of subnodes and 
   could be hundreds of lines of text to represent a single
   node.

It gets more complicated...

In some cases, modifications to device tree nodes may be needed.
An example-- sometimes a device tree property references another node 
and that relationship may not exist when assigned to a guest.
A "phy-handle" property may need to be deleted and a "fixed-link"
property added to a node representing a network device.

So in addition to assigning a device, a mechanism is needed to update 
device tree nodes.  So for the above example, maybe--

 -device assigned-soc-dev,dev=/soc/ethernet@b2000,delete-prop=phy-handle,
  node-update="fixed-link = <2 1 1000 0 0>"

The types of modifications needed--  deleting nodes, deleting properties, 
adding nodes, adding properties, adding properties that reference other
nodes, changing properties. This device tree transformation mechanism
needed is general enough that it could apply to any device tree based
embedded platform (e.g. ARM, MIPS).

Another complexity relates to the IOMMU.  Here things get very company 
and IOMMU specific. Freescale has a proprietary IOMMU.
Devices have 1 or more logical I/O device numbers used to index into 
the IOMMU table. The IOMMU is limited in that it is designed to only 
support large, physically contiguous mappings per device.  It does not 
support any kind of page table.  The IOMMU hardware architecture 
assumes DMAs are typically targeted to just a few address regions.  
So, a common IOMMU setup for a device would be a device with a single 
IOMMU mapping covering the guest's main memory segment.  However, 
there are many much more complicated IOMMU setups that are common as 
well, such as doing "operation translations" where a device's write 
transaction is translated to "stash" directly into CPU caches.  We 
can't assume that all memory slots belonging to the guest are targets 
of DMA.

So for Freescale we would need some very Freescale-specific 
configuration mechanism to set up the IOMMU.  Here I think we would 
need the new qcfg approach to expressing nested
structures (http://wiki.qemu.org/Features/QCFG).   Device
assignment with IOMMU set up might look like the examples
below:

# device with multiple logical i/o device numbers

-device assigned-soc-dev,dev=/qman-portals/qman-portal@4000,
vcpu=1,fsl,iommu.stash-mem={
dma-window.guest-addr=0x0,
dma-window.size=0x100000000,
liodn-index=1,
operation-mapping=0
stash-dest=1},
fsl,iommu.stash-dqrr={
dma-window.guest-addr=0xff4200000,
dma-window.size=0x4000,
liodn-index=0,
operation-mapping=0
stash-dest=1}

# assign pci-bus to a guest with multiple memory # regions
#    addr       size
#    0x0         512MB
#    0x20000000  4KB  (for MSIs)
#    0x40000000  16MB (shared memory)
#    0xc0000000  64MB (shared memory)

-device assigned-soc-dev,dev=/pcie@ffe09000,
fsl,iommu={dma-window.guest-addr=0x0,
dma-window.size=0x100000000,
dma-window.subwindow-count =8,
dma-window.sub-window.0.guest-addr=0x0,
dma-window.sub-window.0.size=0x20000000,
dma-window.sub-window.1.guest-addr=0x20000000,
dma-window.sub-window.1.size=0x4000,
dma-window.sub-window.1.pci-msi-subwindow,
dma-window.sub-window.2.guest-addr. 0x40000000, 
dma-window.sub-window.2.size=0x01000000,
dma-window.sub-window.3.guest-addr. 0xc0000000, 
dma-window.sub-window.3.size=0x04000000}

The above are from some real examples based on the SoC device 
assignment mechanisms in the Freescale Embedded Hypervisor.

A final thing...

Both options 1 and 2 above introduce an implementation complexity--
both need to be able to parse text device tree syntax format.  In option
2 since the entire node is passed as text.  And both options for doing
complex node updates.  QEMU would need to do syntactic and semantic
parsing of DTS syntax, basically needing parts of the front end of
dtc (the device tree compiler-- http://git.jdl.com/gitweb/).

Option 3.  So a 3rd approach could be an extension of options 1
or 2.  Instead of expressing nodes in ascii DTS format requiring
parsing, pass a compiled file in device tree binary format to QEMU
that expresses the Qdev properties.

So instead of:
 -device assigned-soc-dev,dev=/soc/ethernet@b2000,delete-prop=phy-handle,
  node-update="fixed-link = <2 1 1000 0 0>"

You might have a config file containing:

ethernet0 {
   compatible = "device";
   type = "assigned-soc-dev";
   dev = "/soc/ethernet@b2000";
   node-update {
      delete-prop="phy-handle";
      fixed-link = <2 1 1000 0 0>";
   }; 
};

You would compile the file into a DTB and then pass it to QEMU:

   -config-dtb ./myguest.dtb

The above is a very simple example-- the benefit of this approach is
in the much more complicated node updates that are sometimes needed.

The config-dtb is just an alternate way of getting complex
device tree data into QEMU.  It supplements and does not change
existing QEMU architecture.

Some pluses of this approach:
   -avoids pulling in substantial complexity for parsing DTS
    syntax
   -device tree nodes are represented in their "native" DTB
    format
   -an available user space library (libfdt) is already part
    of QEMU for parsing DTBs
   -greatly simplifies handling node updates where node reference other
    nodes
   -could use either option 1 (assign node by reference) or option 2
    (assign node by
   -we've used an approach similar to this in the Freescale Embedded
    Hypervisor for 3+ years now and it's held up well

Regards,
Stuart Yoder

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [Qemu-devel] device assignment for embedded Power
  2011-06-30 15:59 [Qemu-devel] device assignment for embedded Power Yoder Stuart-B08248
@ 2011-07-01  0:58 ` Benjamin Herrenschmidt
  2011-07-01 11:40   ` Alexander Graf
                     ` (3 more replies)
  2011-07-01 11:16 ` Paul Brook
  1 sibling, 4 replies; 29+ messages in thread
From: Benjamin Herrenschmidt @ 2011-07-01  0:58 UTC (permalink / raw)
  To: Yoder Stuart-B08248
  Cc: Wood Scott-B07421, joerg.roedel@amd.com, Alexander Graf,
	qemu-devel@nongnu.org, dwg@au1.ibm.com, blauwirbel@gmail.com,
	alex.williamson@redhat.com, paul@codesourcery.com,
	armbru@redhat.com

On Thu, 2011-06-30 at 15:59 +0000, Yoder Stuart-B08248 wrote:
> One feature we need for QEMU/KVM on embedded Power Architecture is the 
> ability to do passthru assignment of SoC I/O devices and memory.  An 
> important use case in embedded is creating static partitions-- 
> taking physical memory and I/O devices (non-PCI) and partitioning
> them between the host Linux and several virtual machines.   Things like
> live migration would not be needed or supported in these types of scenarios.
> 
> SoC devices do not sit on a probeable bus and there are no identifiers 
> like 01:00.0 with PCI that we can use to identify devices--  the host
> Linux kernel is made aware of SoC I/O devices from nodes/properties in a 
> device tree structure passed at boot.   QEMU needs to generate a
> device tree to pass to the guest as well with all the guest's virtual
> and physical resources.  Today a number of mostly complete guest device
> trees are kept under ./pc-bios in QEMU, but this too static and
> inflexible.
> 
> Some new mechanism is needed to assign SoC devices to guests, and we
> (FSL + Alex Graf) have been discussing a few possible approaches
> for doing this from QEMU and would like some feedback.
> 
> Some possibilities:
> 
> 1. Option 1.  Pass the host dev tree to QEMU and assign devices
>    by device tree path
>
>      -dtb ./mpc8572ds.dtb -device assigned-soc-dev,dev=/soc/i2c@3000
> 
>    /soc/i2c@3000 is the device tree path to the assigned device.
>    The device node 'i2c@3000' has some number of properties (e.g. 
>    address, interrupt info) and possibly subnodes under
>    it.   QEMU copies that node when generating the guest dev tree.
>    See snippet of entire node:  http://paste2.org/p/1496460

Yuck (see below)

> 2. Option 2.  Pass the entire assigned device node as a string to
>    QEMU
> 
>      -device assigned-soc-dev,dev=/i2c@3000,dev-node='#address-cells = <1>;
>       #size-cells = <0>; cell-index = <0>; compatible = "fsl-i2c";
>       reg = <0xffe03000 0x100>; interrupts = <43 2>;
>       interrupt-parent = <&mpic>; dfsrr;'

Beuark ! (see below)

>    This avoids needing to pass the host device tree, but could 
>    get awkward-- the i2c example above is very simple, some device
>    nodes are very large with a complex hierarchy of subnodes and 
>    could be hundreds of lines of text to represent a single
>    node.
> 
> It gets more complicated...


So, from a qemu command line perspective, all you should have to do is
pass qemu the device-tree -path- to the device you want to pass-trough
(you may support passing a full hierarchy here).

That is for normal MMIO mapped SoC devices. Something else (individual
i2c, usb, ...) will use specific virtualization of the corresponding
busses.

Anything else sucks too much really.

>From there, well, there's several approach inside qemu/kvm to handle
that path. If you want to do things at the qemu level you can probably
parse /proc/device-tree. But I'd personally just make it a kernel thing.

IE. I would have an ioctl to "instanciate" a pass-through device, that
takes that path as an argument. I would make it return an anonymous fd
which you can then use to mmap the resources, etc...

> In some cases, modifications to device tree nodes may be needed.
> An example-- sometimes a device tree property references another node 
> and that relationship may not exist when assigned to a guest.
> A "phy-handle" property may need to be deleted and a "fixed-link"
> property added to a node representing a network device.

That's fishy. Why wouldn't you give full access to the MDIO ? It's
shared ? Such things are so device-specific that they would have to be
handled by device-specific quirks, which can live either in qemu or in
the kernel.

> So in addition to assigning a device, a mechanism is needed to update 
> device tree nodes.  So for the above example, maybe--
> 
>  -device assigned-soc-dev,dev=/soc/ethernet@b2000,delete-prop=phy-handle,
>   node-update="fixed-link = <2 1 1000 0 0>"

That's just so gross and error prone, borderline insane.

> The types of modifications needed--  deleting nodes, deleting properties, 
> adding nodes, adding properties, adding properties that reference other
> nodes, changing properties. This device tree transformation mechanism
> needed is general enough that it could apply to any device tree based
> embedded platform (e.g. ARM, MIPS)
>
> Another complexity relates to the IOMMU.  Here things get very company 
> and IOMMU specific. Freescale has a proprietary IOMMU.

Look at the work currently being done for a generic qemu iommu layer. We
need it for server power as well and from what I last saw coming from
Eduardo and David, it's not PCI specific.

> Devices have 1 or more logical I/O device numbers used to index into 
> the IOMMU table. The IOMMU is limited in that it is designed to only 
> support large, physically contiguous mappings per device.  It does not 
> support any kind of page table.  The IOMMU hardware architecture 
> assumes DMAs are typically targeted to just a few address regions.  
> So, a common IOMMU setup for a device would be a device with a single 
> IOMMU mapping covering the guest's main memory segment.  However, 
> there are many much more complicated IOMMU setups that are common as 
> well, such as doing "operation translations" where a device's write 
> transaction is translated to "stash" directly into CPU caches.  We 
> can't assume that all memory slots belonging to the guest are targets 
> of DMA.
> 
> So for Freescale we would need some very Freescale-specific 
> configuration mechanism to set up the IOMMU.  Here I think we would 
> need the new qcfg approach to expressing nested
> structures (http://wiki.qemu.org/Features/QCFG).   Device
> assignment with IOMMU set up might look like the examples
> below:

Cheers,
Ben.

> # device with multiple logical i/o device numbers
> 
> -device assigned-soc-dev,dev=/qman-portals/qman-portal@4000,
> vcpu=1,fsl,iommu.stash-mem={
> dma-window.guest-addr=0x0,
> dma-window.size=0x100000000,
> liodn-index=1,
> operation-mapping=0
> stash-dest=1},
> fsl,iommu.stash-dqrr={
> dma-window.guest-addr=0xff4200000,
> dma-window.size=0x4000,
> liodn-index=0,
> operation-mapping=0
> stash-dest=1}
> 
> # assign pci-bus to a guest with multiple memory # regions
> #    addr       size
> #    0x0         512MB
> #    0x20000000  4KB  (for MSIs)
> #    0x40000000  16MB (shared memory)
> #    0xc0000000  64MB (shared memory)
> 
> -device assigned-soc-dev,dev=/pcie@ffe09000,
> fsl,iommu={dma-window.guest-addr=0x0,
> dma-window.size=0x100000000,
> dma-window.subwindow-count =8,
> dma-window.sub-window.0.guest-addr=0x0,
> dma-window.sub-window.0.size=0x20000000,
> dma-window.sub-window.1.guest-addr=0x20000000,
> dma-window.sub-window.1.size=0x4000,
> dma-window.sub-window.1.pci-msi-subwindow,
> dma-window.sub-window.2.guest-addr. 0x40000000, 
> dma-window.sub-window.2.size=0x01000000,
> dma-window.sub-window.3.guest-addr. 0xc0000000, 
> dma-window.sub-window.3.size=0x04000000}
> 
> The above are from some real examples based on the SoC device 
> assignment mechanisms in the Freescale Embedded Hypervisor.
> 
> A final thing...
> 
> Both options 1 and 2 above introduce an implementation complexity--
> both need to be able to parse text device tree syntax format.  In option
> 2 since the entire node is passed as text.  And both options for doing
> complex node updates.  QEMU would need to do syntactic and semantic
> parsing of DTS syntax, basically needing parts of the front end of
> dtc (the device tree compiler-- http://git.jdl.com/gitweb/).
> 
> Option 3.  So a 3rd approach could be an extension of options 1
> or 2.  Instead of expressing nodes in ascii DTS format requiring
> parsing, pass a compiled file in device tree binary format to QEMU
> that expresses the Qdev properties.
> 
> So instead of:
>  -device assigned-soc-dev,dev=/soc/ethernet@b2000,delete-prop=phy-handle,
>   node-update="fixed-link = <2 1 1000 0 0>"
> 
> You might have a config file containing:
> 
> ethernet0 {
>    compatible = "device";
>    type = "assigned-soc-dev";
>    dev = "/soc/ethernet@b2000";
>    node-update {
>       delete-prop="phy-handle";
>       fixed-link = <2 1 1000 0 0>";
>    }; 
> };
> 
> You would compile the file into a DTB and then pass it to QEMU:
> 
>    -config-dtb ./myguest.dtb
> 
> The above is a very simple example-- the benefit of this approach is
> in the much more complicated node updates that are sometimes needed.
> 
> The config-dtb is just an alternate way of getting complex
> device tree data into QEMU.  It supplements and does not change
> existing QEMU architecture.
> 
> Some pluses of this approach:
>    -avoids pulling in substantial complexity for parsing DTS
>     syntax
>    -device tree nodes are represented in their "native" DTB
>     format
>    -an available user space library (libfdt) is already part
>     of QEMU for parsing DTBs
>    -greatly simplifies handling node updates where node reference other
>     nodes
>    -could use either option 1 (assign node by reference) or option 2
>     (assign node by
>    -we've used an approach similar to this in the Freescale Embedded
>     Hypervisor for 3+ years now and it's held up well
> 
> 
> Regards,
> Stuart Yoder

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [Qemu-devel] device assignment for embedded Power
  2011-06-30 15:59 [Qemu-devel] device assignment for embedded Power Yoder Stuart-B08248
  2011-07-01  0:58 ` Benjamin Herrenschmidt
@ 2011-07-01 11:16 ` Paul Brook
  2011-07-01 11:33   ` Alexander Graf
  2011-07-01 17:51   ` Scott Wood
  1 sibling, 2 replies; 29+ messages in thread
From: Paul Brook @ 2011-07-01 11:16 UTC (permalink / raw)
  To: Yoder Stuart-B08248
  Cc: Wood Scott-B07421, qemu-devel@nongnu.org, Alexander Graf,
	blauwirbel@gmail.com, alex.williamson@redhat.com,
	joerg.roedel@amd.com, dwg@au1.ibm.com, armbru@redhat.com

> One feature we need for QEMU/KVM on embedded Power Architecture is the
> ability to do passthru assignment of SoC I/O devices and memory.  An
> important use case in embedded is creating static partitions--
> taking physical memory and I/O devices (non-PCI) and partitioning
> them between the host Linux and several virtual machines.   Things like
> live migration would not be needed or supported in these types of
> scenarios.
> 
> SoC devices do not sit on a probeable bus and there are no identifiers
> like 01:00.0 with PCI that we can use to identify devices--  the host
> Linux kernel is made aware of SoC I/O devices from nodes/properties in a
> device tree structure passed at boot.   QEMU needs to generate a
> device tree to pass to the guest as well with all the guest's virtual
> and physical resources.  Today a number of mostly complete guest device
> trees are kept under ./pc-bios in QEMU, but this too static and
> inflexible.

I doubt you're going to get generic passthrough of arbitrary devices working 
in a useful way. My expectation is that, at minimum, you'll need a bus 
specific proxy device. i.e. create a virtual device in qemu that responds to 
the guest, and happens poke at a host device rather than emulating things 
directly.

For busses like I2C this is fairly trivial - all communication with the device 
goes down a single well defined and easily proxied channel.  For more complex 
busses you end up having to emulate a lot more.  Basically you have to emulate 
everything that is different between the host and guest.  If that happens to 
include device specific state then you loose.

Using PCI devices as an example: The resources provided by the device are 
self-describing, so proxying those is fairly straightforward, and doesn't even 
require manual configuration.  However replicating the environment seen by the 
device is trickier as PCI devices can initiate memory accesses (i.e. bus-
master).  For machines without an IOMMU this means passthrough in general 
can't work, and substantial amounts of device specific knowledge is required. 
You'd need to intercept and modify and/oor proxy all data relating to DMA 
addresses.  In practice you need to emulate an IOMMU inside qemu (so you can 
determine the address space accessed by the device), and arrange for the host 
IOMMU to present the same virtual address space to the real device.

Paul

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [Qemu-devel] device assignment for embedded Power
  2011-07-01 11:16 ` Paul Brook
@ 2011-07-01 11:33   ` Alexander Graf
  2011-07-01 11:55     ` Paul Brook
  2011-07-01 17:51   ` Scott Wood
  1 sibling, 1 reply; 29+ messages in thread
From: Alexander Graf @ 2011-07-01 11:33 UTC (permalink / raw)
  To: Paul Brook
  Cc: Wood Scott-B07421, qemu-devel@nongnu.org, dwg@au1.ibm.com,
	blauwirbel@gmail.com, Yoder Stuart-B08248,
	alex.williamson@redhat.com, joerg.roedel@amd.com,
	armbru@redhat.com


On 01.07.2011, at 13:16, Paul Brook wrote:

>> One feature we need for QEMU/KVM on embedded Power Architecture is the
>> ability to do passthru assignment of SoC I/O devices and memory.  An
>> important use case in embedded is creating static partitions--
>> taking physical memory and I/O devices (non-PCI) and partitioning
>> them between the host Linux and several virtual machines.   Things like
>> live migration would not be needed or supported in these types of
>> scenarios.
>> 
>> SoC devices do not sit on a probeable bus and there are no identifiers
>> like 01:00.0 with PCI that we can use to identify devices--  the host
>> Linux kernel is made aware of SoC I/O devices from nodes/properties in a
>> device tree structure passed at boot.   QEMU needs to generate a
>> device tree to pass to the guest as well with all the guest's virtual
>> and physical resources.  Today a number of mostly complete guest device
>> trees are kept under ./pc-bios in QEMU, but this too static and
>> inflexible.
> 
> I doubt you're going to get generic passthrough of arbitrary devices working 
> in a useful way. My expectation is that, at minimum, you'll need a bus 
> specific proxy device. i.e. create a virtual device in qemu that responds to 
> the guest, and happens poke at a host device rather than emulating things 
> directly.
> 
> For busses like I2C this is fairly trivial - all communication with the device 
> goes down a single well defined and easily proxied channel.  For more complex 
> busses you end up having to emulate a lot more.  Basically you have to emulate 
> everything that is different between the host and guest.  If that happens to 
> include device specific state then you loose.
> 
> Using PCI devices as an example: The resources provided by the device are 
> self-describing, so proxying those is fairly straightforward, and doesn't even 
> require manual configuration.  However replicating the environment seen by the 
> device is trickier as PCI devices can initiate memory accesses (i.e. bus-
> master).  For machines without an IOMMU this means passthrough in general 
> can't work, and substantial amounts of device specific knowledge is required. 
> You'd need to intercept and modify and/oor proxy all data relating to DMA 
> addresses.  In practice you need to emulate an IOMMU inside qemu (so you can 
> determine the address space accessed by the device), and arrange for the host 
> IOMMU to present the same virtual address space to the real device.

Well, for DMA the solution is reasonably simple. We have basically two choices:

  * run 1:1 mapped, so the guest physical address == host physical address, at which point DMA works, but everything is insecure
  * use an IOMMU

We can easily limit it to those two cases. The more challenging part here (and the main reason for the email) is the question on how to configure all of that in a flexible, yet simple way. We can find the IO regions for devices from the host device tree - no problem there.

But the real challenge is how to expose the device to the guest device tree. Especially when it comes to links between dt nodes, interrupt maps, etc. We basically have 3 choices there:

  * take the host device tree pieces and modify them
  * provide device tree chunks for each device (manually or through qdev parameters)
  * use the device tree as machine config file and base everything on it (solves the linking problem)

The main question is which one would be the cleanest solution. And how would it be implemented.


Alex

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [Qemu-devel] device assignment for embedded Power
  2011-07-01  0:58 ` Benjamin Herrenschmidt
@ 2011-07-01 11:40   ` Alexander Graf
  2011-07-01 12:13     ` Anthony Liguori
  2011-07-01 12:10   ` Anthony Liguori
                     ` (2 subsequent siblings)
  3 siblings, 1 reply; 29+ messages in thread
From: Alexander Graf @ 2011-07-01 11:40 UTC (permalink / raw)
  To: Benjamin Herrenschmidt
  Cc: Wood Scott-B07421, joerg.roedel@amd.com, qemu-devel@nongnu.org,
	dwg@au1.ibm.com, blauwirbel@gmail.com, Yoder Stuart-B08248,
	alex.williamson@redhat.com, paul@codesourcery.com,
	armbru@redhat.com


On 01.07.2011, at 02:58, Benjamin Herrenschmidt wrote:

> On Thu, 2011-06-30 at 15:59 +0000, Yoder Stuart-B08248 wrote:
>> One feature we need for QEMU/KVM on embedded Power Architecture is the 
>> ability to do passthru assignment of SoC I/O devices and memory.  An 
>> important use case in embedded is creating static partitions-- 
>> taking physical memory and I/O devices (non-PCI) and partitioning
>> them between the host Linux and several virtual machines.   Things like
>> live migration would not be needed or supported in these types of scenarios.
>> 
>> SoC devices do not sit on a probeable bus and there are no identifiers 
>> like 01:00.0 with PCI that we can use to identify devices--  the host
>> Linux kernel is made aware of SoC I/O devices from nodes/properties in a 
>> device tree structure passed at boot.   QEMU needs to generate a
>> device tree to pass to the guest as well with all the guest's virtual
>> and physical resources.  Today a number of mostly complete guest device
>> trees are kept under ./pc-bios in QEMU, but this too static and
>> inflexible.
>> 
>> Some new mechanism is needed to assign SoC devices to guests, and we
>> (FSL + Alex Graf) have been discussing a few possible approaches
>> for doing this from QEMU and would like some feedback.
>> 
>> Some possibilities:
>> 
>> 1. Option 1.  Pass the host dev tree to QEMU and assign devices
>>   by device tree path
>> 
>>     -dtb ./mpc8572ds.dtb -device assigned-soc-dev,dev=/soc/i2c@3000
>> 
>>   /soc/i2c@3000 is the device tree path to the assigned device.
>>   The device node 'i2c@3000' has some number of properties (e.g. 
>>   address, interrupt info) and possibly subnodes under
>>   it.   QEMU copies that node when generating the guest dev tree.
>>   See snippet of entire node:  http://paste2.org/p/1496460
> 
> Yuck (see below)
> 
>> 2. Option 2.  Pass the entire assigned device node as a string to
>>   QEMU
>> 
>>     -device assigned-soc-dev,dev=/i2c@3000,dev-node='#address-cells = <1>;
>>      #size-cells = <0>; cell-index = <0>; compatible = "fsl-i2c";
>>      reg = <0xffe03000 0x100>; interrupts = <43 2>;
>>      interrupt-parent = <&mpic>; dfsrr;'
> 
> Beuark ! (see below)
> 
>>   This avoids needing to pass the host device tree, but could 
>>   get awkward-- the i2c example above is very simple, some device
>>   nodes are very large with a complex hierarchy of subnodes and 
>>   could be hundreds of lines of text to represent a single
>>   node.
>> 
>> It gets more complicated...
> 
> 
> So, from a qemu command line perspective, all you should have to do is
> pass qemu the device-tree -path- to the device you want to pass-trough
> (you may support passing a full hierarchy here).
> 
> That is for normal MMIO mapped SoC devices. Something else (individual
> i2c, usb, ...) will use specific virtualization of the corresponding
> busses.
> 
> Anything else sucks too much really.
> 
> From there, well, there's several approach inside qemu/kvm to handle
> that path. If you want to do things at the qemu level you can probably
> parse /proc/device-tree. But I'd personally just make it a kernel thing.
> 
> IE. I would have an ioctl to "instanciate" a pass-through device, that
> takes that path as an argument. I would make it return an anonymous fd
> which you can then use to mmap the resources, etc...

Yeah, one idea was to use VFIO here. We could for example modify the host device tree to occupy device we want to pass through with a specific compatibility parameter. Or we could try to steal the node during runtime. But I agree, reading the device tree data from a VFIO node sounds reasonable. If it's required.

> 
>> In some cases, modifications to device tree nodes may be needed.
>> An example-- sometimes a device tree property references another node 
>> and that relationship may not exist when assigned to a guest.
>> A "phy-handle" property may need to be deleted and a "fixed-link"
>> property added to a node representing a network device.
> 
> That's fishy. Why wouldn't you give full access to the MDIO ? It's
> shared ? Such things are so device-specific that they would have to be
> handled by device-specific quirks, which can live either in qemu or in
> the kernel.

Hrm, so you'd create a separate device for MDIO which can do pass-through of those?

> 
>> So in addition to assigning a device, a mechanism is needed to update 
>> device tree nodes.  So for the above example, maybe--
>> 
>> -device assigned-soc-dev,dev=/soc/ethernet@b2000,delete-prop=phy-handle,
>>  node-update="fixed-link = <2 1 1000 0 0>"
> 
> That's just so gross and error prone, borderline insane.

Alternatives:

  * not modify the device tree (unlikely to work)
  * pass a full device tree chunk to qemu instead of modification commands
  * ?

> 
>> The types of modifications needed--  deleting nodes, deleting properties, 
>> adding nodes, adding properties, adding properties that reference other
>> nodes, changing properties. This device tree transformation mechanism
>> needed is general enough that it could apply to any device tree based
>> embedded platform (e.g. ARM, MIPS)
>> 
>> Another complexity relates to the IOMMU.  Here things get very company 
>> and IOMMU specific. Freescale has a proprietary IOMMU.
> 
> Look at the work currently being done for a generic qemu iommu layer. We
> need it for server power as well and from what I last saw coming from
> Eduardo and David, it's not PCI specific.

Well, but it only implements an IOMMU emulation layer inside the guest. That doesn't help us for the host side of things unfortunately :).


Alex

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [Qemu-devel] device assignment for embedded Power
  2011-07-01 11:33   ` Alexander Graf
@ 2011-07-01 11:55     ` Paul Brook
  2011-07-01 12:02       ` Alexander Graf
  0 siblings, 1 reply; 29+ messages in thread
From: Paul Brook @ 2011-07-01 11:55 UTC (permalink / raw)
  To: Alexander Graf
  Cc: Wood Scott-B07421, qemu-devel@nongnu.org, dwg@au1.ibm.com,
	blauwirbel@gmail.com, Yoder Stuart-B08248,
	alex.williamson@redhat.com, joerg.roedel@amd.com,
	armbru@redhat.com


> But the real challenge is how to expose the device to the guest device
> tree. Especially when it comes to links between dt nodes, interrupt maps,
> etc. We basically have 3 choices there:
> 
>   * take the host device tree pieces and modify them
>   * provide device tree chunks for each device (manually or through qdev
> parameters) * use the device tree as machine config file and base
> everything on it (solves the linking problem)
> 
> The main question is which one would be the cleanest solution. And how
> would it be implemented.

I don't think any of this is specific to device passthrough.  It occurs as 
soon as you have any user-configurable parts of the machine (or even just a 
nontrivial selection of machine variants).  My guess is the only reason you 
haven't hit it before is because you're only emulated a single hard-coded 
SoC/board.

Paul

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [Qemu-devel] device assignment for embedded Power
  2011-07-01 11:55     ` Paul Brook
@ 2011-07-01 12:02       ` Alexander Graf
  2011-07-01 12:14         ` Anthony Liguori
  0 siblings, 1 reply; 29+ messages in thread
From: Alexander Graf @ 2011-07-01 12:02 UTC (permalink / raw)
  To: Paul Brook
  Cc: Wood Scott-B07421, qemu-devel@nongnu.org, dwg@au1.ibm.com,
	blauwirbel@gmail.com, Yoder Stuart-B08248,
	alex.williamson@redhat.com, joerg.roedel@amd.com,
	armbru@redhat.com


On 01.07.2011, at 13:55, Paul Brook wrote:

> 
>> But the real challenge is how to expose the device to the guest device
>> tree. Especially when it comes to links between dt nodes, interrupt maps,
>> etc. We basically have 3 choices there:
>> 
>>  * take the host device tree pieces and modify them
>>  * provide device tree chunks for each device (manually or through qdev
>> parameters) * use the device tree as machine config file and base
>> everything on it (solves the linking problem)
>> 
>> The main question is which one would be the cleanest solution. And how
>> would it be implemented.
> 
> I don't think any of this is specific to device passthrough.  It occurs as 
> soon as you have any user-configurable parts of the machine (or even just a 
> nontrivial selection of machine variants).  My guess is the only reason you 
> haven't hit it before is because you're only emulated a single hard-coded 
> SoC/board.

Well, the real reason we haven't hit this before is that we don't have any devices in Qemu that are generic. We only have specific device emulation. This however would be a device that can handle hundreds of different backing devices, all with different requirements.

The infrastructure we have today simply isn't made for this. The question is how can we model it so that it will? :)


Alex

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [Qemu-devel] device assignment for embedded Power
  2011-07-01  0:58 ` Benjamin Herrenschmidt
  2011-07-01 11:40   ` Alexander Graf
@ 2011-07-01 12:10   ` Anthony Liguori
  2011-07-01 12:52     ` Paul Brook
  2011-07-01 16:43     ` Scott Wood
  2011-07-01 16:34   ` Scott Wood
  2011-07-05 18:19   ` Yoder Stuart-B08248
  3 siblings, 2 replies; 29+ messages in thread
From: Anthony Liguori @ 2011-07-01 12:10 UTC (permalink / raw)
  To: Benjamin Herrenschmidt
  Cc: Alexander Graf, Wood Scott-B07421, joerg.roedel@amd.com,
	qemu-devel@nongnu.org, dwg@au1.ibm.com, blauwirbel@gmail.com,
	Yoder Stuart-B08248, alex.williamson@redhat.com,
	paul@codesourcery.com, armbru@redhat.com

On 06/30/2011 07:58 PM, Benjamin Herrenschmidt wrote:
> On Thu, 2011-06-30 at 15:59 +0000, Yoder Stuart-B08248 wrote:
>>     This avoids needing to pass the host device tree, but could
>>     get awkward-- the i2c example above is very simple, some device
>>     nodes are very large with a complex hierarchy of subnodes and
>>     could be hundreds of lines of text to represent a single
>>     node.
>>
>> It gets more complicated...
>
>
> So, from a qemu command line perspective, all you should have to do is
> pass qemu the device-tree -path- to the device you want to pass-trough
> (you may support passing a full hierarchy here).

I agree in principle but I think it should be done in a slightly 
different way.

I think we ought to support composing a device by passthrough.  For 
instance, something like:

[physical-device "mydev"]
region[0].file = "/dev/mem"
region[0].guest_address = "0x42232000"
region[0].file_offset = "0x23423400"
region[0].size = "4096"
irq[0].guest_irq = "10"
irq[0].host_irq = "10"

This should be independent of anything to do with device tree.  This 
would be useful for x86 too to assign platform devices (like the HPET).

I think there should be a separate mechanism to manipulate the guest 
device tree, just like there are mechanisms to manipulate the guest's 
ACPI tables.

Given these two mechanisms, there should be a simple command line like 
Ben has suggested that just takes a host device tree path and Just 
Works.  It really is just a convenience interface though.

With raw mechanisms like I described above, it would give you the 
flexibility to pass through a device with a modified host tree fragment 
without having an overly complicated command line interface for the more 
common case.

Regards,

Anthony Liguori

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [Qemu-devel] device assignment for embedded Power
  2011-07-01 11:40   ` Alexander Graf
@ 2011-07-01 12:13     ` Anthony Liguori
  0 siblings, 0 replies; 29+ messages in thread
From: Anthony Liguori @ 2011-07-01 12:13 UTC (permalink / raw)
  To: Alexander Graf
  Cc: Wood Scott-B07421, qemu-devel@nongnu.org, dwg@au1.ibm.com,
	blauwirbel@gmail.com, Yoder Stuart-B08248,
	alex.williamson@redhat.com, paul@codesourcery.com,
	joerg.roedel@amd.com, armbru@redhat.com

On 07/01/2011 06:40 AM, Alexander Graf wrote:
>
> On 01.07.2011, at 02:58, Benjamin Herrenschmidt wrote:
>
>> On Thu, 2011-06-30 at 15:59 +0000, Yoder Stuart-B08248 wrote:
>>> One feature we need for QEMU/KVM on embedded Power Architecture is the
>>> ability to do passthru assignment of SoC I/O devices and memory.  An
>>> important use case in embedded is creating static partitions--
>>> taking physical memory and I/O devices (non-PCI) and partitioning
>>> them between the host Linux and several virtual machines.   Things like
>>> live migration would not be needed or supported in these types of scenarios.
>>>
>>> SoC devices do not sit on a probeable bus and there are no identifiers
>>> like 01:00.0 with PCI that we can use to identify devices--  the host
>>> Linux kernel is made aware of SoC I/O devices from nodes/properties in a
>>> device tree structure passed at boot.   QEMU needs to generate a
>>> device tree to pass to the guest as well with all the guest's virtual
>>> and physical resources.  Today a number of mostly complete guest device
>>> trees are kept under ./pc-bios in QEMU, but this too static and
>>> inflexible.
>>>
>>> Some new mechanism is needed to assign SoC devices to guests, and we
>>> (FSL + Alex Graf) have been discussing a few possible approaches
>>> for doing this from QEMU and would like some feedback.
>>>
>>> Some possibilities:
>>>
>>> 1. Option 1.  Pass the host dev tree to QEMU and assign devices
>>>    by device tree path
>>>
>>>      -dtb ./mpc8572ds.dtb -device assigned-soc-dev,dev=/soc/i2c@3000
>>>
>>>    /soc/i2c@3000 is the device tree path to the assigned device.
>>>    The device node 'i2c@3000' has some number of properties (e.g.
>>>    address, interrupt info) and possibly subnodes under
>>>    it.   QEMU copies that node when generating the guest dev tree.
>>>    See snippet of entire node:  http://paste2.org/p/1496460
>>
>> Yuck (see below)
>>
>>> 2. Option 2.  Pass the entire assigned device node as a string to
>>>    QEMU
>>>
>>>      -device assigned-soc-dev,dev=/i2c@3000,dev-node='#address-cells =<1>;
>>>       #size-cells =<0>; cell-index =<0>; compatible = "fsl-i2c";
>>>       reg =<0xffe03000 0x100>; interrupts =<43 2>;
>>>       interrupt-parent =<&mpic>; dfsrr;'
>>
>> Beuark ! (see below)
>>
>>>    This avoids needing to pass the host device tree, but could
>>>    get awkward-- the i2c example above is very simple, some device
>>>    nodes are very large with a complex hierarchy of subnodes and
>>>    could be hundreds of lines of text to represent a single
>>>    node.
>>>
>>> It gets more complicated...
>>
>>
>> So, from a qemu command line perspective, all you should have to do is
>> pass qemu the device-tree -path- to the device you want to pass-trough
>> (you may support passing a full hierarchy here).
>>
>> That is for normal MMIO mapped SoC devices. Something else (individual
>> i2c, usb, ...) will use specific virtualization of the corresponding
>> busses.
>>
>> Anything else sucks too much really.
>>
>>  From there, well, there's several approach inside qemu/kvm to handle
>> that path. If you want to do things at the qemu level you can probably
>> parse /proc/device-tree. But I'd personally just make it a kernel thing.
>>
>> IE. I would have an ioctl to "instanciate" a pass-through device, that
>> takes that path as an argument. I would make it return an anonymous fd
>> which you can then use to mmap the resources, etc...
>
> Yeah, one idea was to use VFIO here. We could for example modify the host device tree to occupy device we want to pass through with a specific compatibility parameter. Or we could try to steal the node during runtime. But I agree, reading the device tree data from a VFIO node sounds reasonable. If it's required.

That makes it very specific to systems that use device trees.

To do the same for ARM platforms or x86, you would need to invent yet 
another mechanism.

Passing through arbitrary MMIO is fairly straight forward (likewise with 
PIO).  Passing through IRQs is a bit less straight forward and perhaps 
VFIO is the answer here.

I don't see a problem with QEMU figuring out what a device's resources 
are and doing the assignment.

Regards,

Anthony Liguori

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [Qemu-devel] device assignment for embedded Power
  2011-07-01 12:02       ` Alexander Graf
@ 2011-07-01 12:14         ` Anthony Liguori
  0 siblings, 0 replies; 29+ messages in thread
From: Anthony Liguori @ 2011-07-01 12:14 UTC (permalink / raw)
  To: Alexander Graf
  Cc: Wood Scott-B07421, qemu-devel@nongnu.org, dwg@au1.ibm.com,
	blauwirbel@gmail.com, Yoder Stuart-B08248,
	alex.williamson@redhat.com, Paul Brook, joerg.roedel@amd.com,
	armbru@redhat.com

On 07/01/2011 07:02 AM, Alexander Graf wrote:
>
> On 01.07.2011, at 13:55, Paul Brook wrote:
>
>>
>>> But the real challenge is how to expose the device to the guest device
>>> tree. Especially when it comes to links between dt nodes, interrupt maps,
>>> etc. We basically have 3 choices there:
>>>
>>>   * take the host device tree pieces and modify them
>>>   * provide device tree chunks for each device (manually or through qdev
>>> parameters) * use the device tree as machine config file and base
>>> everything on it (solves the linking problem)
>>>
>>> The main question is which one would be the cleanest solution. And how
>>> would it be implemented.
>>
>> I don't think any of this is specific to device passthrough.  It occurs as
>> soon as you have any user-configurable parts of the machine (or even just a
>> nontrivial selection of machine variants).  My guess is the only reason you
>> haven't hit it before is because you're only emulated a single hard-coded
>> SoC/board.
>
> Well, the real reason we haven't hit this before is that we don't have any devices in Qemu that are generic. We only have specific device emulation. This however would be a device that can handle hundreds of different backing devices, all with different requirements.
>
> The infrastructure we have today simply isn't made for this. The question is how can we model it so that it will? :)

Our infrastructure is quite capable of handling this.  It has many other 
problems but I think the only thing really missing is the way to have 
lists of parameters.  That seems easy to solve though.

Regards,

Anthony Liguori

>
>
> Alex
>

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [Qemu-devel] device assignment for embedded Power
  2011-07-01 12:10   ` Anthony Liguori
@ 2011-07-01 12:52     ` Paul Brook
  2011-07-01 13:33       ` Anthony Liguori
  2011-07-01 16:43     ` Scott Wood
  1 sibling, 1 reply; 29+ messages in thread
From: Paul Brook @ 2011-07-01 12:52 UTC (permalink / raw)
  To: Anthony Liguori
  Cc: Wood Scott-B07421, qemu-devel@nongnu.org, Alexander Graf,
	blauwirbel@gmail.com, Yoder Stuart-B08248,
	alex.williamson@redhat.com, joerg.roedel@amd.com, dwg@au1.ibm.com,
	armbru@redhat.com

> > So, from a qemu command line perspective, all you should have to do is
> > pass qemu the device-tree -path- to the device you want to pass-trough
> > (you may support passing a full hierarchy here).
> 
> I agree in principle but I think it should be done in a slightly
> different way.
> 
> I think we ought to support composing a device by passthrough.  For
> instance, something like:
> 
> [physical-device "mydev"]
> region[0].file = "/dev/mem"
> region[0].guest_address = "0x42232000"
> region[0].file_offset = "0x23423400"
> region[0].size = "4096"
> irq[0].guest_irq = "10"
> irq[0].host_irq = "10"
> 
> This should be independent of anything to do with device tree.  This
> would be useful for x86 too to assign platform devices (like the HPET).

I'm not quite sure what you're getting at here.  IMO there should be little or 
no need for special knowledge of passthrough devices.  They should just be 
annother qdev device, configured in the normal way.  e.g.:
   -device sysbus-host,hostdev=whatever,normal_mmio_and_irq_config
Should work the same as adding any other device. If it doesn't then we should 
fix that.  This is an example of why it's good to have device features (IRQs, 
MMIO regions, sockets, or whatever we call them) registered when the device is 
instantiated, not relying on pre-compiled device decriptors/property lists.  
In the latter case you probably need explicit variants for differnt numbers of 
IRQs, MMIO regions, etc.

While I'm thinking about it, we already have exactly this for USB (i.e. the 
usb-host device).

> I think there should be a separate mechanism to manipulate the guest
> device tree, just like there are mechanisms to manipulate the guest's
> ACPI tables.

I aggree.  Any sort of device tree (IIUC ACPI tables are in principle giving 
the same information) is, in practice, going to need to be assembled at 
runtime.  This needs some mechanism for devices to describe themselves, 
probably largely independent of actual machine/device creation code.

We've got away without it thus far because the only real place where we have 
nontrivial user-specified machine variants is on the PCI bus.  Devices there 
are for the most part self-describing so the guest firmware/OS can probe 
hardware itself.

Paul

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [Qemu-devel] device assignment for embedded Power
  2011-07-01 12:52     ` Paul Brook
@ 2011-07-01 13:33       ` Anthony Liguori
  0 siblings, 0 replies; 29+ messages in thread
From: Anthony Liguori @ 2011-07-01 13:33 UTC (permalink / raw)
  To: Paul Brook
  Cc: Wood Scott-B07421, joerg.roedel@amd.com, qemu-devel@nongnu.org,
	Alexander Graf, blauwirbel@gmail.com, Yoder Stuart-B08248,
	alex.williamson@redhat.com, dwg@au1.ibm.com, armbru@redhat.com

On 07/01/2011 07:52 AM, Paul Brook wrote:
>>> So, from a qemu command line perspective, all you should have to do is
>>> pass qemu the device-tree -path- to the device you want to pass-trough
>>> (you may support passing a full hierarchy here).
>>
>> I agree in principle but I think it should be done in a slightly
>> different way.
>>
>> I think we ought to support composing a device by passthrough.  For
>> instance, something like:
>>
>> [physical-device "mydev"]
>> region[0].file = "/dev/mem"
>> region[0].guest_address = "0x42232000"
>> region[0].file_offset = "0x23423400"
>> region[0].size = "4096"
>> irq[0].guest_irq = "10"
>> irq[0].host_irq = "10"
>>
>> This should be independent of anything to do with device tree.  This
>> would be useful for x86 too to assign platform devices (like the HPET).
>
> I'm not quite sure what you're getting at here.  IMO there should be little or
> no need for special knowledge of passthrough devices.  They should just be
> annother qdev device, configured in the normal way.  e.g.:
>     -device sysbus-host,hostdev=whatever,normal_mmio_and_irq_config

What I wrote about is just readconfig syntax.  It's the same as:

-device physical-device,id=mydev,region[0].file=/dev/mem,....

Regards,

Anthony Liguori

>> I think there should be a separate mechanism to manipulate the guest
>> device tree, just like there are mechanisms to manipulate the guest's
>> ACPI tables.
>
> I aggree.  Any sort of device tree (IIUC ACPI tables are in principle giving
> the same information) is, in practice, going to need to be assembled at
> runtime.  This needs some mechanism for devices to describe themselves,
> probably largely independent of actual machine/device creation code.
>
> We've got away without it thus far because the only real place where we have
> nontrivial user-specified machine variants is on the PCI bus.  Devices there
> are for the most part self-describing so the guest firmware/OS can probe
> hardware itself.
>
> Paul
>

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [Qemu-devel] device assignment for embedded Power
  2011-07-01  0:58 ` Benjamin Herrenschmidt
  2011-07-01 11:40   ` Alexander Graf
  2011-07-01 12:10   ` Anthony Liguori
@ 2011-07-01 16:34   ` Scott Wood
  2011-07-05 18:19   ` Yoder Stuart-B08248
  3 siblings, 0 replies; 29+ messages in thread
From: Scott Wood @ 2011-07-01 16:34 UTC (permalink / raw)
  To: Benjamin Herrenschmidt
  Cc: Wood Scott-B07421, joerg.roedel@amd.com, Alexander Graf,
	qemu-devel@nongnu.org, blauwirbel@gmail.com, Yoder Stuart-B08248,
	alex.williamson@redhat.com, paul@codesourcery.com,
	dwg@au1.ibm.com, armbru@redhat.com

On Fri, 1 Jul 2011 10:58:14 +1000
Benjamin Herrenschmidt <benh@kernel.crashing.org> wrote:

> So, from a qemu command line perspective, all you should have to do is
> pass qemu the device-tree -path- to the device you want to pass-trough
> (you may support passing a full hierarchy here).
> 
> That is for normal MMIO mapped SoC devices. Something else (individual
> i2c, usb, ...) will use specific virtualization of the corresponding
> busses.
> 
> Anything else sucks too much really.
> 
> From there, well, there's several approach inside qemu/kvm to handle
> that path. If you want to do things at the qemu level you can probably
> parse /proc/device-tree.

That's what option 1 is, except that instead of adding code to qemu to
parse /proc/device-tree, we'd use dtc to dump /proc/device-tree into a dtb
and let qemu use libfdt to look at the tree.  This is less Linux-specific,
more modular, and more flexible for doing the sort of insane hacks that are
going to happen in embedded-land whether you like them or not. :-)

> But I'd personally just make it a kernel thing.

I'd rather keep the kernel interface simple -- assign this memory region,
assign that interrupt, use this IOMMU device ID, etc.  Getting the kernel
involved in preparing the guest device tree, and understanding guuest
configuration, seems quite excessive.

> IE. I would have an ioctl to "instanciate" a pass-through device, that
> takes that path as an argument. I would make it return an anonymous fd
> which you can then use to mmap the resources, etc...
> 
> > In some cases, modifications to device tree nodes may be needed.
> > An example-- sometimes a device tree property references another node 
> > and that relationship may not exist when assigned to a guest.
> > A "phy-handle" property may need to be deleted and a "fixed-link"
> > property added to a node representing a network device.
> 
> That's fishy. Why wouldn't you give full access to the MDIO ? It's
> shared ? 

Yes, it's shared.  Yes, it sucks.

> Such things are so device-specific that they would have to be
> handled by device-specific quirks, which can live either in qemu or in
> the kernel.

Or in the configuration of qemu.  Not all users of the device want to do
the same thing.

> > So in addition to assigning a device, a mechanism is needed to update 
> > device tree nodes.  So for the above example, maybe--
> > 
> >  -device assigned-soc-dev,dev=/soc/ethernet@b2000,delete-prop=phy-handle,
> >   node-update="fixed-link = <2 1 1000 0 0>"
> 
> That's just so gross and error prone, borderline insane.

Welcome to embedded. :-)

Here, users are going to want to be able to mess around under the hood in
a way that server or desktop users generally don't need or want to.

> > The types of modifications needed--  deleting nodes, deleting properties, 
> > adding nodes, adding properties, adding properties that reference other
> > nodes, changing properties. This device tree transformation mechanism
> > needed is general enough that it could apply to any device tree based
> > embedded platform (e.g. ARM, MIPS)
> >
> > Another complexity relates to the IOMMU.  Here things get very company 
> > and IOMMU specific. Freescale has a proprietary IOMMU.
> 
> Look at the work currently being done for a generic qemu iommu layer. We
> need it for server power as well and from what I last saw coming from
> Eduardo and David, it's not PCI specific.

The problem is that our current IOMMU doesn't implement full paging (yes,
the HW people have been screamed at, but we're stuck with it for current
chips).  You have to break things down into regions following certain
alignment rules, which may require user guidance as to which memory regions
actually need DMA access, especially if you're setting up discontiguous
shared memory regions and such.

-Scott

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [Qemu-devel] device assignment for embedded Power
  2011-07-01 12:10   ` Anthony Liguori
  2011-07-01 12:52     ` Paul Brook
@ 2011-07-01 16:43     ` Scott Wood
  2011-07-01 17:03       ` Paul Brook
  2011-07-01 22:32       ` Anthony Liguori
  1 sibling, 2 replies; 29+ messages in thread
From: Scott Wood @ 2011-07-01 16:43 UTC (permalink / raw)
  To: Anthony Liguori
  Cc: Wood Scott-B07421, qemu-devel@nongnu.org, Alexander Graf,
	blauwirbel@gmail.com, Yoder Stuart-B08248,
	alex.williamson@redhat.com, paul@codesourcery.com,
	joerg.roedel@amd.com, dwg@au1.ibm.com, armbru@redhat.com

On Fri, 1 Jul 2011 07:10:45 -0500
Anthony Liguori <anthony@codemonkey.ws> wrote:

> I agree in principle but I think it should be done in a slightly 
> different way.
> 
> I think we ought to support composing a device by passthrough.  For 
> instance, something like:
> 
> [physical-device "mydev"]
> region[0].file = "/dev/mem"
> region[0].guest_address = "0x42232000"
> region[0].file_offset = "0x23423400"
> region[0].size = "4096"
> irq[0].guest_irq = "10"
> irq[0].host_irq = "10"
> 
> This should be independent of anything to do with device tree.  This 
> would be useful for x86 too to assign platform devices (like the HPET).

That's fine, as long as there's something layered on top of it for the case
where we do want to reference something in the device tree.  

However, we'll need to address the question of what it means to say "irq 10"
-- outside of PC-land there often isn't a global IRQ numberspace that isn't
a fiction created by some software layer.  Addressing this is one of the
device tree's strengths.

-Scott

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [Qemu-devel] device assignment for embedded Power
  2011-07-01 16:43     ` Scott Wood
@ 2011-07-01 17:03       ` Paul Brook
  2011-07-01 17:49         ` Scott Wood
  2011-07-01 22:35         ` Anthony Liguori
  2011-07-01 22:32       ` Anthony Liguori
  1 sibling, 2 replies; 29+ messages in thread
From: Paul Brook @ 2011-07-01 17:03 UTC (permalink / raw)
  To: Scott Wood
  Cc: Wood Scott-B07421, Alexander Graf, qemu-devel@nongnu.org,
	blauwirbel@gmail.com, Yoder Stuart-B08248,
	alex.williamson@redhat.com, joerg.roedel@amd.com, dwg@au1.ibm.com,
	armbru@redhat.com

> > irq[0].guest_irq = "10"
> > 
> > This should be independent of anything to do with device tree.  This
> > would be useful for x86 too to assign platform devices (like the HPET).
> 
> That's fine, as long as there's something layered on top of it for the case
> where we do want to reference something in the device tree.
> 
> However, we'll need to address the question of what it means to say "irq
> 10" -- outside of PC-land there often isn't a global IRQ numberspace that
> isn't a fiction created by some software layer.  Addressing this is one of
> the device tree's strengths.

That's an entirely separate problem, thoug probably a prerequisite.

Basically you should start by implementing full emulation of a device with 
similar characteristics to the one you want to passthrough.

Then fix whatever is needed to allow the user to contol instantiation of those 
devices. This almost certainly means using the -device commandline option.  
This currently only works for a fairly simple subset of devices (approximately 
PCI and USB), so you'll probably need to fix/implement the missing bits.  To 
do this you'll probably need to do some work on the various bits of the qdev 
relating to linking devices together.  See recent discussion about sockets in 
the "basic support for composing sysbus devices" thread.

To expose this to the guest you'll probably also need to implement some form 
of dynamic device tree assembly/manipulation.  Not strictly necessary (we can 
require the user supply a complete device tree that matches whatever devices 
they've configured), but probably highly desirable.

Once you've done all the above, host device passthrough should be relatively 
straightforward.  Just replace the emulation bits in the above device with 
code that pokes at a real device via the relevant kernel API.

Paul

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [Qemu-devel] device assignment for embedded Power
  2011-07-01 17:03       ` Paul Brook
@ 2011-07-01 17:49         ` Scott Wood
  2011-07-01 20:59           ` Paul Brook
  2011-07-01 22:35         ` Anthony Liguori
  1 sibling, 1 reply; 29+ messages in thread
From: Scott Wood @ 2011-07-01 17:49 UTC (permalink / raw)
  To: Paul Brook
  Cc: Wood Scott-B07421, Alexander Graf, qemu-devel@nongnu.org,
	blauwirbel@gmail.com, Yoder Stuart-B08248,
	alex.williamson@redhat.com, joerg.roedel@amd.com, dwg@au1.ibm.com,
	armbru@redhat.com

On Fri, 1 Jul 2011 18:03:01 +0100
Paul Brook <paul@codesourcery.com> wrote:

> Basically you should start by implementing full emulation of a device with 
> similar characteristics to the one you want to passthrough.

That's not going to happen.

> Once you've done all the above, host device passthrough should be relatively 
> straightforward.  Just replace the emulation bits in the above device with 
> code that pokes at a real device via the relevant kernel API.

That's not what we mean by direct device assignment.

We're talking about directly mapping the registers into the guest.  The
whole point is performance.

-Scott

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [Qemu-devel] device assignment for embedded Power
  2011-07-01 11:16 ` Paul Brook
  2011-07-01 11:33   ` Alexander Graf
@ 2011-07-01 17:51   ` Scott Wood
  1 sibling, 0 replies; 29+ messages in thread
From: Scott Wood @ 2011-07-01 17:51 UTC (permalink / raw)
  To: Paul Brook
  Cc: Wood Scott-B07421, Alexander Graf, qemu-devel@nongnu.org,
	blauwirbel@gmail.com, Yoder Stuart-B08248,
	alex.williamson@redhat.com, joerg.roedel@amd.com, dwg@au1.ibm.com,
	armbru@redhat.com

On Fri, 1 Jul 2011 12:16:35 +0100
Paul Brook <paul@codesourcery.com> wrote:

> > One feature we need for QEMU/KVM on embedded Power Architecture is the
> > ability to do passthru assignment of SoC I/O devices and memory.  An
> > important use case in embedded is creating static partitions--
> > taking physical memory and I/O devices (non-PCI) and partitioning
> > them between the host Linux and several virtual machines.   Things like
> > live migration would not be needed or supported in these types of
> > scenarios.
> > 
> > SoC devices do not sit on a probeable bus and there are no identifiers
> > like 01:00.0 with PCI that we can use to identify devices--  the host
> > Linux kernel is made aware of SoC I/O devices from nodes/properties in a
> > device tree structure passed at boot.   QEMU needs to generate a
> > device tree to pass to the guest as well with all the guest's virtual
> > and physical resources.  Today a number of mostly complete guest device
> > trees are kept under ./pc-bios in QEMU, but this too static and
> > inflexible.
> 
> I doubt you're going to get generic passthrough of arbitrary devices working 
> in a useful way.

It's usefully working for us internally -- we're just trying to find a way
to improve it for upstream, with a better configuration mechanism.

> My expectation is that, at minimum, you'll need a bus 
> specific proxy device. i.e. create a virtual device in qemu that responds to 
> the guest, and happens poke at a host device rather than emulating things 
> directly.

Many of these embedded devices don't sit on any sort of software-visible
bus, and requiring that the I/O happen via MMIO traps would result in
unacceptable overhead.

> Basically you have to emulate  everything that is different between the host and guest.

Directly assigning a device means you don't get to have differences between
the actual hardware device and what the guest sees.  The kind of thin
wrapper you're suggesting might have some use cases, but it's a different
problem from what we're trying to solve.

-Scott

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [Qemu-devel] device assignment for embedded Power
  2011-07-01 17:49         ` Scott Wood
@ 2011-07-01 20:59           ` Paul Brook
  2011-07-01 21:51             ` Scott Wood
  2011-07-01 23:05             ` Benjamin Herrenschmidt
  0 siblings, 2 replies; 29+ messages in thread
From: Paul Brook @ 2011-07-01 20:59 UTC (permalink / raw)
  To: Scott Wood
  Cc: Wood Scott-B07421, Alexander Graf, qemu-devel@nongnu.org,
	blauwirbel@gmail.com, Yoder Stuart-B08248,
	alex.williamson@redhat.com, joerg.roedel@amd.com, dwg@au1.ibm.com,
	armbru@redhat.com

> On Fri, 1 Jul 2011 18:03:01 +0100
> 
> Paul Brook <paul@codesourcery.com> wrote:
> > Basically you should start by implementing full emulation of a device
> > with similar characteristics to the one you want to passthrough.
> 
> That's not going to happen.

Why is your device so unique? How does it interact with the guest system and 
what features does it require that doen't exist in any device that can be 
emulated?

I'm also extremely sceptical of anything that only works in a kvm environment.  
Makes me think it's an unmaintainable hack, and almost certainly going to 
cause you immense amounts of pain later.

> > I doubt you're going to get generic passthrough of arbitrary devices
> > working in a useful way.
> 
> It's usefully working for us internally -- we're just trying to find a way
> to improve it for upstream, with a better configuration mechanism.

I don't believe that either.  More likely you've got passthrough of device 
hanging off your specific CPU bus, using only (or even a subset of) the 
facilities provided by that bus.

> > Basically you have to emulate  everything that is different between the
> > host and guest.
> 
> Directly assigning a device means you don't get to have differences between
> the actual hardware device and what the guest sees.  The kind of thin
> wrapper you're suggesting might have some use cases, but it's a different
> problem from what we're trying to solve.

That's the problem. You've skipped several steps and gone startigh for 
optimization before you've even got basic functionality working.

You've also missed the point I was making.  In order to do device passthrough 
you need to define a boundary allong which the emulated machine state can be 
fully replicated on the host machine.  Anything inside this boundary is (by 
definition) that same on both the host and guest systems (we're effectively 
using host hardware to emulate a device for us). Outside that boundary the 
host and guest systems will diverge.

For a device that merely responds to CPU initiated MMIO transfers this is 
pretty simple, it's the point at which MMIO transfers are generated. So the 
guest gets a proxy device that intercepts accesses to that memory region, and 
the host proxies some way for qemu to poke values at the host device.

> > Once you've done all the above, host device passthrough should be
> > relatively straightforward.  Just replace the emulation bits in the
> > above device with code that pokes at a real device via the relevant
> > kernel API.
> 
> That's not what we mean by direct device assignment.

Maybe, but IMO but it's a necessary prerequisite. You're trying to run before 
you can walk.

> We're talking about directly mapping the registers into the guest.  The
> whole point is performance.

That's an additional step after you get passthrough working the normal way.
We already have mechanisms (or at least patches) for mapping file-like objects 
into guest physical memory.  That's largely independent of device passthrough.  
It's a relatively minor tweak to how the passthrough device sets up its MMIO 
regions.

Mapping host device MMIO regions into guest space is entirely uninteresting 
unless we already have some way of creating guest-host passthrough devices.  
Creating guest-device passthrough devices isn't going to happen until the can 
create arbitrary devices (within the set emulated by qemu) that interact with 
the rest of the emulated machine in a similar way.

Paul

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [Qemu-devel] device assignment for embedded Power
  2011-07-01 20:59           ` Paul Brook
@ 2011-07-01 21:51             ` Scott Wood
  2011-07-01 23:33               ` Paul Brook
  2011-07-01 23:05             ` Benjamin Herrenschmidt
  1 sibling, 1 reply; 29+ messages in thread
From: Scott Wood @ 2011-07-01 21:51 UTC (permalink / raw)
  To: Paul Brook
  Cc: Wood Scott-B07421, Alexander Graf, qemu-devel@nongnu.org,
	blauwirbel@gmail.com, Yoder Stuart-B08248,
	alex.williamson@redhat.com, joerg.roedel@amd.com, dwg@au1.ibm.com,
	armbru@redhat.com

On Fri, 1 Jul 2011 21:59:35 +0100
Paul Brook <paul@codesourcery.com> wrote:

> > On Fri, 1 Jul 2011 18:03:01 +0100
> > 
> > Paul Brook <paul@codesourcery.com> wrote:
> > > Basically you should start by implementing full emulation of a device
> > > with similar characteristics to the one you want to passthrough.
> > 
> > That's not going to happen.
> 
> Why is your device so unique? How does it interact with the guest system and 
> what features does it require that doen't exist in any device that can be 
> emulated?

Perhaps I misunderstood what you meant by "similar characteristics".  I see
no reason to spend a bunch of time implementing full emulation for a device,
that isn't going to be used, just because it seems like a nice
intermediary step.

What specifically is it you're suggesting we do full emulation of?

> I'm also extremely sceptical of anything that only works in a kvm environment.  
> Makes me think it's an unmaintainable hack, and almost certainly going to 
> cause you immense amounts of pain later.

I believe the only part of the device assignment stuff we've implemented so
far that is KVM specific is the interrupt routing.  I'm open to ways of
routing the interrupts to qemu in the non-KVM case, as long as we can
bypass it when KVM is used.

I'm not sure what the use case is for direct assignment of a device in an
otherwise completely emulated guest, but perhaps there is one.

> > > I doubt you're going to get generic passthrough of arbitrary devices
> > > working in a useful way.
> > 
> > It's usefully working for us internally -- we're just trying to find a way
> > to improve it for upstream, with a better configuration mechanism.
> 
> I don't believe that either.  More likely you've got passthrough of device 
> hanging off your specific CPU bus, using only (or even a subset of) the 
> facilities provided by that bus.

There's nothing special about our "bus".  It's MMIO, DMA, and interrupts.

What specifically are you disbelieving?

> > > Basically you have to emulate  everything that is different between the
> > > host and guest.
> > 
> > Directly assigning a device means you don't get to have differences between
> > the actual hardware device and what the guest sees.  The kind of thin
> > wrapper you're suggesting might have some use cases, but it's a different
> > problem from what we're trying to solve.
> 
> That's the problem. You've skipped several steps and gone startigh for 
> optimization before you've even got basic functionality working.

This is the basic functionality -- assign a piece of hardware to the
guest with minimal overhead.  Why go through contortions to construct some
intermediate phase that nobody's interested in using?

> You've also missed the point I was making.  In order to do device passthrough 
> you need to define a boundary allong which the emulated machine state can be 
> fully replicated on the host machine.  Anything inside this boundary is (by 
> definition) that same on both the host and guest systems (we're effectively 
> using host hardware to emulate a device for us). Outside that boundary the 
> host and guest systems will diverge.

I'm still not sure what the point is, then.  By directly assigning the
device the user is placing everything about the device on the "same as
host" side of that boundary.

We're not using host hardware to emulate a device, we're using host
hardware to send and receive packets under control of the guest.
Whatever hardware that is, the guest will deal with it, just as if the
guest weren't running in a vm.

> For a device that merely responds to CPU initiated MMIO transfers this is 
> pretty simple, it's the point at which MMIO transfers are generated. So the 
> guest gets a proxy device that intercepts accesses to that memory region, and 
> the host proxies some way for qemu to poke values at the host device.

The point is to be faster than virtio, not slower.  There would be no
reason for us to do this otherwise.

Emulating some specific device is not our goal, at all.  I realize that
that's a major part of what qemu does, but it's not the only thing it's
used for.

> > > Once you've done all the above, host device passthrough should be
> > > relatively straightforward.  Just replace the emulation bits in the
> > > above device with code that pokes at a real device via the relevant
> > > kernel API.
> > 
> > That's not what we mean by direct device assignment.
> 
> Maybe, but IMO but it's a necessary prerequisite. You're trying to run before 
> you can walk.

I disagree that it is a prerequisite.  It is a fundamentally different
thing, for a different purpose.

If it's a purpose that is important to you, and you think the proposed
config mechanisms don't accommodate that, then propose something that does.

> > We're talking about directly mapping the registers into the guest.  The
> > whole point is performance.
> 
> That's an additional step after you get passthrough working the normal way.

"normal"?

> We already have mechanisms (or at least patches) for mapping file-like objects 
> into guest physical memory.  That's largely independent of device passthrough.  
> It's a relatively minor tweak to how the passthrough device sets up its MMIO 
> regions.
> 
> Mapping host device MMIO regions into guest space is entirely uninteresting 
> unless we already have some way of creating guest-host passthrough devices.  

Isn't that what's being discussed?

> Creating guest-device passthrough devices isn't going to happen until the can 
> create arbitrary devices (within the set emulated by qemu) that interact with 
> the rest of the emulated machine in a similar way.

What do you mean by "interact with the rest of the emulated machine in a
similar way"?

-Scott

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [Qemu-devel] device assignment for embedded Power
  2011-07-01 16:43     ` Scott Wood
  2011-07-01 17:03       ` Paul Brook
@ 2011-07-01 22:32       ` Anthony Liguori
  2011-07-05 18:16         ` Scott Wood
  1 sibling, 1 reply; 29+ messages in thread
From: Anthony Liguori @ 2011-07-01 22:32 UTC (permalink / raw)
  To: Scott Wood
  Cc: Wood Scott-B07421, joerg.roedel@amd.com, qemu-devel@nongnu.org,
	Alexander Graf, blauwirbel@gmail.com, Yoder Stuart-B08248,
	alex.williamson@redhat.com, paul@codesourcery.com,
	dwg@au1.ibm.com, armbru@redhat.com

On 07/01/2011 11:43 AM, Scott Wood wrote:
> On Fri, 1 Jul 2011 07:10:45 -0500
> Anthony Liguori<anthony@codemonkey.ws>  wrote:
>
>> I agree in principle but I think it should be done in a slightly
>> different way.
>>
>> I think we ought to support composing a device by passthrough.  For
>> instance, something like:
>>
>> [physical-device "mydev"]
>> region[0].file = "/dev/mem"
>> region[0].guest_address = "0x42232000"
>> region[0].file_offset = "0x23423400"
>> region[0].size = "4096"
>> irq[0].guest_irq = "10"
>> irq[0].host_irq = "10"
>>
>> This should be independent of anything to do with device tree.  This
>> would be useful for x86 too to assign platform devices (like the HPET).
>
> That's fine, as long as there's something layered on top of it for the case
> where we do want to reference something in the device tree.
>
> However, we'll need to address the question of what it means to say "irq 10"

It depends on what the bus is.  If you're going to declare "system bus" 
which is sort of what we call ISA for the PC, then it can map trivially 
to the interrupt controller's inputs.

> -- outside of PC-land there often isn't a global IRQ numberspace that isn't
> a fiction created by some software layer.

PC's don't have a global IRQ number space FWIW.  When we say:

-device isa-serial,irq=4

This really means, "ISA irq 4", which is mapped to the PIIX3 and then 
routed through GSI, then the APIC architecture to correspond to some 
interrupt for some physical CPU.

> Addressing this is one of the
> device tree's strengths.

Not really.  There's nothing magical about the device tree.  It's just a 
guest visible description of the platform hardware that isn't probe-able 
in some bus framework.  ACPI does exactly the same thing.  I'll concede 
that the device tree is far nicer than ACPI but again, it's not magical :-)

Regards,

Anthony Liguori

> -Scott
>
>

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [Qemu-devel] device assignment for embedded Power
  2011-07-01 17:03       ` Paul Brook
  2011-07-01 17:49         ` Scott Wood
@ 2011-07-01 22:35         ` Anthony Liguori
  1 sibling, 0 replies; 29+ messages in thread
From: Anthony Liguori @ 2011-07-01 22:35 UTC (permalink / raw)
  To: Paul Brook
  Cc: Wood Scott-B07421, joerg.roedel@amd.com, Alexander Graf,
	qemu-devel@nongnu.org, blauwirbel@gmail.com, Yoder Stuart-B08248,
	alex.williamson@redhat.com, Scott Wood, dwg@au1.ibm.com,
	armbru@redhat.com

On 07/01/2011 12:03 PM, Paul Brook wrote:
>>> irq[0].guest_irq = "10"
>>>
>>> This should be independent of anything to do with device tree.  This
>>> would be useful for x86 too to assign platform devices (like the HPET).
>>
>> That's fine, as long as there's something layered on top of it for the case
>> where we do want to reference something in the device tree.
>>
>> However, we'll need to address the question of what it means to say "irq
>> 10" -- outside of PC-land there often isn't a global IRQ numberspace that
>> isn't a fiction created by some software layer.  Addressing this is one of
>> the device tree's strengths.
>
> That's an entirely separate problem, thoug probably a prerequisite.
>
> Basically you should start by implementing full emulation of a device with
> similar characteristics to the one you want to passthrough.

If you want to model interrupt remapping, you have to model device 
relationships.  If you cannot express the bus hierarchy/relationship 
then you cannot sanely model interrupt remapping.

You can only really ever think about passing through an entire subtree 
of the device hierarchy.  You can't have a partial subtree with some 
crazy hack logic to explain how the physical layer may remap interrupts. 
  That's just asking for pain.

Regards,

Anthony Liguori

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [Qemu-devel] device assignment for embedded Power
  2011-07-01 20:59           ` Paul Brook
  2011-07-01 21:51             ` Scott Wood
@ 2011-07-01 23:05             ` Benjamin Herrenschmidt
  2011-07-01 23:50               ` Paul Brook
  1 sibling, 1 reply; 29+ messages in thread
From: Benjamin Herrenschmidt @ 2011-07-01 23:05 UTC (permalink / raw)
  To: Paul Brook
  Cc: Wood Scott-B07421, joerg.roedel@amd.com, Alexander Graf,
	qemu-devel@nongnu.org, blauwirbel@gmail.com, Yoder Stuart-B08248,
	alex.williamson@redhat.com, Scott Wood, dwg@au1.ibm.com,
	armbru@redhat.com

On Fri, 2011-07-01 at 21:59 +0100, Paul Brook wrote:
> > On Fri, 1 Jul 2011 18:03:01 +0100
> > 
> > Paul Brook <paul@codesourcery.com> wrote:
> > > Basically you should start by implementing full emulation of a device
> > > with similar characteristics to the one you want to passthrough.
> > 
> > That's not going to happen.
> 
> Why is your device so unique? How does it interact with the guest system and 
> what features does it require that doen't exist in any device that can be 
> emulated?

Do you guys only support PCI pass-through by doing full emulation of the
all possible supported PCI devices first ? :-)

> I'm also extremely sceptical of anything that only works in a kvm environment.  
> Makes me think it's an unmaintainable hack, and almost certainly going to 
> cause you immense amounts of pain later.

See above question...

Cheers,
Ben.
 
> > > I doubt you're going to get generic passthrough of arbitrary devices
> > > working in a useful way.
> > 
> > It's usefully working for us internally -- we're just trying to find a way
> > to improve it for upstream, with a better configuration mechanism.
> 
> I don't believe that either.  More likely you've got passthrough of device 
> hanging off your specific CPU bus, using only (or even a subset of) the 
> facilities provided by that bus.
> 
> > > Basically you have to emulate  everything that is different between the
> > > host and guest.
> > 
> > Directly assigning a device means you don't get to have differences between
> > the actual hardware device and what the guest sees.  The kind of thin
> > wrapper you're suggesting might have some use cases, but it's a different
> > problem from what we're trying to solve.
> 
> That's the problem. You've skipped several steps and gone startigh for 
> optimization before you've even got basic functionality working.
> 
> You've also missed the point I was making.  In order to do device passthrough 
> you need to define a boundary allong which the emulated machine state can be 
> fully replicated on the host machine.  Anything inside this boundary is (by 
> definition) that same on both the host and guest systems (we're effectively 
> using host hardware to emulate a device for us). Outside that boundary the 
> host and guest systems will diverge.
> 
> For a device that merely responds to CPU initiated MMIO transfers this is 
> pretty simple, it's the point at which MMIO transfers are generated. So the 
> guest gets a proxy device that intercepts accesses to that memory region, and 
> the host proxies some way for qemu to poke values at the host device.
> 
> > > Once you've done all the above, host device passthrough should be
> > > relatively straightforward.  Just replace the emulation bits in the
> > > above device with code that pokes at a real device via the relevant
> > > kernel API.
> > 
> > That's not what we mean by direct device assignment.
> 
> Maybe, but IMO but it's a necessary prerequisite. You're trying to run before 
> you can walk.
> 
> > We're talking about directly mapping the registers into the guest.  The
> > whole point is performance.
> 
> That's an additional step after you get passthrough working the normal way.
> We already have mechanisms (or at least patches) for mapping file-like objects 
> into guest physical memory.  That's largely independent of device passthrough.  
> It's a relatively minor tweak to how the passthrough device sets up its MMIO 
> regions.
> 
> Mapping host device MMIO regions into guest space is entirely uninteresting 
> unless we already have some way of creating guest-host passthrough devices.  
> Creating guest-device passthrough devices isn't going to happen until the can 
> create arbitrary devices (within the set emulated by qemu) that interact with 
> the rest of the emulated machine in a similar way.
> 
> Paul

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [Qemu-devel] device assignment for embedded Power
  2011-07-01 21:51             ` Scott Wood
@ 2011-07-01 23:33               ` Paul Brook
  0 siblings, 0 replies; 29+ messages in thread
From: Paul Brook @ 2011-07-01 23:33 UTC (permalink / raw)
  To: Scott Wood
  Cc: Wood Scott-B07421, Alexander Graf, qemu-devel@nongnu.org,
	blauwirbel@gmail.com, Yoder Stuart-B08248,
	alex.williamson@redhat.com, joerg.roedel@amd.com, dwg@au1.ibm.com,
	armbru@redhat.com

> > Why is your device so unique? How does it interact with the guest system
> > and what features does it require that doen't exist in any device that
> > can be emulated?
> 
> Perhaps I misunderstood what you meant by "similar characteristics".  I see
> no reason to spend a bunch of time implementing full emulation for a
> device, that isn't going to be used, just because it seems like a nice
> intermediary step.

You say your device has MMIO regions, generates IRQs and initiates DMA 
transactions.  Any device or selection of devices that between them use all 
those features will do the job. I'd expect most SoC to have several.  We don't 
care what the device actually does, only the ways it communicates with the 
rest of the machine.

I think you're coming at this problem from completely the wrong direction.  
Instead of "how do I wedge this passthrough into my machine", you should be 
asking "how do I create a machine without knowing the machine layout at 
compile time".  Once you fix that, hooking up the passthrough device should be 
fairly trivial.  You only have a single passthrough device, and the rest of us 
have none at all.  Anything restricted to the pasthrough case is thus unlikely 
to be the right answer to the second question, and I'd expect it to be 
removed/changed/broken when we do get round to implementing dynamic device 
creation.

> > > We're talking about directly mapping the registers into the guest.  The
> > > whole point is performance.
> > 
> > That's an additional step after you get passthrough working the normal
> > way.
> 
> "normal"?

Mapping a MMIO region into the guest is an additional complication, and purely 
a performance optimization.  qemu already needs to be in the loop to handle 
interrupts, probably DMA setup and the non-kvm case.

> I'm not sure what the use case is for direct assignment of a device in an
> otherwise completely emulated guest, but perhaps there is one.

Typically because the host system doesn't know how to talk to it, or there 
isn't a sensible way to relay the functionality provided by the device from 
the kernel to qemu.

> > We already have mechanisms (or at least patches) for mapping file-like
> > objects into guest physical memory.  That's largely independent of
> > device passthrough. It's a relatively minor tweak to how the passthrough
> > device sets up its MMIO regions.
> > 
> > Mapping host device MMIO regions into guest space is entirely
> > uninteresting unless we already have some way of creating guest-host
> > passthrough devices.
> 
> Isn't that what's being discussed?

It's your end goal, but I don't think it's particularly relevant to the 
problem you've encountered.

> > Creating guest-device passthrough devices isn't going to happen until the
> > can create arbitrary devices (within the set emulated by qemu) that
> > interact with the rest of the emulated machine in a similar way.
> 
> What do you mean by "interact with the rest of the emulated machine in a
> similar way"?

See first paragraph above.

Paul

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [Qemu-devel] device assignment for embedded Power
  2011-07-01 23:05             ` Benjamin Herrenschmidt
@ 2011-07-01 23:50               ` Paul Brook
  2011-07-02  2:17                 ` Alexander Graf
  0 siblings, 1 reply; 29+ messages in thread
From: Paul Brook @ 2011-07-01 23:50 UTC (permalink / raw)
  To: Benjamin Herrenschmidt
  Cc: Wood Scott-B07421, joerg.roedel@amd.com, Alexander Graf,
	qemu-devel@nongnu.org, blauwirbel@gmail.com, Yoder Stuart-B08248,
	alex.williamson@redhat.com, Scott Wood, dwg@au1.ibm.com,
	armbru@redhat.com

> On Fri, 2011-07-01 at 21:59 +0100, Paul Brook wrote:
> > > On Fri, 1 Jul 2011 18:03:01 +0100
> > > 
> > > Paul Brook <paul@codesourcery.com> wrote:
> > > > Basically you should start by implementing full emulation of a device
> > > > with similar characteristics to the one you want to passthrough.
> > > 
> > > That's not going to happen.
> > 
> > Why is your device so unique? How does it interact with the guest system
> > and what features does it require that doen't exist in any device that
> > can be emulated?
> 
> Do you guys only support PCI pass-through by doing full emulation of the
> all possible supported PCI devices first ? :-)

Absolutely not.  My point is that dynamic (user-driven) device creation is 
effectively a prerequisite for a passthrough device.

If you just want to make a very specific use-case then this doesn't need any 
code in qemu at all.  We just make the user provide the device tree 
themselves. If it doesn't match then they loose.  If you do choose an ugly 
qemu then the changes are it'll be changed/removed once we do dyamic device 
creation properly.  There have already been discussions about dynamic device 
creation, so this this isn't completely hypothetical.

If you integrate it properly, then you need to realise then there's a fair 
chunk of infrastructure and user interface required.  Most of which has 
nothing to do with device passthrough.  Trying to implement both at the same 
time is just going to cause confusion and complicate things.  It's already a 
hard problem, combining it with something else is just going to cause you and 
everyone else even more pain.

Paul

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [Qemu-devel] device assignment for embedded Power
  2011-07-01 23:50               ` Paul Brook
@ 2011-07-02  2:17                 ` Alexander Graf
  2011-07-02 11:45                   ` Paul Brook
  0 siblings, 1 reply; 29+ messages in thread
From: Alexander Graf @ 2011-07-02  2:17 UTC (permalink / raw)
  To: Paul Brook
  Cc: Wood Scott-B07421, qemu-devel@nongnu.org, dwg@au1.ibm.com,
	blauwirbel@gmail.com, Yoder Stuart-B08248,
	alex.williamson@redhat.com, joerg.roedel@amd.com, Scott Wood,
	armbru@redhat.com


On 02.07.2011, at 01:50, Paul Brook wrote:

>> On Fri, 2011-07-01 at 21:59 +0100, Paul Brook wrote:
>>>> On Fri, 1 Jul 2011 18:03:01 +0100
>>>> 
>>>> Paul Brook <paul@codesourcery.com> wrote:
>>>>> Basically you should start by implementing full emulation of a device
>>>>> with similar characteristics to the one you want to passthrough.
>>>> 
>>>> That's not going to happen.
>>> 
>>> Why is your device so unique? How does it interact with the guest system
>>> and what features does it require that doen't exist in any device that
>>> can be emulated?
>> 
>> Do you guys only support PCI pass-through by doing full emulation of the
>> all possible supported PCI devices first ? :-)
> 
> Absolutely not.  My point is that dynamic (user-driven) device creation is 
> effectively a prerequisite for a passthrough device.
> 
> If you just want to make a very specific use-case then this doesn't need any 
> code in qemu at all.  We just make the user provide the device tree 
> themselves. If it doesn't match then they loose.  If you do choose an ugly 
> qemu then the changes are it'll be changed/removed once we do dyamic device 
> creation properly.  There have already been discussions about dynamic device 
> creation, so this this isn't completely hypothetical.
> 
> If you integrate it properly, then you need to realise then there's a fair 
> chunk of infrastructure and user interface required.  Most of which has 
> nothing to do with device passthrough.  Trying to implement both at the same 
> time is just going to cause confusion and complicate things.  It's already a 
> hard problem, combining it with something else is just going to cause you and 
> everyone else even more pain.

So you're basically saying we should tackle these 3 issues separately:

  * actually pass through a device
  * generate interrupt links
  * model the guest device tree dynamically based on whatever the user gives us

I tend to agree with that perspective. Still, the main issue still stands in that we don't have a concrete answer for all three issues :). Facing them one at a time might help actually solving them though.


Alex

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [Qemu-devel] device assignment for embedded Power
  2011-07-02  2:17                 ` Alexander Graf
@ 2011-07-02 11:45                   ` Paul Brook
  0 siblings, 0 replies; 29+ messages in thread
From: Paul Brook @ 2011-07-02 11:45 UTC (permalink / raw)
  To: Alexander Graf
  Cc: Wood Scott-B07421, qemu-devel@nongnu.org, dwg@au1.ibm.com,
	blauwirbel@gmail.com, Yoder Stuart-B08248,
	alex.williamson@redhat.com, joerg.roedel@amd.com, Scott Wood,
	armbru@redhat.com

> So you're basically saying we should tackle these 3 issues separately:
> 
>   * actually pass through a device
>   * generate interrupt links
>   * model the guest device tree dynamically based on whatever the user
> gives us

Yes.

Paul

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [Qemu-devel] device assignment for embedded Power
  2011-07-01 22:32       ` Anthony Liguori
@ 2011-07-05 18:16         ` Scott Wood
  0 siblings, 0 replies; 29+ messages in thread
From: Scott Wood @ 2011-07-05 18:16 UTC (permalink / raw)
  To: Anthony Liguori
  Cc: Wood Scott-B07421, joerg.roedel@amd.com, qemu-devel@nongnu.org,
	Alexander Graf, blauwirbel@gmail.com, Yoder Stuart-B08248,
	alex.williamson@redhat.com, paul@codesourcery.com,
	dwg@au1.ibm.com, armbru@redhat.com

On Fri, 1 Jul 2011 17:32:43 -0500
Anthony Liguori <anthony@codemonkey.ws> wrote:

> On 07/01/2011 11:43 AM, Scott Wood wrote:
> > However, we'll need to address the question of what it means to say "irq 10"
> 
> It depends on what the bus is.  If you're going to declare "system bus" 
> which is sort of what we call ISA for the PC,

More like "arbitrary MMIO".  Could be an on-chip peripheral.  Could be some
external custom chip.  Could be an entire PCIe root complex.

> then it can map trivially to the interrupt controller's inputs.

Which interrupt controller?  We might want to assign an IRQ that's on some
cascaded controller.

We also have some things like MPIC IPIs and timers,
that are on the main interrupt controller but aren't normal numbered
interrupts.  We use the ability to have multiple cells in an interrupt
specifier to express these.  And while you could make up fake numbers for
these to force it to be linear, someone has to come up with this mapping and
get qemu, its users, and the kernel to agree on it.  We already have a
repository for such bindings for the device tree.

That's not to say that the device tree should be forced onto platforms that
have some other reasonable way of doing it, of course -- just that it's
nice to be able to refer to it when it's there.

> > -- outside of PC-land there often isn't a global IRQ numberspace that isn't
> > a fiction created by some software layer.
> 
> PC's don't have a global IRQ number space FWIW.  When we say:
> 
> -device isa-serial,irq=4
> 
> This really means, "ISA irq 4", which is mapped to the PIIX3 and then 
> routed through GSI, then the APIC architecture to correspond to some 
> interrupt for some physical CPU.

Well, it's been a while since I've dealt with such things on PCs...  I
thought there was at least some standard way of interpreting things like
IRQ numbers that the BIOS wrote into PCI config space.

> > Addressing this is one of the
> > device tree's strengths.
> 
> Not really.  There's nothing magical about the device tree.  It's just a 
> guest visible description of the platform hardware that isn't probe-able 
> in some bus framework.  ACPI does exactly the same thing.  I'll concede 
> that the device tree is far nicer than ACPI but again, it's not magical :-)

I didn't say it was the only way to express it -- just that the device tree,
or something like it, comes in useful here.

And we're not about to do ACPI on powerpc. :-)

-Scott

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [Qemu-devel] device assignment for embedded Power
  2011-07-01  0:58 ` Benjamin Herrenschmidt
                     ` (2 preceding siblings ...)
  2011-07-01 16:34   ` Scott Wood
@ 2011-07-05 18:19   ` Yoder Stuart-B08248
  2011-07-05 22:23     ` Alexander Graf
  3 siblings, 1 reply; 29+ messages in thread
From: Yoder Stuart-B08248 @ 2011-07-05 18:19 UTC (permalink / raw)
  To: Benjamin Herrenschmidt
  Cc: Wood Scott-B07421, joerg.roedel@amd.com, Alexander Graf,
	qemu-devel@nongnu.org, dwg@au1.ibm.com, blauwirbel@gmail.com,
	alex.williamson@redhat.com, paul@codesourcery.com,
	armbru@redhat.com



> -----Original Message-----
> From: Benjamin Herrenschmidt [mailto:benh@kernel.crashing.org]
> Sent: Thursday, June 30, 2011 7:58 PM
> To: Yoder Stuart-B08248
> Cc: qemu-devel@nongnu.org; Wood Scott-B07421; Alexander Graf; alex.williamson@redhat.com;
> anthony@codemonkey.ws; dwg@au1.ibm.com; joerg.roedel@amd.com; paul@codesourcery.com;
> blauwirbel@gmail.com; armbru@redhat.com
> Subject: Re: device assignment for embedded Power
> 
> On Thu, 2011-06-30 at 15:59 +0000, Yoder Stuart-B08248 wrote:
> > One feature we need for QEMU/KVM on embedded Power Architecture is the
> > ability to do passthru assignment of SoC I/O devices and memory.  An
> > important use case in embedded is creating static partitions-- taking
> > physical memory and I/O devices (non-PCI) and partitioning
> > them between the host Linux and several virtual machines.   Things like
> > live migration would not be needed or supported in these types of scenarios.
> >
> > SoC devices do not sit on a probeable bus and there are no identifiers
> > like 01:00.0 with PCI that we can use to identify devices--  the host
> > Linux kernel is made aware of SoC I/O devices from nodes/properties in a
> > device tree structure passed at boot.   QEMU needs to generate a
> > device tree to pass to the guest as well with all the guest's virtual
> > and physical resources.  Today a number of mostly complete guest
> > device trees are kept under ./pc-bios in QEMU, but this too static and
> > inflexible.
> >
> > Some new mechanism is needed to assign SoC devices to guests, and we
> > (FSL + Alex Graf) have been discussing a few possible approaches for
> > doing this from QEMU and would like some feedback.
> >
> > Some possibilities:
> >
> > 1. Option 1.  Pass the host dev tree to QEMU and assign devices
> >    by device tree path
> >
> >      -dtb ./mpc8572ds.dtb -device assigned-soc-dev,dev=/soc/i2c@3000
> >
> >    /soc/i2c@3000 is the device tree path to the assigned device.
> >    The device node 'i2c@3000' has some number of properties (e.g.
> >    address, interrupt info) and possibly subnodes under
> >    it.   QEMU copies that node when generating the guest dev tree.
> >    See snippet of entire node:  http://paste2.org/p/1496460
> 
> Yuck (see below)
> 
> > 2. Option 2.  Pass the entire assigned device node as a string to
> >    QEMU
> >
> >      -device assigned-soc-dev,dev=/i2c@3000,dev-node='#address-cells = <1>;
> >       #size-cells = <0>; cell-index = <0>; compatible = "fsl-i2c";
> >       reg = <0xffe03000 0x100>; interrupts = <43 2>;
> >       interrupt-parent = <&mpic>; dfsrr;'
> 
> Beuark ! (see below)
> 
> >    This avoids needing to pass the host device tree, but could
> >    get awkward-- the i2c example above is very simple, some device
> >    nodes are very large with a complex hierarchy of subnodes and
> >    could be hundreds of lines of text to represent a single
> >    node.
> >
> > It gets more complicated...
> 
> 
> So, from a qemu command line perspective, all you should have to do is pass qemu the device-
> tree -path- to the device you want to pass-trough (you may support passing a full hierarchy
> here).
> 
> That is for normal MMIO mapped SoC devices. Something else (individual i2c, usb, ...) will use
> specific virtualization of the corresponding busses.

Then why 'yuck' to option 1 :)?   That is basically what was being proposed.

> Anything else sucks too much really.
> 
> From there, well, there's several approach inside qemu/kvm to handle that path. If you want to
> do things at the qemu level you can probably parse /proc/device-tree. But I'd personally just
> make it a kernel thing.
>
> IE. I would have an ioctl to "instanciate" a pass-through device, that takes that path as an
> argument. I would make it return an anonymous fd which you can then use to mmap the resources,
> etc...

Regarding implementation I think there are 3 things that need
to be set up--  1) mmapping the device's registers, 2) getting the iommu
set up (if there is one), 3) getting the interrupt(s) handled.

> > In some cases, modifications to device tree nodes may be needed.
> > An example-- sometimes a device tree property references another node
> > and that relationship may not exist when assigned to a guest.
> > A "phy-handle" property may need to be deleted and a "fixed-link"
> > property added to a node representing a network device.
> 
> That's fishy. Why wouldn't you give full access to the MDIO ? It's shared ? Such things are so
> device-specific that they would have to be handled by device-specific quirks, which can live
> either in qemu or in the kernel.

It is shared and in this case didn't want the phy shared.   That was a super
simple example to illustrate the idea.  With our experience with the Freescale
Embedded Hypervisor we see this as a definite requirement-- nodes in the
hardware device may need modifications.  In the P4080 device tree there
are some complex relationships expressed between nodes of our 'data
path'.   In some cases the hardware device tree expresses configuration
information, and while it could be argued that config info does not belong
there, it's what some drivers expect and what we have right now.   So, a mechanism
to allow node updates is really needed.

> > So in addition to assigning a device, a mechanism is needed to update
> > device tree nodes.  So for the above example, maybe--
> >
> >  -device assigned-soc-dev,dev=/soc/ethernet@b2000,delete-prop=phy-handle,
> >   node-update="fixed-link = <2 1 1000 0 0>"
> 
> That's just so gross and error prone, borderline insane.

Not going to argue the gross/insane part, but it's reality.  Don't
think anyone would type all that in at the command line, but would
be in an init script or something, so don't see it being more error
prone than messing around with device trees in general.

There's a small set of operations needed, based on our experience:
   -adding,deleting properties (including phandle references)
   -adding,deleting nodes (including subtrees)

Stuart

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [Qemu-devel] device assignment for embedded Power
  2011-07-05 18:19   ` Yoder Stuart-B08248
@ 2011-07-05 22:23     ` Alexander Graf
  0 siblings, 0 replies; 29+ messages in thread
From: Alexander Graf @ 2011-07-05 22:23 UTC (permalink / raw)
  To: Yoder Stuart-B08248
  Cc: Wood Scott-B07421, qemu-devel@nongnu.org, dwg@au1.ibm.com,
	blauwirbel@gmail.com, alex.williamson@redhat.com,
	paul@codesourcery.com, joerg.roedel@amd.com, armbru@redhat.com


On 05.07.2011, at 20:19, Yoder Stuart-B08248 wrote:

> 
> 
>> -----Original Message-----
>> From: Benjamin Herrenschmidt [mailto:benh@kernel.crashing.org]
>> Sent: Thursday, June 30, 2011 7:58 PM
>> To: Yoder Stuart-B08248
>> Cc: qemu-devel@nongnu.org; Wood Scott-B07421; Alexander Graf; alex.williamson@redhat.com;
>> anthony@codemonkey.ws; dwg@au1.ibm.com; joerg.roedel@amd.com; paul@codesourcery.com;
>> blauwirbel@gmail.com; armbru@redhat.com
>> Subject: Re: device assignment for embedded Power
>> 
>> On Thu, 2011-06-30 at 15:59 +0000, Yoder Stuart-B08248 wrote:
>>> One feature we need for QEMU/KVM on embedded Power Architecture is the
>>> ability to do passthru assignment of SoC I/O devices and memory.  An
>>> important use case in embedded is creating static partitions-- taking
>>> physical memory and I/O devices (non-PCI) and partitioning
>>> them between the host Linux and several virtual machines.   Things like
>>> live migration would not be needed or supported in these types of scenarios.
>>> 
>>> SoC devices do not sit on a probeable bus and there are no identifiers
>>> like 01:00.0 with PCI that we can use to identify devices--  the host
>>> Linux kernel is made aware of SoC I/O devices from nodes/properties in a
>>> device tree structure passed at boot.   QEMU needs to generate a
>>> device tree to pass to the guest as well with all the guest's virtual
>>> and physical resources.  Today a number of mostly complete guest
>>> device trees are kept under ./pc-bios in QEMU, but this too static and
>>> inflexible.
>>> 
>>> Some new mechanism is needed to assign SoC devices to guests, and we
>>> (FSL + Alex Graf) have been discussing a few possible approaches for
>>> doing this from QEMU and would like some feedback.
>>> 
>>> Some possibilities:
>>> 
>>> 1. Option 1.  Pass the host dev tree to QEMU and assign devices
>>>   by device tree path
>>> 
>>>     -dtb ./mpc8572ds.dtb -device assigned-soc-dev,dev=/soc/i2c@3000
>>> 
>>>   /soc/i2c@3000 is the device tree path to the assigned device.
>>>   The device node 'i2c@3000' has some number of properties (e.g.
>>>   address, interrupt info) and possibly subnodes under
>>>   it.   QEMU copies that node when generating the guest dev tree.
>>>   See snippet of entire node:  http://paste2.org/p/1496460
>> 
>> Yuck (see below)
>> 
>>> 2. Option 2.  Pass the entire assigned device node as a string to
>>>   QEMU
>>> 
>>>     -device assigned-soc-dev,dev=/i2c@3000,dev-node='#address-cells = <1>;
>>>      #size-cells = <0>; cell-index = <0>; compatible = "fsl-i2c";
>>>      reg = <0xffe03000 0x100>; interrupts = <43 2>;
>>>      interrupt-parent = <&mpic>; dfsrr;'
>> 
>> Beuark ! (see below)
>> 
>>>   This avoids needing to pass the host device tree, but could
>>>   get awkward-- the i2c example above is very simple, some device
>>>   nodes are very large with a complex hierarchy of subnodes and
>>>   could be hundreds of lines of text to represent a single
>>>   node.
>>> 
>>> It gets more complicated...
>> 
>> 
>> So, from a qemu command line perspective, all you should have to do is pass qemu the device-
>> tree -path- to the device you want to pass-trough (you may support passing a full hierarchy
>> here).
>> 
>> That is for normal MMIO mapped SoC devices. Something else (individual i2c, usb, ...) will use
>> specific virtualization of the corresponding busses.
> 
> Then why 'yuck' to option 1 :)?   That is basically what was being proposed.

Yes, and probably a good idea to go with for now. We can handle the guest device tree parts externally for now by passing in a fully populated device tree that just contains everything we need and pass qemu the configuration the way we did it in the device tree.


>> Anything else sucks too much really.
>> 
>> From there, well, there's several approach inside qemu/kvm to handle that path. If you want to
>> do things at the qemu level you can probably parse /proc/device-tree. But I'd personally just
>> make it a kernel thing.
>> 
>> IE. I would have an ioctl to "instanciate" a pass-through device, that takes that path as an
>> argument. I would make it return an anonymous fd which you can then use to mmap the resources,
>> etc...
> 
> Regarding implementation I think there are 3 things that need
> to be set up--  1) mmapping the device's registers, 2) getting the iommu
> set up (if there is one), 3) getting the interrupt(s) handled.

Yes :).

I guess we'll just have to sit down and implement something very simple that can at least pass through MMIO regions and interrupts and then take it from there until we hit the plenty walls.


Alex

^ permalink raw reply	[flat|nested] 29+ messages in thread

end of thread, other threads:[~2011-07-05 22:37 UTC | newest]

Thread overview: 29+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2011-06-30 15:59 [Qemu-devel] device assignment for embedded Power Yoder Stuart-B08248
2011-07-01  0:58 ` Benjamin Herrenschmidt
2011-07-01 11:40   ` Alexander Graf
2011-07-01 12:13     ` Anthony Liguori
2011-07-01 12:10   ` Anthony Liguori
2011-07-01 12:52     ` Paul Brook
2011-07-01 13:33       ` Anthony Liguori
2011-07-01 16:43     ` Scott Wood
2011-07-01 17:03       ` Paul Brook
2011-07-01 17:49         ` Scott Wood
2011-07-01 20:59           ` Paul Brook
2011-07-01 21:51             ` Scott Wood
2011-07-01 23:33               ` Paul Brook
2011-07-01 23:05             ` Benjamin Herrenschmidt
2011-07-01 23:50               ` Paul Brook
2011-07-02  2:17                 ` Alexander Graf
2011-07-02 11:45                   ` Paul Brook
2011-07-01 22:35         ` Anthony Liguori
2011-07-01 22:32       ` Anthony Liguori
2011-07-05 18:16         ` Scott Wood
2011-07-01 16:34   ` Scott Wood
2011-07-05 18:19   ` Yoder Stuart-B08248
2011-07-05 22:23     ` Alexander Graf
2011-07-01 11:16 ` Paul Brook
2011-07-01 11:33   ` Alexander Graf
2011-07-01 11:55     ` Paul Brook
2011-07-01 12:02       ` Alexander Graf
2011-07-01 12:14         ` Anthony Liguori
2011-07-01 17:51   ` Scott Wood

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).