* Re: [Qemu-devel] device assignment for embedded Power
2011-07-01 0:58 ` Benjamin Herrenschmidt
@ 2011-07-01 11:40 ` Alexander Graf
2011-07-01 12:13 ` Anthony Liguori
2011-07-01 12:10 ` Anthony Liguori
` (2 subsequent siblings)
3 siblings, 1 reply; 29+ messages in thread
From: Alexander Graf @ 2011-07-01 11:40 UTC (permalink / raw)
To: Benjamin Herrenschmidt
Cc: Wood Scott-B07421, joerg.roedel@amd.com, qemu-devel@nongnu.org,
dwg@au1.ibm.com, blauwirbel@gmail.com, Yoder Stuart-B08248,
alex.williamson@redhat.com, paul@codesourcery.com,
armbru@redhat.com
On 01.07.2011, at 02:58, Benjamin Herrenschmidt wrote:
> On Thu, 2011-06-30 at 15:59 +0000, Yoder Stuart-B08248 wrote:
>> One feature we need for QEMU/KVM on embedded Power Architecture is the
>> ability to do passthru assignment of SoC I/O devices and memory. An
>> important use case in embedded is creating static partitions--
>> taking physical memory and I/O devices (non-PCI) and partitioning
>> them between the host Linux and several virtual machines. Things like
>> live migration would not be needed or supported in these types of scenarios.
>>
>> SoC devices do not sit on a probeable bus and there are no identifiers
>> like 01:00.0 with PCI that we can use to identify devices-- the host
>> Linux kernel is made aware of SoC I/O devices from nodes/properties in a
>> device tree structure passed at boot. QEMU needs to generate a
>> device tree to pass to the guest as well with all the guest's virtual
>> and physical resources. Today a number of mostly complete guest device
>> trees are kept under ./pc-bios in QEMU, but this too static and
>> inflexible.
>>
>> Some new mechanism is needed to assign SoC devices to guests, and we
>> (FSL + Alex Graf) have been discussing a few possible approaches
>> for doing this from QEMU and would like some feedback.
>>
>> Some possibilities:
>>
>> 1. Option 1. Pass the host dev tree to QEMU and assign devices
>> by device tree path
>>
>> -dtb ./mpc8572ds.dtb -device assigned-soc-dev,dev=/soc/i2c@3000
>>
>> /soc/i2c@3000 is the device tree path to the assigned device.
>> The device node 'i2c@3000' has some number of properties (e.g.
>> address, interrupt info) and possibly subnodes under
>> it. QEMU copies that node when generating the guest dev tree.
>> See snippet of entire node: http://paste2.org/p/1496460
>
> Yuck (see below)
>
>> 2. Option 2. Pass the entire assigned device node as a string to
>> QEMU
>>
>> -device assigned-soc-dev,dev=/i2c@3000,dev-node='#address-cells = <1>;
>> #size-cells = <0>; cell-index = <0>; compatible = "fsl-i2c";
>> reg = <0xffe03000 0x100>; interrupts = <43 2>;
>> interrupt-parent = <&mpic>; dfsrr;'
>
> Beuark ! (see below)
>
>> This avoids needing to pass the host device tree, but could
>> get awkward-- the i2c example above is very simple, some device
>> nodes are very large with a complex hierarchy of subnodes and
>> could be hundreds of lines of text to represent a single
>> node.
>>
>> It gets more complicated...
>
>
> So, from a qemu command line perspective, all you should have to do is
> pass qemu the device-tree -path- to the device you want to pass-trough
> (you may support passing a full hierarchy here).
>
> That is for normal MMIO mapped SoC devices. Something else (individual
> i2c, usb, ...) will use specific virtualization of the corresponding
> busses.
>
> Anything else sucks too much really.
>
> From there, well, there's several approach inside qemu/kvm to handle
> that path. If you want to do things at the qemu level you can probably
> parse /proc/device-tree. But I'd personally just make it a kernel thing.
>
> IE. I would have an ioctl to "instanciate" a pass-through device, that
> takes that path as an argument. I would make it return an anonymous fd
> which you can then use to mmap the resources, etc...
Yeah, one idea was to use VFIO here. We could for example modify the host device tree to occupy device we want to pass through with a specific compatibility parameter. Or we could try to steal the node during runtime. But I agree, reading the device tree data from a VFIO node sounds reasonable. If it's required.
>
>> In some cases, modifications to device tree nodes may be needed.
>> An example-- sometimes a device tree property references another node
>> and that relationship may not exist when assigned to a guest.
>> A "phy-handle" property may need to be deleted and a "fixed-link"
>> property added to a node representing a network device.
>
> That's fishy. Why wouldn't you give full access to the MDIO ? It's
> shared ? Such things are so device-specific that they would have to be
> handled by device-specific quirks, which can live either in qemu or in
> the kernel.
Hrm, so you'd create a separate device for MDIO which can do pass-through of those?
>
>> So in addition to assigning a device, a mechanism is needed to update
>> device tree nodes. So for the above example, maybe--
>>
>> -device assigned-soc-dev,dev=/soc/ethernet@b2000,delete-prop=phy-handle,
>> node-update="fixed-link = <2 1 1000 0 0>"
>
> That's just so gross and error prone, borderline insane.
Alternatives:
* not modify the device tree (unlikely to work)
* pass a full device tree chunk to qemu instead of modification commands
* ?
>
>> The types of modifications needed-- deleting nodes, deleting properties,
>> adding nodes, adding properties, adding properties that reference other
>> nodes, changing properties. This device tree transformation mechanism
>> needed is general enough that it could apply to any device tree based
>> embedded platform (e.g. ARM, MIPS)
>>
>> Another complexity relates to the IOMMU. Here things get very company
>> and IOMMU specific. Freescale has a proprietary IOMMU.
>
> Look at the work currently being done for a generic qemu iommu layer. We
> need it for server power as well and from what I last saw coming from
> Eduardo and David, it's not PCI specific.
Well, but it only implements an IOMMU emulation layer inside the guest. That doesn't help us for the host side of things unfortunately :).
Alex
^ permalink raw reply [flat|nested] 29+ messages in thread
* Re: [Qemu-devel] device assignment for embedded Power
2011-07-01 11:40 ` Alexander Graf
@ 2011-07-01 12:13 ` Anthony Liguori
0 siblings, 0 replies; 29+ messages in thread
From: Anthony Liguori @ 2011-07-01 12:13 UTC (permalink / raw)
To: Alexander Graf
Cc: Wood Scott-B07421, qemu-devel@nongnu.org, dwg@au1.ibm.com,
blauwirbel@gmail.com, Yoder Stuart-B08248,
alex.williamson@redhat.com, paul@codesourcery.com,
joerg.roedel@amd.com, armbru@redhat.com
On 07/01/2011 06:40 AM, Alexander Graf wrote:
>
> On 01.07.2011, at 02:58, Benjamin Herrenschmidt wrote:
>
>> On Thu, 2011-06-30 at 15:59 +0000, Yoder Stuart-B08248 wrote:
>>> One feature we need for QEMU/KVM on embedded Power Architecture is the
>>> ability to do passthru assignment of SoC I/O devices and memory. An
>>> important use case in embedded is creating static partitions--
>>> taking physical memory and I/O devices (non-PCI) and partitioning
>>> them between the host Linux and several virtual machines. Things like
>>> live migration would not be needed or supported in these types of scenarios.
>>>
>>> SoC devices do not sit on a probeable bus and there are no identifiers
>>> like 01:00.0 with PCI that we can use to identify devices-- the host
>>> Linux kernel is made aware of SoC I/O devices from nodes/properties in a
>>> device tree structure passed at boot. QEMU needs to generate a
>>> device tree to pass to the guest as well with all the guest's virtual
>>> and physical resources. Today a number of mostly complete guest device
>>> trees are kept under ./pc-bios in QEMU, but this too static and
>>> inflexible.
>>>
>>> Some new mechanism is needed to assign SoC devices to guests, and we
>>> (FSL + Alex Graf) have been discussing a few possible approaches
>>> for doing this from QEMU and would like some feedback.
>>>
>>> Some possibilities:
>>>
>>> 1. Option 1. Pass the host dev tree to QEMU and assign devices
>>> by device tree path
>>>
>>> -dtb ./mpc8572ds.dtb -device assigned-soc-dev,dev=/soc/i2c@3000
>>>
>>> /soc/i2c@3000 is the device tree path to the assigned device.
>>> The device node 'i2c@3000' has some number of properties (e.g.
>>> address, interrupt info) and possibly subnodes under
>>> it. QEMU copies that node when generating the guest dev tree.
>>> See snippet of entire node: http://paste2.org/p/1496460
>>
>> Yuck (see below)
>>
>>> 2. Option 2. Pass the entire assigned device node as a string to
>>> QEMU
>>>
>>> -device assigned-soc-dev,dev=/i2c@3000,dev-node='#address-cells =<1>;
>>> #size-cells =<0>; cell-index =<0>; compatible = "fsl-i2c";
>>> reg =<0xffe03000 0x100>; interrupts =<43 2>;
>>> interrupt-parent =<&mpic>; dfsrr;'
>>
>> Beuark ! (see below)
>>
>>> This avoids needing to pass the host device tree, but could
>>> get awkward-- the i2c example above is very simple, some device
>>> nodes are very large with a complex hierarchy of subnodes and
>>> could be hundreds of lines of text to represent a single
>>> node.
>>>
>>> It gets more complicated...
>>
>>
>> So, from a qemu command line perspective, all you should have to do is
>> pass qemu the device-tree -path- to the device you want to pass-trough
>> (you may support passing a full hierarchy here).
>>
>> That is for normal MMIO mapped SoC devices. Something else (individual
>> i2c, usb, ...) will use specific virtualization of the corresponding
>> busses.
>>
>> Anything else sucks too much really.
>>
>> From there, well, there's several approach inside qemu/kvm to handle
>> that path. If you want to do things at the qemu level you can probably
>> parse /proc/device-tree. But I'd personally just make it a kernel thing.
>>
>> IE. I would have an ioctl to "instanciate" a pass-through device, that
>> takes that path as an argument. I would make it return an anonymous fd
>> which you can then use to mmap the resources, etc...
>
> Yeah, one idea was to use VFIO here. We could for example modify the host device tree to occupy device we want to pass through with a specific compatibility parameter. Or we could try to steal the node during runtime. But I agree, reading the device tree data from a VFIO node sounds reasonable. If it's required.
That makes it very specific to systems that use device trees.
To do the same for ARM platforms or x86, you would need to invent yet
another mechanism.
Passing through arbitrary MMIO is fairly straight forward (likewise with
PIO). Passing through IRQs is a bit less straight forward and perhaps
VFIO is the answer here.
I don't see a problem with QEMU figuring out what a device's resources
are and doing the assignment.
Regards,
Anthony Liguori
^ permalink raw reply [flat|nested] 29+ messages in thread
* Re: [Qemu-devel] device assignment for embedded Power
2011-07-01 0:58 ` Benjamin Herrenschmidt
2011-07-01 11:40 ` Alexander Graf
@ 2011-07-01 12:10 ` Anthony Liguori
2011-07-01 12:52 ` Paul Brook
2011-07-01 16:43 ` Scott Wood
2011-07-01 16:34 ` Scott Wood
2011-07-05 18:19 ` Yoder Stuart-B08248
3 siblings, 2 replies; 29+ messages in thread
From: Anthony Liguori @ 2011-07-01 12:10 UTC (permalink / raw)
To: Benjamin Herrenschmidt
Cc: Alexander Graf, Wood Scott-B07421, joerg.roedel@amd.com,
qemu-devel@nongnu.org, dwg@au1.ibm.com, blauwirbel@gmail.com,
Yoder Stuart-B08248, alex.williamson@redhat.com,
paul@codesourcery.com, armbru@redhat.com
On 06/30/2011 07:58 PM, Benjamin Herrenschmidt wrote:
> On Thu, 2011-06-30 at 15:59 +0000, Yoder Stuart-B08248 wrote:
>> This avoids needing to pass the host device tree, but could
>> get awkward-- the i2c example above is very simple, some device
>> nodes are very large with a complex hierarchy of subnodes and
>> could be hundreds of lines of text to represent a single
>> node.
>>
>> It gets more complicated...
>
>
> So, from a qemu command line perspective, all you should have to do is
> pass qemu the device-tree -path- to the device you want to pass-trough
> (you may support passing a full hierarchy here).
I agree in principle but I think it should be done in a slightly
different way.
I think we ought to support composing a device by passthrough. For
instance, something like:
[physical-device "mydev"]
region[0].file = "/dev/mem"
region[0].guest_address = "0x42232000"
region[0].file_offset = "0x23423400"
region[0].size = "4096"
irq[0].guest_irq = "10"
irq[0].host_irq = "10"
This should be independent of anything to do with device tree. This
would be useful for x86 too to assign platform devices (like the HPET).
I think there should be a separate mechanism to manipulate the guest
device tree, just like there are mechanisms to manipulate the guest's
ACPI tables.
Given these two mechanisms, there should be a simple command line like
Ben has suggested that just takes a host device tree path and Just
Works. It really is just a convenience interface though.
With raw mechanisms like I described above, it would give you the
flexibility to pass through a device with a modified host tree fragment
without having an overly complicated command line interface for the more
common case.
Regards,
Anthony Liguori
^ permalink raw reply [flat|nested] 29+ messages in thread
* Re: [Qemu-devel] device assignment for embedded Power
2011-07-01 12:10 ` Anthony Liguori
@ 2011-07-01 12:52 ` Paul Brook
2011-07-01 13:33 ` Anthony Liguori
2011-07-01 16:43 ` Scott Wood
1 sibling, 1 reply; 29+ messages in thread
From: Paul Brook @ 2011-07-01 12:52 UTC (permalink / raw)
To: Anthony Liguori
Cc: Wood Scott-B07421, qemu-devel@nongnu.org, Alexander Graf,
blauwirbel@gmail.com, Yoder Stuart-B08248,
alex.williamson@redhat.com, joerg.roedel@amd.com, dwg@au1.ibm.com,
armbru@redhat.com
> > So, from a qemu command line perspective, all you should have to do is
> > pass qemu the device-tree -path- to the device you want to pass-trough
> > (you may support passing a full hierarchy here).
>
> I agree in principle but I think it should be done in a slightly
> different way.
>
> I think we ought to support composing a device by passthrough. For
> instance, something like:
>
> [physical-device "mydev"]
> region[0].file = "/dev/mem"
> region[0].guest_address = "0x42232000"
> region[0].file_offset = "0x23423400"
> region[0].size = "4096"
> irq[0].guest_irq = "10"
> irq[0].host_irq = "10"
>
> This should be independent of anything to do with device tree. This
> would be useful for x86 too to assign platform devices (like the HPET).
I'm not quite sure what you're getting at here. IMO there should be little or
no need for special knowledge of passthrough devices. They should just be
annother qdev device, configured in the normal way. e.g.:
-device sysbus-host,hostdev=whatever,normal_mmio_and_irq_config
Should work the same as adding any other device. If it doesn't then we should
fix that. This is an example of why it's good to have device features (IRQs,
MMIO regions, sockets, or whatever we call them) registered when the device is
instantiated, not relying on pre-compiled device decriptors/property lists.
In the latter case you probably need explicit variants for differnt numbers of
IRQs, MMIO regions, etc.
While I'm thinking about it, we already have exactly this for USB (i.e. the
usb-host device).
> I think there should be a separate mechanism to manipulate the guest
> device tree, just like there are mechanisms to manipulate the guest's
> ACPI tables.
I aggree. Any sort of device tree (IIUC ACPI tables are in principle giving
the same information) is, in practice, going to need to be assembled at
runtime. This needs some mechanism for devices to describe themselves,
probably largely independent of actual machine/device creation code.
We've got away without it thus far because the only real place where we have
nontrivial user-specified machine variants is on the PCI bus. Devices there
are for the most part self-describing so the guest firmware/OS can probe
hardware itself.
Paul
^ permalink raw reply [flat|nested] 29+ messages in thread
* Re: [Qemu-devel] device assignment for embedded Power
2011-07-01 12:52 ` Paul Brook
@ 2011-07-01 13:33 ` Anthony Liguori
0 siblings, 0 replies; 29+ messages in thread
From: Anthony Liguori @ 2011-07-01 13:33 UTC (permalink / raw)
To: Paul Brook
Cc: Wood Scott-B07421, joerg.roedel@amd.com, qemu-devel@nongnu.org,
Alexander Graf, blauwirbel@gmail.com, Yoder Stuart-B08248,
alex.williamson@redhat.com, dwg@au1.ibm.com, armbru@redhat.com
On 07/01/2011 07:52 AM, Paul Brook wrote:
>>> So, from a qemu command line perspective, all you should have to do is
>>> pass qemu the device-tree -path- to the device you want to pass-trough
>>> (you may support passing a full hierarchy here).
>>
>> I agree in principle but I think it should be done in a slightly
>> different way.
>>
>> I think we ought to support composing a device by passthrough. For
>> instance, something like:
>>
>> [physical-device "mydev"]
>> region[0].file = "/dev/mem"
>> region[0].guest_address = "0x42232000"
>> region[0].file_offset = "0x23423400"
>> region[0].size = "4096"
>> irq[0].guest_irq = "10"
>> irq[0].host_irq = "10"
>>
>> This should be independent of anything to do with device tree. This
>> would be useful for x86 too to assign platform devices (like the HPET).
>
> I'm not quite sure what you're getting at here. IMO there should be little or
> no need for special knowledge of passthrough devices. They should just be
> annother qdev device, configured in the normal way. e.g.:
> -device sysbus-host,hostdev=whatever,normal_mmio_and_irq_config
What I wrote about is just readconfig syntax. It's the same as:
-device physical-device,id=mydev,region[0].file=/dev/mem,....
Regards,
Anthony Liguori
>> I think there should be a separate mechanism to manipulate the guest
>> device tree, just like there are mechanisms to manipulate the guest's
>> ACPI tables.
>
> I aggree. Any sort of device tree (IIUC ACPI tables are in principle giving
> the same information) is, in practice, going to need to be assembled at
> runtime. This needs some mechanism for devices to describe themselves,
> probably largely independent of actual machine/device creation code.
>
> We've got away without it thus far because the only real place where we have
> nontrivial user-specified machine variants is on the PCI bus. Devices there
> are for the most part self-describing so the guest firmware/OS can probe
> hardware itself.
>
> Paul
>
^ permalink raw reply [flat|nested] 29+ messages in thread
* Re: [Qemu-devel] device assignment for embedded Power
2011-07-01 12:10 ` Anthony Liguori
2011-07-01 12:52 ` Paul Brook
@ 2011-07-01 16:43 ` Scott Wood
2011-07-01 17:03 ` Paul Brook
2011-07-01 22:32 ` Anthony Liguori
1 sibling, 2 replies; 29+ messages in thread
From: Scott Wood @ 2011-07-01 16:43 UTC (permalink / raw)
To: Anthony Liguori
Cc: Wood Scott-B07421, qemu-devel@nongnu.org, Alexander Graf,
blauwirbel@gmail.com, Yoder Stuart-B08248,
alex.williamson@redhat.com, paul@codesourcery.com,
joerg.roedel@amd.com, dwg@au1.ibm.com, armbru@redhat.com
On Fri, 1 Jul 2011 07:10:45 -0500
Anthony Liguori <anthony@codemonkey.ws> wrote:
> I agree in principle but I think it should be done in a slightly
> different way.
>
> I think we ought to support composing a device by passthrough. For
> instance, something like:
>
> [physical-device "mydev"]
> region[0].file = "/dev/mem"
> region[0].guest_address = "0x42232000"
> region[0].file_offset = "0x23423400"
> region[0].size = "4096"
> irq[0].guest_irq = "10"
> irq[0].host_irq = "10"
>
> This should be independent of anything to do with device tree. This
> would be useful for x86 too to assign platform devices (like the HPET).
That's fine, as long as there's something layered on top of it for the case
where we do want to reference something in the device tree.
However, we'll need to address the question of what it means to say "irq 10"
-- outside of PC-land there often isn't a global IRQ numberspace that isn't
a fiction created by some software layer. Addressing this is one of the
device tree's strengths.
-Scott
^ permalink raw reply [flat|nested] 29+ messages in thread
* Re: [Qemu-devel] device assignment for embedded Power
2011-07-01 16:43 ` Scott Wood
@ 2011-07-01 17:03 ` Paul Brook
2011-07-01 17:49 ` Scott Wood
2011-07-01 22:35 ` Anthony Liguori
2011-07-01 22:32 ` Anthony Liguori
1 sibling, 2 replies; 29+ messages in thread
From: Paul Brook @ 2011-07-01 17:03 UTC (permalink / raw)
To: Scott Wood
Cc: Wood Scott-B07421, Alexander Graf, qemu-devel@nongnu.org,
blauwirbel@gmail.com, Yoder Stuart-B08248,
alex.williamson@redhat.com, joerg.roedel@amd.com, dwg@au1.ibm.com,
armbru@redhat.com
> > irq[0].guest_irq = "10"
> >
> > This should be independent of anything to do with device tree. This
> > would be useful for x86 too to assign platform devices (like the HPET).
>
> That's fine, as long as there's something layered on top of it for the case
> where we do want to reference something in the device tree.
>
> However, we'll need to address the question of what it means to say "irq
> 10" -- outside of PC-land there often isn't a global IRQ numberspace that
> isn't a fiction created by some software layer. Addressing this is one of
> the device tree's strengths.
That's an entirely separate problem, thoug probably a prerequisite.
Basically you should start by implementing full emulation of a device with
similar characteristics to the one you want to passthrough.
Then fix whatever is needed to allow the user to contol instantiation of those
devices. This almost certainly means using the -device commandline option.
This currently only works for a fairly simple subset of devices (approximately
PCI and USB), so you'll probably need to fix/implement the missing bits. To
do this you'll probably need to do some work on the various bits of the qdev
relating to linking devices together. See recent discussion about sockets in
the "basic support for composing sysbus devices" thread.
To expose this to the guest you'll probably also need to implement some form
of dynamic device tree assembly/manipulation. Not strictly necessary (we can
require the user supply a complete device tree that matches whatever devices
they've configured), but probably highly desirable.
Once you've done all the above, host device passthrough should be relatively
straightforward. Just replace the emulation bits in the above device with
code that pokes at a real device via the relevant kernel API.
Paul
^ permalink raw reply [flat|nested] 29+ messages in thread
* Re: [Qemu-devel] device assignment for embedded Power
2011-07-01 17:03 ` Paul Brook
@ 2011-07-01 17:49 ` Scott Wood
2011-07-01 20:59 ` Paul Brook
2011-07-01 22:35 ` Anthony Liguori
1 sibling, 1 reply; 29+ messages in thread
From: Scott Wood @ 2011-07-01 17:49 UTC (permalink / raw)
To: Paul Brook
Cc: Wood Scott-B07421, Alexander Graf, qemu-devel@nongnu.org,
blauwirbel@gmail.com, Yoder Stuart-B08248,
alex.williamson@redhat.com, joerg.roedel@amd.com, dwg@au1.ibm.com,
armbru@redhat.com
On Fri, 1 Jul 2011 18:03:01 +0100
Paul Brook <paul@codesourcery.com> wrote:
> Basically you should start by implementing full emulation of a device with
> similar characteristics to the one you want to passthrough.
That's not going to happen.
> Once you've done all the above, host device passthrough should be relatively
> straightforward. Just replace the emulation bits in the above device with
> code that pokes at a real device via the relevant kernel API.
That's not what we mean by direct device assignment.
We're talking about directly mapping the registers into the guest. The
whole point is performance.
-Scott
^ permalink raw reply [flat|nested] 29+ messages in thread
* Re: [Qemu-devel] device assignment for embedded Power
2011-07-01 17:49 ` Scott Wood
@ 2011-07-01 20:59 ` Paul Brook
2011-07-01 21:51 ` Scott Wood
2011-07-01 23:05 ` Benjamin Herrenschmidt
0 siblings, 2 replies; 29+ messages in thread
From: Paul Brook @ 2011-07-01 20:59 UTC (permalink / raw)
To: Scott Wood
Cc: Wood Scott-B07421, Alexander Graf, qemu-devel@nongnu.org,
blauwirbel@gmail.com, Yoder Stuart-B08248,
alex.williamson@redhat.com, joerg.roedel@amd.com, dwg@au1.ibm.com,
armbru@redhat.com
> On Fri, 1 Jul 2011 18:03:01 +0100
>
> Paul Brook <paul@codesourcery.com> wrote:
> > Basically you should start by implementing full emulation of a device
> > with similar characteristics to the one you want to passthrough.
>
> That's not going to happen.
Why is your device so unique? How does it interact with the guest system and
what features does it require that doen't exist in any device that can be
emulated?
I'm also extremely sceptical of anything that only works in a kvm environment.
Makes me think it's an unmaintainable hack, and almost certainly going to
cause you immense amounts of pain later.
> > I doubt you're going to get generic passthrough of arbitrary devices
> > working in a useful way.
>
> It's usefully working for us internally -- we're just trying to find a way
> to improve it for upstream, with a better configuration mechanism.
I don't believe that either. More likely you've got passthrough of device
hanging off your specific CPU bus, using only (or even a subset of) the
facilities provided by that bus.
> > Basically you have to emulate everything that is different between the
> > host and guest.
>
> Directly assigning a device means you don't get to have differences between
> the actual hardware device and what the guest sees. The kind of thin
> wrapper you're suggesting might have some use cases, but it's a different
> problem from what we're trying to solve.
That's the problem. You've skipped several steps and gone startigh for
optimization before you've even got basic functionality working.
You've also missed the point I was making. In order to do device passthrough
you need to define a boundary allong which the emulated machine state can be
fully replicated on the host machine. Anything inside this boundary is (by
definition) that same on both the host and guest systems (we're effectively
using host hardware to emulate a device for us). Outside that boundary the
host and guest systems will diverge.
For a device that merely responds to CPU initiated MMIO transfers this is
pretty simple, it's the point at which MMIO transfers are generated. So the
guest gets a proxy device that intercepts accesses to that memory region, and
the host proxies some way for qemu to poke values at the host device.
> > Once you've done all the above, host device passthrough should be
> > relatively straightforward. Just replace the emulation bits in the
> > above device with code that pokes at a real device via the relevant
> > kernel API.
>
> That's not what we mean by direct device assignment.
Maybe, but IMO but it's a necessary prerequisite. You're trying to run before
you can walk.
> We're talking about directly mapping the registers into the guest. The
> whole point is performance.
That's an additional step after you get passthrough working the normal way.
We already have mechanisms (or at least patches) for mapping file-like objects
into guest physical memory. That's largely independent of device passthrough.
It's a relatively minor tweak to how the passthrough device sets up its MMIO
regions.
Mapping host device MMIO regions into guest space is entirely uninteresting
unless we already have some way of creating guest-host passthrough devices.
Creating guest-device passthrough devices isn't going to happen until the can
create arbitrary devices (within the set emulated by qemu) that interact with
the rest of the emulated machine in a similar way.
Paul
^ permalink raw reply [flat|nested] 29+ messages in thread
* Re: [Qemu-devel] device assignment for embedded Power
2011-07-01 20:59 ` Paul Brook
@ 2011-07-01 21:51 ` Scott Wood
2011-07-01 23:33 ` Paul Brook
2011-07-01 23:05 ` Benjamin Herrenschmidt
1 sibling, 1 reply; 29+ messages in thread
From: Scott Wood @ 2011-07-01 21:51 UTC (permalink / raw)
To: Paul Brook
Cc: Wood Scott-B07421, Alexander Graf, qemu-devel@nongnu.org,
blauwirbel@gmail.com, Yoder Stuart-B08248,
alex.williamson@redhat.com, joerg.roedel@amd.com, dwg@au1.ibm.com,
armbru@redhat.com
On Fri, 1 Jul 2011 21:59:35 +0100
Paul Brook <paul@codesourcery.com> wrote:
> > On Fri, 1 Jul 2011 18:03:01 +0100
> >
> > Paul Brook <paul@codesourcery.com> wrote:
> > > Basically you should start by implementing full emulation of a device
> > > with similar characteristics to the one you want to passthrough.
> >
> > That's not going to happen.
>
> Why is your device so unique? How does it interact with the guest system and
> what features does it require that doen't exist in any device that can be
> emulated?
Perhaps I misunderstood what you meant by "similar characteristics". I see
no reason to spend a bunch of time implementing full emulation for a device,
that isn't going to be used, just because it seems like a nice
intermediary step.
What specifically is it you're suggesting we do full emulation of?
> I'm also extremely sceptical of anything that only works in a kvm environment.
> Makes me think it's an unmaintainable hack, and almost certainly going to
> cause you immense amounts of pain later.
I believe the only part of the device assignment stuff we've implemented so
far that is KVM specific is the interrupt routing. I'm open to ways of
routing the interrupts to qemu in the non-KVM case, as long as we can
bypass it when KVM is used.
I'm not sure what the use case is for direct assignment of a device in an
otherwise completely emulated guest, but perhaps there is one.
> > > I doubt you're going to get generic passthrough of arbitrary devices
> > > working in a useful way.
> >
> > It's usefully working for us internally -- we're just trying to find a way
> > to improve it for upstream, with a better configuration mechanism.
>
> I don't believe that either. More likely you've got passthrough of device
> hanging off your specific CPU bus, using only (or even a subset of) the
> facilities provided by that bus.
There's nothing special about our "bus". It's MMIO, DMA, and interrupts.
What specifically are you disbelieving?
> > > Basically you have to emulate everything that is different between the
> > > host and guest.
> >
> > Directly assigning a device means you don't get to have differences between
> > the actual hardware device and what the guest sees. The kind of thin
> > wrapper you're suggesting might have some use cases, but it's a different
> > problem from what we're trying to solve.
>
> That's the problem. You've skipped several steps and gone startigh for
> optimization before you've even got basic functionality working.
This is the basic functionality -- assign a piece of hardware to the
guest with minimal overhead. Why go through contortions to construct some
intermediate phase that nobody's interested in using?
> You've also missed the point I was making. In order to do device passthrough
> you need to define a boundary allong which the emulated machine state can be
> fully replicated on the host machine. Anything inside this boundary is (by
> definition) that same on both the host and guest systems (we're effectively
> using host hardware to emulate a device for us). Outside that boundary the
> host and guest systems will diverge.
I'm still not sure what the point is, then. By directly assigning the
device the user is placing everything about the device on the "same as
host" side of that boundary.
We're not using host hardware to emulate a device, we're using host
hardware to send and receive packets under control of the guest.
Whatever hardware that is, the guest will deal with it, just as if the
guest weren't running in a vm.
> For a device that merely responds to CPU initiated MMIO transfers this is
> pretty simple, it's the point at which MMIO transfers are generated. So the
> guest gets a proxy device that intercepts accesses to that memory region, and
> the host proxies some way for qemu to poke values at the host device.
The point is to be faster than virtio, not slower. There would be no
reason for us to do this otherwise.
Emulating some specific device is not our goal, at all. I realize that
that's a major part of what qemu does, but it's not the only thing it's
used for.
> > > Once you've done all the above, host device passthrough should be
> > > relatively straightforward. Just replace the emulation bits in the
> > > above device with code that pokes at a real device via the relevant
> > > kernel API.
> >
> > That's not what we mean by direct device assignment.
>
> Maybe, but IMO but it's a necessary prerequisite. You're trying to run before
> you can walk.
I disagree that it is a prerequisite. It is a fundamentally different
thing, for a different purpose.
If it's a purpose that is important to you, and you think the proposed
config mechanisms don't accommodate that, then propose something that does.
> > We're talking about directly mapping the registers into the guest. The
> > whole point is performance.
>
> That's an additional step after you get passthrough working the normal way.
"normal"?
> We already have mechanisms (or at least patches) for mapping file-like objects
> into guest physical memory. That's largely independent of device passthrough.
> It's a relatively minor tweak to how the passthrough device sets up its MMIO
> regions.
>
> Mapping host device MMIO regions into guest space is entirely uninteresting
> unless we already have some way of creating guest-host passthrough devices.
Isn't that what's being discussed?
> Creating guest-device passthrough devices isn't going to happen until the can
> create arbitrary devices (within the set emulated by qemu) that interact with
> the rest of the emulated machine in a similar way.
What do you mean by "interact with the rest of the emulated machine in a
similar way"?
-Scott
^ permalink raw reply [flat|nested] 29+ messages in thread
* Re: [Qemu-devel] device assignment for embedded Power
2011-07-01 21:51 ` Scott Wood
@ 2011-07-01 23:33 ` Paul Brook
0 siblings, 0 replies; 29+ messages in thread
From: Paul Brook @ 2011-07-01 23:33 UTC (permalink / raw)
To: Scott Wood
Cc: Wood Scott-B07421, Alexander Graf, qemu-devel@nongnu.org,
blauwirbel@gmail.com, Yoder Stuart-B08248,
alex.williamson@redhat.com, joerg.roedel@amd.com, dwg@au1.ibm.com,
armbru@redhat.com
> > Why is your device so unique? How does it interact with the guest system
> > and what features does it require that doen't exist in any device that
> > can be emulated?
>
> Perhaps I misunderstood what you meant by "similar characteristics". I see
> no reason to spend a bunch of time implementing full emulation for a
> device, that isn't going to be used, just because it seems like a nice
> intermediary step.
You say your device has MMIO regions, generates IRQs and initiates DMA
transactions. Any device or selection of devices that between them use all
those features will do the job. I'd expect most SoC to have several. We don't
care what the device actually does, only the ways it communicates with the
rest of the machine.
I think you're coming at this problem from completely the wrong direction.
Instead of "how do I wedge this passthrough into my machine", you should be
asking "how do I create a machine without knowing the machine layout at
compile time". Once you fix that, hooking up the passthrough device should be
fairly trivial. You only have a single passthrough device, and the rest of us
have none at all. Anything restricted to the pasthrough case is thus unlikely
to be the right answer to the second question, and I'd expect it to be
removed/changed/broken when we do get round to implementing dynamic device
creation.
> > > We're talking about directly mapping the registers into the guest. The
> > > whole point is performance.
> >
> > That's an additional step after you get passthrough working the normal
> > way.
>
> "normal"?
Mapping a MMIO region into the guest is an additional complication, and purely
a performance optimization. qemu already needs to be in the loop to handle
interrupts, probably DMA setup and the non-kvm case.
> I'm not sure what the use case is for direct assignment of a device in an
> otherwise completely emulated guest, but perhaps there is one.
Typically because the host system doesn't know how to talk to it, or there
isn't a sensible way to relay the functionality provided by the device from
the kernel to qemu.
> > We already have mechanisms (or at least patches) for mapping file-like
> > objects into guest physical memory. That's largely independent of
> > device passthrough. It's a relatively minor tweak to how the passthrough
> > device sets up its MMIO regions.
> >
> > Mapping host device MMIO regions into guest space is entirely
> > uninteresting unless we already have some way of creating guest-host
> > passthrough devices.
>
> Isn't that what's being discussed?
It's your end goal, but I don't think it's particularly relevant to the
problem you've encountered.
> > Creating guest-device passthrough devices isn't going to happen until the
> > can create arbitrary devices (within the set emulated by qemu) that
> > interact with the rest of the emulated machine in a similar way.
>
> What do you mean by "interact with the rest of the emulated machine in a
> similar way"?
See first paragraph above.
Paul
^ permalink raw reply [flat|nested] 29+ messages in thread
* Re: [Qemu-devel] device assignment for embedded Power
2011-07-01 20:59 ` Paul Brook
2011-07-01 21:51 ` Scott Wood
@ 2011-07-01 23:05 ` Benjamin Herrenschmidt
2011-07-01 23:50 ` Paul Brook
1 sibling, 1 reply; 29+ messages in thread
From: Benjamin Herrenschmidt @ 2011-07-01 23:05 UTC (permalink / raw)
To: Paul Brook
Cc: Wood Scott-B07421, joerg.roedel@amd.com, Alexander Graf,
qemu-devel@nongnu.org, blauwirbel@gmail.com, Yoder Stuart-B08248,
alex.williamson@redhat.com, Scott Wood, dwg@au1.ibm.com,
armbru@redhat.com
On Fri, 2011-07-01 at 21:59 +0100, Paul Brook wrote:
> > On Fri, 1 Jul 2011 18:03:01 +0100
> >
> > Paul Brook <paul@codesourcery.com> wrote:
> > > Basically you should start by implementing full emulation of a device
> > > with similar characteristics to the one you want to passthrough.
> >
> > That's not going to happen.
>
> Why is your device so unique? How does it interact with the guest system and
> what features does it require that doen't exist in any device that can be
> emulated?
Do you guys only support PCI pass-through by doing full emulation of the
all possible supported PCI devices first ? :-)
> I'm also extremely sceptical of anything that only works in a kvm environment.
> Makes me think it's an unmaintainable hack, and almost certainly going to
> cause you immense amounts of pain later.
See above question...
Cheers,
Ben.
> > > I doubt you're going to get generic passthrough of arbitrary devices
> > > working in a useful way.
> >
> > It's usefully working for us internally -- we're just trying to find a way
> > to improve it for upstream, with a better configuration mechanism.
>
> I don't believe that either. More likely you've got passthrough of device
> hanging off your specific CPU bus, using only (or even a subset of) the
> facilities provided by that bus.
>
> > > Basically you have to emulate everything that is different between the
> > > host and guest.
> >
> > Directly assigning a device means you don't get to have differences between
> > the actual hardware device and what the guest sees. The kind of thin
> > wrapper you're suggesting might have some use cases, but it's a different
> > problem from what we're trying to solve.
>
> That's the problem. You've skipped several steps and gone startigh for
> optimization before you've even got basic functionality working.
>
> You've also missed the point I was making. In order to do device passthrough
> you need to define a boundary allong which the emulated machine state can be
> fully replicated on the host machine. Anything inside this boundary is (by
> definition) that same on both the host and guest systems (we're effectively
> using host hardware to emulate a device for us). Outside that boundary the
> host and guest systems will diverge.
>
> For a device that merely responds to CPU initiated MMIO transfers this is
> pretty simple, it's the point at which MMIO transfers are generated. So the
> guest gets a proxy device that intercepts accesses to that memory region, and
> the host proxies some way for qemu to poke values at the host device.
>
> > > Once you've done all the above, host device passthrough should be
> > > relatively straightforward. Just replace the emulation bits in the
> > > above device with code that pokes at a real device via the relevant
> > > kernel API.
> >
> > That's not what we mean by direct device assignment.
>
> Maybe, but IMO but it's a necessary prerequisite. You're trying to run before
> you can walk.
>
> > We're talking about directly mapping the registers into the guest. The
> > whole point is performance.
>
> That's an additional step after you get passthrough working the normal way.
> We already have mechanisms (or at least patches) for mapping file-like objects
> into guest physical memory. That's largely independent of device passthrough.
> It's a relatively minor tweak to how the passthrough device sets up its MMIO
> regions.
>
> Mapping host device MMIO regions into guest space is entirely uninteresting
> unless we already have some way of creating guest-host passthrough devices.
> Creating guest-device passthrough devices isn't going to happen until the can
> create arbitrary devices (within the set emulated by qemu) that interact with
> the rest of the emulated machine in a similar way.
>
> Paul
^ permalink raw reply [flat|nested] 29+ messages in thread
* Re: [Qemu-devel] device assignment for embedded Power
2011-07-01 23:05 ` Benjamin Herrenschmidt
@ 2011-07-01 23:50 ` Paul Brook
2011-07-02 2:17 ` Alexander Graf
0 siblings, 1 reply; 29+ messages in thread
From: Paul Brook @ 2011-07-01 23:50 UTC (permalink / raw)
To: Benjamin Herrenschmidt
Cc: Wood Scott-B07421, joerg.roedel@amd.com, Alexander Graf,
qemu-devel@nongnu.org, blauwirbel@gmail.com, Yoder Stuart-B08248,
alex.williamson@redhat.com, Scott Wood, dwg@au1.ibm.com,
armbru@redhat.com
> On Fri, 2011-07-01 at 21:59 +0100, Paul Brook wrote:
> > > On Fri, 1 Jul 2011 18:03:01 +0100
> > >
> > > Paul Brook <paul@codesourcery.com> wrote:
> > > > Basically you should start by implementing full emulation of a device
> > > > with similar characteristics to the one you want to passthrough.
> > >
> > > That's not going to happen.
> >
> > Why is your device so unique? How does it interact with the guest system
> > and what features does it require that doen't exist in any device that
> > can be emulated?
>
> Do you guys only support PCI pass-through by doing full emulation of the
> all possible supported PCI devices first ? :-)
Absolutely not. My point is that dynamic (user-driven) device creation is
effectively a prerequisite for a passthrough device.
If you just want to make a very specific use-case then this doesn't need any
code in qemu at all. We just make the user provide the device tree
themselves. If it doesn't match then they loose. If you do choose an ugly
qemu then the changes are it'll be changed/removed once we do dyamic device
creation properly. There have already been discussions about dynamic device
creation, so this this isn't completely hypothetical.
If you integrate it properly, then you need to realise then there's a fair
chunk of infrastructure and user interface required. Most of which has
nothing to do with device passthrough. Trying to implement both at the same
time is just going to cause confusion and complicate things. It's already a
hard problem, combining it with something else is just going to cause you and
everyone else even more pain.
Paul
^ permalink raw reply [flat|nested] 29+ messages in thread
* Re: [Qemu-devel] device assignment for embedded Power
2011-07-01 23:50 ` Paul Brook
@ 2011-07-02 2:17 ` Alexander Graf
2011-07-02 11:45 ` Paul Brook
0 siblings, 1 reply; 29+ messages in thread
From: Alexander Graf @ 2011-07-02 2:17 UTC (permalink / raw)
To: Paul Brook
Cc: Wood Scott-B07421, qemu-devel@nongnu.org, dwg@au1.ibm.com,
blauwirbel@gmail.com, Yoder Stuart-B08248,
alex.williamson@redhat.com, joerg.roedel@amd.com, Scott Wood,
armbru@redhat.com
On 02.07.2011, at 01:50, Paul Brook wrote:
>> On Fri, 2011-07-01 at 21:59 +0100, Paul Brook wrote:
>>>> On Fri, 1 Jul 2011 18:03:01 +0100
>>>>
>>>> Paul Brook <paul@codesourcery.com> wrote:
>>>>> Basically you should start by implementing full emulation of a device
>>>>> with similar characteristics to the one you want to passthrough.
>>>>
>>>> That's not going to happen.
>>>
>>> Why is your device so unique? How does it interact with the guest system
>>> and what features does it require that doen't exist in any device that
>>> can be emulated?
>>
>> Do you guys only support PCI pass-through by doing full emulation of the
>> all possible supported PCI devices first ? :-)
>
> Absolutely not. My point is that dynamic (user-driven) device creation is
> effectively a prerequisite for a passthrough device.
>
> If you just want to make a very specific use-case then this doesn't need any
> code in qemu at all. We just make the user provide the device tree
> themselves. If it doesn't match then they loose. If you do choose an ugly
> qemu then the changes are it'll be changed/removed once we do dyamic device
> creation properly. There have already been discussions about dynamic device
> creation, so this this isn't completely hypothetical.
>
> If you integrate it properly, then you need to realise then there's a fair
> chunk of infrastructure and user interface required. Most of which has
> nothing to do with device passthrough. Trying to implement both at the same
> time is just going to cause confusion and complicate things. It's already a
> hard problem, combining it with something else is just going to cause you and
> everyone else even more pain.
So you're basically saying we should tackle these 3 issues separately:
* actually pass through a device
* generate interrupt links
* model the guest device tree dynamically based on whatever the user gives us
I tend to agree with that perspective. Still, the main issue still stands in that we don't have a concrete answer for all three issues :). Facing them one at a time might help actually solving them though.
Alex
^ permalink raw reply [flat|nested] 29+ messages in thread
* Re: [Qemu-devel] device assignment for embedded Power
2011-07-02 2:17 ` Alexander Graf
@ 2011-07-02 11:45 ` Paul Brook
0 siblings, 0 replies; 29+ messages in thread
From: Paul Brook @ 2011-07-02 11:45 UTC (permalink / raw)
To: Alexander Graf
Cc: Wood Scott-B07421, qemu-devel@nongnu.org, dwg@au1.ibm.com,
blauwirbel@gmail.com, Yoder Stuart-B08248,
alex.williamson@redhat.com, joerg.roedel@amd.com, Scott Wood,
armbru@redhat.com
> So you're basically saying we should tackle these 3 issues separately:
>
> * actually pass through a device
> * generate interrupt links
> * model the guest device tree dynamically based on whatever the user
> gives us
Yes.
Paul
^ permalink raw reply [flat|nested] 29+ messages in thread
* Re: [Qemu-devel] device assignment for embedded Power
2011-07-01 17:03 ` Paul Brook
2011-07-01 17:49 ` Scott Wood
@ 2011-07-01 22:35 ` Anthony Liguori
1 sibling, 0 replies; 29+ messages in thread
From: Anthony Liguori @ 2011-07-01 22:35 UTC (permalink / raw)
To: Paul Brook
Cc: Wood Scott-B07421, joerg.roedel@amd.com, Alexander Graf,
qemu-devel@nongnu.org, blauwirbel@gmail.com, Yoder Stuart-B08248,
alex.williamson@redhat.com, Scott Wood, dwg@au1.ibm.com,
armbru@redhat.com
On 07/01/2011 12:03 PM, Paul Brook wrote:
>>> irq[0].guest_irq = "10"
>>>
>>> This should be independent of anything to do with device tree. This
>>> would be useful for x86 too to assign platform devices (like the HPET).
>>
>> That's fine, as long as there's something layered on top of it for the case
>> where we do want to reference something in the device tree.
>>
>> However, we'll need to address the question of what it means to say "irq
>> 10" -- outside of PC-land there often isn't a global IRQ numberspace that
>> isn't a fiction created by some software layer. Addressing this is one of
>> the device tree's strengths.
>
> That's an entirely separate problem, thoug probably a prerequisite.
>
> Basically you should start by implementing full emulation of a device with
> similar characteristics to the one you want to passthrough.
If you want to model interrupt remapping, you have to model device
relationships. If you cannot express the bus hierarchy/relationship
then you cannot sanely model interrupt remapping.
You can only really ever think about passing through an entire subtree
of the device hierarchy. You can't have a partial subtree with some
crazy hack logic to explain how the physical layer may remap interrupts.
That's just asking for pain.
Regards,
Anthony Liguori
^ permalink raw reply [flat|nested] 29+ messages in thread
* Re: [Qemu-devel] device assignment for embedded Power
2011-07-01 16:43 ` Scott Wood
2011-07-01 17:03 ` Paul Brook
@ 2011-07-01 22:32 ` Anthony Liguori
2011-07-05 18:16 ` Scott Wood
1 sibling, 1 reply; 29+ messages in thread
From: Anthony Liguori @ 2011-07-01 22:32 UTC (permalink / raw)
To: Scott Wood
Cc: Wood Scott-B07421, joerg.roedel@amd.com, qemu-devel@nongnu.org,
Alexander Graf, blauwirbel@gmail.com, Yoder Stuart-B08248,
alex.williamson@redhat.com, paul@codesourcery.com,
dwg@au1.ibm.com, armbru@redhat.com
On 07/01/2011 11:43 AM, Scott Wood wrote:
> On Fri, 1 Jul 2011 07:10:45 -0500
> Anthony Liguori<anthony@codemonkey.ws> wrote:
>
>> I agree in principle but I think it should be done in a slightly
>> different way.
>>
>> I think we ought to support composing a device by passthrough. For
>> instance, something like:
>>
>> [physical-device "mydev"]
>> region[0].file = "/dev/mem"
>> region[0].guest_address = "0x42232000"
>> region[0].file_offset = "0x23423400"
>> region[0].size = "4096"
>> irq[0].guest_irq = "10"
>> irq[0].host_irq = "10"
>>
>> This should be independent of anything to do with device tree. This
>> would be useful for x86 too to assign platform devices (like the HPET).
>
> That's fine, as long as there's something layered on top of it for the case
> where we do want to reference something in the device tree.
>
> However, we'll need to address the question of what it means to say "irq 10"
It depends on what the bus is. If you're going to declare "system bus"
which is sort of what we call ISA for the PC, then it can map trivially
to the interrupt controller's inputs.
> -- outside of PC-land there often isn't a global IRQ numberspace that isn't
> a fiction created by some software layer.
PC's don't have a global IRQ number space FWIW. When we say:
-device isa-serial,irq=4
This really means, "ISA irq 4", which is mapped to the PIIX3 and then
routed through GSI, then the APIC architecture to correspond to some
interrupt for some physical CPU.
> Addressing this is one of the
> device tree's strengths.
Not really. There's nothing magical about the device tree. It's just a
guest visible description of the platform hardware that isn't probe-able
in some bus framework. ACPI does exactly the same thing. I'll concede
that the device tree is far nicer than ACPI but again, it's not magical :-)
Regards,
Anthony Liguori
> -Scott
>
>
^ permalink raw reply [flat|nested] 29+ messages in thread
* Re: [Qemu-devel] device assignment for embedded Power
2011-07-01 22:32 ` Anthony Liguori
@ 2011-07-05 18:16 ` Scott Wood
0 siblings, 0 replies; 29+ messages in thread
From: Scott Wood @ 2011-07-05 18:16 UTC (permalink / raw)
To: Anthony Liguori
Cc: Wood Scott-B07421, joerg.roedel@amd.com, qemu-devel@nongnu.org,
Alexander Graf, blauwirbel@gmail.com, Yoder Stuart-B08248,
alex.williamson@redhat.com, paul@codesourcery.com,
dwg@au1.ibm.com, armbru@redhat.com
On Fri, 1 Jul 2011 17:32:43 -0500
Anthony Liguori <anthony@codemonkey.ws> wrote:
> On 07/01/2011 11:43 AM, Scott Wood wrote:
> > However, we'll need to address the question of what it means to say "irq 10"
>
> It depends on what the bus is. If you're going to declare "system bus"
> which is sort of what we call ISA for the PC,
More like "arbitrary MMIO". Could be an on-chip peripheral. Could be some
external custom chip. Could be an entire PCIe root complex.
> then it can map trivially to the interrupt controller's inputs.
Which interrupt controller? We might want to assign an IRQ that's on some
cascaded controller.
We also have some things like MPIC IPIs and timers,
that are on the main interrupt controller but aren't normal numbered
interrupts. We use the ability to have multiple cells in an interrupt
specifier to express these. And while you could make up fake numbers for
these to force it to be linear, someone has to come up with this mapping and
get qemu, its users, and the kernel to agree on it. We already have a
repository for such bindings for the device tree.
That's not to say that the device tree should be forced onto platforms that
have some other reasonable way of doing it, of course -- just that it's
nice to be able to refer to it when it's there.
> > -- outside of PC-land there often isn't a global IRQ numberspace that isn't
> > a fiction created by some software layer.
>
> PC's don't have a global IRQ number space FWIW. When we say:
>
> -device isa-serial,irq=4
>
> This really means, "ISA irq 4", which is mapped to the PIIX3 and then
> routed through GSI, then the APIC architecture to correspond to some
> interrupt for some physical CPU.
Well, it's been a while since I've dealt with such things on PCs... I
thought there was at least some standard way of interpreting things like
IRQ numbers that the BIOS wrote into PCI config space.
> > Addressing this is one of the
> > device tree's strengths.
>
> Not really. There's nothing magical about the device tree. It's just a
> guest visible description of the platform hardware that isn't probe-able
> in some bus framework. ACPI does exactly the same thing. I'll concede
> that the device tree is far nicer than ACPI but again, it's not magical :-)
I didn't say it was the only way to express it -- just that the device tree,
or something like it, comes in useful here.
And we're not about to do ACPI on powerpc. :-)
-Scott
^ permalink raw reply [flat|nested] 29+ messages in thread
* Re: [Qemu-devel] device assignment for embedded Power
2011-07-01 0:58 ` Benjamin Herrenschmidt
2011-07-01 11:40 ` Alexander Graf
2011-07-01 12:10 ` Anthony Liguori
@ 2011-07-01 16:34 ` Scott Wood
2011-07-05 18:19 ` Yoder Stuart-B08248
3 siblings, 0 replies; 29+ messages in thread
From: Scott Wood @ 2011-07-01 16:34 UTC (permalink / raw)
To: Benjamin Herrenschmidt
Cc: Wood Scott-B07421, joerg.roedel@amd.com, Alexander Graf,
qemu-devel@nongnu.org, blauwirbel@gmail.com, Yoder Stuart-B08248,
alex.williamson@redhat.com, paul@codesourcery.com,
dwg@au1.ibm.com, armbru@redhat.com
On Fri, 1 Jul 2011 10:58:14 +1000
Benjamin Herrenschmidt <benh@kernel.crashing.org> wrote:
> So, from a qemu command line perspective, all you should have to do is
> pass qemu the device-tree -path- to the device you want to pass-trough
> (you may support passing a full hierarchy here).
>
> That is for normal MMIO mapped SoC devices. Something else (individual
> i2c, usb, ...) will use specific virtualization of the corresponding
> busses.
>
> Anything else sucks too much really.
>
> From there, well, there's several approach inside qemu/kvm to handle
> that path. If you want to do things at the qemu level you can probably
> parse /proc/device-tree.
That's what option 1 is, except that instead of adding code to qemu to
parse /proc/device-tree, we'd use dtc to dump /proc/device-tree into a dtb
and let qemu use libfdt to look at the tree. This is less Linux-specific,
more modular, and more flexible for doing the sort of insane hacks that are
going to happen in embedded-land whether you like them or not. :-)
> But I'd personally just make it a kernel thing.
I'd rather keep the kernel interface simple -- assign this memory region,
assign that interrupt, use this IOMMU device ID, etc. Getting the kernel
involved in preparing the guest device tree, and understanding guuest
configuration, seems quite excessive.
> IE. I would have an ioctl to "instanciate" a pass-through device, that
> takes that path as an argument. I would make it return an anonymous fd
> which you can then use to mmap the resources, etc...
>
> > In some cases, modifications to device tree nodes may be needed.
> > An example-- sometimes a device tree property references another node
> > and that relationship may not exist when assigned to a guest.
> > A "phy-handle" property may need to be deleted and a "fixed-link"
> > property added to a node representing a network device.
>
> That's fishy. Why wouldn't you give full access to the MDIO ? It's
> shared ?
Yes, it's shared. Yes, it sucks.
> Such things are so device-specific that they would have to be
> handled by device-specific quirks, which can live either in qemu or in
> the kernel.
Or in the configuration of qemu. Not all users of the device want to do
the same thing.
> > So in addition to assigning a device, a mechanism is needed to update
> > device tree nodes. So for the above example, maybe--
> >
> > -device assigned-soc-dev,dev=/soc/ethernet@b2000,delete-prop=phy-handle,
> > node-update="fixed-link = <2 1 1000 0 0>"
>
> That's just so gross and error prone, borderline insane.
Welcome to embedded. :-)
Here, users are going to want to be able to mess around under the hood in
a way that server or desktop users generally don't need or want to.
> > The types of modifications needed-- deleting nodes, deleting properties,
> > adding nodes, adding properties, adding properties that reference other
> > nodes, changing properties. This device tree transformation mechanism
> > needed is general enough that it could apply to any device tree based
> > embedded platform (e.g. ARM, MIPS)
> >
> > Another complexity relates to the IOMMU. Here things get very company
> > and IOMMU specific. Freescale has a proprietary IOMMU.
>
> Look at the work currently being done for a generic qemu iommu layer. We
> need it for server power as well and from what I last saw coming from
> Eduardo and David, it's not PCI specific.
The problem is that our current IOMMU doesn't implement full paging (yes,
the HW people have been screamed at, but we're stuck with it for current
chips). You have to break things down into regions following certain
alignment rules, which may require user guidance as to which memory regions
actually need DMA access, especially if you're setting up discontiguous
shared memory regions and such.
-Scott
^ permalink raw reply [flat|nested] 29+ messages in thread
* Re: [Qemu-devel] device assignment for embedded Power
2011-07-01 0:58 ` Benjamin Herrenschmidt
` (2 preceding siblings ...)
2011-07-01 16:34 ` Scott Wood
@ 2011-07-05 18:19 ` Yoder Stuart-B08248
2011-07-05 22:23 ` Alexander Graf
3 siblings, 1 reply; 29+ messages in thread
From: Yoder Stuart-B08248 @ 2011-07-05 18:19 UTC (permalink / raw)
To: Benjamin Herrenschmidt
Cc: Wood Scott-B07421, joerg.roedel@amd.com, Alexander Graf,
qemu-devel@nongnu.org, dwg@au1.ibm.com, blauwirbel@gmail.com,
alex.williamson@redhat.com, paul@codesourcery.com,
armbru@redhat.com
> -----Original Message-----
> From: Benjamin Herrenschmidt [mailto:benh@kernel.crashing.org]
> Sent: Thursday, June 30, 2011 7:58 PM
> To: Yoder Stuart-B08248
> Cc: qemu-devel@nongnu.org; Wood Scott-B07421; Alexander Graf; alex.williamson@redhat.com;
> anthony@codemonkey.ws; dwg@au1.ibm.com; joerg.roedel@amd.com; paul@codesourcery.com;
> blauwirbel@gmail.com; armbru@redhat.com
> Subject: Re: device assignment for embedded Power
>
> On Thu, 2011-06-30 at 15:59 +0000, Yoder Stuart-B08248 wrote:
> > One feature we need for QEMU/KVM on embedded Power Architecture is the
> > ability to do passthru assignment of SoC I/O devices and memory. An
> > important use case in embedded is creating static partitions-- taking
> > physical memory and I/O devices (non-PCI) and partitioning
> > them between the host Linux and several virtual machines. Things like
> > live migration would not be needed or supported in these types of scenarios.
> >
> > SoC devices do not sit on a probeable bus and there are no identifiers
> > like 01:00.0 with PCI that we can use to identify devices-- the host
> > Linux kernel is made aware of SoC I/O devices from nodes/properties in a
> > device tree structure passed at boot. QEMU needs to generate a
> > device tree to pass to the guest as well with all the guest's virtual
> > and physical resources. Today a number of mostly complete guest
> > device trees are kept under ./pc-bios in QEMU, but this too static and
> > inflexible.
> >
> > Some new mechanism is needed to assign SoC devices to guests, and we
> > (FSL + Alex Graf) have been discussing a few possible approaches for
> > doing this from QEMU and would like some feedback.
> >
> > Some possibilities:
> >
> > 1. Option 1. Pass the host dev tree to QEMU and assign devices
> > by device tree path
> >
> > -dtb ./mpc8572ds.dtb -device assigned-soc-dev,dev=/soc/i2c@3000
> >
> > /soc/i2c@3000 is the device tree path to the assigned device.
> > The device node 'i2c@3000' has some number of properties (e.g.
> > address, interrupt info) and possibly subnodes under
> > it. QEMU copies that node when generating the guest dev tree.
> > See snippet of entire node: http://paste2.org/p/1496460
>
> Yuck (see below)
>
> > 2. Option 2. Pass the entire assigned device node as a string to
> > QEMU
> >
> > -device assigned-soc-dev,dev=/i2c@3000,dev-node='#address-cells = <1>;
> > #size-cells = <0>; cell-index = <0>; compatible = "fsl-i2c";
> > reg = <0xffe03000 0x100>; interrupts = <43 2>;
> > interrupt-parent = <&mpic>; dfsrr;'
>
> Beuark ! (see below)
>
> > This avoids needing to pass the host device tree, but could
> > get awkward-- the i2c example above is very simple, some device
> > nodes are very large with a complex hierarchy of subnodes and
> > could be hundreds of lines of text to represent a single
> > node.
> >
> > It gets more complicated...
>
>
> So, from a qemu command line perspective, all you should have to do is pass qemu the device-
> tree -path- to the device you want to pass-trough (you may support passing a full hierarchy
> here).
>
> That is for normal MMIO mapped SoC devices. Something else (individual i2c, usb, ...) will use
> specific virtualization of the corresponding busses.
Then why 'yuck' to option 1 :)? That is basically what was being proposed.
> Anything else sucks too much really.
>
> From there, well, there's several approach inside qemu/kvm to handle that path. If you want to
> do things at the qemu level you can probably parse /proc/device-tree. But I'd personally just
> make it a kernel thing.
>
> IE. I would have an ioctl to "instanciate" a pass-through device, that takes that path as an
> argument. I would make it return an anonymous fd which you can then use to mmap the resources,
> etc...
Regarding implementation I think there are 3 things that need
to be set up-- 1) mmapping the device's registers, 2) getting the iommu
set up (if there is one), 3) getting the interrupt(s) handled.
> > In some cases, modifications to device tree nodes may be needed.
> > An example-- sometimes a device tree property references another node
> > and that relationship may not exist when assigned to a guest.
> > A "phy-handle" property may need to be deleted and a "fixed-link"
> > property added to a node representing a network device.
>
> That's fishy. Why wouldn't you give full access to the MDIO ? It's shared ? Such things are so
> device-specific that they would have to be handled by device-specific quirks, which can live
> either in qemu or in the kernel.
It is shared and in this case didn't want the phy shared. That was a super
simple example to illustrate the idea. With our experience with the Freescale
Embedded Hypervisor we see this as a definite requirement-- nodes in the
hardware device may need modifications. In the P4080 device tree there
are some complex relationships expressed between nodes of our 'data
path'. In some cases the hardware device tree expresses configuration
information, and while it could be argued that config info does not belong
there, it's what some drivers expect and what we have right now. So, a mechanism
to allow node updates is really needed.
> > So in addition to assigning a device, a mechanism is needed to update
> > device tree nodes. So for the above example, maybe--
> >
> > -device assigned-soc-dev,dev=/soc/ethernet@b2000,delete-prop=phy-handle,
> > node-update="fixed-link = <2 1 1000 0 0>"
>
> That's just so gross and error prone, borderline insane.
Not going to argue the gross/insane part, but it's reality. Don't
think anyone would type all that in at the command line, but would
be in an init script or something, so don't see it being more error
prone than messing around with device trees in general.
There's a small set of operations needed, based on our experience:
-adding,deleting properties (including phandle references)
-adding,deleting nodes (including subtrees)
Stuart
^ permalink raw reply [flat|nested] 29+ messages in thread
* Re: [Qemu-devel] device assignment for embedded Power
2011-07-05 18:19 ` Yoder Stuart-B08248
@ 2011-07-05 22:23 ` Alexander Graf
0 siblings, 0 replies; 29+ messages in thread
From: Alexander Graf @ 2011-07-05 22:23 UTC (permalink / raw)
To: Yoder Stuart-B08248
Cc: Wood Scott-B07421, qemu-devel@nongnu.org, dwg@au1.ibm.com,
blauwirbel@gmail.com, alex.williamson@redhat.com,
paul@codesourcery.com, joerg.roedel@amd.com, armbru@redhat.com
On 05.07.2011, at 20:19, Yoder Stuart-B08248 wrote:
>
>
>> -----Original Message-----
>> From: Benjamin Herrenschmidt [mailto:benh@kernel.crashing.org]
>> Sent: Thursday, June 30, 2011 7:58 PM
>> To: Yoder Stuart-B08248
>> Cc: qemu-devel@nongnu.org; Wood Scott-B07421; Alexander Graf; alex.williamson@redhat.com;
>> anthony@codemonkey.ws; dwg@au1.ibm.com; joerg.roedel@amd.com; paul@codesourcery.com;
>> blauwirbel@gmail.com; armbru@redhat.com
>> Subject: Re: device assignment for embedded Power
>>
>> On Thu, 2011-06-30 at 15:59 +0000, Yoder Stuart-B08248 wrote:
>>> One feature we need for QEMU/KVM on embedded Power Architecture is the
>>> ability to do passthru assignment of SoC I/O devices and memory. An
>>> important use case in embedded is creating static partitions-- taking
>>> physical memory and I/O devices (non-PCI) and partitioning
>>> them between the host Linux and several virtual machines. Things like
>>> live migration would not be needed or supported in these types of scenarios.
>>>
>>> SoC devices do not sit on a probeable bus and there are no identifiers
>>> like 01:00.0 with PCI that we can use to identify devices-- the host
>>> Linux kernel is made aware of SoC I/O devices from nodes/properties in a
>>> device tree structure passed at boot. QEMU needs to generate a
>>> device tree to pass to the guest as well with all the guest's virtual
>>> and physical resources. Today a number of mostly complete guest
>>> device trees are kept under ./pc-bios in QEMU, but this too static and
>>> inflexible.
>>>
>>> Some new mechanism is needed to assign SoC devices to guests, and we
>>> (FSL + Alex Graf) have been discussing a few possible approaches for
>>> doing this from QEMU and would like some feedback.
>>>
>>> Some possibilities:
>>>
>>> 1. Option 1. Pass the host dev tree to QEMU and assign devices
>>> by device tree path
>>>
>>> -dtb ./mpc8572ds.dtb -device assigned-soc-dev,dev=/soc/i2c@3000
>>>
>>> /soc/i2c@3000 is the device tree path to the assigned device.
>>> The device node 'i2c@3000' has some number of properties (e.g.
>>> address, interrupt info) and possibly subnodes under
>>> it. QEMU copies that node when generating the guest dev tree.
>>> See snippet of entire node: http://paste2.org/p/1496460
>>
>> Yuck (see below)
>>
>>> 2. Option 2. Pass the entire assigned device node as a string to
>>> QEMU
>>>
>>> -device assigned-soc-dev,dev=/i2c@3000,dev-node='#address-cells = <1>;
>>> #size-cells = <0>; cell-index = <0>; compatible = "fsl-i2c";
>>> reg = <0xffe03000 0x100>; interrupts = <43 2>;
>>> interrupt-parent = <&mpic>; dfsrr;'
>>
>> Beuark ! (see below)
>>
>>> This avoids needing to pass the host device tree, but could
>>> get awkward-- the i2c example above is very simple, some device
>>> nodes are very large with a complex hierarchy of subnodes and
>>> could be hundreds of lines of text to represent a single
>>> node.
>>>
>>> It gets more complicated...
>>
>>
>> So, from a qemu command line perspective, all you should have to do is pass qemu the device-
>> tree -path- to the device you want to pass-trough (you may support passing a full hierarchy
>> here).
>>
>> That is for normal MMIO mapped SoC devices. Something else (individual i2c, usb, ...) will use
>> specific virtualization of the corresponding busses.
>
> Then why 'yuck' to option 1 :)? That is basically what was being proposed.
Yes, and probably a good idea to go with for now. We can handle the guest device tree parts externally for now by passing in a fully populated device tree that just contains everything we need and pass qemu the configuration the way we did it in the device tree.
>> Anything else sucks too much really.
>>
>> From there, well, there's several approach inside qemu/kvm to handle that path. If you want to
>> do things at the qemu level you can probably parse /proc/device-tree. But I'd personally just
>> make it a kernel thing.
>>
>> IE. I would have an ioctl to "instanciate" a pass-through device, that takes that path as an
>> argument. I would make it return an anonymous fd which you can then use to mmap the resources,
>> etc...
>
> Regarding implementation I think there are 3 things that need
> to be set up-- 1) mmapping the device's registers, 2) getting the iommu
> set up (if there is one), 3) getting the interrupt(s) handled.
Yes :).
I guess we'll just have to sit down and implement something very simple that can at least pass through MMIO regions and interrupts and then take it from there until we hit the plenty walls.
Alex
^ permalink raw reply [flat|nested] 29+ messages in thread