[Qemu-devel] PCI iommu issues

qemu-devel.nongnu.org archive mirror
 help / color / mirror / Atom feed

* [Qemu-devel] PCI iommu issues
@ 2015-01-30  5:45 Benjamin Herrenschmidt
  2015-01-30  7:40 ` Jan Kiszka
  0 siblings, 1 reply; 4+ messages in thread
From: Benjamin Herrenschmidt @ 2015-01-30  5:45 UTC (permalink / raw)
  To: qemu-devel; +Cc: Jan Kiszka, Le Tan, Michael S. Tsirkin

Hi folks !

I've looked at the intel iommu code to try to figure out how to properly
implement a Power8 "native" iommu model and encountered some issues.

Today "pseries" ppc machine is paravirtualized and so uses a pretty
simplistic iommu model that essentially has one address space per host
bridge.

However, the real HW model I'm working on is closer to Intel in that we
have various tables walked by HW that match an originator RID to what
we called a "PE" (Partitionable Endpoint) to which corresponds an
address space.

So on a given domain, individual functions can have different iommu
address spaces & translation structures, or group of devices etc...
which can all be configured dynamically by the guest OS. This is similar
as far as I understand things to the Intel model though the details of
the implementation are very different.

So I implemented something along the lines of what you guys did for q35
and intel_iommu, and quickly discovered that it doesn't work, which
makes me wonder whether the intel stuff in qemu actually works, or
rather, does it work when adding bridges & switches into the picture.

I basically have two problems but they are somewhat related. Firstly
the way the intel code works is that it creates lazily context
structures that contain the address space, and get associated with
devices when pci_device_iommu_address_space() is called which in
turns calls the bridge iommu_fn which performs the association.

The first problem is that the association is done based on bus/dev/fn
of the device... at a time where bus numbers have not been assigned yet.

In fact, the bus numbers are assigned dynamically by SW, the BIOS
typically, but the OS can renumber things and it's bogus to assume thus
that the RID (bus/dev/fn) of a PCI device/function is fixed. However
that's exactly what the code does as it calls
pci_device_iommu_address_space() once at device instanciation time in
qemu, even before SW had a chance to assign anything.

So as far as I can tell, things will work as long as you are on bus 0
and there is no bridge, otherwise, it's broken by design, unless I'm
missing something...

I've hacked that locally in my code by using the PCIBus * pointer
instead of the bus number to match the device to the iommu context.

The second problem is that pci_device_iommu_address_space(), as it walks
up the hierarchy to find the iommu_fn, drops the original device
information. That means that if a device is below a switch or a p2p
bridge of some sort, once you reach the host bridge top level bus, all
we know is the bus & devfn of the last p2p entity along the path, we
lose the original bus & devfn information.

This is incorrect for that sort of iommu, at least while in the PCIe
domain, as the original RID is carried along with DMA transactions and i
thus needed to properly associate the device/function with a context.

One fix could be to populate the iommu_fn of every bus down the food
chain but that's fairly cumbersome... unless we make the PCI bridges by
default "inherit" from their parent iommu_fn.

Here, I've done a hack locally to keep the original device information
in pci_device_iommu_address_space() but it's not a proper way to do it
either, ultimately, each bridge need to be able to tell whether it
properly forwards the RID information or not, so the bridge itself need
to have some attribute to control that. Typically a PCIe switch or root
complex will always forward the full RID... while most PCI-E -> PCI-X
bridges are busted in that regard. Worse, some bridges forward *some*
bits (partial RID) which is even more broken but I don't know if we can
or even care about simulating it. Thankfully most PCI-X or PCI bridges
will behave properly and make it look like all DMAs are coming from the
bridge itself.

What do you guys reckon is the right approach for both problems ?

Cheers,
Ben.

^ permalink raw reply	[flat|nested] 4+ messages in thread

* Re: [Qemu-devel] PCI iommu issues
  2015-01-30  5:45 [Qemu-devel] PCI iommu issues Benjamin Herrenschmidt
@ 2015-01-30  7:40 ` Jan Kiszka
  2015-01-31 14:42   ` Knut Omang
  0 siblings, 1 reply; 4+ messages in thread
From: Jan Kiszka @ 2015-01-30  7:40 UTC (permalink / raw)
  To: Benjamin Herrenschmidt, qemu-devel; +Cc: Knut Omang, Le Tan, Michael S. Tsirkin

Adding Knut to CC as he particularly looked into and fixed the bridging
issues or the vtd emulation. I will have to refresh my memories first.

Jan

On 2015-01-30 06:45, Benjamin Herrenschmidt wrote:
> Hi folks !
> 
> 
> I've looked at the intel iommu code to try to figure out how to properly
> implement a Power8 "native" iommu model and encountered some issues.
> 
> Today "pseries" ppc machine is paravirtualized and so uses a pretty
> simplistic iommu model that essentially has one address space per host
> bridge.
> 
> However, the real HW model I'm working on is closer to Intel in that we
> have various tables walked by HW that match an originator RID to what
> we called a "PE" (Partitionable Endpoint) to which corresponds an
> address space.
> 
> So on a given domain, individual functions can have different iommu
> address spaces & translation structures, or group of devices etc...
> which can all be configured dynamically by the guest OS. This is similar
> as far as I understand things to the Intel model though the details of
> the implementation are very different.
> 
> So I implemented something along the lines of what you guys did for q35
> and intel_iommu, and quickly discovered that it doesn't work, which
> makes me wonder whether the intel stuff in qemu actually works, or
> rather, does it work when adding bridges & switches into the picture.
> 
> I basically have two problems but they are somewhat related. Firstly
> the way the intel code works is that it creates lazily context
> structures that contain the address space, and get associated with
> devices when pci_device_iommu_address_space() is called which in
> turns calls the bridge iommu_fn which performs the association.
> 
> The first problem is that the association is done based on bus/dev/fn
> of the device... at a time where bus numbers have not been assigned yet.
> 
> In fact, the bus numbers are assigned dynamically by SW, the BIOS
> typically, but the OS can renumber things and it's bogus to assume thus
> that the RID (bus/dev/fn) of a PCI device/function is fixed. However
> that's exactly what the code does as it calls
> pci_device_iommu_address_space() once at device instanciation time in
> qemu, even before SW had a chance to assign anything.
> 
> So as far as I can tell, things will work as long as you are on bus 0
> and there is no bridge, otherwise, it's broken by design, unless I'm
> missing something...
> 
> I've hacked that locally in my code by using the PCIBus * pointer
> instead of the bus number to match the device to the iommu context.
> 
> The second problem is that pci_device_iommu_address_space(), as it walks
> up the hierarchy to find the iommu_fn, drops the original device
> information. That means that if a device is below a switch or a p2p
> bridge of some sort, once you reach the host bridge top level bus, all
> we know is the bus & devfn of the last p2p entity along the path, we
> lose the original bus & devfn information.
> 
> This is incorrect for that sort of iommu, at least while in the PCIe
> domain, as the original RID is carried along with DMA transactions and i
> thus needed to properly associate the device/function with a context.
> 
> One fix could be to populate the iommu_fn of every bus down the food
> chain but that's fairly cumbersome... unless we make the PCI bridges by
> default "inherit" from their parent iommu_fn.
> 
> Here, I've done a hack locally to keep the original device information
> in pci_device_iommu_address_space() but it's not a proper way to do it
> either, ultimately, each bridge need to be able to tell whether it
> properly forwards the RID information or not, so the bridge itself need
> to have some attribute to control that. Typically a PCIe switch or root
> complex will always forward the full RID... while most PCI-E -> PCI-X
> bridges are busted in that regard. Worse, some bridges forward *some*
> bits (partial RID) which is even more broken but I don't know if we can
> or even care about simulating it. Thankfully most PCI-X or PCI bridges
> will behave properly and make it look like all DMAs are coming from the
> bridge itself.
> 
> What do you guys reckon is the right approach for both problems ?
> 
> Cheers,
> Ben.
> 
> 

-- 
Siemens AG, Corporate Technology, CT RTC ITP SES-DE
Corporate Competence Center Embedded Linux

^ permalink raw reply	[flat|nested] 4+ messages in thread

* Re: [Qemu-devel] PCI iommu issues
  2015-01-30  7:40 ` Jan Kiszka
@ 2015-01-31 14:42   ` Knut Omang
  2015-01-31 20:18     ` Benjamin Herrenschmidt
  0 siblings, 1 reply; 4+ messages in thread
From: Knut Omang @ 2015-01-31 14:42 UTC (permalink / raw)
  To: Benjamin Herrenschmidt
  Cc: Jan Kiszka, Alex Williamson, Michael S. Tsirkin, qemu-devel,
	Le Tan

On Fri, 2015-01-30 at 08:40 +0100, Jan Kiszka wrote:
> Adding Knut to CC as he particularly looked into and fixed the bridging
> issues or the vtd emulation. I will have to refresh my memories first.

Thanks for cc'ing me, Jan -
Yes I did some work specifically to serve my need of being able to use
the intel iommu with a PCIe device in a root port.

> Jan
> 
> On 2015-01-30 06:45, Benjamin Herrenschmidt wrote:
> > Hi folks !
> > 
> > 
> > I've looked at the intel iommu code to try to figure out how to properly
> > implement a Power8 "native" iommu model and encountered some issues.
> > 
> > Today "pseries" ppc machine is paravirtualized and so uses a pretty
> > simplistic iommu model that essentially has one address space per host
> > bridge.
> > 
> > However, the real HW model I'm working on is closer to Intel in that we
> > have various tables walked by HW that match an originator RID to what
> > we called a "PE" (Partitionable Endpoint) to which corresponds an
> > address space.
> > 
> > So on a given domain, individual functions can have different iommu
> > address spaces & translation structures, or group of devices etc...
> > which can all be configured dynamically by the guest OS. This is similar
> > as far as I understand things to the Intel model though the details of
> > the implementation are very different.
> > 
> > So I implemented something along the lines of what you guys did for q35
> > and intel_iommu, and quickly discovered that it doesn't work, which
> > makes me wonder whether the intel stuff in qemu actually works, or
> > rather, does it work when adding bridges & switches into the picture.
> > 
> > I basically have two problems but they are somewhat related. Firstly
> > the way the intel code works is that it creates lazily context
> > structures that contain the address space, and get associated with
> > devices when pci_device_iommu_address_space() is called which in
> > turns calls the bridge iommu_fn which performs the association.
> > 
> > The first problem is that the association is done based on bus/dev/fn
> > of the device... at a time where bus numbers have not been assigned yet.
> >
> > In fact, the bus numbers are assigned dynamically by SW, the BIOS
> > typically, but the OS can renumber things and it's bogus to assume thus
> > that the RID (bus/dev/fn) of a PCI device/function is fixed. However
> > that's exactly what the code does as it calls
> > pci_device_iommu_address_space() once at device instanciation time in
> > qemu, even before SW had a chance to assign anything.

This is very much in line with my understanding, and I implemented a
solution that works for me in the root port and downstream cases on
Intel in the patch leading up to this:

http://lists.gnu.org/archive/html/qemu-devel/2014-10/msg03121.html

I still have on my TODO list to follow up on your feedback, Michael, but
i figured it needs a bit of continuous focus and unfortunately I have
had far too much to do on other fronts lately to be able to get forward
on this.
 
As Jan argued, my solution cannot be used with non-pci devices, and I
didnt have much to work on wrt other archtectures since the
implementations there, as you indicate, is fairly simple so far. 

I also understand there has been quite some work on "platform" devices
lately, maybe that makes it easier to integrate at a more generic level?

> > So as far as I can tell, things will work as long as you are on bus 0
> > and there is no bridge, otherwise, it's broken by design, unless I'm
> > missing something...

Yes, as it stands today it does not work with bridges on Intel.

> > I've hacked that locally in my code by using the PCIBus * pointer
> > instead of the bus number to match the device to the iommu context.
> > 
> > The second problem is that pci_device_iommu_address_space(), as it walks
> > up the hierarchy to find the iommu_fn, drops the original device
> > information. That means that if a device is below a switch or a p2p
> > bridge of some sort, once you reach the host bridge top level bus, all
> > we know is the bus & devfn of the last p2p entity along the path, we
> > lose the original bus & devfn information.
> >
> > This is incorrect for that sort of iommu, at least while in the PCIe
> > domain, as the original RID is carried along with DMA transactions and i
> > thus needed to properly associate the device/function with a context.
> > 
> > One fix could be to populate the iommu_fn of every bus down the food
> > chain but that's fairly cumbersome... unless we make the PCI bridges by
> > default "inherit" from their parent iommu_fn.
> > 
> > Here, I've done a hack locally to keep the original device information
> > in pci_device_iommu_address_space() but it's not a proper way to do it
> > either, ultimately, each bridge need to be able to tell whether it
> > properly forwards the RID information or not, so the bridge itself need
> > to have some attribute to control that. Typically a PCIe switch or root
> > complex will always forward the full RID... while most PCI-E -> PCI-X
> > bridges are busted in that regard. Worse, some bridges forward *some*
> > bits (partial RID) which is even more broken but I don't know if we can
> > or even care about simulating it. Thankfully most PCI-X or PCI bridges
> > will behave properly and make it look like all DMAs are coming from the
> > bridge itself.
> > 
> > What do you guys reckon is the right approach for both problems ?

I agree with you that they are two distinct problems, with perhaps a 
3rd problem of how to generalize this to other devices than PCI devices?

Right now the 
handling of requester IDs across bridges seems very rudimentary in the
QEMU model, and as you indicate, it is perhaps not necessary to emulate
all details of broken switches etc, but maybe we need some kind of
requester ID translation function (similar to pci_bridge_map_irq()) that
a bridge can choose to implement, where the default is the unit
translation?

Wrt. handling solving the original issue of IOMMU support for devices on
secondary buses (and beyond?) My approach in the patch above is just to
pass a PCIDevice pointer instead of PCIBus* + devfn, eg:

-typedef AddressSpace *(*PCIIOMMUFunc)(PCIBus *, void *, int);
+typedef AddressSpace *(*PCIIOMMUFunc)(PCIDevice *, void *);

For the current Intel Iommu implementation, that seems to work fine as
no additional logic is need to "propagate" the enumeration, as accurate
IDs can be found via the device pointer. The current API leaves the task
of maintaining multiple address spaces for each IOMMU implementation,
maybe a more explicit notion of an IOMMU group similar to the way it is
used by VFIO is more close to how a common denominator of hardware
works? Adding Alex as well, as he might have more insight into the
details of different IOMMU implementations from his VFIO work.

Knut

> > Cheers,
> > Ben.
> > 
> > 
> 

^ permalink raw reply	[flat|nested] 4+ messages in thread

* Re: [Qemu-devel] PCI iommu issues
  2015-01-31 14:42   ` Knut Omang
@ 2015-01-31 20:18     ` Benjamin Herrenschmidt
  0 siblings, 0 replies; 4+ messages in thread
From: Benjamin Herrenschmidt @ 2015-01-31 20:18 UTC (permalink / raw)
  To: Knut Omang
  Cc: Jan Kiszka, Alex Williamson, Michael S. Tsirkin, qemu-devel,
	Le Tan

On Sat, 2015-01-31 at 15:42 +0100, Knut Omang wrote:

> I agree with you that they are two distinct problems, with perhaps a 
> 3rd problem of how to generalize this to other devices than PCI devices?
> 
> Right now the 
> handling of requester IDs across bridges seems very rudimentary in the
> QEMU model, and as you indicate, it is perhaps not necessary to emulate
> all details of broken switches etc, but maybe we need some kind of
> requester ID translation function (similar to pci_bridge_map_irq()) that
> a bridge can choose to implement, where the default is the unit
> translation?

That could work.

> Wrt. handling solving the original issue of IOMMU support for devices on
> secondary buses (and beyond?) My approach in the patch above is just to
> pass a PCIDevice pointer instead of PCIBus* + devfn, eg:
> 
> -typedef AddressSpace *(*PCIIOMMUFunc)(PCIBus *, void *, int);
> +typedef AddressSpace *(*PCIIOMMUFunc)(PCIDevice *, void *);

This is something I was going to suggest :-)

> For the current Intel Iommu implementation, that seems to work fine as
> no additional logic is need to "propagate" the enumeration, as accurate
> IDs can be found via the device pointer.

Agreed.

>  The current API leaves the task
> of maintaining multiple address spaces for each IOMMU implementation,
> maybe a more explicit notion of an IOMMU group similar to the way it is
> used by VFIO is more close to how a common denominator of hardware
> works? Adding Alex as well, as he might have more insight into the
> details of different IOMMU implementations from his VFIO work.

I would first make it work, let more than one implementation be written,
and only then, see what can be factored. 

As for non-PCI, I wouldn't worry too much yet, ie, I don't see a point
in delaying fixing broken PCI IOMMUs for an hypothetical generic layer
than may or may not work or take a year in the making.

As for Michael feedback, well, if we feel we can hack all the way to the
core DMA functions, I would have gone further and not attached address
spaces to devices permanently like we do ... the address spaces are
really the domains, so to mimmic the HW properly, ideally we would
capture the bus number on config space, and lazily resolve the address
space on each DMA ... We can invalidate the cached pointer when cfg
space is written or when the iommu invalidates its cache, but it's more
work and it could be difficult with vfio...

Also it makes it harder to use a memory region as child of the address
space that we enable/disable to deal with the operations of the bus
master enable bit (but then the way we do that prevents us from
implementing broken devices that ignore that bit :-)

Cheers,
Ben.

> Knut
> 
> > > Cheers,
> > > Ben.
> > > 
> > > 
> > 
> 

^ permalink raw reply	[flat|nested] 4+ messages in thread

end of thread, other threads:[~2015-01-31 20:18 UTC | newest]

Thread overview: 4+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2015-01-30  5:45 [Qemu-devel] PCI iommu issues Benjamin Herrenschmidt
2015-01-30  7:40 ` Jan Kiszka
2015-01-31 14:42   ` Knut Omang
2015-01-31 20:18     ` Benjamin Herrenschmidt

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).