Re: [Qemu-devel] PCI iommu issues

qemu-devel.nongnu.org archive mirror
 help / color / mirror / Atom feed

From: Jan Kiszka <jan.kiszka@siemens.com>
To: Benjamin Herrenschmidt <benh@kernel.crashing.org>, qemu-devel@nongnu.org
Cc: Knut Omang <knut.omang@oracle.com>, Le Tan <tamlokveer@gmail.com>,
	"Michael S. Tsirkin" <mst@redhat.com>
Subject: Re: [Qemu-devel] PCI iommu issues
Date: Fri, 30 Jan 2015 08:40:51 +0100	[thread overview]
Message-ID: <54CB3583.10309@siemens.com> (raw)
In-Reply-To: <1422596700.6246.13.camel@kernel.crashing.org>

Adding Knut to CC as he particularly looked into and fixed the bridging
issues or the vtd emulation. I will have to refresh my memories first.

Jan

On 2015-01-30 06:45, Benjamin Herrenschmidt wrote:
> Hi folks !
> 
> 
> I've looked at the intel iommu code to try to figure out how to properly
> implement a Power8 "native" iommu model and encountered some issues.
> 
> Today "pseries" ppc machine is paravirtualized and so uses a pretty
> simplistic iommu model that essentially has one address space per host
> bridge.
> 
> However, the real HW model I'm working on is closer to Intel in that we
> have various tables walked by HW that match an originator RID to what
> we called a "PE" (Partitionable Endpoint) to which corresponds an
> address space.
> 
> So on a given domain, individual functions can have different iommu
> address spaces & translation structures, or group of devices etc...
> which can all be configured dynamically by the guest OS. This is similar
> as far as I understand things to the Intel model though the details of
> the implementation are very different.
> 
> So I implemented something along the lines of what you guys did for q35
> and intel_iommu, and quickly discovered that it doesn't work, which
> makes me wonder whether the intel stuff in qemu actually works, or
> rather, does it work when adding bridges & switches into the picture.
> 
> I basically have two problems but they are somewhat related. Firstly
> the way the intel code works is that it creates lazily context
> structures that contain the address space, and get associated with
> devices when pci_device_iommu_address_space() is called which in
> turns calls the bridge iommu_fn which performs the association.
> 
> The first problem is that the association is done based on bus/dev/fn
> of the device... at a time where bus numbers have not been assigned yet.
> 
> In fact, the bus numbers are assigned dynamically by SW, the BIOS
> typically, but the OS can renumber things and it's bogus to assume thus
> that the RID (bus/dev/fn) of a PCI device/function is fixed. However
> that's exactly what the code does as it calls
> pci_device_iommu_address_space() once at device instanciation time in
> qemu, even before SW had a chance to assign anything.
> 
> So as far as I can tell, things will work as long as you are on bus 0
> and there is no bridge, otherwise, it's broken by design, unless I'm
> missing something...
> 
> I've hacked that locally in my code by using the PCIBus * pointer
> instead of the bus number to match the device to the iommu context.
> 
> The second problem is that pci_device_iommu_address_space(), as it walks
> up the hierarchy to find the iommu_fn, drops the original device
> information. That means that if a device is below a switch or a p2p
> bridge of some sort, once you reach the host bridge top level bus, all
> we know is the bus & devfn of the last p2p entity along the path, we
> lose the original bus & devfn information.
> 
> This is incorrect for that sort of iommu, at least while in the PCIe
> domain, as the original RID is carried along with DMA transactions and i
> thus needed to properly associate the device/function with a context.
> 
> One fix could be to populate the iommu_fn of every bus down the food
> chain but that's fairly cumbersome... unless we make the PCI bridges by
> default "inherit" from their parent iommu_fn.
> 
> Here, I've done a hack locally to keep the original device information
> in pci_device_iommu_address_space() but it's not a proper way to do it
> either, ultimately, each bridge need to be able to tell whether it
> properly forwards the RID information or not, so the bridge itself need
> to have some attribute to control that. Typically a PCIe switch or root
> complex will always forward the full RID... while most PCI-E -> PCI-X
> bridges are busted in that regard. Worse, some bridges forward *some*
> bits (partial RID) which is even more broken but I don't know if we can
> or even care about simulating it. Thankfully most PCI-X or PCI bridges
> will behave properly and make it look like all DMAs are coming from the
> bridge itself.
> 
> What do you guys reckon is the right approach for both problems ?
> 
> Cheers,
> Ben.
> 
> 

-- 
Siemens AG, Corporate Technology, CT RTC ITP SES-DE
Corporate Competence Center Embedded Linux

next prev parent reply	other threads:[~2015-01-30  7:41 UTC|newest]

Thread overview: 4+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2015-01-30  5:45 [Qemu-devel] PCI iommu issues Benjamin Herrenschmidt
2015-01-30  7:40 ` Jan Kiszka [this message]
2015-01-31 14:42   ` Knut Omang
2015-01-31 20:18     ` Benjamin Herrenschmidt

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=54CB3583.10309@siemens.com \
    --to=jan.kiszka@siemens.com \
    --cc=benh@kernel.crashing.org \
    --cc=knut.omang@oracle.com \
    --cc=mst@redhat.com \
    --cc=qemu-devel@nongnu.org \
    --cc=tamlokveer@gmail.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).