All of lore.kernel.org
 help / color / mirror / Atom feed
From: Jan Kiszka <jan.kiszka@siemens.com>
To: Benjamin Herrenschmidt <benh@kernel.crashing.org>, qemu-devel@nongnu.org
Cc: Knut Omang <knut.omang@oracle.com>, Le Tan <tamlokveer@gmail.com>,
	"Michael S. Tsirkin" <mst@redhat.com>
Subject: Re: [Qemu-devel] PCI iommu issues
Date: Fri, 30 Jan 2015 08:40:51 +0100	[thread overview]
Message-ID: <54CB3583.10309@siemens.com> (raw)
In-Reply-To: <1422596700.6246.13.camel@kernel.crashing.org>

Adding Knut to CC as he particularly looked into and fixed the bridging
issues or the vtd emulation. I will have to refresh my memories first.

Jan

On 2015-01-30 06:45, Benjamin Herrenschmidt wrote:
> Hi folks !
> 
> 
> I've looked at the intel iommu code to try to figure out how to properly
> implement a Power8 "native" iommu model and encountered some issues.
> 
> Today "pseries" ppc machine is paravirtualized and so uses a pretty
> simplistic iommu model that essentially has one address space per host
> bridge.
> 
> However, the real HW model I'm working on is closer to Intel in that we
> have various tables walked by HW that match an originator RID to what
> we called a "PE" (Partitionable Endpoint) to which corresponds an
> address space.
> 
> So on a given domain, individual functions can have different iommu
> address spaces & translation structures, or group of devices etc...
> which can all be configured dynamically by the guest OS. This is similar
> as far as I understand things to the Intel model though the details of
> the implementation are very different.
> 
> So I implemented something along the lines of what you guys did for q35
> and intel_iommu, and quickly discovered that it doesn't work, which
> makes me wonder whether the intel stuff in qemu actually works, or
> rather, does it work when adding bridges & switches into the picture.
> 
> I basically have two problems but they are somewhat related. Firstly
> the way the intel code works is that it creates lazily context
> structures that contain the address space, and get associated with
> devices when pci_device_iommu_address_space() is called which in
> turns calls the bridge iommu_fn which performs the association.
> 
> The first problem is that the association is done based on bus/dev/fn
> of the device... at a time where bus numbers have not been assigned yet.
> 
> In fact, the bus numbers are assigned dynamically by SW, the BIOS
> typically, but the OS can renumber things and it's bogus to assume thus
> that the RID (bus/dev/fn) of a PCI device/function is fixed. However
> that's exactly what the code does as it calls
> pci_device_iommu_address_space() once at device instanciation time in
> qemu, even before SW had a chance to assign anything.
> 
> So as far as I can tell, things will work as long as you are on bus 0
> and there is no bridge, otherwise, it's broken by design, unless I'm
> missing something...
> 
> I've hacked that locally in my code by using the PCIBus * pointer
> instead of the bus number to match the device to the iommu context.
> 
> The second problem is that pci_device_iommu_address_space(), as it walks
> up the hierarchy to find the iommu_fn, drops the original device
> information. That means that if a device is below a switch or a p2p
> bridge of some sort, once you reach the host bridge top level bus, all
> we know is the bus & devfn of the last p2p entity along the path, we
> lose the original bus & devfn information.
> 
> This is incorrect for that sort of iommu, at least while in the PCIe
> domain, as the original RID is carried along with DMA transactions and i
> thus needed to properly associate the device/function with a context.
> 
> One fix could be to populate the iommu_fn of every bus down the food
> chain but that's fairly cumbersome... unless we make the PCI bridges by
> default "inherit" from their parent iommu_fn.
> 
> Here, I've done a hack locally to keep the original device information
> in pci_device_iommu_address_space() but it's not a proper way to do it
> either, ultimately, each bridge need to be able to tell whether it
> properly forwards the RID information or not, so the bridge itself need
> to have some attribute to control that. Typically a PCIe switch or root
> complex will always forward the full RID... while most PCI-E -> PCI-X
> bridges are busted in that regard. Worse, some bridges forward *some*
> bits (partial RID) which is even more broken but I don't know if we can
> or even care about simulating it. Thankfully most PCI-X or PCI bridges
> will behave properly and make it look like all DMAs are coming from the
> bridge itself.
> 
> What do you guys reckon is the right approach for both problems ?
> 
> Cheers,
> Ben.
> 
> 

-- 
Siemens AG, Corporate Technology, CT RTC ITP SES-DE
Corporate Competence Center Embedded Linux

  reply	other threads:[~2015-01-30  7:41 UTC|newest]

Thread overview: 4+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2015-01-30  5:45 [Qemu-devel] PCI iommu issues Benjamin Herrenschmidt
2015-01-30  7:40 ` Jan Kiszka [this message]
2015-01-31 14:42   ` Knut Omang
2015-01-31 20:18     ` Benjamin Herrenschmidt

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=54CB3583.10309@siemens.com \
    --to=jan.kiszka@siemens.com \
    --cc=benh@kernel.crashing.org \
    --cc=knut.omang@oracle.com \
    --cc=mst@redhat.com \
    --cc=qemu-devel@nongnu.org \
    --cc=tamlokveer@gmail.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.