From: Jan Kiszka <jan.kiszka@siemens.com>
To: Benjamin Herrenschmidt <benh@kernel.crashing.org>, qemu-devel@nongnu.org
Cc: Knut Omang <knut.omang@oracle.com>, Le Tan <tamlokveer@gmail.com>,
"Michael S. Tsirkin" <mst@redhat.com>
Subject: Re: [Qemu-devel] PCI iommu issues
Date: Fri, 30 Jan 2015 08:40:51 +0100 [thread overview]
Message-ID: <54CB3583.10309@siemens.com> (raw)
In-Reply-To: <1422596700.6246.13.camel@kernel.crashing.org>
Adding Knut to CC as he particularly looked into and fixed the bridging
issues or the vtd emulation. I will have to refresh my memories first.
Jan
On 2015-01-30 06:45, Benjamin Herrenschmidt wrote:
> Hi folks !
>
>
> I've looked at the intel iommu code to try to figure out how to properly
> implement a Power8 "native" iommu model and encountered some issues.
>
> Today "pseries" ppc machine is paravirtualized and so uses a pretty
> simplistic iommu model that essentially has one address space per host
> bridge.
>
> However, the real HW model I'm working on is closer to Intel in that we
> have various tables walked by HW that match an originator RID to what
> we called a "PE" (Partitionable Endpoint) to which corresponds an
> address space.
>
> So on a given domain, individual functions can have different iommu
> address spaces & translation structures, or group of devices etc...
> which can all be configured dynamically by the guest OS. This is similar
> as far as I understand things to the Intel model though the details of
> the implementation are very different.
>
> So I implemented something along the lines of what you guys did for q35
> and intel_iommu, and quickly discovered that it doesn't work, which
> makes me wonder whether the intel stuff in qemu actually works, or
> rather, does it work when adding bridges & switches into the picture.
>
> I basically have two problems but they are somewhat related. Firstly
> the way the intel code works is that it creates lazily context
> structures that contain the address space, and get associated with
> devices when pci_device_iommu_address_space() is called which in
> turns calls the bridge iommu_fn which performs the association.
>
> The first problem is that the association is done based on bus/dev/fn
> of the device... at a time where bus numbers have not been assigned yet.
>
> In fact, the bus numbers are assigned dynamically by SW, the BIOS
> typically, but the OS can renumber things and it's bogus to assume thus
> that the RID (bus/dev/fn) of a PCI device/function is fixed. However
> that's exactly what the code does as it calls
> pci_device_iommu_address_space() once at device instanciation time in
> qemu, even before SW had a chance to assign anything.
>
> So as far as I can tell, things will work as long as you are on bus 0
> and there is no bridge, otherwise, it's broken by design, unless I'm
> missing something...
>
> I've hacked that locally in my code by using the PCIBus * pointer
> instead of the bus number to match the device to the iommu context.
>
> The second problem is that pci_device_iommu_address_space(), as it walks
> up the hierarchy to find the iommu_fn, drops the original device
> information. That means that if a device is below a switch or a p2p
> bridge of some sort, once you reach the host bridge top level bus, all
> we know is the bus & devfn of the last p2p entity along the path, we
> lose the original bus & devfn information.
>
> This is incorrect for that sort of iommu, at least while in the PCIe
> domain, as the original RID is carried along with DMA transactions and i
> thus needed to properly associate the device/function with a context.
>
> One fix could be to populate the iommu_fn of every bus down the food
> chain but that's fairly cumbersome... unless we make the PCI bridges by
> default "inherit" from their parent iommu_fn.
>
> Here, I've done a hack locally to keep the original device information
> in pci_device_iommu_address_space() but it's not a proper way to do it
> either, ultimately, each bridge need to be able to tell whether it
> properly forwards the RID information or not, so the bridge itself need
> to have some attribute to control that. Typically a PCIe switch or root
> complex will always forward the full RID... while most PCI-E -> PCI-X
> bridges are busted in that regard. Worse, some bridges forward *some*
> bits (partial RID) which is even more broken but I don't know if we can
> or even care about simulating it. Thankfully most PCI-X or PCI bridges
> will behave properly and make it look like all DMAs are coming from the
> bridge itself.
>
> What do you guys reckon is the right approach for both problems ?
>
> Cheers,
> Ben.
>
>
--
Siemens AG, Corporate Technology, CT RTC ITP SES-DE
Corporate Competence Center Embedded Linux
next prev parent reply other threads:[~2015-01-30 7:41 UTC|newest]
Thread overview: 4+ messages / expand[flat|nested] mbox.gz Atom feed top
2015-01-30 5:45 [Qemu-devel] PCI iommu issues Benjamin Herrenschmidt
2015-01-30 7:40 ` Jan Kiszka [this message]
2015-01-31 14:42 ` Knut Omang
2015-01-31 20:18 ` Benjamin Herrenschmidt
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=54CB3583.10309@siemens.com \
--to=jan.kiszka@siemens.com \
--cc=benh@kernel.crashing.org \
--cc=knut.omang@oracle.com \
--cc=mst@redhat.com \
--cc=qemu-devel@nongnu.org \
--cc=tamlokveer@gmail.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).