From: Alexey Kardashevskiy <aik@ozlabs.ru>
To: Alex Williamson <alex.williamson@redhat.com>
Cc: Alexander Graf <agraf@suse.de>,
Paolo Bonzini <pbonzini@redhat.com>,
qemu-ppc@nongnu.org, qemu-devel@nongnu.org,
David Gibson <david@gibson.dropbear.id.au>
Subject: Re: [Qemu-devel] [PATCH v5 07/11] vfio: Add guest side IOMMU support
Date: Fri, 21 Mar 2014 18:59:24 +1100 [thread overview]
Message-ID: <532BF15C.7030802@ozlabs.ru> (raw)
In-Reply-To: <1395259061.8201.242.camel@ul30vt.home>
On 03/20/2014 06:57 AM, Alex Williamson wrote:
> On Wed, 2014-03-12 at 16:52 +1100, Alexey Kardashevskiy wrote:
>> From: David Gibson <david@gibson.dropbear.id.au>
>>
>> This patch uses the new IOMMU notifiers to allow VFIO pass through devices
>> to work with guest side IOMMUs, as long as the host-side VFIO iommu has
>> sufficient capability and granularity to match the guest side. This works
>> by tracking all map and unmap operations on the guest IOMMU using the
>> notifiers, and mirroring them into VFIO.
>>
>> There are a number of FIXMEs, and the scheme involves rather more notifier
>> structures than I'd like, but it should make for a reasonable proof of
>> concept.
>>
>> Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
>> Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
>>
>> ---
>> Changes:
>> v4:
>> * fixed list objects naming
>> * vfio_listener_region_add() reworked to call memory_region_ref() from one
>> place only, it is also easier to review the changes
>> * fixes boundary check not to fail on sections == 2^64 bytes,
>> the "vfio: Fix debug output for int128 values" patch is required;
>> this obsoletes the "[PATCH v3 0/3] vfio: fixes for better support
>> for 128 bit memory section sizes" patch proposal
>> ---
>> hw/misc/vfio.c | 126 ++++++++++++++++++++++++++++++++++++++++++++++++++++++---
>> 1 file changed, 120 insertions(+), 6 deletions(-)
>>
>> diff --git a/hw/misc/vfio.c b/hw/misc/vfio.c
>> index 038010b..4f6f5da 100644
>> --- a/hw/misc/vfio.c
>> +++ b/hw/misc/vfio.c
>> @@ -159,10 +159,18 @@ typedef struct VFIOContainer {
>> };
>> void (*release)(struct VFIOContainer *);
>> } iommu_data;
>> + QLIST_HEAD(, VFIOGuestIOMMU) giommu_list;
>> QLIST_HEAD(, VFIOGroup) group_list;
>> QLIST_ENTRY(VFIOContainer) next;
>> } VFIOContainer;
>>
>> +typedef struct VFIOGuestIOMMU {
>> + VFIOContainer *container;
>> + MemoryRegion *iommu;
>> + Notifier n;
>> + QLIST_ENTRY(VFIOGuestIOMMU) giommu_next;
>> +} VFIOGuestIOMMU;
>> +
>> /* Cache of MSI-X setup plus extra mmap and memory region for split BAR map */
>> typedef struct VFIOMSIXInfo {
>> uint8_t table_bar;
>> @@ -2241,8 +2249,9 @@ static int vfio_dma_map(VFIOContainer *container, hwaddr iova,
>>
>> static bool vfio_listener_skipped_section(MemoryRegionSection *section)
>> {
>> - return !memory_region_is_ram(section->mr) ||
>> - /*
>> + return (!memory_region_is_ram(section->mr) &&
>> + !memory_region_is_iommu(section->mr)) ||
>> + /*
>
> White space damage
>
>> * Sizing an enabled 64-bit BAR can cause spurious mappings to
>> * addresses in the upper part of the 64-bit address space. These
>> * are never accessed by the CPU and beyond the address width of
>> @@ -2251,6 +2260,61 @@ static bool vfio_listener_skipped_section(MemoryRegionSection *section)
>> section->offset_within_address_space & (1ULL << 63);
>> }
>>
>> +static void vfio_iommu_map_notify(Notifier *n, void *data)
>> +{
>> + VFIOGuestIOMMU *giommu = container_of(n, VFIOGuestIOMMU, n);
>> + VFIOContainer *container = giommu->container;
>> + IOMMUTLBEntry *iotlb = data;
>> + MemoryRegion *mr;
>> + hwaddr xlat;
>> + hwaddr len = iotlb->addr_mask + 1;
>> + void *vaddr;
>> + int ret;
>> +
>> + DPRINTF("iommu map @ %"HWADDR_PRIx" - %"HWADDR_PRIx"\n",
>> + iotlb->iova, iotlb->iova + iotlb->addr_mask);
>> +
>> + /*
>> + * The IOMMU TLB entry we have just covers translation through
>> + * this IOMMU to its immediate target. We need to translate
>> + * it the rest of the way through to memory.
>> + */
>> + mr = address_space_translate(&address_space_memory,
>> + iotlb->translated_addr,
>> + &xlat, &len, iotlb->perm & IOMMU_WO);
>
> Write-only? Is this supposed to be read-write to mask just 2 bits?
The last parameter of address_space_translate() bool is_write. So I do not
really understand the problem here.
>> + if (!memory_region_is_ram(mr)) {
>> + DPRINTF("iommu map to non memory area %"HWADDR_PRIx"\n",
>> + xlat);
>> + return;
>> + }
>> + if (len & iotlb->addr_mask) {
>> + DPRINTF("iommu has granularity incompatible with target AS\n");
>
> Is this possible? Assuming len is initially a power-of-2, would the
> translate function change it? Maybe worth a comment to explain.
Oh. address_space_translate() actually changes @len to min(len,
TARGET_PAGE_SIZE) and TARGET_PAGE_SIZE is hardcoded to 4K. So far it was ok
but lately I have been implementing a huge DMA window (plus one
sPAPRTCETable and one VFIOGuestIOMMU objects) which currently operates with
16MB pages (can do 64K pages too) and now this "granularity incompatible"
is happening.
I disabled that check but I need to think of better fix...
Adding Paolo to cc, may be he picks the context and gives good piece of
advise :)
>
>> + return;
>> + }
>> +
>> + vaddr = memory_region_get_ram_ptr(mr) + xlat;
>
> This lookup isn't free and the unmap path doesn't need it, maybe move
> the variable and lookup into the first branch below?
>
>> +
>> + if (iotlb->perm != IOMMU_NONE) {
>> + ret = vfio_dma_map(container, iotlb->iova,
>> + iotlb->addr_mask + 1, vaddr,
>> + !(iotlb->perm & IOMMU_WO) || mr->readonly);
>> + if (ret) {
>> + error_report("vfio_dma_map(%p, 0x%"HWADDR_PRIx", "
>> + "0x%"HWADDR_PRIx", %p) = %d (%m)",
>> + container, iotlb->iova,
>> + iotlb->addr_mask + 1, vaddr, ret);
>> + }
>> + } else {
>> + ret = vfio_dma_unmap(container, iotlb->iova, iotlb->addr_mask + 1);
>> + if (ret) {
>> + error_report("vfio_dma_unmap(%p, 0x%"HWADDR_PRIx", "
>> + "0x%"HWADDR_PRIx") = %d (%m)",
>> + container, iotlb->iova,
>> + iotlb->addr_mask + 1, ret);
>> + }
>> + }
>> +}
>> +
>> static void vfio_listener_region_add(MemoryListener *listener,
>> MemoryRegionSection *section)
>> {
>> @@ -2261,8 +2325,6 @@ static void vfio_listener_region_add(MemoryListener *listener,
>> void *vaddr;
>> int ret;
>>
>> - assert(!memory_region_is_iommu(section->mr));
>> -
>> if (vfio_listener_skipped_section(section)) {
>> DPRINTF("SKIPPING region_add %"HWADDR_PRIx" - %"PRIx64"\n",
>> section->offset_within_address_space,
>> @@ -2286,15 +2348,47 @@ static void vfio_listener_region_add(MemoryListener *listener,
>> return;
>> }
>>
>> + memory_region_ref(section->mr);
>> +
>> + if (memory_region_is_iommu(section->mr)) {
>> + VFIOGuestIOMMU *giommu;
>> +
>> + DPRINTF("region_add [iommu] %"HWADDR_PRIx" - %"HWADDR_PRIx"\n",
>> + iova, int128_get64(int128_sub(llend, int128_one())));
>> + /*
>> + * FIXME: We should do some checking to see if the
>> + * capabilities of the host VFIO IOMMU are adequate to model
>> + * the guest IOMMU
>> + *
>> + * FIXME: This assumes that the guest IOMMU is empty of
>> + * mappings at this point - we should either enforce this, or
>> + * loop through existing mappings to map them into VFIO.
>> + *
>> + * FIXME: For VFIO iommu types which have KVM acceleration to
>> + * avoid bouncing all map/unmaps through qemu this way, this
>> + * would be the right place to wire that up (tell the KVM
>> + * device emulation the VFIO iommu handles to use).
>> + */
>
> That's a lot of FIXMEs... The second one in particular looks like it
> needs to expand a bit on why this is likely a valid assumption. The
> last one is more of a TODO than a FIXME.
>
>> + giommu = g_malloc0(sizeof(*giommu));
>> + giommu->iommu = section->mr;
>> + giommu->container = container;
>> + giommu->n.notify = vfio_iommu_map_notify;
>> + QLIST_INSERT_HEAD(&container->giommu_list, giommu, giommu_next);
>> + memory_region_register_iommu_notifier(giommu->iommu, &giommu->n);
>> +
>> + return;
>> + }
>> +
>> + /* Here we assume that memory_region_is_ram(section->mr)==true */
>> +
>> end = int128_get64(llend);
>> vaddr = memory_region_get_ram_ptr(section->mr) +
>> section->offset_within_region +
>> (iova - section->offset_within_address_space);
>>
>> - DPRINTF("region_add %"HWADDR_PRIx" - %"HWADDR_PRIx" [%p]\n",
>> + DPRINTF("region_add [ram] %"HWADDR_PRIx" - %"HWADDR_PRIx" [%p]\n",
>> iova, end - 1, vaddr);
>>
>> - memory_region_ref(section->mr);
>> ret = vfio_dma_map(container, iova, end - iova, vaddr, section->readonly);
>> if (ret) {
>> error_report("vfio_dma_map(%p, 0x%"HWADDR_PRIx", "
>> @@ -2338,6 +2432,26 @@ static void vfio_listener_region_del(MemoryListener *listener,
>> return;
>> }
>>
>> + if (memory_region_is_iommu(section->mr)) {
>> + VFIOGuestIOMMU *giommu;
>> +
>> + QLIST_FOREACH(giommu, &container->giommu_list, giommu_next) {
>> + if (giommu->iommu == section->mr) {
>> + memory_region_unregister_iommu_notifier(&giommu->n);
>> + QLIST_REMOVE(giommu, giommu_next);
>> + g_free(giommu);
>> + break;
>> + }
>> + }
>> +
>> + /*
>> + * FIXME: We assume the one big unmap below is adequate to
>> + * remove any individual page mappings in the IOMMU which
>> + * might have been copied into VFIO. That may not be true for
>> + * all IOMMU types
>> + */
>
> We assume this because the IOVA that gets unmapped is the same
> regardless of whether a guest IOMMU is present?
What exactly is meant by "guest IOMMU is present"? Doing the second DMA
window, now I am really confused about terminology :(
>> + }
>> +
>> iova = TARGET_PAGE_ALIGN(section->offset_within_address_space);
>> end = (section->offset_within_address_space + int128_get64(section->size)) &
>> TARGET_PAGE_MASK;
>
>
>
--
Alexey
next prev parent reply other threads:[~2014-03-21 7:59 UTC|newest]
Thread overview: 42+ messages / expand[flat|nested] mbox.gz Atom feed top
2014-03-12 5:52 [Qemu-devel] [PATCH v5 00/11] vfio on spapr-ppc64 Alexey Kardashevskiy
2014-03-12 5:52 ` [Qemu-devel] [PATCH v5 01/11] memory: Sanity check that no listeners remain on a destroyed AddressSpace Alexey Kardashevskiy
2014-03-20 10:20 ` Paolo Bonzini
2014-03-20 11:45 ` David Gibson
2014-03-27 5:40 ` Alexey Kardashevskiy
2014-03-27 12:15 ` Paolo Bonzini
2014-03-12 5:52 ` [Qemu-devel] [PATCH v5 02/11] int128: add int128_exts64() Alexey Kardashevskiy
2014-03-20 10:19 ` Paolo Bonzini
2014-03-12 5:52 ` [Qemu-devel] [PATCH v5 03/11] vfio: Fix 128 bit handling Alexey Kardashevskiy
2014-03-20 10:20 ` Paolo Bonzini
2014-03-12 5:52 ` [Qemu-devel] [PATCH v5 04/11] vfio: rework to have error paths Alexey Kardashevskiy
2014-03-12 5:52 ` [Qemu-devel] [PATCH v5 05/11] vfio: Introduce VFIO address spaces Alexey Kardashevskiy
2014-03-19 19:57 ` Alex Williamson
2014-03-28 3:42 ` Alexey Kardashevskiy
2014-03-31 19:14 ` Alex Williamson
2014-03-12 5:52 ` [Qemu-devel] [PATCH v5 06/11] vfio: Create VFIOAddressSpace objects as needed Alexey Kardashevskiy
2014-03-19 19:57 ` Alex Williamson
2014-03-12 5:52 ` [Qemu-devel] [PATCH v5 07/11] vfio: Add guest side IOMMU support Alexey Kardashevskiy
2014-03-19 19:57 ` Alex Williamson
2014-03-20 5:25 ` David Gibson
2014-03-28 5:12 ` Alexey Kardashevskiy
2014-03-31 19:59 ` Alex Williamson
2014-03-21 7:59 ` Alexey Kardashevskiy [this message]
2014-03-21 14:17 ` Alex Williamson
2014-03-21 14:23 ` Paolo Bonzini
2014-03-28 4:49 ` Alexey Kardashevskiy
2014-03-31 19:54 ` Alex Williamson
2014-03-12 5:52 ` [Qemu-devel] [PATCH v5 08/11] spapr-iommu: add SPAPR VFIO IOMMU device Alexey Kardashevskiy
2014-04-03 12:17 ` Alexander Graf
2014-04-07 4:07 ` Alexey Kardashevskiy
2014-04-10 12:13 ` Alexander Graf
2014-03-12 5:52 ` [Qemu-devel] [PATCH v5 09/11] spapr vfio: add vfio_container_spapr_get_info() Alexey Kardashevskiy
2014-03-12 5:52 ` [Qemu-devel] [PATCH v5 10/11] spapr-vfio: add spapr-pci-vfio-host-bridge to support vfio Alexey Kardashevskiy
2014-03-13 8:12 ` [Qemu-devel] [PATCH v6] " Alexey Kardashevskiy
2014-03-19 19:57 ` [Qemu-devel] [PATCH v5 10/11] " Alex Williamson
2014-03-28 6:01 ` Alexey Kardashevskiy
2014-03-31 20:09 ` Alex Williamson
2014-04-01 6:25 ` Alexey Kardashevskiy
2014-04-01 18:21 ` Alex Williamson
2014-03-12 5:52 ` [Qemu-devel] [PATCH v5 11/11] spapr-vfio: enable for spapr Alexey Kardashevskiy
2014-03-19 19:57 ` Alex Williamson
2014-03-19 20:12 ` [Qemu-devel] [PATCH v5 00/11] vfio on spapr-ppc64 Alex Williamson
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=532BF15C.7030802@ozlabs.ru \
--to=aik@ozlabs.ru \
--cc=agraf@suse.de \
--cc=alex.williamson@redhat.com \
--cc=david@gibson.dropbear.id.au \
--cc=pbonzini@redhat.com \
--cc=qemu-devel@nongnu.org \
--cc=qemu-ppc@nongnu.org \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).