All of lore.kernel.org
 help / color / mirror / Atom feed
* [Qemu-devel] [RFC PATCH 00/13] VT-d replay and misc cleanup
@ 2016-12-06 10:36 Peter Xu
  2016-12-06 10:36 ` [Qemu-devel] [RFC PATCH 01/13] intel_iommu: allocate new key when creating new address space Peter Xu
                   ` (16 more replies)
  0 siblings, 17 replies; 33+ messages in thread
From: Peter Xu @ 2016-12-06 10:36 UTC (permalink / raw)
  To: qemu-devel
  Cc: tianyu.lan, kevin.tian, mst, jan.kiszka, bd.aviv, peterx,
	alex.williamson, jasowang

This RFC series is a continue work for Aviv B.D.'s vfio enablement
series with vt-d. Aviv has done a great job there, and what we still
lack there are mostly the following:

(1) VFIO got duplicated IOTLB notifications due to splitted VT-d IOMMU
    memory region.

(2) VT-d still haven't provide a correct replay() mechanism (e.g.,
    when IOMMU domain switches, things will broke).

Here I'm trying to solve the above two issues.

(1) is solved by patch 7, (2) is solved by patch 11-12.

Basically it contains the following:

patch 1:    picked up from Jason's vhost DMAR series, which is a bugfix

patch 2-6:  Cleanups/Enhancements for existing vt-d codes (please see
            specific commit message for details, there are patches
            that I thought may be suitable for 2.8 as well, but looks
            like it's too late)

patch 7:    Solve the issue that vfio is notified more than once for
            IOTLB notifications with Aviv's patches

patch 8-10: Some trivial memory APIs added for further patches, and
            add customize replay() support for MemoryRegion (I see
            Aviv's latest v7 contains similar replay, I can rebase
            onto that, merely the same thing)

patch 11:   Provide a valid vt-d replay() callback, using page walk

patch 12:   Enable the domain switch support - we replay() when
            context entry got invalidated

patch 13:   Enhancement for existing invalidation notification,
            instead of using translate() for each page, we leverage
            the new vtd_page_walk() interface, which should be faster.

I would glad to hear about any review comments for above patches
(especially patch 8-13, which is the main part of this series),
especially any issue I missed in the series.

=========
Test Done
=========

Build test passed for x86_64/arm/ppc64.

Simply tested with x86_64, assigning two PCI devices to a single VM,
boot the VM using:

bin=x86_64-softmmu/qemu-system-x86_64
$bin -M q35,accel=kvm,kernel-irqchip=split -m 1G \
     -device intel-iommu,intremap=on,eim=off,cache-mode=on \
     -netdev user,id=net0,hostfwd=tcp::5555-:22 \
     -device virtio-net-pci,netdev=net0 \
     -device vfio-pci,host=03:00.0 \
     -device vfio-pci,host=02:00.0 \
     -trace events=".trace.vfio" \
     /var/lib/libvirt/images/vm1.qcow2

pxdev:bin [vtd-vfio-enablement]# cat .trace.vfio
vtd_page_walk*
vtd_replay*
vtd_inv_desc*

Then, in the guest, run the following tool:

  https://github.com/xzpeter/clibs/blob/master/gpl/userspace/vfio-bind-group/vfio-bind-group.c

With parameter:

  ./vfio-bind-group 00:03.0 00:04.0

Check host side trace log, I can see pages are replayed and mapped in
00:04.0 device address space, like:

...
vtd_replay_ce_valid replay valid context device 00:04.00 hi 0x301 lo 0x3be77001
vtd_page_walk Page walk for ce (0x301, 0x3be77001) iova range 0x0 - 0x8000000000
vtd_page_walk_level Page walk (base=0x3be77000, level=3) iova range 0x0 - 0x8000000000
vtd_page_walk_level Page walk (base=0x3c88a000, level=2) iova range 0x0 - 0x40000000
vtd_page_walk_level Page walk (base=0x366cb000, level=1) iova range 0x0 - 0x200000
vtd_page_walk_one Page walk detected map level 0x1 iova 0x0 -> gpa 0x366cb000 mask 0xfff perm 3
vtd_page_walk_one Page walk detected map level 0x1 iova 0x1000 -> gpa 0x366cb000 mask 0xfff perm 3
vtd_page_walk_one Page walk detected map level 0x1 iova 0x2000 -> gpa 0x366cb000 mask 0xfff perm 3
vtd_page_walk_one Page walk detected map level 0x1 iova 0x3000 -> gpa 0x366cb000 mask 0xfff perm 3
vtd_page_walk_one Page walk detected map level 0x1 iova 0x4000 -> gpa 0x366cb000 mask 0xfff perm 3
vtd_page_walk_one Page walk detected map level 0x1 iova 0x5000 -> gpa 0x366cb000 mask 0xfff perm 3
vtd_page_walk_one Page walk detected map level 0x1 iova 0x6000 -> gpa 0x366cb000 mask 0xfff perm 3
vtd_page_walk_one Page walk detected map level 0x1 iova 0x7000 -> gpa 0x366cb000 mask 0xfff perm 3
vtd_page_walk_one Page walk detected map level 0x1 iova 0x8000 -> gpa 0x366cb000 mask 0xfff perm 3
vtd_page_walk_one Page walk detected map level 0x1 iova 0x9000 -> gpa 0x366cb000 mask 0xfff perm 3
vtd_page_walk_one Page walk detected map level 0x1 iova 0xa000 -> gpa 0x366cb000 mask 0xfff perm 3
vtd_page_walk_one Page walk detected map level 0x1 iova 0xb000 -> gpa 0x366cb000 mask 0xfff perm 3
vtd_page_walk_one Page walk detected map level 0x1 iova 0xc000 -> gpa 0x366cb000 mask 0xfff perm 3
vtd_page_walk_one Page walk detected map level 0x1 iova 0xd000 -> gpa 0x366cb000 mask 0xfff perm 3
vtd_page_walk_one Page walk detected map level 0x1 iova 0xe000 -> gpa 0x366cb000 mask 0xfff perm 3
...

=========
Todo List
=========

- error reporting for the assigned devices (as Tianyu has mentioned)

- per-domain address-space: A better solution in the future may be -
  we maintain one address space per IOMMU domain in the guest (so
  multiple devices can share a same address space if they are sharing
  the same IOMMU domains in the guest), rather than one address space
  per device (which is current implementation of vt-d). However that's
  a step further than this series, and let's see whether we can first
  provide a workable version of device assignment with vt-d
  protection.

- more to come...

Thanks,

Jason Wang (1):
  intel_iommu: allocate new key when creating new address space

Peter Xu (12):
  intel_iommu: simplify irq region translation
  intel_iommu: renaming gpa to iova where proper
  intel_iommu: fix trace for inv desc handling
  intel_iommu: fix trace for addr translation
  intel_iommu: vtd_slpt_level_shift check level
  memory: add section range info for IOMMU notifier
  memory: provide iommu_replay_all()
  memory: introduce memory_region_notify_one()
  memory: add MemoryRegionIOMMUOps.replay() callback
  intel_iommu: provide its own replay() callback
  intel_iommu: do replay when context invalidate
  intel_iommu: use page_walk for iotlb inv notify

 hw/i386/intel_iommu.c | 521 ++++++++++++++++++++++++++++++++------------------
 hw/i386/trace-events  |  27 +++
 hw/vfio/common.c      |   7 +-
 include/exec/memory.h |  30 +++
 memory.c              |  42 +++-
 5 files changed, 432 insertions(+), 195 deletions(-)

-- 
2.7.4

^ permalink raw reply	[flat|nested] 33+ messages in thread
* Re: [Qemu-devel] [RFC PATCH 11/13] intel_iommu: provide its own replay() callback
@ 2016-12-29  7:38 Liu, Yi L
  2016-12-29  9:55 ` Peter Xu
  0 siblings, 1 reply; 33+ messages in thread
From: Liu, Yi L @ 2016-12-29  7:38 UTC (permalink / raw)
  To: ', Peter Xu'
  Cc: Tian, Kevin, Lan, Tianyu, 'qemu-devel@nongnu.org',
	'bd.aviv@gmail.com'

> The default replay() don't work for VT-d since vt-d will have a huge
> default memory region which covers address range 0-(2^64-1). This will
> normally bring a dead loop when guest starts.
> 
> The solution is simple - we don't walk over all the regions. Instead, we
> jump over the regions when we found that the page directories are empty.
> It'll greatly reduce the time to walk the whole region.
> 
> To achieve this, we provided a page walk helper to do that, invoking
> corresponding hook function when we found an page we are interested in.
> vtd_page_walk_level() is the core logic for the page walking. It's
> interface is designed to suite further use case, e.g., to invalidate a
> range of addresses.
> 
> Signed-off-by: Peter Xu <address@hidden>
> ---
>  hw/i386/intel_iommu.c | 212 ++++++++++++++++++++++++++++++++++++++++++++++++--
>  hw/i386/trace-events  |   8 ++
>  include/exec/memory.h |   2 +
>  3 files changed, 217 insertions(+), 5 deletions(-)
> 
> diff --git a/hw/i386/intel_iommu.c b/hw/i386/intel_iommu.c
> index 46b8a2f..2fcd7af 100644
> --- a/hw/i386/intel_iommu.c
> +++ b/hw/i386/intel_iommu.c
> @@ -620,6 +620,22 @@ static inline uint32_t 
> vtd_get_agaw_from_context_entry(VTDContextEntry *ce)
>      return 30 + (ce->hi & VTD_CONTEXT_ENTRY_AW) * 9;
>  }
>  
> +static inline uint64_t vtd_iova_limit(VTDContextEntry *ce)
> +{
> +    uint32_t ce_agaw = vtd_get_agaw_from_context_entry(ce);
> +    return 1ULL << MIN(ce_agaw, VTD_MGAW);
> +}
> +
> +/* Return true if IOVA passes range check, otherwise false. */
> +static inline bool vtd_iova_range_check(uint64_t iova, VTDContextEntry *ce)
> +{
> +    /*
> +     * Check if @iova is above 2^X-1, where X is the minimum of MGAW
> +     * in CAP_REG and AW in context-entry.
> +     */
> +    return !(iova & ~(vtd_iova_limit(ce) - 1));
> +}
> +
>  static const uint64_t vtd_paging_entry_rsvd_field[] = {
>      [0] = ~0ULL,
>      /* For not large page */
> @@ -656,13 +672,9 @@ static int vtd_iova_to_slpte(VTDContextEntry *ce, uint64_t 
> iova,
>      uint32_t level = vtd_get_level_from_context_entry(ce);
>      uint32_t offset;
>      uint64_t slpte;
> -    uint32_t ce_agaw = vtd_get_agaw_from_context_entry(ce);
>      uint64_t access_right_check = 0;
>  
> -    /* Check if @iova is above 2^X-1, where X is the minimum of MGAW
> -     * in CAP_REG and AW in context-entry.
> -     */
> -    if (iova & ~((1ULL << MIN(ce_agaw, VTD_MGAW)) - 1)) {
> +    if (!vtd_iova_range_check(iova, ce)) {
>          error_report("IOVA 0x%"PRIx64 " exceeds limits", iova);
>          return -VTD_FR_ADDR_BEYOND_MGAW;
>      }
> @@ -718,6 +730,166 @@ static int vtd_iova_to_slpte(VTDContextEntry *ce, 
> uint64_t iova,
>      }
>  }
>  
> +typedef int (*vtd_page_walk_hook)(IOMMUTLBEntry *entry, void *private);
> +
> +/**
> + * vtd_page_walk_level - walk over specific level for IOVA range
> + *
> + * @addr: base GPA addr to start the walk
> + * @start: IOVA range start address
> + * @end: IOVA range end address (start <= addr < end)
> + * @hook_fn: hook func to be called when detected page
> + * @private: private data to be passed into hook func
> + * @read: whether parent level has read permission
> + * @write: whether parent level has write permission
> + * @skipped: accumulated skipped ranges
> + * @notify_unmap: whether we should notify invalid entries
> + */
> +static int vtd_page_walk_level(dma_addr_t addr, uint64_t start,
> +                               uint64_t end, vtd_page_walk_hook hook_fn,
> +                               void *private, uint32_t level,
> +                               bool read, bool write, uint64_t *skipped,
> +                               bool notify_unmap)
> +{
> +    bool read_cur, write_cur, entry_valid;
> +    uint32_t offset;
> +    uint64_t slpte;
> +    uint64_t subpage_size, subpage_mask;
> +    IOMMUTLBEntry entry;
> +    uint64_t iova = start;
> +    uint64_t iova_next;
> +    uint64_t skipped_local = 0;
> +    int ret = 0;
> +
> +    trace_vtd_page_walk_level(addr, level, start, end);
> +
> +    subpage_size = 1ULL << vtd_slpt_level_shift(level);
> +    subpage_mask = vtd_slpt_level_page_mask(level);
> +
> +    while (iova < end) {
> +        iova_next = (iova & subpage_mask) + subpage_size;
> +
> +        offset = vtd_iova_level_offset(iova, level);
> +        slpte = vtd_get_slpte(addr, offset);
> +
> +        /*
> +         * When one of the following case happens, we assume the whole
> +         * range is invalid:
> +         *
> +         * 1. read block failed
> +         * 3. reserved area non-zero
> +         * 2. both read & write flag are not set
> +         */
> +
> +        if (slpte == (uint64_t)-1) {
> +            trace_vtd_page_walk_skip_read(iova, iova_next);
> +            skipped_local++;
> +            goto next;
> +        }
> +
> +        if (vtd_slpte_nonzero_rsvd(slpte, level)) {
> +            trace_vtd_page_walk_skip_reserve(iova, iova_next);
> +            skipped_local++;
> +            goto next;
> +        }
> +
> +        /* Permissions are stacked with parents' */
> +        read_cur = read && (slpte & VTD_SL_R);
> +        write_cur = write && (slpte & VTD_SL_W);
> +
> +        /*
> +         * As long as we have either read/write permission, this is
> +         * a valid entry. The rule works for both page or page tables.
> +         */
> +        entry_valid = read_cur | write_cur;
> +
> +        if (vtd_is_last_slpte(slpte, level)) {
> +            entry.target_as = &address_space_memory;
> +            entry.iova = iova & subpage_mask;
> +            /*
> +             * This might be meaningless addr if (!read_cur &&
> +             * !write_cur), but after all this field will be
> +             * meaningless in that case, so let's share the code to
> +             * generate the IOTLBs no matter it's an MAP or UNMAP
> +             */
> +            entry.translated_addr = vtd_get_slpte_addr(slpte);
> +            entry.addr_mask = ~subpage_mask;
> +            entry.perm = IOMMU_ACCESS_FLAG(read_cur, write_cur);
> +            if (!entry_valid && !notify_unmap) {
> +                trace_vtd_page_walk_skip_perm(iova, iova_next);
> +                skipped_local++;
> +                goto next;
> +            }
> +            trace_vtd_page_walk_one(level, entry.iova, addr,
> +                                    entry.addr_mask, entry.perm);
> +            if (hook_fn) {
> +                ret = hook_fn(&entry, private);
> +                if (ret < 0) {
> +                    error_report("Detected error in page walk hook "
> +                                 "function, stop walk.");
> +                    return ret;
> +                }
> +            }
> +        } else {
> +            if (!entry_valid) {
> +                trace_vtd_page_walk_skip_perm(iova, iova_next);
> +                skipped_local++;
> +                goto next;
> +            }
> +            ret = vtd_page_walk_level(vtd_get_slpte_addr(slpte), iova,
> +                                      MIN(iova_next, end), hook_fn, private,
> +                                      level - 1, read_cur, write_cur,
> +                                      &skipped_local, notify_unmap);
> +            if (ret < 0) {
> +                error_report("Detected page walk error on addr 0x%"PRIx64
> +                             " level %"PRIu32", stop walk.", addr, level - 1);
> +                return ret;
> +            }
> +        }
> +
> +next:
> +        iova = iova_next;
> +    }
> +
> +    if (skipped) {
> +        *skipped += skipped_local;
> +    }
> +
> +    return 0;
> +}
> +
> +/**
> + * vtd_page_walk - walk specific IOVA range, and call the hook
> + *
> + * @ce: context entry to walk upon
> + * @start: IOVA address to start the walk
> + * @end: size of the IOVA range to walk over
> + * @hook_fn: the hook that to be called for each detected area
> + * @private: private data for the hook function
> + */
> +static int vtd_page_walk(VTDContextEntry *ce, uint64_t start, uint64_t end,
> +                         vtd_page_walk_hook hook_fn, void *private)
> +{
> +    dma_addr_t addr = vtd_get_slpt_base_from_context(ce);
> +    uint32_t level = vtd_get_level_from_context_entry(ce);
> +
> +    if (!vtd_iova_range_check(start, ce)) {
> +        error_report("IOVA start 0x%"PRIx64 " end 0x%"PRIx64" exceeds limits",
> +                     start, end);
> +        return -VTD_FR_ADDR_BEYOND_MGAW;
> +    }
> +
> +    if (!vtd_iova_range_check(end, ce)) {
> +        /* Fix end so that it reaches the maximum */
> +        end = vtd_iova_limit(ce);
> +    }
> +
> +    trace_vtd_page_walk(ce->hi, ce->lo, start, end);
> +
> +    return vtd_page_walk_level(addr, start, end, hook_fn, private,
> +                               level, true, true, NULL, false);
> +}
> +
>  /* Map a device to its corresponding domain (context-entry) */
>  static int vtd_dev_to_context_entry(IntelIOMMUState *s, uint8_t bus_num,
>                                      uint8_t devfn, VTDContextEntry *ce)
> @@ -2430,6 +2602,35 @@ VTDAddressSpace *vtd_find_add_as(IntelIOMMUState *s, 
> PCIBus *bus, int devfn)
>      return vtd_dev_as;
>  }
>  
> +static int vtd_replay_hook(IOMMUTLBEntry *entry, void *private)
> +{
> +    memory_region_notify_one((IOMMUNotifier *)private, entry);
> +    return 0;
> +}
> +
> +static void vtd_iommu_replay(MemoryRegion *mr, IOMMUNotifier *n)
> +{
> +    VTDAddressSpace *vtd_as = container_of(mr, VTDAddressSpace, iommu);
> +    IntelIOMMUState *s = vtd_as->iommu_state;
> +    uint8_t bus_n = pci_bus_num(vtd_as->bus);
> +    VTDContextEntry ce;
> +
> +    if (vtd_dev_to_context_entry(s, bus_n, vtd_as->devfn, &ce) == 0) {

Hi Peter,

Again I'd like to give my cheers on this patch set.

Hereby, I have a minor concern on this part.
The replay would be performed in vfio_listener_region_add() when device is
assigned. Since guest is just started, so there will be no valid context entry for the
assigned device. So there will be no vtd_page_walk. Thus no change to SL page
table in host. The SL page table would still be a IOVA->HPA mapping. It's true
that the gIOVA->HPA mapping can be built by page map/unmap when guest
trying to issue dma. So it is just ok without relpay in the beginning. Hot plug should
be all the same. I guess it is the design here?

However, it may be incompatible with a future work in which we expose an
vIOMMU to guest but guest disable IOVA usage. In this scenario, the SL page
table in host would be a GPA->HPA instead of a gIOVA->HPA mapping. It would
rely on memory_replay to build a GPA->HPA mapping. If I'm correct, I think it
would be expected to have a replay in the time device is assigned.

Regards,
Yi L

^ permalink raw reply	[flat|nested] 33+ messages in thread

end of thread, other threads:[~2016-12-30  9:19 UTC | newest]

Thread overview: 33+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2016-12-06 10:36 [Qemu-devel] [RFC PATCH 00/13] VT-d replay and misc cleanup Peter Xu
2016-12-06 10:36 ` [Qemu-devel] [RFC PATCH 01/13] intel_iommu: allocate new key when creating new address space Peter Xu
2016-12-12  8:16   ` Liu, Yi L
2016-12-14  3:05     ` Peter Xu
2016-12-14  3:24       ` Liu, Yi L
2016-12-06 10:36 ` [Qemu-devel] [RFC PATCH 02/13] intel_iommu: simplify irq region translation Peter Xu
2016-12-06 10:36 ` [Qemu-devel] [RFC PATCH 03/13] intel_iommu: renaming gpa to iova where proper Peter Xu
2016-12-18  8:39   ` Liu, Yi L
2016-12-19  9:23     ` Peter Xu
2016-12-06 10:36 ` [Qemu-devel] [RFC PATCH 04/13] intel_iommu: fix trace for inv desc handling Peter Xu
2016-12-06 10:36 ` [Qemu-devel] [RFC PATCH 05/13] intel_iommu: fix trace for addr translation Peter Xu
2016-12-06 10:36 ` [Qemu-devel] [RFC PATCH 06/13] intel_iommu: vtd_slpt_level_shift check level Peter Xu
2016-12-06 10:36 ` [Qemu-devel] [RFC PATCH 07/13] memory: add section range info for IOMMU notifier Peter Xu
2016-12-06 10:36 ` [Qemu-devel] [RFC PATCH 08/13] memory: provide iommu_replay_all() Peter Xu
2016-12-06 10:36 ` [Qemu-devel] [RFC PATCH 09/13] memory: introduce memory_region_notify_one() Peter Xu
2016-12-06 10:36 ` [Qemu-devel] [RFC PATCH 10/13] memory: add MemoryRegionIOMMUOps.replay() callback Peter Xu
2016-12-06 10:36 ` [Qemu-devel] [RFC PATCH 11/13] intel_iommu: provide its own replay() callback Peter Xu
2016-12-08  3:01   ` Lan Tianyu
2016-12-09  1:56     ` Peter Xu
2016-12-06 10:36 ` [Qemu-devel] [RFC PATCH 12/13] intel_iommu: do replay when context invalidate Peter Xu
2016-12-06 10:36 ` [Qemu-devel] [RFC PATCH 13/13] intel_iommu: use page_walk for iotlb inv notify Peter Xu
2016-12-06 10:49 ` [Qemu-devel] [RFC PATCH 00/13] VT-d replay and misc cleanup Peter Xu
2016-12-13  7:45 ` Peter Xu
2016-12-18  8:42 ` Liu, Yi L
2016-12-19  9:22   ` Peter Xu
2016-12-19 11:25     ` Liu, Yi L
2016-12-19  9:33 ` Peter Xu
  -- strict thread matches above, loose matches on Subject: below --
2016-12-29  7:38 [Qemu-devel] [RFC PATCH 11/13] intel_iommu: provide its own replay() callback Liu, Yi L
2016-12-29  9:55 ` Peter Xu
2016-12-29 10:23   ` Liu, Yi L
2016-12-30  3:43     ` Peter Xu
2016-12-30  4:39       ` Liu, Yi L
2016-12-30  9:18         ` Peter Xu

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.