[Qemu-devel] [RFC PATCH 00/13] VT-d replay and misc cleanup

All of lore.kernel.org
 help / color / mirror / Atom feed

* [Qemu-devel] [RFC PATCH 00/13] VT-d replay and misc cleanup
@ 2016-12-06 10:36 Peter Xu
  2016-12-06 10:36 ` [Qemu-devel] [RFC PATCH 01/13] intel_iommu: allocate new key when creating new address space Peter Xu
                   ` (16 more replies)
  0 siblings, 17 replies; 33+ messages in thread
From: Peter Xu @ 2016-12-06 10:36 UTC (permalink / raw)
  To: qemu-devel
  Cc: tianyu.lan, kevin.tian, mst, jan.kiszka, bd.aviv, peterx,
	alex.williamson, jasowang

This RFC series is a continue work for Aviv B.D.'s vfio enablement
series with vt-d. Aviv has done a great job there, and what we still
lack there are mostly the following:

(1) VFIO got duplicated IOTLB notifications due to splitted VT-d IOMMU
    memory region.

(2) VT-d still haven't provide a correct replay() mechanism (e.g.,
    when IOMMU domain switches, things will broke).

Here I'm trying to solve the above two issues.

(1) is solved by patch 7, (2) is solved by patch 11-12.

Basically it contains the following:

patch 1:    picked up from Jason's vhost DMAR series, which is a bugfix

patch 2-6:  Cleanups/Enhancements for existing vt-d codes (please see
            specific commit message for details, there are patches
            that I thought may be suitable for 2.8 as well, but looks
            like it's too late)

patch 7:    Solve the issue that vfio is notified more than once for
            IOTLB notifications with Aviv's patches

patch 8-10: Some trivial memory APIs added for further patches, and
            add customize replay() support for MemoryRegion (I see
            Aviv's latest v7 contains similar replay, I can rebase
            onto that, merely the same thing)

patch 11:   Provide a valid vt-d replay() callback, using page walk

patch 12:   Enable the domain switch support - we replay() when
            context entry got invalidated

patch 13:   Enhancement for existing invalidation notification,
            instead of using translate() for each page, we leverage
            the new vtd_page_walk() interface, which should be faster.

I would glad to hear about any review comments for above patches
(especially patch 8-13, which is the main part of this series),
especially any issue I missed in the series.

=========
Test Done
=========

Build test passed for x86_64/arm/ppc64.

Simply tested with x86_64, assigning two PCI devices to a single VM,
boot the VM using:

bin=x86_64-softmmu/qemu-system-x86_64
$bin -M q35,accel=kvm,kernel-irqchip=split -m 1G \
     -device intel-iommu,intremap=on,eim=off,cache-mode=on \
     -netdev user,id=net0,hostfwd=tcp::5555-:22 \
     -device virtio-net-pci,netdev=net0 \
     -device vfio-pci,host=03:00.0 \
     -device vfio-pci,host=02:00.0 \
     -trace events=".trace.vfio" \
     /var/lib/libvirt/images/vm1.qcow2

pxdev:bin [vtd-vfio-enablement]# cat .trace.vfio
vtd_page_walk*
vtd_replay*
vtd_inv_desc*

Then, in the guest, run the following tool:

  https://github.com/xzpeter/clibs/blob/master/gpl/userspace/vfio-bind-group/vfio-bind-group.c

With parameter:

  ./vfio-bind-group 00:03.0 00:04.0

Check host side trace log, I can see pages are replayed and mapped in
00:04.0 device address space, like:

...
vtd_replay_ce_valid replay valid context device 00:04.00 hi 0x301 lo 0x3be77001
vtd_page_walk Page walk for ce (0x301, 0x3be77001) iova range 0x0 - 0x8000000000
vtd_page_walk_level Page walk (base=0x3be77000, level=3) iova range 0x0 - 0x8000000000
vtd_page_walk_level Page walk (base=0x3c88a000, level=2) iova range 0x0 - 0x40000000
vtd_page_walk_level Page walk (base=0x366cb000, level=1) iova range 0x0 - 0x200000
vtd_page_walk_one Page walk detected map level 0x1 iova 0x0 -> gpa 0x366cb000 mask 0xfff perm 3
vtd_page_walk_one Page walk detected map level 0x1 iova 0x1000 -> gpa 0x366cb000 mask 0xfff perm 3
vtd_page_walk_one Page walk detected map level 0x1 iova 0x2000 -> gpa 0x366cb000 mask 0xfff perm 3
vtd_page_walk_one Page walk detected map level 0x1 iova 0x3000 -> gpa 0x366cb000 mask 0xfff perm 3
vtd_page_walk_one Page walk detected map level 0x1 iova 0x4000 -> gpa 0x366cb000 mask 0xfff perm 3
vtd_page_walk_one Page walk detected map level 0x1 iova 0x5000 -> gpa 0x366cb000 mask 0xfff perm 3
vtd_page_walk_one Page walk detected map level 0x1 iova 0x6000 -> gpa 0x366cb000 mask 0xfff perm 3
vtd_page_walk_one Page walk detected map level 0x1 iova 0x7000 -> gpa 0x366cb000 mask 0xfff perm 3
vtd_page_walk_one Page walk detected map level 0x1 iova 0x8000 -> gpa 0x366cb000 mask 0xfff perm 3
vtd_page_walk_one Page walk detected map level 0x1 iova 0x9000 -> gpa 0x366cb000 mask 0xfff perm 3
vtd_page_walk_one Page walk detected map level 0x1 iova 0xa000 -> gpa 0x366cb000 mask 0xfff perm 3
vtd_page_walk_one Page walk detected map level 0x1 iova 0xb000 -> gpa 0x366cb000 mask 0xfff perm 3
vtd_page_walk_one Page walk detected map level 0x1 iova 0xc000 -> gpa 0x366cb000 mask 0xfff perm 3
vtd_page_walk_one Page walk detected map level 0x1 iova 0xd000 -> gpa 0x366cb000 mask 0xfff perm 3
vtd_page_walk_one Page walk detected map level 0x1 iova 0xe000 -> gpa 0x366cb000 mask 0xfff perm 3
...

=========
Todo List
=========

- error reporting for the assigned devices (as Tianyu has mentioned)

- per-domain address-space: A better solution in the future may be -
  we maintain one address space per IOMMU domain in the guest (so
  multiple devices can share a same address space if they are sharing
  the same IOMMU domains in the guest), rather than one address space
  per device (which is current implementation of vt-d). However that's
  a step further than this series, and let's see whether we can first
  provide a workable version of device assignment with vt-d
  protection.

- more to come...

Thanks,

Jason Wang (1):
  intel_iommu: allocate new key when creating new address space

Peter Xu (12):
  intel_iommu: simplify irq region translation
  intel_iommu: renaming gpa to iova where proper
  intel_iommu: fix trace for inv desc handling
  intel_iommu: fix trace for addr translation
  intel_iommu: vtd_slpt_level_shift check level
  memory: add section range info for IOMMU notifier
  memory: provide iommu_replay_all()
  memory: introduce memory_region_notify_one()
  memory: add MemoryRegionIOMMUOps.replay() callback
  intel_iommu: provide its own replay() callback
  intel_iommu: do replay when context invalidate
  intel_iommu: use page_walk for iotlb inv notify

 hw/i386/intel_iommu.c | 521 ++++++++++++++++++++++++++++++++------------------
 hw/i386/trace-events  |  27 +++
 hw/vfio/common.c      |   7 +-
 include/exec/memory.h |  30 +++
 memory.c              |  42 +++-
 5 files changed, 432 insertions(+), 195 deletions(-)

-- 
2.7.4

^ permalink raw reply	[flat|nested] 33+ messages in thread

* [Qemu-devel] [RFC PATCH 01/13] intel_iommu: allocate new key when creating new address space
  2016-12-06 10:36 [Qemu-devel] [RFC PATCH 00/13] VT-d replay and misc cleanup Peter Xu
@ 2016-12-06 10:36 ` Peter Xu
  2016-12-12  8:16   ` Liu, Yi L
  2016-12-06 10:36 ` [Qemu-devel] [RFC PATCH 02/13] intel_iommu: simplify irq region translation Peter Xu
                   ` (15 subsequent siblings)
  16 siblings, 1 reply; 33+ messages in thread
From: Peter Xu @ 2016-12-06 10:36 UTC (permalink / raw)
  To: qemu-devel
  Cc: tianyu.lan, kevin.tian, mst, jan.kiszka, bd.aviv, peterx,
	alex.williamson, jasowang

From: Jason Wang <jasowang@redhat.com>

We use the pointer to stack for key for new address space, this will
break hash table searching, fixing by g_malloc() a new key instead.

Cc: Michael S. Tsirkin <mst@redhat.com>
Cc: Paolo Bonzini <pbonzini@redhat.com>
Cc: Richard Henderson <rth@twiddle.net>
Cc: Eduardo Habkost <ehabkost@redhat.com>
Acked-by: Peter Xu <peterx@redhat.com>
Signed-off-by: Jason Wang <jasowang@redhat.com>
Signed-off-by: Peter Xu <peterx@redhat.com>
---
 hw/i386/intel_iommu.c | 5 +++--
 1 file changed, 3 insertions(+), 2 deletions(-)

diff --git a/hw/i386/intel_iommu.c b/hw/i386/intel_iommu.c
index 708770e..92e4064 100644
--- a/hw/i386/intel_iommu.c
+++ b/hw/i386/intel_iommu.c
@@ -2426,12 +2426,13 @@ VTDAddressSpace *vtd_find_add_as(IntelIOMMUState *s, PCIBus *bus, int devfn)
     VTDAddressSpace *vtd_dev_as;
 
     if (!vtd_bus) {
+        uintptr_t *new_key = g_malloc(sizeof(*new_key));
+        *new_key = (uintptr_t)bus;
         /* No corresponding free() */
         vtd_bus = g_malloc0(sizeof(VTDBus) + sizeof(VTDAddressSpace *) * \
                             X86_IOMMU_PCI_DEVFN_MAX);
         vtd_bus->bus = bus;
-        key = (uintptr_t)bus;
-        g_hash_table_insert(s->vtd_as_by_busptr, &key, vtd_bus);
+        g_hash_table_insert(s->vtd_as_by_busptr, new_key, vtd_bus);
     }
 
     vtd_dev_as = vtd_bus->dev_as[devfn];
-- 
2.7.4

^ permalink raw reply related	[flat|nested] 33+ messages in thread

* Re: [Qemu-devel] [RFC PATCH 01/13] intel_iommu: allocate new key when creating new address space
  2016-12-06 10:36 ` [Qemu-devel] [RFC PATCH 01/13] intel_iommu: allocate new key when creating new address space Peter Xu
@ 2016-12-12  8:16   ` Liu, Yi L
  2016-12-14  3:05     ` Peter Xu
  0 siblings, 1 reply; 33+ messages in thread
From: Liu, Yi L @ 2016-12-12  8:16 UTC (permalink / raw)
  To: Peter Xu
  Cc: qemu-devel, tianyu.lan, kevin.tian, mst, jan.kiszka, bd.aviv,
	alex.williamson, jasowang, yi.liu

On Tue, Dec 06, 2016 at 06:36:16PM +0800, Peter Xu wrote:
> From: Jason Wang <jasowang@redhat.com>
> 
> We use the pointer to stack for key for new address space, this will
> break hash table searching, fixing by g_malloc() a new key instead.
> 
> Cc: Michael S. Tsirkin <mst@redhat.com>
> Cc: Paolo Bonzini <pbonzini@redhat.com>
> Cc: Richard Henderson <rth@twiddle.net>
> Cc: Eduardo Habkost <ehabkost@redhat.com>
> Acked-by: Peter Xu <peterx@redhat.com>
> Signed-off-by: Jason Wang <jasowang@redhat.com>
> Signed-off-by: Peter Xu <peterx@redhat.com>
> ---
>  hw/i386/intel_iommu.c | 5 +++--
>  1 file changed, 3 insertions(+), 2 deletions(-)
> 
> diff --git a/hw/i386/intel_iommu.c b/hw/i386/intel_iommu.c
> index 708770e..92e4064 100644
> --- a/hw/i386/intel_iommu.c
> +++ b/hw/i386/intel_iommu.c
> @@ -2426,12 +2426,13 @@ VTDAddressSpace *vtd_find_add_as(IntelIOMMUState *s, PCIBus *bus, int devfn)
>      VTDAddressSpace *vtd_dev_as;
>  
>      if (!vtd_bus) {
> +        uintptr_t *new_key = g_malloc(sizeof(*new_key));
> +        *new_key = (uintptr_t)bus;
>          /* No corresponding free() */
>          vtd_bus = g_malloc0(sizeof(VTDBus) + sizeof(VTDAddressSpace *) * \
>                              X86_IOMMU_PCI_DEVFN_MAX);
>          vtd_bus->bus = bus;
> -        key = (uintptr_t)bus;
> -        g_hash_table_insert(s->vtd_as_by_busptr, &key, vtd_bus);
> +        g_hash_table_insert(s->vtd_as_by_busptr, new_key, vtd_bus);
Hi Peter,
Your fix seems to answer an issue I encountered back in Oct. The symptom is: use the same bus value to
searcha previous inserted entry in s->vtd_as_by_busptr, the result is not found.

really grt fix. could explain it a bit on why this change would fix the issue?

Regards,
Yi L
>      }
>  
>      vtd_dev_as = vtd_bus->dev_as[devfn];
> -- 
> 2.7.4
> 
> 

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [Qemu-devel] [RFC PATCH 01/13] intel_iommu: allocate new key when creating new address space
  2016-12-12  8:16   ` Liu, Yi L
@ 2016-12-14  3:05     ` Peter Xu
  2016-12-14  3:24       ` Liu, Yi L
  0 siblings, 1 reply; 33+ messages in thread
From: Peter Xu @ 2016-12-14  3:05 UTC (permalink / raw)
  To: Liu, Yi L, Jason Wang
  Cc: qemu-devel, tianyu.lan, kevin.tian, mst, jan.kiszka, bd.aviv,
	alex.williamson, yi.liu

On Mon, Dec 12, 2016 at 04:16:50PM +0800, Liu, Yi L wrote:
> On Tue, Dec 06, 2016 at 06:36:16PM +0800, Peter Xu wrote:
> > From: Jason Wang <jasowang@redhat.com>
> > 
> > We use the pointer to stack for key for new address space, this will
> > break hash table searching, fixing by g_malloc() a new key instead.
> > 
> > Cc: Michael S. Tsirkin <mst@redhat.com>
> > Cc: Paolo Bonzini <pbonzini@redhat.com>
> > Cc: Richard Henderson <rth@twiddle.net>
> > Cc: Eduardo Habkost <ehabkost@redhat.com>
> > Acked-by: Peter Xu <peterx@redhat.com>
> > Signed-off-by: Jason Wang <jasowang@redhat.com>
> > Signed-off-by: Peter Xu <peterx@redhat.com>
> > ---
> >  hw/i386/intel_iommu.c | 5 +++--
> >  1 file changed, 3 insertions(+), 2 deletions(-)
> > 
> > diff --git a/hw/i386/intel_iommu.c b/hw/i386/intel_iommu.c
> > index 708770e..92e4064 100644
> > --- a/hw/i386/intel_iommu.c
> > +++ b/hw/i386/intel_iommu.c
> > @@ -2426,12 +2426,13 @@ VTDAddressSpace *vtd_find_add_as(IntelIOMMUState *s, PCIBus *bus, int devfn)
> >      VTDAddressSpace *vtd_dev_as;
> >  
> >      if (!vtd_bus) {
> > +        uintptr_t *new_key = g_malloc(sizeof(*new_key));
> > +        *new_key = (uintptr_t)bus;
> >          /* No corresponding free() */
> >          vtd_bus = g_malloc0(sizeof(VTDBus) + sizeof(VTDAddressSpace *) * \
> >                              X86_IOMMU_PCI_DEVFN_MAX);
> >          vtd_bus->bus = bus;
> > -        key = (uintptr_t)bus;
> > -        g_hash_table_insert(s->vtd_as_by_busptr, &key, vtd_bus);
> > +        g_hash_table_insert(s->vtd_as_by_busptr, new_key, vtd_bus);
> Hi Peter,
> Your fix seems to answer an issue I encountered back in Oct. The symptom is: use the same bus value to
> searcha previous inserted entry in s->vtd_as_by_busptr, the result is not found.
> 
> really grt fix. could explain it a bit on why this change would fix the issue?

The old code is doing g_hash_table_insert() with "&key" as the hash
key. However variable "key" is allocated on stack, so it's value might
change after we return from vtd_find_add_as() (stack variables can
only be used inside its functional scope). The patch switched to use
g_malloc0(), that'll use heap memory rather than stack, which is safe.

(Forwarding this thankfulness to Jason who is the real author of this
 fix :-)

Thanks,

-- peterx

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [Qemu-devel] [RFC PATCH 01/13] intel_iommu: allocate new key when creating new address space
  2016-12-14  3:05     ` Peter Xu
@ 2016-12-14  3:24       ` Liu, Yi L
  0 siblings, 0 replies; 33+ messages in thread
From: Liu, Yi L @ 2016-12-14  3:24 UTC (permalink / raw)
  To: Peter Xu, Liu, Yi L, Jason Wang
  Cc: Lan, Tianyu, Tian, Kevin, mst@redhat.com, jan.kiszka@siemens.com,
	bd.aviv@gmail.com, qemu-devel@nongnu.org,
	alex.williamson@redhat.com, Wu, Feng

> -----Original Message-----
> From: Qemu-devel [mailto:qemu-devel-bounces+yi.l.liu=intel.com@nongnu.org] On
> Behalf Of Peter Xu
> Sent: Wednesday, December 14, 2016 11:06 AM
> To: Liu, Yi L <yi.l.liu@linux.intel.com>; Jason Wang <jasowang@redhat.com>
> Cc: Lan, Tianyu <tianyu.lan@intel.com>; Tian, Kevin <kevin.tian@intel.com>;
> mst@redhat.com; jan.kiszka@siemens.com; bd.aviv@gmail.com; qemu-
> devel@nongnu.org; alex.williamson@redhat.com; Liu, Yi <yi.liu@intel.com>
> Subject: Re: [Qemu-devel] [RFC PATCH 01/13] intel_iommu: allocate new key when
> creating new address space
> 
> On Mon, Dec 12, 2016 at 04:16:50PM +0800, Liu, Yi L wrote:
> > On Tue, Dec 06, 2016 at 06:36:16PM +0800, Peter Xu wrote:
> > > From: Jason Wang <jasowang@redhat.com>
> > >
> > > We use the pointer to stack for key for new address space, this will
> > > break hash table searching, fixing by g_malloc() a new key instead.
> > >
> > > Cc: Michael S. Tsirkin <mst@redhat.com>
> > > Cc: Paolo Bonzini <pbonzini@redhat.com>
> > > Cc: Richard Henderson <rth@twiddle.net>
> > > Cc: Eduardo Habkost <ehabkost@redhat.com>
> > > Acked-by: Peter Xu <peterx@redhat.com>
> > > Signed-off-by: Jason Wang <jasowang@redhat.com>
> > > Signed-off-by: Peter Xu <peterx@redhat.com>
> > > ---
> > >  hw/i386/intel_iommu.c | 5 +++--
> > >  1 file changed, 3 insertions(+), 2 deletions(-)
> > >
> > > diff --git a/hw/i386/intel_iommu.c b/hw/i386/intel_iommu.c
> > > index 708770e..92e4064 100644
> > > --- a/hw/i386/intel_iommu.c
> > > +++ b/hw/i386/intel_iommu.c
> > > @@ -2426,12 +2426,13 @@ VTDAddressSpace
> *vtd_find_add_as(IntelIOMMUState *s, PCIBus *bus, int devfn)
> > >      VTDAddressSpace *vtd_dev_as;
> > >
> > >      if (!vtd_bus) {
> > > +        uintptr_t *new_key = g_malloc(sizeof(*new_key));
> > > +        *new_key = (uintptr_t)bus;
> > >          /* No corresponding free() */
> > >          vtd_bus = g_malloc0(sizeof(VTDBus) + sizeof(VTDAddressSpace *) * \
> > >                              X86_IOMMU_PCI_DEVFN_MAX);
> > >          vtd_bus->bus = bus;
> > > -        key = (uintptr_t)bus;
> > > -        g_hash_table_insert(s->vtd_as_by_busptr, &key, vtd_bus);
> > > +        g_hash_table_insert(s->vtd_as_by_busptr, new_key, vtd_bus);
> > Hi Peter,
> > Your fix seems to answer an issue I encountered back in Oct. The symptom is: use
> the same bus value to
> > searcha previous inserted entry in s->vtd_as_by_busptr, the result is not found.
> >
> > really grt fix. could explain it a bit on why this change would fix the issue?
> 
> The old code is doing g_hash_table_insert() with "&key" as the hash
> key. However variable "key" is allocated on stack, so it's value might
> change after we return from vtd_find_add_as() (stack variables can
> only be used inside its functional scope). The patch switched to use
> g_malloc0(), that'll use heap memory rather than stack, which is safe.
> 
> (Forwarding this thankfulness to Jason who is the real author of this
>  fix :-)

Thanks for the explanation, Peter. I see. Then it is sure this really fix what
I encountered in Oct. Really a wonderful fix Jason~ 

> 
> Thanks,
> 
> -- peterx


^ permalink raw reply	[flat|nested] 33+ messages in thread

* [Qemu-devel] [RFC PATCH 02/13] intel_iommu: simplify irq region translation
  2016-12-06 10:36 [Qemu-devel] [RFC PATCH 00/13] VT-d replay and misc cleanup Peter Xu
  2016-12-06 10:36 ` [Qemu-devel] [RFC PATCH 01/13] intel_iommu: allocate new key when creating new address space Peter Xu
@ 2016-12-06 10:36 ` Peter Xu
  2016-12-06 10:36 ` [Qemu-devel] [RFC PATCH 03/13] intel_iommu: renaming gpa to iova where proper Peter Xu
                   ` (14 subsequent siblings)
  16 siblings, 0 replies; 33+ messages in thread
From: Peter Xu @ 2016-12-06 10:36 UTC (permalink / raw)
  To: qemu-devel
  Cc: tianyu.lan, kevin.tian, mst, jan.kiszka, bd.aviv, peterx,
	alex.williamson, jasowang

Before we have int-remap, we need to bypass interrupt write requests.
That's not necessary now - we have supported int-remap, and all the irq
region requests should be redirected there. Cleaning up the block with
an assertion instead.

Signed-off-by: Peter Xu <peterx@redhat.com>
---
 hw/i386/intel_iommu.c | 29 ++++++-----------------------
 1 file changed, 6 insertions(+), 23 deletions(-)

diff --git a/hw/i386/intel_iommu.c b/hw/i386/intel_iommu.c
index 92e4064..f19a8b3 100644
--- a/hw/i386/intel_iommu.c
+++ b/hw/i386/intel_iommu.c
@@ -842,29 +842,12 @@ static void vtd_do_iommu_translate(VTDAddressSpace *vtd_as, PCIBus *bus,
     bool writes = true;
     VTDIOTLBEntry *iotlb_entry;
 
-    /* Check if the request is in interrupt address range */
-    if (vtd_is_interrupt_addr(addr)) {
-        if (flags & IOMMU_WO) {
-            /* FIXME: since we don't know the length of the access here, we
-             * treat Non-DWORD length write requests without PASID as
-             * interrupt requests, too. Withoud interrupt remapping support,
-             * we just use 1:1 mapping.
-             */
-            VTD_DPRINTF(MMU, "write request to interrupt address "
-                        "gpa 0x%"PRIx64, addr);
-            entry->iova = addr & VTD_PAGE_MASK_4K;
-            entry->translated_addr = addr & VTD_PAGE_MASK_4K;
-            entry->addr_mask = ~VTD_PAGE_MASK_4K;
-            entry->perm = IOMMU_WO;
-            return;
-        } else {
-            VTD_DPRINTF(GENERAL, "error: read request from interrupt address "
-                        "gpa 0x%"PRIx64, addr);
-            vtd_report_dmar_fault(s, source_id, addr, VTD_FR_READ,
-                    flags & IOMMU_WO);
-            return;
-        }
-    }
+    /*
+     * We have standalone memory region for interrupt addresses, we
+     * should never receive translation requests in this region.
+     */
+    assert(!vtd_is_interrupt_addr(addr));
+
     /* Try to fetch slpte form IOTLB */
     iotlb_entry = vtd_lookup_iotlb(s, source_id, addr);
     if (iotlb_entry) {
-- 
2.7.4

^ permalink raw reply related	[flat|nested] 33+ messages in thread

* [Qemu-devel] [RFC PATCH 03/13] intel_iommu: renaming gpa to iova where proper
  2016-12-06 10:36 [Qemu-devel] [RFC PATCH 00/13] VT-d replay and misc cleanup Peter Xu
  2016-12-06 10:36 ` [Qemu-devel] [RFC PATCH 01/13] intel_iommu: allocate new key when creating new address space Peter Xu
  2016-12-06 10:36 ` [Qemu-devel] [RFC PATCH 02/13] intel_iommu: simplify irq region translation Peter Xu
@ 2016-12-06 10:36 ` Peter Xu
  2016-12-18  8:39   ` Liu, Yi L
  2016-12-06 10:36 ` [Qemu-devel] [RFC PATCH 04/13] intel_iommu: fix trace for inv desc handling Peter Xu
                   ` (13 subsequent siblings)
  16 siblings, 1 reply; 33+ messages in thread
From: Peter Xu @ 2016-12-06 10:36 UTC (permalink / raw)
  To: qemu-devel
  Cc: tianyu.lan, kevin.tian, mst, jan.kiszka, bd.aviv, peterx,
	alex.williamson, jasowang

There are lots of places in current intel_iommu.c codes that named
"iova" as "gpa". It is really confusing to use a name "gpa" in these
places (which is very easily to be understood as "Guest Physical
Address", while it's not). To make the codes (much) easier to be read, I
decided to do this once and for all.

No functional change is made. Only literal ones.

Signed-off-by: Peter Xu <peterx@redhat.com>
---
 hw/i386/intel_iommu.c | 46 +++++++++++++++++++++++-----------------------
 1 file changed, 23 insertions(+), 23 deletions(-)

diff --git a/hw/i386/intel_iommu.c b/hw/i386/intel_iommu.c
index f19a8b3..3d98797 100644
--- a/hw/i386/intel_iommu.c
+++ b/hw/i386/intel_iommu.c
@@ -279,7 +279,7 @@ static void vtd_update_iotlb(IntelIOMMUState *s, uint16_t source_id,
     uint64_t *key = g_malloc(sizeof(*key));
     uint64_t gfn = vtd_get_iotlb_gfn(addr, level);
 
-    VTD_DPRINTF(CACHE, "update iotlb sid 0x%"PRIx16 " gpa 0x%"PRIx64
+    VTD_DPRINTF(CACHE, "update iotlb sid 0x%"PRIx16 " iova 0x%"PRIx64
                 " slpte 0x%"PRIx64 " did 0x%"PRIx16, source_id, addr, slpte,
                 domain_id);
     if (g_hash_table_size(s->iotlb) >= VTD_IOTLB_MAX_SIZE) {
@@ -595,12 +595,12 @@ static uint64_t vtd_get_slpte(dma_addr_t base_addr, uint32_t index)
     return slpte;
 }
 
-/* Given a gpa and the level of paging structure, return the offset of current
- * level.
+/* Given a iova and the level of paging structure, return the offset
+ * of current level.
  */
-static inline uint32_t vtd_gpa_level_offset(uint64_t gpa, uint32_t level)
+static inline uint32_t vtd_iova_level_offset(uint64_t iova, uint32_t level)
 {
-    return (gpa >> vtd_slpt_level_shift(level)) &
+    return (iova >> vtd_slpt_level_shift(level)) &
             ((1ULL << VTD_SL_LEVEL_BITS) - 1);
 }
 
@@ -648,13 +648,13 @@ static bool vtd_slpte_nonzero_rsvd(uint64_t slpte, uint32_t level)
     }
 }
 
-/* Given the @gpa, get relevant @slptep. @slpte_level will be the last level
+/* Given the @iova, get relevant @slptep. @slpte_level will be the last level
  * of the translation, can be used for deciding the size of large page.
  */
-static int vtd_gpa_to_slpte(VTDContextEntry *ce, uint64_t gpa,
-                            IOMMUAccessFlags flags,
-                            uint64_t *slptep, uint32_t *slpte_level,
-                            bool *reads, bool *writes)
+static int vtd_iova_to_slpte(VTDContextEntry *ce, uint64_t iova,
+                             IOMMUAccessFlags flags,
+                             uint64_t *slptep, uint32_t *slpte_level,
+                             bool *reads, bool *writes)
 {
     dma_addr_t addr = vtd_get_slpt_base_from_context(ce);
     uint32_t level = vtd_get_level_from_context_entry(ce);
@@ -663,11 +663,11 @@ static int vtd_gpa_to_slpte(VTDContextEntry *ce, uint64_t gpa,
     uint32_t ce_agaw = vtd_get_agaw_from_context_entry(ce);
     uint64_t access_right_check = 0;
 
-    /* Check if @gpa is above 2^X-1, where X is the minimum of MGAW in CAP_REG
-     * and AW in context-entry.
+    /* Check if @iova is above 2^X-1, where X is the minimum of MGAW
+     * in CAP_REG and AW in context-entry.
      */
-    if (gpa & ~((1ULL << MIN(ce_agaw, VTD_MGAW)) - 1)) {
-        VTD_DPRINTF(GENERAL, "error: gpa 0x%"PRIx64 " exceeds limits", gpa);
+    if (iova & ~((1ULL << MIN(ce_agaw, VTD_MGAW)) - 1)) {
+        VTD_DPRINTF(GENERAL, "error: iova 0x%"PRIx64 " exceeds limits", iova);
         return -VTD_FR_ADDR_BEYOND_MGAW;
     }
 
@@ -683,13 +683,13 @@ static int vtd_gpa_to_slpte(VTDContextEntry *ce, uint64_t gpa,
     }
 
     while (true) {
-        offset = vtd_gpa_level_offset(gpa, level);
+        offset = vtd_iova_level_offset(iova, level);
         slpte = vtd_get_slpte(addr, offset);
 
         if (slpte == (uint64_t)-1) {
             VTD_DPRINTF(GENERAL, "error: fail to access second-level paging "
-                        "entry at level %"PRIu32 " for gpa 0x%"PRIx64,
-                        level, gpa);
+                        "entry at level %"PRIu32 " for iova 0x%"PRIx64,
+                        level, iova);
             if (level == vtd_get_level_from_context_entry(ce)) {
                 /* Invalid programming of context-entry */
                 return -VTD_FR_CONTEXT_ENTRY_INV;
@@ -701,8 +701,8 @@ static int vtd_gpa_to_slpte(VTDContextEntry *ce, uint64_t gpa,
         *writes = (*writes) && (slpte & VTD_SL_W);
         if (!(slpte & access_right_check) && !(flags & IOMMU_NO_FAIL)) {
             VTD_DPRINTF(GENERAL, "error: lack of %s permission for "
-                        "gpa 0x%"PRIx64 " slpte 0x%"PRIx64,
-                        (flags & IOMMU_WO ? "write" : "read"), gpa, slpte);
+                        "iova 0x%"PRIx64 " slpte 0x%"PRIx64,
+                        (flags & IOMMU_WO ? "write" : "read"), iova, slpte);
             return (flags & IOMMU_WO) ? -VTD_FR_WRITE : -VTD_FR_READ;
         }
         if (vtd_slpte_nonzero_rsvd(slpte, level)) {
@@ -851,7 +851,7 @@ static void vtd_do_iommu_translate(VTDAddressSpace *vtd_as, PCIBus *bus,
     /* Try to fetch slpte form IOTLB */
     iotlb_entry = vtd_lookup_iotlb(s, source_id, addr);
     if (iotlb_entry) {
-        VTD_DPRINTF(CACHE, "hit iotlb sid 0x%"PRIx16 " gpa 0x%"PRIx64
+        VTD_DPRINTF(CACHE, "hit iotlb sid 0x%"PRIx16 " iova 0x%"PRIx64
                     " slpte 0x%"PRIx64 " did 0x%"PRIx16, source_id, addr,
                     iotlb_entry->slpte, iotlb_entry->domain_id);
         slpte = iotlb_entry->slpte;
@@ -894,8 +894,8 @@ static void vtd_do_iommu_translate(VTDAddressSpace *vtd_as, PCIBus *bus,
         cc_entry->context_cache_gen = s->context_cache_gen;
     }
 
-    ret_fr = vtd_gpa_to_slpte(&ce, addr, flags, &slpte, &level,
-                              &reads, &writes);
+    ret_fr = vtd_iova_to_slpte(&ce, addr, flags, &slpte, &level,
+                               &reads, &writes);
     if (ret_fr) {
         ret_fr = -ret_fr;
         if (!(flags & IOMMU_NO_FAIL)) {
@@ -2032,7 +2032,7 @@ static IOMMUTLBEntry vtd_iommu_translate(MemoryRegion *iommu, hwaddr addr,
                            flags, &ret);
     VTD_DPRINTF(MMU,
                 "bus %"PRIu8 " slot %"PRIu8 " func %"PRIu8 " devfn %"PRIu8
-                " gpa 0x%"PRIx64 " hpa 0x%"PRIx64, pci_bus_num(vtd_as->bus),
+                " iova 0x%"PRIx64 " hpa 0x%"PRIx64, pci_bus_num(vtd_as->bus),
                 VTD_PCI_SLOT(vtd_as->devfn), VTD_PCI_FUNC(vtd_as->devfn),
                 vtd_as->devfn, addr, ret.translated_addr);
     return ret;
-- 
2.7.4

^ permalink raw reply related	[flat|nested] 33+ messages in thread

* Re: [Qemu-devel] [RFC PATCH 03/13] intel_iommu: renaming gpa to iova where proper
  2016-12-06 10:36 ` [Qemu-devel] [RFC PATCH 03/13] intel_iommu: renaming gpa to iova where proper Peter Xu
@ 2016-12-18  8:39   ` Liu, Yi L
  2016-12-19  9:23     ` Peter Xu
  0 siblings, 1 reply; 33+ messages in thread
From: Liu, Yi L @ 2016-12-18  8:39 UTC (permalink / raw)
  To: Peter Xu; +Cc: kevin.tian, tianyu.lan, jasowang, qemu-devel, yi.l.liu

On Tue, Dec 06, 2016 at 06:36:18PM +0800, Peter Xu wrote:
> There are lots of places in current intel_iommu.c codes that named
> "iova" as "gpa". It is really confusing to use a name "gpa" in these
> places (which is very easily to be understood as "Guest Physical
> Address", while it's not). To make the codes (much) easier to be read, I
> decided to do this once and for all.
> 
> No functional change is made. Only literal ones.
> 
> Signed-off-by: Peter Xu <peterx@redhat.com>
> ---
>  hw/i386/intel_iommu.c | 46 +++++++++++++++++++++++-----------------------
>  1 file changed, 23 insertions(+), 23 deletions(-)
> 
> diff --git a/hw/i386/intel_iommu.c b/hw/i386/intel_iommu.c
> index f19a8b3..3d98797 100644
> --- a/hw/i386/intel_iommu.c
> +++ b/hw/i386/intel_iommu.c
> @@ -279,7 +279,7 @@ static void vtd_update_iotlb(IntelIOMMUState *s, uint16_t source_id,
>      uint64_t *key = g_malloc(sizeof(*key));
>      uint64_t gfn = vtd_get_iotlb_gfn(addr, level);
>  
> -    VTD_DPRINTF(CACHE, "update iotlb sid 0x%"PRIx16 " gpa 0x%"PRIx64
> +    VTD_DPRINTF(CACHE, "update iotlb sid 0x%"PRIx16 " iova 0x%"PRIx64
>                  " slpte 0x%"PRIx64 " did 0x%"PRIx16, source_id, addr, slpte,
>                  domain_id);
>      if (g_hash_table_size(s->iotlb) >= VTD_IOTLB_MAX_SIZE) {
> @@ -595,12 +595,12 @@ static uint64_t vtd_get_slpte(dma_addr_t base_addr, uint32_t index)
>      return slpte;
>  }
>  
> -/* Given a gpa and the level of paging structure, return the offset of current
> - * level.
> +/* Given a iova and the level of paging structure, return the offset
maybe "an iova" instead of "a iova"
> + * of current level.
>   */
> -static inline uint32_t vtd_gpa_level_offset(uint64_t gpa, uint32_t level)
> +static inline uint32_t vtd_iova_level_offset(uint64_t iova, uint32_t level)
>  {
> -    return (gpa >> vtd_slpt_level_shift(level)) &
> +    return (iova >> vtd_slpt_level_shift(level)) &
>              ((1ULL << VTD_SL_LEVEL_BITS) - 1);
>  }
>  
> @@ -648,13 +648,13 @@ static bool vtd_slpte_nonzero_rsvd(uint64_t slpte, uint32_t level)
>      }
>  }
>  
> -/* Given the @gpa, get relevant @slptep. @slpte_level will be the last level
> +/* Given the @iova, get relevant @slptep. @slpte_level will be the last level
>   * of the translation, can be used for deciding the size of large page.
>   */
> -static int vtd_gpa_to_slpte(VTDContextEntry *ce, uint64_t gpa,
> -                            IOMMUAccessFlags flags,
> -                            uint64_t *slptep, uint32_t *slpte_level,
> -                            bool *reads, bool *writes)
> +static int vtd_iova_to_slpte(VTDContextEntry *ce, uint64_t iova,
> +                             IOMMUAccessFlags flags,
> +                             uint64_t *slptep, uint32_t *slpte_level,
> +                             bool *reads, bool *writes)
>  {
>      dma_addr_t addr = vtd_get_slpt_base_from_context(ce);
>      uint32_t level = vtd_get_level_from_context_entry(ce);
> @@ -663,11 +663,11 @@ static int vtd_gpa_to_slpte(VTDContextEntry *ce, uint64_t gpa,
>      uint32_t ce_agaw = vtd_get_agaw_from_context_entry(ce);
>      uint64_t access_right_check = 0;
>  
> -    /* Check if @gpa is above 2^X-1, where X is the minimum of MGAW in CAP_REG
> -     * and AW in context-entry.
> +    /* Check if @iova is above 2^X-1, where X is the minimum of MGAW
> +     * in CAP_REG and AW in context-entry.
>       */
> -    if (gpa & ~((1ULL << MIN(ce_agaw, VTD_MGAW)) - 1)) {
> -        VTD_DPRINTF(GENERAL, "error: gpa 0x%"PRIx64 " exceeds limits", gpa);
> +    if (iova & ~((1ULL << MIN(ce_agaw, VTD_MGAW)) - 1)) {
> +        VTD_DPRINTF(GENERAL, "error: iova 0x%"PRIx64 " exceeds limits", iova);
>          return -VTD_FR_ADDR_BEYOND_MGAW;
>      }
>  
> @@ -683,13 +683,13 @@ static int vtd_gpa_to_slpte(VTDContextEntry *ce, uint64_t gpa,
>      }
>  
>      while (true) {
> -        offset = vtd_gpa_level_offset(gpa, level);
> +        offset = vtd_iova_level_offset(iova, level);
>          slpte = vtd_get_slpte(addr, offset);
>  
>          if (slpte == (uint64_t)-1) {
>              VTD_DPRINTF(GENERAL, "error: fail to access second-level paging "
> -                        "entry at level %"PRIu32 " for gpa 0x%"PRIx64,
> -                        level, gpa);
> +                        "entry at level %"PRIu32 " for iova 0x%"PRIx64,
> +                        level, iova);
>              if (level == vtd_get_level_from_context_entry(ce)) {
>                  /* Invalid programming of context-entry */
>                  return -VTD_FR_CONTEXT_ENTRY_INV;
> @@ -701,8 +701,8 @@ static int vtd_gpa_to_slpte(VTDContextEntry *ce, uint64_t gpa,
>          *writes = (*writes) && (slpte & VTD_SL_W);
>          if (!(slpte & access_right_check) && !(flags & IOMMU_NO_FAIL)) {
>              VTD_DPRINTF(GENERAL, "error: lack of %s permission for "
> -                        "gpa 0x%"PRIx64 " slpte 0x%"PRIx64,
> -                        (flags & IOMMU_WO ? "write" : "read"), gpa, slpte);
> +                        "iova 0x%"PRIx64 " slpte 0x%"PRIx64,
> +                        (flags & IOMMU_WO ? "write" : "read"), iova, slpte);
>              return (flags & IOMMU_WO) ? -VTD_FR_WRITE : -VTD_FR_READ;
>          }
>          if (vtd_slpte_nonzero_rsvd(slpte, level)) {
> @@ -851,7 +851,7 @@ static void vtd_do_iommu_translate(VTDAddressSpace *vtd_as, PCIBus *bus,
>      /* Try to fetch slpte form IOTLB */
>      iotlb_entry = vtd_lookup_iotlb(s, source_id, addr);
>      if (iotlb_entry) {
> -        VTD_DPRINTF(CACHE, "hit iotlb sid 0x%"PRIx16 " gpa 0x%"PRIx64
> +        VTD_DPRINTF(CACHE, "hit iotlb sid 0x%"PRIx16 " iova 0x%"PRIx64
>                      " slpte 0x%"PRIx64 " did 0x%"PRIx16, source_id, addr,
>                      iotlb_entry->slpte, iotlb_entry->domain_id);
>          slpte = iotlb_entry->slpte;
> @@ -894,8 +894,8 @@ static void vtd_do_iommu_translate(VTDAddressSpace *vtd_as, PCIBus *bus,
>          cc_entry->context_cache_gen = s->context_cache_gen;
>      }
>  
> -    ret_fr = vtd_gpa_to_slpte(&ce, addr, flags, &slpte, &level,
> -                              &reads, &writes);
> +    ret_fr = vtd_iova_to_slpte(&ce, addr, flags, &slpte, &level,
> +                               &reads, &writes);
>      if (ret_fr) {
>          ret_fr = -ret_fr;
>          if (!(flags & IOMMU_NO_FAIL)) {
> @@ -2032,7 +2032,7 @@ static IOMMUTLBEntry vtd_iommu_translate(MemoryRegion *iommu, hwaddr addr,
>                             flags, &ret);
>      VTD_DPRINTF(MMU,
>                  "bus %"PRIu8 " slot %"PRIu8 " func %"PRIu8 " devfn %"PRIu8
> -                " gpa 0x%"PRIx64 " hpa 0x%"PRIx64, pci_bus_num(vtd_as->bus),
> +                " iova 0x%"PRIx64 " hpa 0x%"PRIx64, pci_bus_num(vtd_as->bus),
>                  VTD_PCI_SLOT(vtd_as->devfn), VTD_PCI_FUNC(vtd_as->devfn),
>                  vtd_as->devfn, addr, ret.translated_addr);
>      return ret;
> -- 
> 2.7.4
> 
> 

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [Qemu-devel] [RFC PATCH 03/13] intel_iommu: renaming gpa to iova where proper
  2016-12-18  8:39   ` Liu, Yi L
@ 2016-12-19  9:23     ` Peter Xu
  0 siblings, 0 replies; 33+ messages in thread
From: Peter Xu @ 2016-12-19  9:23 UTC (permalink / raw)
  To: Liu, Yi L; +Cc: kevin.tian, tianyu.lan, jasowang, qemu-devel, yi.l.liu

On Sun, Dec 18, 2016 at 04:39:11PM +0800, Liu, Yi L wrote:

[...]

> > @@ -595,12 +595,12 @@ static uint64_t vtd_get_slpte(dma_addr_t base_addr, uint32_t index)
> >      return slpte;
> >  }
> >  
> > -/* Given a gpa and the level of paging structure, return the offset of current
> > - * level.
> > +/* Given a iova and the level of paging structure, return the offset
> maybe "an iova" instead of "a iova"

Will fix. Thanks,

-- peterx

^ permalink raw reply	[flat|nested] 33+ messages in thread

* [Qemu-devel] [RFC PATCH 04/13] intel_iommu: fix trace for inv desc handling
  2016-12-06 10:36 [Qemu-devel] [RFC PATCH 00/13] VT-d replay and misc cleanup Peter Xu
                   ` (2 preceding siblings ...)
  2016-12-06 10:36 ` [Qemu-devel] [RFC PATCH 03/13] intel_iommu: renaming gpa to iova where proper Peter Xu
@ 2016-12-06 10:36 ` Peter Xu
  2016-12-06 10:36 ` [Qemu-devel] [RFC PATCH 05/13] intel_iommu: fix trace for addr translation Peter Xu
                   ` (12 subsequent siblings)
  16 siblings, 0 replies; 33+ messages in thread
From: Peter Xu @ 2016-12-06 10:36 UTC (permalink / raw)
  To: qemu-devel
  Cc: tianyu.lan, kevin.tian, mst, jan.kiszka, bd.aviv, peterx,
	alex.williamson, jasowang

VT-d codes are still using static DEBUG_INTEL_IOMMU macro. That's not
good, and we should end the day when we need to recompile the code
before getting useful debugging information for vt-d. Time to switch to
the trace system.

This is the first patch to do it.

Generally, the rule of mine is:

- for the old GENERAL typed message, I use error_report() directly if
  apply. Those are something shouldn't happen, and we should print those
  errors in all cases, even without enabling debug and tracing.

- for the non-GENERAL typed messages, remove those VTD_PRINTF()s that
  looks hardly used, and convert the rest lines into trace_*().

- for useless DPRINTFs, I removed them.

Signed-off-by: Peter Xu <peterx@redhat.com>
---
 hw/i386/intel_iommu.c | 98 ++++++++++++++++++++++++---------------------------
 hw/i386/trace-events  | 12 +++++++
 2 files changed, 58 insertions(+), 52 deletions(-)

diff --git a/hw/i386/intel_iommu.c b/hw/i386/intel_iommu.c
index 3d98797..35fbfbe 100644
--- a/hw/i386/intel_iommu.c
+++ b/hw/i386/intel_iommu.c
@@ -35,6 +35,7 @@
 #include "sysemu/kvm.h"
 #include "hw/i386/apic_internal.h"
 #include "kvm_i386.h"
+#include "trace.h"
 
 /*#define DEBUG_INTEL_IOMMU*/
 #ifdef DEBUG_INTEL_IOMMU
@@ -494,22 +495,19 @@ static void vtd_handle_inv_queue_error(IntelIOMMUState *s)
 /* Set the IWC field and try to generate an invalidation completion interrupt */
 static void vtd_generate_completion_event(IntelIOMMUState *s)
 {
-    VTD_DPRINTF(INV, "completes an invalidation wait command with "
-                "Interrupt Flag");
     if (vtd_get_long_raw(s, DMAR_ICS_REG) & VTD_ICS_IWC) {
-        VTD_DPRINTF(INV, "there is a previous interrupt condition to be "
-                    "serviced by software, "
-                    "new invalidation event is not generated");
+        trace_vtd_inv_desc_wait_irq("One pending, skip current");
         return;
     }
     vtd_set_clear_mask_long(s, DMAR_ICS_REG, 0, VTD_ICS_IWC);
     vtd_set_clear_mask_long(s, DMAR_IECTL_REG, 0, VTD_IECTL_IP);
     if (vtd_get_long_raw(s, DMAR_IECTL_REG) & VTD_IECTL_IM) {
-        VTD_DPRINTF(INV, "IM filed in IECTL_REG is set, new invalidation "
-                    "event is not generated");
+        trace_vtd_inv_desc_wait_irq("IM in IECTL_REG is set, "
+                                    "new event not generated");
         return;
     } else {
         /* Generate the interrupt event */
+        trace_vtd_inv_desc_wait_irq("Generating complete event");
         vtd_generate_interrupt(s, DMAR_IEADDR_REG, DMAR_IEDATA_REG);
         vtd_set_clear_mask_long(s, DMAR_IECTL_REG, VTD_IECTL_IP, 0);
     }
@@ -952,6 +950,7 @@ static void vtd_interrupt_remap_table_setup(IntelIOMMUState *s)
 
 static void vtd_context_global_invalidate(IntelIOMMUState *s)
 {
+    trace_vtd_inv_desc_cc_global();
     s->context_cache_gen++;
     if (s->context_cache_gen == VTD_CONTEXT_CACHE_GEN_MAX) {
         vtd_reset_context_cache(s);
@@ -991,9 +990,11 @@ static void vtd_context_device_invalidate(IntelIOMMUState *s,
     uint16_t mask;
     VTDBus *vtd_bus;
     VTDAddressSpace *vtd_as;
-    uint16_t devfn;
+    uint8_t bus_n, devfn;
     uint16_t devfn_it;
 
+    trace_vtd_inv_desc_cc_devices(source_id, func_mask);
+
     switch (func_mask & 3) {
     case 0:
         mask = 0;   /* No bits in the SID field masked */
@@ -1009,16 +1010,16 @@ static void vtd_context_device_invalidate(IntelIOMMUState *s,
         break;
     }
     mask = ~mask;
-    VTD_DPRINTF(INV, "device-selective invalidation source 0x%"PRIx16
-                    " mask %"PRIu16, source_id, mask);
-    vtd_bus = vtd_find_as_from_bus_num(s, VTD_SID_TO_BUS(source_id));
+
+    bus_n = VTD_SID_TO_BUS(source_id);
+    vtd_bus = vtd_find_as_from_bus_num(s, bus_n);
     if (vtd_bus) {
         devfn = VTD_SID_TO_DEVFN(source_id);
         for (devfn_it = 0; devfn_it < X86_IOMMU_PCI_DEVFN_MAX; ++devfn_it) {
             vtd_as = vtd_bus->dev_as[devfn_it];
             if (vtd_as && ((devfn_it & mask) == (devfn & mask))) {
-                VTD_DPRINTF(INV, "invalidate context-cahce of devfn 0x%"PRIx16,
-                            devfn_it);
+                trace_vtd_inv_desc_cc_device(bus_n, (devfn_it >> 3) & 0x1f,
+                                             devfn_it & 3);
                 vtd_as->context_cache_entry.context_cache_gen = 0;
             }
         }
@@ -1371,7 +1372,7 @@ static bool vtd_process_wait_desc(IntelIOMMUState *s, VTDInvDesc *inv_desc)
 {
     if ((inv_desc->hi & VTD_INV_DESC_WAIT_RSVD_HI) ||
         (inv_desc->lo & VTD_INV_DESC_WAIT_RSVD_LO)) {
-        VTD_DPRINTF(GENERAL, "error: non-zero reserved field in Invalidation "
+        error_report("Non-zero reserved field in Invalidation "
                     "Wait Descriptor hi 0x%"PRIx64 " lo 0x%"PRIx64,
                     inv_desc->hi, inv_desc->lo);
         return false;
@@ -1385,21 +1386,20 @@ static bool vtd_process_wait_desc(IntelIOMMUState *s, VTDInvDesc *inv_desc)
 
         /* FIXME: need to be masked with HAW? */
         dma_addr_t status_addr = inv_desc->hi;
-        VTD_DPRINTF(INV, "status data 0x%x, status addr 0x%"PRIx64,
-                    status_data, status_addr);
+        trace_vtd_inv_desc_wait_sw(status_addr, status_data);
         status_data = cpu_to_le32(status_data);
         if (dma_memory_write(&address_space_memory, status_addr, &status_data,
                              sizeof(status_data))) {
-            VTD_DPRINTF(GENERAL, "error: fail to perform a coherent write");
+            error_report("Invalidate Desc Wait status write failed");
             return false;
         }
     } else if (inv_desc->lo & VTD_INV_DESC_WAIT_IF) {
         /* Interrupt flag */
-        VTD_DPRINTF(INV, "Invalidation Wait Descriptor interrupt completion");
         vtd_generate_completion_event(s);
     } else {
-        VTD_DPRINTF(GENERAL, "error: invalid Invalidation Wait Descriptor: "
-                    "hi 0x%"PRIx64 " lo 0x%"PRIx64, inv_desc->hi, inv_desc->lo);
+        error_report("invalid Invalidation Wait Descriptor: "
+                     "hi 0x%"PRIx64 " lo 0x%"PRIx64,
+                     inv_desc->hi, inv_desc->lo);
         return false;
     }
     return true;
@@ -1408,30 +1408,32 @@ static bool vtd_process_wait_desc(IntelIOMMUState *s, VTDInvDesc *inv_desc)
 static bool vtd_process_context_cache_desc(IntelIOMMUState *s,
                                            VTDInvDesc *inv_desc)
 {
+    uint16_t sid, fmask;
+
     if ((inv_desc->lo & VTD_INV_DESC_CC_RSVD) || inv_desc->hi) {
-        VTD_DPRINTF(GENERAL, "error: non-zero reserved field in Context-cache "
-                    "Invalidate Descriptor");
+        error_report("non-zero reserved field in Context-cache "
+                     "Invalidate Descriptor");
         return false;
     }
     switch (inv_desc->lo & VTD_INV_DESC_CC_G) {
     case VTD_INV_DESC_CC_DOMAIN:
-        VTD_DPRINTF(INV, "domain-selective invalidation domain 0x%"PRIx16,
-                    (uint16_t)VTD_INV_DESC_CC_DID(inv_desc->lo));
+        trace_vtd_inv_desc_cc_domain(
+            (uint16_t)VTD_INV_DESC_CC_DID(inv_desc->lo));
         /* Fall through */
     case VTD_INV_DESC_CC_GLOBAL:
-        VTD_DPRINTF(INV, "global invalidation");
         vtd_context_global_invalidate(s);
         break;
 
     case VTD_INV_DESC_CC_DEVICE:
-        vtd_context_device_invalidate(s, VTD_INV_DESC_CC_SID(inv_desc->lo),
-                                      VTD_INV_DESC_CC_FM(inv_desc->lo));
+        sid = VTD_INV_DESC_CC_SID(inv_desc->lo);
+        fmask = VTD_INV_DESC_CC_FM(inv_desc->lo);
+        vtd_context_device_invalidate(s, sid, fmask);
         break;
 
     default:
-        VTD_DPRINTF(GENERAL, "error: invalid granularity in Context-cache "
-                    "Invalidate Descriptor hi 0x%"PRIx64  " lo 0x%"PRIx64,
-                    inv_desc->hi, inv_desc->lo);
+        error_report("invalid granularity in Context-cache "
+                     "Invalidate Descriptor hi 0x%"PRIx64" lo 0x%"PRIx64,
+                     inv_desc->hi, inv_desc->lo);
         return false;
     }
     return true;
@@ -1445,7 +1447,7 @@ static bool vtd_process_iotlb_desc(IntelIOMMUState *s, VTDInvDesc *inv_desc)
 
     if ((inv_desc->lo & VTD_INV_DESC_IOTLB_RSVD_LO) ||
         (inv_desc->hi & VTD_INV_DESC_IOTLB_RSVD_HI)) {
-        VTD_DPRINTF(GENERAL, "error: non-zero reserved field in IOTLB "
+        error_report("non-zero reserved field in IOTLB "
                     "Invalidate Descriptor hi 0x%"PRIx64 " lo 0x%"PRIx64,
                     inv_desc->hi, inv_desc->lo);
         return false;
@@ -1453,14 +1455,13 @@ static bool vtd_process_iotlb_desc(IntelIOMMUState *s, VTDInvDesc *inv_desc)
 
     switch (inv_desc->lo & VTD_INV_DESC_IOTLB_G) {
     case VTD_INV_DESC_IOTLB_GLOBAL:
-        VTD_DPRINTF(INV, "global invalidation");
+        trace_vtd_inv_desc_iotlb_global();
         vtd_iotlb_global_invalidate(s);
         break;
 
     case VTD_INV_DESC_IOTLB_DOMAIN:
         domain_id = VTD_INV_DESC_IOTLB_DID(inv_desc->lo);
-        VTD_DPRINTF(INV, "domain-selective invalidation domain 0x%"PRIx16,
-                    domain_id);
+        trace_vtd_inv_desc_iotlb_domain(domain_id);
         vtd_iotlb_domain_invalidate(s, domain_id);
         break;
 
@@ -1468,18 +1469,17 @@ static bool vtd_process_iotlb_desc(IntelIOMMUState *s, VTDInvDesc *inv_desc)
         domain_id = VTD_INV_DESC_IOTLB_DID(inv_desc->lo);
         addr = VTD_INV_DESC_IOTLB_ADDR(inv_desc->hi);
         am = VTD_INV_DESC_IOTLB_AM(inv_desc->hi);
-        VTD_DPRINTF(INV, "page-selective invalidation domain 0x%"PRIx16
-                    " addr 0x%"PRIx64 " mask %"PRIu8, domain_id, addr, am);
+        trace_vtd_inv_desc_iotlb_pages(domain_id, addr, am);
         if (am > VTD_MAMV) {
-            VTD_DPRINTF(GENERAL, "error: supported max address mask value is "
-                        "%"PRIu8, (uint8_t)VTD_MAMV);
+            error_report("supported max address mask value is %"PRIu8,
+                         (uint8_t)VTD_MAMV);
             return false;
         }
         vtd_iotlb_page_invalidate(s, domain_id, addr, am);
         break;
 
     default:
-        VTD_DPRINTF(GENERAL, "error: invalid granularity in IOTLB Invalidate "
+        error_report("invalid granularity in IOTLB Invalidate "
                     "Descriptor hi 0x%"PRIx64 " lo 0x%"PRIx64,
                     inv_desc->hi, inv_desc->lo);
         return false;
@@ -1507,7 +1507,6 @@ static bool vtd_process_inv_desc(IntelIOMMUState *s)
     VTDInvDesc inv_desc;
     uint8_t desc_type;
 
-    VTD_DPRINTF(INV, "iq head %"PRIu16, s->iq_head);
     if (!vtd_get_inv_desc(s->iq, s->iq_head, &inv_desc)) {
         s->iq_last_desc_type = VTD_INV_DESC_NONE;
         return false;
@@ -1518,42 +1517,37 @@ static bool vtd_process_inv_desc(IntelIOMMUState *s)
 
     switch (desc_type) {
     case VTD_INV_DESC_CC:
-        VTD_DPRINTF(INV, "Context-cache Invalidate Descriptor hi 0x%"PRIx64
-                    " lo 0x%"PRIx64, inv_desc.hi, inv_desc.lo);
+        trace_vtd_inv_desc("context-cache", inv_desc.hi, inv_desc.lo);
         if (!vtd_process_context_cache_desc(s, &inv_desc)) {
             return false;
         }
         break;
 
     case VTD_INV_DESC_IOTLB:
-        VTD_DPRINTF(INV, "IOTLB Invalidate Descriptor hi 0x%"PRIx64
-                    " lo 0x%"PRIx64, inv_desc.hi, inv_desc.lo);
+        trace_vtd_inv_desc("iotlb", inv_desc.hi, inv_desc.lo);
         if (!vtd_process_iotlb_desc(s, &inv_desc)) {
             return false;
         }
         break;
 
     case VTD_INV_DESC_WAIT:
-        VTD_DPRINTF(INV, "Invalidation Wait Descriptor hi 0x%"PRIx64
-                    " lo 0x%"PRIx64, inv_desc.hi, inv_desc.lo);
+        trace_vtd_inv_desc("wait", inv_desc.hi, inv_desc.lo);
         if (!vtd_process_wait_desc(s, &inv_desc)) {
             return false;
         }
         break;
 
     case VTD_INV_DESC_IEC:
-        VTD_DPRINTF(INV, "Invalidation Interrupt Entry Cache "
-                    "Descriptor hi 0x%"PRIx64 " lo 0x%"PRIx64,
-                    inv_desc.hi, inv_desc.lo);
+        trace_vtd_inv_desc("iec", inv_desc.hi, inv_desc.lo);
         if (!vtd_process_inv_iec_desc(s, &inv_desc)) {
             return false;
         }
         break;
 
     default:
-        VTD_DPRINTF(GENERAL, "error: unkonw Invalidation Descriptor type "
-                    "hi 0x%"PRIx64 " lo 0x%"PRIx64 " type %"PRIu8,
-                    inv_desc.hi, inv_desc.lo, desc_type);
+        error_report("Unkonw Invalidation Descriptor type "
+                     "hi 0x%"PRIx64" lo 0x%"PRIx64" type %"PRIu8,
+                     inv_desc.hi, inv_desc.lo, desc_type);
         return false;
     }
     s->iq_head++;
diff --git a/hw/i386/trace-events b/hw/i386/trace-events
index d2b4973..c83ebae 100644
--- a/hw/i386/trace-events
+++ b/hw/i386/trace-events
@@ -10,6 +10,18 @@ xen_pv_mmio_write(uint64_t addr) "WARNING: write to Xen PV Device MMIO space (ad
 # hw/i386/x86-iommu.c
 x86_iommu_iec_notify(bool global, uint32_t index, uint32_t mask) "Notify IEC invalidation: global=%d index=%" PRIu32 " mask=%" PRIu32
 
+# hw/i386/intel_iommu.c
+vtd_inv_desc(const char *type, uint64_t hi, uint64_t lo) "invalidate desc type %s high 0x%"PRIx64" low 0x%"PRIx64
+vtd_inv_desc_cc_domain(uint16_t domain) "context invalidate domain 0x%"PRIx16
+vtd_inv_desc_cc_global(void) "context invalidate globally"
+vtd_inv_desc_cc_device(uint8_t bus, uint8_t dev, uint8_t fn) "context invalidate device %02"PRIx8":%02"PRIx8".%02"PRIx8
+vtd_inv_desc_cc_devices(uint16_t sid, uint16_t fmask) "context invalidate devices sid 0x%"PRIx16" fmask 0x%"PRIx16
+vtd_inv_desc_iotlb_global(void) "iotlb invalidate global"
+vtd_inv_desc_iotlb_domain(uint16_t domain) "iotlb invalidate whole domain 0x%"PRIx16
+vtd_inv_desc_iotlb_pages(uint16_t domain, uint64_t addr, uint8_t mask) "iotlb invalidate domain 0x%"PRIx16" addr 0x%"PRIx64" mask 0x%"PRIx8
+vtd_inv_desc_wait_sw(uint64_t addr, uint32_t data) "wait invalidate status write addr 0x%"PRIx64" data 0x%"PRIx32
+vtd_inv_desc_wait_irq(const char *msg) "%s"
+
 # hw/i386/amd_iommu.c
 amdvi_evntlog_fail(uint64_t addr, uint32_t head) "error: fail to write at addr 0x%"PRIx64" +  offset 0x%"PRIx32
 amdvi_cache_update(uint16_t domid, uint8_t bus, uint8_t slot, uint8_t func, uint64_t gpa, uint64_t txaddr) " update iotlb domid 0x%"PRIx16" devid: %02x:%02x.%x gpa 0x%"PRIx64" hpa 0x%"PRIx64
-- 
2.7.4

^ permalink raw reply related	[flat|nested] 33+ messages in thread

* [Qemu-devel] [RFC PATCH 05/13] intel_iommu: fix trace for addr translation
  2016-12-06 10:36 [Qemu-devel] [RFC PATCH 00/13] VT-d replay and misc cleanup Peter Xu
                   ` (3 preceding siblings ...)
  2016-12-06 10:36 ` [Qemu-devel] [RFC PATCH 04/13] intel_iommu: fix trace for inv desc handling Peter Xu
@ 2016-12-06 10:36 ` Peter Xu
  2016-12-06 10:36 ` [Qemu-devel] [RFC PATCH 06/13] intel_iommu: vtd_slpt_level_shift check level Peter Xu
                   ` (11 subsequent siblings)
  16 siblings, 0 replies; 33+ messages in thread
From: Peter Xu @ 2016-12-06 10:36 UTC (permalink / raw)
  To: qemu-devel
  Cc: tianyu.lan, kevin.tian, mst, jan.kiszka, bd.aviv, peterx,
	alex.williamson, jasowang

Another patch to convert the DPRINTF() stuffs. This patch focuses on the
address translation path and caching.

Signed-off-by: Peter Xu <peterx@redhat.com>
---
 hw/i386/intel_iommu.c | 87 ++++++++++++++++++++++++---------------------------
 hw/i386/trace-events  |  7 +++++
 2 files changed, 48 insertions(+), 46 deletions(-)

diff --git a/hw/i386/intel_iommu.c b/hw/i386/intel_iommu.c
index 35fbfbe..0f8387e 100644
--- a/hw/i386/intel_iommu.c
+++ b/hw/i386/intel_iommu.c
@@ -280,11 +280,9 @@ static void vtd_update_iotlb(IntelIOMMUState *s, uint16_t source_id,
     uint64_t *key = g_malloc(sizeof(*key));
     uint64_t gfn = vtd_get_iotlb_gfn(addr, level);
 
-    VTD_DPRINTF(CACHE, "update iotlb sid 0x%"PRIx16 " iova 0x%"PRIx64
-                " slpte 0x%"PRIx64 " did 0x%"PRIx16, source_id, addr, slpte,
-                domain_id);
+    trace_vtd_iotlb_page_update(source_id, addr, slpte, domain_id);
     if (g_hash_table_size(s->iotlb) >= VTD_IOTLB_MAX_SIZE) {
-        VTD_DPRINTF(CACHE, "iotlb exceeds size limit, forced to reset");
+        trace_vtd_iotlb_reset("iotlb exceeds size limit");
         vtd_reset_iotlb(s);
     }
 
@@ -525,8 +523,8 @@ static int vtd_get_root_entry(IntelIOMMUState *s, uint8_t index,
 
     addr = s->root + index * sizeof(*re);
     if (dma_memory_read(&address_space_memory, addr, re, sizeof(*re))) {
-        VTD_DPRINTF(GENERAL, "error: fail to access root-entry at 0x%"PRIx64
-                    " + %"PRIu8, s->root, index);
+        error_report("Fail to access root-entry at 0x%"PRIx64
+                     " index %"PRIu8, s->root, index);
         re->val = 0;
         return -VTD_FR_ROOT_TABLE_INV;
     }
@@ -545,13 +543,12 @@ static int vtd_get_context_entry_from_root(VTDRootEntry *root, uint8_t index,
     dma_addr_t addr;
 
     if (!vtd_root_entry_present(root)) {
-        VTD_DPRINTF(GENERAL, "error: root-entry is not present");
+        error_report("Root-entry is not present");
         return -VTD_FR_ROOT_ENTRY_P;
     }
     addr = (root->val & VTD_ROOT_ENTRY_CTP) + index * sizeof(*ce);
     if (dma_memory_read(&address_space_memory, addr, ce, sizeof(*ce))) {
-        VTD_DPRINTF(GENERAL, "error: fail to access context-entry at 0x%"PRIx64
-                    " + %"PRIu8,
+        error_report("Fail to access context-entry at 0x%"PRIx64" ind %"PRIu8,
                     (uint64_t)(root->val & VTD_ROOT_ENTRY_CTP), index);
         return -VTD_FR_CONTEXT_TABLE_INV;
     }
@@ -665,7 +662,7 @@ static int vtd_iova_to_slpte(VTDContextEntry *ce, uint64_t iova,
      * in CAP_REG and AW in context-entry.
      */
     if (iova & ~((1ULL << MIN(ce_agaw, VTD_MGAW)) - 1)) {
-        VTD_DPRINTF(GENERAL, "error: iova 0x%"PRIx64 " exceeds limits", iova);
+        error_report("IOVA 0x%"PRIx64 " exceeds limits", iova);
         return -VTD_FR_ADDR_BEYOND_MGAW;
     }
 
@@ -685,7 +682,7 @@ static int vtd_iova_to_slpte(VTDContextEntry *ce, uint64_t iova,
         slpte = vtd_get_slpte(addr, offset);
 
         if (slpte == (uint64_t)-1) {
-            VTD_DPRINTF(GENERAL, "error: fail to access second-level paging "
+            error_report("Fail to access second-level paging "
                         "entry at level %"PRIu32 " for iova 0x%"PRIx64,
                         level, iova);
             if (level == vtd_get_level_from_context_entry(ce)) {
@@ -698,13 +695,13 @@ static int vtd_iova_to_slpte(VTDContextEntry *ce, uint64_t iova,
         *reads = (*reads) && (slpte & VTD_SL_R);
         *writes = (*writes) && (slpte & VTD_SL_W);
         if (!(slpte & access_right_check) && !(flags & IOMMU_NO_FAIL)) {
-            VTD_DPRINTF(GENERAL, "error: lack of %s permission for "
-                        "iova 0x%"PRIx64 " slpte 0x%"PRIx64,
-                        (flags & IOMMU_WO ? "write" : "read"), iova, slpte);
+            error_report("Lack of %s permission for iova 0x%"PRIx64
+                         " slpte 0x%"PRIx64,
+                         (flags & IOMMU_WO ? "write" : "read"), iova, slpte);
             return (flags & IOMMU_WO) ? -VTD_FR_WRITE : -VTD_FR_READ;
         }
         if (vtd_slpte_nonzero_rsvd(slpte, level)) {
-            VTD_DPRINTF(GENERAL, "error: non-zero reserved field in second "
+            error_report("Non-zero reserved field in second "
                         "level paging entry level %"PRIu32 " slpte 0x%"PRIx64,
                         level, slpte);
             return -VTD_FR_PAGING_ENTRY_RSVD;
@@ -733,12 +730,13 @@ static int vtd_dev_to_context_entry(IntelIOMMUState *s, uint8_t bus_num,
     }
 
     if (!vtd_root_entry_present(&re)) {
-        VTD_DPRINTF(GENERAL, "error: root-entry #%"PRIu8 " is not present",
-                    bus_num);
+        /* Not error - it's okay we don't have root entry. */
+        trace_vtd_re_not_present(bus_num);
         return -VTD_FR_ROOT_ENTRY_P;
     } else if (re.rsvd || (re.val & VTD_ROOT_ENTRY_RSVD)) {
-        VTD_DPRINTF(GENERAL, "error: non-zero reserved field in root-entry "
-                    "hi 0x%"PRIx64 " lo 0x%"PRIx64, re.rsvd, re.val);
+        error_report("Non-zero reserved field in root-entry bus_num %d "
+                     "hi 0x%"PRIx64 " lo 0x%"PRIx64,
+                     bus_num, re.rsvd, re.val);
         return -VTD_FR_ROOT_ENTRY_RSVD;
     }
 
@@ -748,27 +746,25 @@ static int vtd_dev_to_context_entry(IntelIOMMUState *s, uint8_t bus_num,
     }
 
     if (!vtd_context_entry_present(ce)) {
-        VTD_DPRINTF(GENERAL,
-                    "error: context-entry #%"PRIu8 "(bus #%"PRIu8 ") "
-                    "is not present", devfn, bus_num);
+        /* Not error - it's okay we don't have context entry. */
+        trace_vtd_ce_not_present(bus_num, devfn);
         return -VTD_FR_CONTEXT_ENTRY_P;
     } else if ((ce->hi & VTD_CONTEXT_ENTRY_RSVD_HI) ||
                (ce->lo & VTD_CONTEXT_ENTRY_RSVD_LO)) {
-        VTD_DPRINTF(GENERAL,
-                    "error: non-zero reserved field in context-entry "
+        error_report("Non-zero reserved field in context-entry"
                     "hi 0x%"PRIx64 " lo 0x%"PRIx64, ce->hi, ce->lo);
         return -VTD_FR_CONTEXT_ENTRY_RSVD;
     }
     /* Check if the programming of context-entry is valid */
     if (!vtd_is_level_supported(s, vtd_get_level_from_context_entry(ce))) {
-        VTD_DPRINTF(GENERAL, "error: unsupported Address Width value in "
-                    "context-entry hi 0x%"PRIx64 " lo 0x%"PRIx64,
-                    ce->hi, ce->lo);
+        error_report("Unsupported Address Width value in "
+                     "context-entry hi 0x%"PRIx64 " lo 0x%"PRIx64,
+                     ce->hi, ce->lo);
         return -VTD_FR_CONTEXT_ENTRY_INV;
     } else if (ce->lo & VTD_CONTEXT_ENTRY_TT) {
-        VTD_DPRINTF(GENERAL, "error: unsupported Translation Type in "
-                    "context-entry hi 0x%"PRIx64 " lo 0x%"PRIx64,
-                    ce->hi, ce->lo);
+        error_report("Unsupported Translation Type in "
+                     "context-entry hi 0x%"PRIx64 " lo 0x%"PRIx64,
+                     ce->hi, ce->lo);
         return -VTD_FR_CONTEXT_ENTRY_INV;
     }
     return 0;
@@ -849,9 +845,8 @@ static void vtd_do_iommu_translate(VTDAddressSpace *vtd_as, PCIBus *bus,
     /* Try to fetch slpte form IOTLB */
     iotlb_entry = vtd_lookup_iotlb(s, source_id, addr);
     if (iotlb_entry) {
-        VTD_DPRINTF(CACHE, "hit iotlb sid 0x%"PRIx16 " iova 0x%"PRIx64
-                    " slpte 0x%"PRIx64 " did 0x%"PRIx16, source_id, addr,
-                    iotlb_entry->slpte, iotlb_entry->domain_id);
+        trace_vtd_iotlb_page_hit(source_id, addr, iotlb_entry->slpte,
+                                 iotlb_entry->domain_id);
         slpte = iotlb_entry->slpte;
         reads = iotlb_entry->read_flags;
         writes = iotlb_entry->write_flags;
@@ -860,10 +855,9 @@ static void vtd_do_iommu_translate(VTDAddressSpace *vtd_as, PCIBus *bus,
     }
     /* Try to fetch context-entry from cache first */
     if (cc_entry->context_cache_gen == s->context_cache_gen) {
-        VTD_DPRINTF(CACHE, "hit context-cache bus %d devfn %d "
-                    "(hi %"PRIx64 " lo %"PRIx64 " gen %"PRIu32 ")",
-                    bus_num, devfn, cc_entry->context_entry.hi,
-                    cc_entry->context_entry.lo, cc_entry->context_cache_gen);
+        trace_vtd_iotlb_cc_hit(bus_num, devfn, cc_entry->context_entry.hi,
+                               cc_entry->context_entry.lo,
+                               cc_entry->context_cache_gen);
         ce = cc_entry->context_entry;
         is_fpd_set = ce.lo & VTD_CONTEXT_ENTRY_FPD;
     } else {
@@ -873,9 +867,9 @@ static void vtd_do_iommu_translate(VTDAddressSpace *vtd_as, PCIBus *bus,
             ret_fr = -ret_fr;
             if (!(flags & IOMMU_NO_FAIL)) {
                 if (is_fpd_set && vtd_is_qualified_fault(ret_fr)) {
-                    VTD_DPRINTF(FLOG, "fault processing is disabled for DMA "
-                            "requests through this context-entry "
-                            "(with FPD Set)");
+                    error_report("Fault processing is disabled for DMA "
+                                 "requests through this context-entry "
+                                 "(with FPD Set)");
                 } else {
                     vtd_report_dmar_fault(s, source_id, addr, ret_fr,
                             flags & IOMMU_WO);
@@ -884,10 +878,9 @@ static void vtd_do_iommu_translate(VTDAddressSpace *vtd_as, PCIBus *bus,
             return;
         }
         /* Update context-cache */
-        VTD_DPRINTF(CACHE, "update context-cache bus %d devfn %d "
-                    "(hi %"PRIx64 " lo %"PRIx64 " gen %"PRIu32 "->%"PRIu32 ")",
-                    bus_num, devfn, ce.hi, ce.lo,
-                    cc_entry->context_cache_gen, s->context_cache_gen);
+        trace_vtd_iotlb_cc_update(bus_num, devfn, ce.hi, ce.lo,
+                                  cc_entry->context_cache_gen,
+                                  s->context_cache_gen);
         cc_entry->context_entry = ce;
         cc_entry->context_cache_gen = s->context_cache_gen;
     }
@@ -898,8 +891,9 @@ static void vtd_do_iommu_translate(VTDAddressSpace *vtd_as, PCIBus *bus,
         ret_fr = -ret_fr;
         if (!(flags & IOMMU_NO_FAIL)) {
             if (is_fpd_set && vtd_is_qualified_fault(ret_fr)) {
-                VTD_DPRINTF(FLOG, "fault processing is disabled for DMA "
-                        "requests through this context-entry (with FPD Set)");
+                error_report("Fault processing is disabled for DMA "
+                             "requests through this context-entry "
+                             "(with FPD Set)");
             } else {
                 vtd_report_dmar_fault(s, source_id, addr, ret_fr, flags);
             }
@@ -1060,6 +1054,7 @@ static uint64_t vtd_context_cache_invalidate(IntelIOMMUState *s, uint64_t val)
 
 static void vtd_iotlb_global_invalidate(IntelIOMMUState *s)
 {
+    trace_vtd_iotlb_reset("global invalidation recved");
     vtd_reset_iotlb(s);
 }
 
diff --git a/hw/i386/trace-events b/hw/i386/trace-events
index c83ebae..e64a4e2 100644
--- a/hw/i386/trace-events
+++ b/hw/i386/trace-events
@@ -21,6 +21,13 @@ vtd_inv_desc_iotlb_domain(uint16_t domain) "iotlb invalidate whole domain 0x%"PR
 vtd_inv_desc_iotlb_pages(uint16_t domain, uint64_t addr, uint8_t mask) "iotlb invalidate domain 0x%"PRIx16" addr 0x%"PRIx64" mask 0x%"PRIx8
 vtd_inv_desc_wait_sw(uint64_t addr, uint32_t data) "wait invalidate status write addr 0x%"PRIx64" data 0x%"PRIx32
 vtd_inv_desc_wait_irq(const char *msg) "%s"
+vtd_re_not_present(uint8_t bus) "Root entry bus %"PRIu8" not present"
+vtd_ce_not_present(uint8_t bus, uint8_t devfn) "Context entry bus %"PRIu8" devfn %"PRIu8" not present"
+vtd_iotlb_page_hit(uint16_t sid, uint64_t addr, uint64_t slpte, uint16_t domain) "IOTLB page hit sid 0x%"PRIx16" iova 0x%"PRIx64" slpte 0x%"PRIx64" domain 0x%"PRIx16
+vtd_iotlb_page_update(uint16_t sid, uint64_t addr, uint64_t slpte, uint16_t domain) "IOTLB page update sid 0x%"PRIx16" iova 0x%"PRIx64" slpte 0x%"PRIx64" domain 0x%"PRIx16
+vtd_iotlb_cc_hit(uint8_t bus, uint8_t devfn, uint64_t high, uint64_t low, uint32_t gen) "IOTLB context hit bus 0x%"PRIx8" devfn 0x%"PRIx8" high 0x%"PRIx64" low 0x%"PRIx64" gen %"PRIu32
+vtd_iotlb_cc_update(uint8_t bus, uint8_t devfn, uint64_t high, uint64_t low, uint32_t gen1, uint32_t gen2) "IOTLB context update bus 0x%"PRIx8" devfn 0x%"PRIx8" high 0x%"PRIx64" low 0x%"PRIx64" gen %"PRIu32" -> gen %"PRIu32
+vtd_iotlb_reset(const char *reason) "IOTLB reset (reason: %s)"
 
 # hw/i386/amd_iommu.c
 amdvi_evntlog_fail(uint64_t addr, uint32_t head) "error: fail to write at addr 0x%"PRIx64" +  offset 0x%"PRIx32
-- 
2.7.4

^ permalink raw reply related	[flat|nested] 33+ messages in thread

* [Qemu-devel] [RFC PATCH 06/13] intel_iommu: vtd_slpt_level_shift check level
  2016-12-06 10:36 [Qemu-devel] [RFC PATCH 00/13] VT-d replay and misc cleanup Peter Xu
                   ` (4 preceding siblings ...)
  2016-12-06 10:36 ` [Qemu-devel] [RFC PATCH 05/13] intel_iommu: fix trace for addr translation Peter Xu
@ 2016-12-06 10:36 ` Peter Xu
  2016-12-06 10:36 ` [Qemu-devel] [RFC PATCH 07/13] memory: add section range info for IOMMU notifier Peter Xu
                   ` (10 subsequent siblings)
  16 siblings, 0 replies; 33+ messages in thread
From: Peter Xu @ 2016-12-06 10:36 UTC (permalink / raw)
  To: qemu-devel
  Cc: tianyu.lan, kevin.tian, mst, jan.kiszka, bd.aviv, peterx,
	alex.williamson, jasowang

This helps in debugging incorrect level passed in.

Signed-off-by: Peter Xu <peterx@redhat.com>
---
 hw/i386/intel_iommu.c | 1 +
 1 file changed, 1 insertion(+)

diff --git a/hw/i386/intel_iommu.c b/hw/i386/intel_iommu.c
index 0f8387e..46b8a2f 100644
--- a/hw/i386/intel_iommu.c
+++ b/hw/i386/intel_iommu.c
@@ -188,6 +188,7 @@ static gboolean vtd_hash_remove_by_domain(gpointer key, gpointer value,
 /* The shift of an addr for a certain level of paging structure */
 static inline uint32_t vtd_slpt_level_shift(uint32_t level)
 {
+    assert(level != 0);
     return VTD_PAGE_SHIFT_4K + (level - 1) * VTD_SL_LEVEL_BITS;
 }
 
-- 
2.7.4

^ permalink raw reply related	[flat|nested] 33+ messages in thread

* [Qemu-devel] [RFC PATCH 07/13] memory: add section range info for IOMMU notifier
  2016-12-06 10:36 [Qemu-devel] [RFC PATCH 00/13] VT-d replay and misc cleanup Peter Xu
                   ` (5 preceding siblings ...)
  2016-12-06 10:36 ` [Qemu-devel] [RFC PATCH 06/13] intel_iommu: vtd_slpt_level_shift check level Peter Xu
@ 2016-12-06 10:36 ` Peter Xu
  2016-12-06 10:36 ` [Qemu-devel] [RFC PATCH 08/13] memory: provide iommu_replay_all() Peter Xu
                   ` (9 subsequent siblings)
  16 siblings, 0 replies; 33+ messages in thread
From: Peter Xu @ 2016-12-06 10:36 UTC (permalink / raw)
  To: qemu-devel
  Cc: tianyu.lan, kevin.tian, mst, jan.kiszka, bd.aviv, peterx,
	alex.williamson, jasowang

In this patch, IOMMUNotifier.{start|end} are introduced to store section
information for a specific notifier. When notification occurs, we not
only check the notification type (MAP|UNMAP), but also check whether the
notified iova is in the range of specific IOMMU notifier, and skip those
notifiers if not in the listened range.

When removing an region, we need to make sure we removed the correct
VFIOGuestIOMMU by checking the IOMMUNotifier.start address as well.

Suggested-by: David Gibson <david@gibson.dropbear.id.au>
Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Peter Xu <peterx@redhat.com>
---
v2:
- replace offset_within_address_space with offset_within_region since
  IOTLB iova is relative to region [David]
---
 hw/vfio/common.c      | 7 ++++++-
 include/exec/memory.h | 3 +++
 memory.c              | 4 +++-
 3 files changed, 12 insertions(+), 2 deletions(-)

diff --git a/hw/vfio/common.c b/hw/vfio/common.c
index 801578b..6f648da 100644
--- a/hw/vfio/common.c
+++ b/hw/vfio/common.c
@@ -455,6 +455,10 @@ static void vfio_listener_region_add(MemoryListener *listener,
         giommu->container = container;
         giommu->n.notify = vfio_iommu_map_notify;
         giommu->n.notifier_flags = IOMMU_NOTIFIER_ALL;
+        giommu->n.start = section->offset_within_region;
+        llend = int128_add(int128_make64(giommu->n.start), section->size);
+        llend = int128_sub(llend, int128_one());
+        giommu->n.end = int128_get64(llend);
         QLIST_INSERT_HEAD(&container->giommu_list, giommu, giommu_next);
 
         memory_region_register_iommu_notifier(giommu->iommu, &giommu->n);
@@ -525,7 +529,8 @@ static void vfio_listener_region_del(MemoryListener *listener,
         VFIOGuestIOMMU *giommu;
 
         QLIST_FOREACH(giommu, &container->giommu_list, giommu_next) {
-            if (giommu->iommu == section->mr) {
+            if (giommu->iommu == section->mr &&
+                giommu->n.start == section->offset_within_region) {
                 memory_region_unregister_iommu_notifier(giommu->iommu,
                                                         &giommu->n);
                 QLIST_REMOVE(giommu, giommu_next);
diff --git a/include/exec/memory.h b/include/exec/memory.h
index 2d7ee54..cb2d432 100644
--- a/include/exec/memory.h
+++ b/include/exec/memory.h
@@ -85,6 +85,9 @@ typedef enum {
 struct IOMMUNotifier {
     void (*notify)(struct IOMMUNotifier *notifier, IOMMUTLBEntry *data);
     IOMMUNotifierFlag notifier_flags;
+    /* Notify for address space range start <= addr <= end */
+    hwaddr start;
+    hwaddr end;
     QLIST_ENTRY(IOMMUNotifier) node;
 };
 typedef struct IOMMUNotifier IOMMUNotifier;
diff --git a/memory.c b/memory.c
index 9b88638..f73c897 100644
--- a/memory.c
+++ b/memory.c
@@ -1663,7 +1663,9 @@ void memory_region_notify_iommu(MemoryRegion *mr,
     }
 
     QLIST_FOREACH(iommu_notifier, &mr->iommu_notify, node) {
-        if (iommu_notifier->notifier_flags & request_flags) {
+        if (iommu_notifier->notifier_flags & request_flags &&
+            iommu_notifier->start <= entry.iova &&
+            iommu_notifier->end >= entry.iova) {
             iommu_notifier->notify(iommu_notifier, &entry);
         }
     }
-- 
2.7.4

^ permalink raw reply related	[flat|nested] 33+ messages in thread

* [Qemu-devel] [RFC PATCH 08/13] memory: provide iommu_replay_all()
  2016-12-06 10:36 [Qemu-devel] [RFC PATCH 00/13] VT-d replay and misc cleanup Peter Xu
                   ` (6 preceding siblings ...)
  2016-12-06 10:36 ` [Qemu-devel] [RFC PATCH 07/13] memory: add section range info for IOMMU notifier Peter Xu
@ 2016-12-06 10:36 ` Peter Xu
  2016-12-06 10:36 ` [Qemu-devel] [RFC PATCH 09/13] memory: introduce memory_region_notify_one() Peter Xu
                   ` (8 subsequent siblings)
  16 siblings, 0 replies; 33+ messages in thread
From: Peter Xu @ 2016-12-06 10:36 UTC (permalink / raw)
  To: qemu-devel
  Cc: tianyu.lan, kevin.tian, mst, jan.kiszka, bd.aviv, peterx,
	alex.williamson, jasowang

This is an "global" version of exising memory_region_iommu_replay() - we
announce the translations to all the registered notifiers, instead of a
specific one.

Signed-off-by: Peter Xu <peterx@redhat.com>
---
 include/exec/memory.h | 8 ++++++++
 memory.c              | 9 +++++++++
 2 files changed, 17 insertions(+)

diff --git a/include/exec/memory.h b/include/exec/memory.h
index cb2d432..1669c7b 100644
--- a/include/exec/memory.h
+++ b/include/exec/memory.h
@@ -693,6 +693,14 @@ void memory_region_iommu_replay(MemoryRegion *mr, IOMMUNotifier *n,
                                 bool is_write);
 
 /**
+ * memory_region_iommu_replay_all: replay existing IOMMU translations
+ * to all the notifiers registered.
+ *
+ * @mr: the memory region to observe
+ */
+void memory_region_iommu_replay_all(MemoryRegion *mr);
+
+/**
  * memory_region_unregister_iommu_notifier: unregister a notifier for
  * changes to IOMMU translation entries.
  *
diff --git a/memory.c b/memory.c
index f73c897..62ca6e0 100644
--- a/memory.c
+++ b/memory.c
@@ -1641,6 +1641,15 @@ void memory_region_iommu_replay(MemoryRegion *mr, IOMMUNotifier *n,
     }
 }
 
+void memory_region_iommu_replay_all(MemoryRegion *mr)
+{
+    IOMMUNotifier *notifier;
+
+    QLIST_FOREACH(notifier, &mr->iommu_notify, node) {
+        memory_region_iommu_replay(mr, notifier, false);
+    }
+}
+
 void memory_region_unregister_iommu_notifier(MemoryRegion *mr,
                                              IOMMUNotifier *n)
 {
-- 
2.7.4

^ permalink raw reply related	[flat|nested] 33+ messages in thread

* [Qemu-devel] [RFC PATCH 09/13] memory: introduce memory_region_notify_one()
  2016-12-06 10:36 [Qemu-devel] [RFC PATCH 00/13] VT-d replay and misc cleanup Peter Xu
                   ` (7 preceding siblings ...)
  2016-12-06 10:36 ` [Qemu-devel] [RFC PATCH 08/13] memory: provide iommu_replay_all() Peter Xu
@ 2016-12-06 10:36 ` Peter Xu
  2016-12-06 10:36 ` [Qemu-devel] [RFC PATCH 10/13] memory: add MemoryRegionIOMMUOps.replay() callback Peter Xu
                   ` (7 subsequent siblings)
  16 siblings, 0 replies; 33+ messages in thread
From: Peter Xu @ 2016-12-06 10:36 UTC (permalink / raw)
  To: qemu-devel
  Cc: tianyu.lan, kevin.tian, mst, jan.kiszka, bd.aviv, peterx,
	alex.williamson, jasowang

Generalizing the notify logic in memory_region_notify_iommu() into a
single function. This can be further used in customized replay()
functions for IOMMUs.

Signed-off-by: Peter Xu <peterx@redhat.com>
---
 include/exec/memory.h | 15 +++++++++++++++
 memory.c              | 29 ++++++++++++++++++-----------
 2 files changed, 33 insertions(+), 11 deletions(-)

diff --git a/include/exec/memory.h b/include/exec/memory.h
index 1669c7b..9902e9e 100644
--- a/include/exec/memory.h
+++ b/include/exec/memory.h
@@ -668,6 +668,21 @@ void memory_region_notify_iommu(MemoryRegion *mr,
                                 IOMMUTLBEntry entry);
 
 /**
+ * memory_region_notify_one: notify a change in an IOMMU translation
+ *                           entry to a single notifier
+ *
+ * This works just like memory_region_notify_iommu(), but it only
+ * notifies a specific notifier, not all of them.
+ *
+ * @notifier: the notifier to be notified
+ * @entry: the new entry in the IOMMU translation table.  The entry
+ *         replaces all old entries for the same virtual I/O address range.
+ *         Deleted entries have .@perm == 0.
+ */
+void memory_region_notify_one(IOMMUNotifier *notifier,
+                              IOMMUTLBEntry *entry);
+
+/**
  * memory_region_register_iommu_notifier: register a notifier for changes to
  * IOMMU translation entries.
  *
diff --git a/memory.c b/memory.c
index 62ca6e0..84c91fa 100644
--- a/memory.c
+++ b/memory.c
@@ -1657,26 +1657,33 @@ void memory_region_unregister_iommu_notifier(MemoryRegion *mr,
     memory_region_update_iommu_notify_flags(mr);
 }
 
-void memory_region_notify_iommu(MemoryRegion *mr,
-                                IOMMUTLBEntry entry)
+void memory_region_notify_one(IOMMUNotifier *notifier,
+                              IOMMUTLBEntry *entry)
 {
-    IOMMUNotifier *iommu_notifier;
     IOMMUNotifierFlag request_flags;
 
-    assert(memory_region_is_iommu(mr));
-
-    if (entry.perm & IOMMU_RW) {
+    if (entry->perm & IOMMU_RW) {
         request_flags = IOMMU_NOTIFIER_MAP;
     } else {
         request_flags = IOMMU_NOTIFIER_UNMAP;
     }
 
+    if (notifier->notifier_flags & request_flags &&
+        notifier->start <= entry->iova &&
+        notifier->end >= entry->iova) {
+        notifier->notify(notifier, entry);
+    }
+}
+
+void memory_region_notify_iommu(MemoryRegion *mr,
+                                IOMMUTLBEntry entry)
+{
+    IOMMUNotifier *iommu_notifier;
+
+    assert(memory_region_is_iommu(mr));
+
     QLIST_FOREACH(iommu_notifier, &mr->iommu_notify, node) {
-        if (iommu_notifier->notifier_flags & request_flags &&
-            iommu_notifier->start <= entry.iova &&
-            iommu_notifier->end >= entry.iova) {
-            iommu_notifier->notify(iommu_notifier, &entry);
-        }
+        memory_region_notify_one(iommu_notifier, &entry);
     }
 }
 
-- 
2.7.4

^ permalink raw reply related	[flat|nested] 33+ messages in thread

* [Qemu-devel] [RFC PATCH 10/13] memory: add MemoryRegionIOMMUOps.replay() callback
  2016-12-06 10:36 [Qemu-devel] [RFC PATCH 00/13] VT-d replay and misc cleanup Peter Xu
                   ` (8 preceding siblings ...)
  2016-12-06 10:36 ` [Qemu-devel] [RFC PATCH 09/13] memory: introduce memory_region_notify_one() Peter Xu
@ 2016-12-06 10:36 ` Peter Xu
  2016-12-06 10:36 ` [Qemu-devel] [RFC PATCH 11/13] intel_iommu: provide its own replay() callback Peter Xu
                   ` (6 subsequent siblings)
  16 siblings, 0 replies; 33+ messages in thread
From: Peter Xu @ 2016-12-06 10:36 UTC (permalink / raw)
  To: qemu-devel
  Cc: tianyu.lan, kevin.tian, mst, jan.kiszka, bd.aviv, peterx,
	alex.williamson, jasowang

Originally we have one memory_region_iommu_replay() function, which is
the default behavior to replay the translations of the whole IOMMU
region. However, on some platform like x86, we may want our own replay
logic for IOMMU regions. This patch add one more hook for IOMMUOps for
the callback, and it'll override the default if set.

Signed-off-by: Peter Xu <peterx@redhat.com>
---
 include/exec/memory.h | 2 ++
 memory.c              | 6 ++++++
 2 files changed, 8 insertions(+)

diff --git a/include/exec/memory.h b/include/exec/memory.h
index 9902e9e..6bdd12c 100644
--- a/include/exec/memory.h
+++ b/include/exec/memory.h
@@ -183,6 +183,8 @@ struct MemoryRegionIOMMUOps {
     void (*notify_flag_changed)(MemoryRegion *iommu,
                                 IOMMUNotifierFlag old_flags,
                                 IOMMUNotifierFlag new_flags);
+    /* Set this up to provide customized IOMMU replay function */
+    void (*replay)(MemoryRegion *iommu, IOMMUNotifier *notifier);
 };
 
 typedef struct CoalescedMemoryRange CoalescedMemoryRange;
diff --git a/memory.c b/memory.c
index 84c91fa..9ad6319 100644
--- a/memory.c
+++ b/memory.c
@@ -1624,6 +1624,12 @@ void memory_region_iommu_replay(MemoryRegion *mr, IOMMUNotifier *n,
     hwaddr addr, granularity;
     IOMMUTLBEntry iotlb;
 
+    /* If the IOMMU has its own replay callback, override */
+    if (mr->iommu_ops->replay) {
+        mr->iommu_ops->replay(mr, n);
+        return;
+    }
+
     granularity = memory_region_iommu_get_min_page_size(mr);
 
     for (addr = 0; addr < memory_region_size(mr); addr += granularity) {
-- 
2.7.4

^ permalink raw reply related	[flat|nested] 33+ messages in thread

* [Qemu-devel] [RFC PATCH 11/13] intel_iommu: provide its own replay() callback
  2016-12-06 10:36 [Qemu-devel] [RFC PATCH 00/13] VT-d replay and misc cleanup Peter Xu
                   ` (9 preceding siblings ...)
  2016-12-06 10:36 ` [Qemu-devel] [RFC PATCH 10/13] memory: add MemoryRegionIOMMUOps.replay() callback Peter Xu
@ 2016-12-06 10:36 ` Peter Xu
  2016-12-08  3:01   ` Lan Tianyu
  2016-12-06 10:36 ` [Qemu-devel] [RFC PATCH 12/13] intel_iommu: do replay when context invalidate Peter Xu
                   ` (5 subsequent siblings)
  16 siblings, 1 reply; 33+ messages in thread
From: Peter Xu @ 2016-12-06 10:36 UTC (permalink / raw)
  To: qemu-devel
  Cc: tianyu.lan, kevin.tian, mst, jan.kiszka, bd.aviv, peterx,
	alex.williamson, jasowang

The default replay() don't work for VT-d since vt-d will have a huge
default memory region which covers address range 0-(2^64-1). This will
normally bring a dead loop when guest starts.

The solution is simple - we don't walk over all the regions. Instead, we
jump over the regions when we found that the page directories are empty.
It'll greatly reduce the time to walk the whole region.

To achieve this, we provided a page walk helper to do that, invoking
corresponding hook function when we found an page we are interested in.
vtd_page_walk_level() is the core logic for the page walking. It's
interface is designed to suite further use case, e.g., to invalidate a
range of addresses.

Signed-off-by: Peter Xu <peterx@redhat.com>
---
 hw/i386/intel_iommu.c | 212 ++++++++++++++++++++++++++++++++++++++++++++++++--
 hw/i386/trace-events  |   8 ++
 include/exec/memory.h |   2 +
 3 files changed, 217 insertions(+), 5 deletions(-)

diff --git a/hw/i386/intel_iommu.c b/hw/i386/intel_iommu.c
index 46b8a2f..2fcd7af 100644
--- a/hw/i386/intel_iommu.c
+++ b/hw/i386/intel_iommu.c
@@ -620,6 +620,22 @@ static inline uint32_t vtd_get_agaw_from_context_entry(VTDContextEntry *ce)
     return 30 + (ce->hi & VTD_CONTEXT_ENTRY_AW) * 9;
 }
 
+static inline uint64_t vtd_iova_limit(VTDContextEntry *ce)
+{
+    uint32_t ce_agaw = vtd_get_agaw_from_context_entry(ce);
+    return 1ULL << MIN(ce_agaw, VTD_MGAW);
+}
+
+/* Return true if IOVA passes range check, otherwise false. */
+static inline bool vtd_iova_range_check(uint64_t iova, VTDContextEntry *ce)
+{
+    /*
+     * Check if @iova is above 2^X-1, where X is the minimum of MGAW
+     * in CAP_REG and AW in context-entry.
+     */
+    return !(iova & ~(vtd_iova_limit(ce) - 1));
+}
+
 static const uint64_t vtd_paging_entry_rsvd_field[] = {
     [0] = ~0ULL,
     /* For not large page */
@@ -656,13 +672,9 @@ static int vtd_iova_to_slpte(VTDContextEntry *ce, uint64_t iova,
     uint32_t level = vtd_get_level_from_context_entry(ce);
     uint32_t offset;
     uint64_t slpte;
-    uint32_t ce_agaw = vtd_get_agaw_from_context_entry(ce);
     uint64_t access_right_check = 0;
 
-    /* Check if @iova is above 2^X-1, where X is the minimum of MGAW
-     * in CAP_REG and AW in context-entry.
-     */
-    if (iova & ~((1ULL << MIN(ce_agaw, VTD_MGAW)) - 1)) {
+    if (!vtd_iova_range_check(iova, ce)) {
         error_report("IOVA 0x%"PRIx64 " exceeds limits", iova);
         return -VTD_FR_ADDR_BEYOND_MGAW;
     }
@@ -718,6 +730,166 @@ static int vtd_iova_to_slpte(VTDContextEntry *ce, uint64_t iova,
     }
 }
 
+typedef int (*vtd_page_walk_hook)(IOMMUTLBEntry *entry, void *private);
+
+/**
+ * vtd_page_walk_level - walk over specific level for IOVA range
+ *
+ * @addr: base GPA addr to start the walk
+ * @start: IOVA range start address
+ * @end: IOVA range end address (start <= addr < end)
+ * @hook_fn: hook func to be called when detected page
+ * @private: private data to be passed into hook func
+ * @read: whether parent level has read permission
+ * @write: whether parent level has write permission
+ * @skipped: accumulated skipped ranges
+ * @notify_unmap: whether we should notify invalid entries
+ */
+static int vtd_page_walk_level(dma_addr_t addr, uint64_t start,
+                               uint64_t end, vtd_page_walk_hook hook_fn,
+                               void *private, uint32_t level,
+                               bool read, bool write, uint64_t *skipped,
+                               bool notify_unmap)
+{
+    bool read_cur, write_cur, entry_valid;
+    uint32_t offset;
+    uint64_t slpte;
+    uint64_t subpage_size, subpage_mask;
+    IOMMUTLBEntry entry;
+    uint64_t iova = start;
+    uint64_t iova_next;
+    uint64_t skipped_local = 0;
+    int ret = 0;
+
+    trace_vtd_page_walk_level(addr, level, start, end);
+
+    subpage_size = 1ULL << vtd_slpt_level_shift(level);
+    subpage_mask = vtd_slpt_level_page_mask(level);
+
+    while (iova < end) {
+        iova_next = (iova & subpage_mask) + subpage_size;
+
+        offset = vtd_iova_level_offset(iova, level);
+        slpte = vtd_get_slpte(addr, offset);
+
+        /*
+         * When one of the following case happens, we assume the whole
+         * range is invalid:
+         *
+         * 1. read block failed
+         * 3. reserved area non-zero
+         * 2. both read & write flag are not set
+         */
+
+        if (slpte == (uint64_t)-1) {
+            trace_vtd_page_walk_skip_read(iova, iova_next);
+            skipped_local++;
+            goto next;
+        }
+
+        if (vtd_slpte_nonzero_rsvd(slpte, level)) {
+            trace_vtd_page_walk_skip_reserve(iova, iova_next);
+            skipped_local++;
+            goto next;
+        }
+
+        /* Permissions are stacked with parents' */
+        read_cur = read && (slpte & VTD_SL_R);
+        write_cur = write && (slpte & VTD_SL_W);
+
+        /*
+         * As long as we have either read/write permission, this is
+         * a valid entry. The rule works for both page or page tables.
+         */
+        entry_valid = read_cur | write_cur;
+
+        if (vtd_is_last_slpte(slpte, level)) {
+            entry.target_as = &address_space_memory;
+            entry.iova = iova & subpage_mask;
+            /*
+             * This might be meaningless addr if (!read_cur &&
+             * !write_cur), but after all this field will be
+             * meaningless in that case, so let's share the code to
+             * generate the IOTLBs no matter it's an MAP or UNMAP
+             */
+            entry.translated_addr = vtd_get_slpte_addr(slpte);
+            entry.addr_mask = ~subpage_mask;
+            entry.perm = IOMMU_ACCESS_FLAG(read_cur, write_cur);
+            if (!entry_valid && !notify_unmap) {
+                trace_vtd_page_walk_skip_perm(iova, iova_next);
+                skipped_local++;
+                goto next;
+            }
+            trace_vtd_page_walk_one(level, entry.iova, addr,
+                                    entry.addr_mask, entry.perm);
+            if (hook_fn) {
+                ret = hook_fn(&entry, private);
+                if (ret < 0) {
+                    error_report("Detected error in page walk hook "
+                                 "function, stop walk.");
+                    return ret;
+                }
+            }
+        } else {
+            if (!entry_valid) {
+                trace_vtd_page_walk_skip_perm(iova, iova_next);
+                skipped_local++;
+                goto next;
+            }
+            ret = vtd_page_walk_level(vtd_get_slpte_addr(slpte), iova,
+                                      MIN(iova_next, end), hook_fn, private,
+                                      level - 1, read_cur, write_cur,
+                                      &skipped_local, notify_unmap);
+            if (ret < 0) {
+                error_report("Detected page walk error on addr 0x%"PRIx64
+                             " level %"PRIu32", stop walk.", addr, level - 1);
+                return ret;
+            }
+        }
+
+next:
+        iova = iova_next;
+    }
+
+    if (skipped) {
+        *skipped += skipped_local;
+    }
+
+    return 0;
+}
+
+/**
+ * vtd_page_walk - walk specific IOVA range, and call the hook
+ *
+ * @ce: context entry to walk upon
+ * @start: IOVA address to start the walk
+ * @end: size of the IOVA range to walk over
+ * @hook_fn: the hook that to be called for each detected area
+ * @private: private data for the hook function
+ */
+static int vtd_page_walk(VTDContextEntry *ce, uint64_t start, uint64_t end,
+                         vtd_page_walk_hook hook_fn, void *private)
+{
+    dma_addr_t addr = vtd_get_slpt_base_from_context(ce);
+    uint32_t level = vtd_get_level_from_context_entry(ce);
+
+    if (!vtd_iova_range_check(start, ce)) {
+        error_report("IOVA start 0x%"PRIx64 " end 0x%"PRIx64" exceeds limits",
+                     start, end);
+        return -VTD_FR_ADDR_BEYOND_MGAW;
+    }
+
+    if (!vtd_iova_range_check(end, ce)) {
+        /* Fix end so that it reaches the maximum */
+        end = vtd_iova_limit(ce);
+    }
+
+    trace_vtd_page_walk(ce->hi, ce->lo, start, end);
+
+    return vtd_page_walk_level(addr, start, end, hook_fn, private,
+                               level, true, true, NULL, false);
+}
+
 /* Map a device to its corresponding domain (context-entry) */
 static int vtd_dev_to_context_entry(IntelIOMMUState *s, uint8_t bus_num,
                                     uint8_t devfn, VTDContextEntry *ce)
@@ -2430,6 +2602,35 @@ VTDAddressSpace *vtd_find_add_as(IntelIOMMUState *s, PCIBus *bus, int devfn)
     return vtd_dev_as;
 }
 
+static int vtd_replay_hook(IOMMUTLBEntry *entry, void *private)
+{
+    memory_region_notify_one((IOMMUNotifier *)private, entry);
+    return 0;
+}
+
+static void vtd_iommu_replay(MemoryRegion *mr, IOMMUNotifier *n)
+{
+    VTDAddressSpace *vtd_as = container_of(mr, VTDAddressSpace, iommu);
+    IntelIOMMUState *s = vtd_as->iommu_state;
+    uint8_t bus_n = pci_bus_num(vtd_as->bus);
+    VTDContextEntry ce;
+
+    if (vtd_dev_to_context_entry(s, bus_n, vtd_as->devfn, &ce) == 0) {
+        /*
+         * Scanned a valid context entry, walk over the pages and
+         * notify when needed.
+         */
+        trace_vtd_replay_ce_valid(bus_n, PCI_SLOT(vtd_as->devfn),
+                                  PCI_FUNC(vtd_as->devfn), ce.hi, ce.lo);
+        vtd_page_walk(&ce, 0, ~0, vtd_replay_hook, (void *)n);
+    } else {
+        trace_vtd_replay_ce_invalid(bus_n, PCI_SLOT(vtd_as->devfn),
+                                    PCI_FUNC(vtd_as->devfn));
+    }
+
+    return;
+}
+
 /* Do the initialization. It will also be called when reset, so pay
  * attention when adding new initialization stuff.
  */
@@ -2444,6 +2645,7 @@ static void vtd_init(IntelIOMMUState *s)
 
     s->iommu_ops.translate = vtd_iommu_translate;
     s->iommu_ops.notify_flag_changed = vtd_iommu_notify_flag_changed;
+    s->iommu_ops.replay = vtd_iommu_replay;
     s->root = 0;
     s->root_extended = false;
     s->dmar_enabled = false;
diff --git a/hw/i386/trace-events b/hw/i386/trace-events
index e64a4e2..922d543 100644
--- a/hw/i386/trace-events
+++ b/hw/i386/trace-events
@@ -28,6 +28,14 @@ vtd_iotlb_page_update(uint16_t sid, uint64_t addr, uint64_t slpte, uint16_t doma
 vtd_iotlb_cc_hit(uint8_t bus, uint8_t devfn, uint64_t high, uint64_t low, uint32_t gen) "IOTLB context hit bus 0x%"PRIx8" devfn 0x%"PRIx8" high 0x%"PRIx64" low 0x%"PRIx64" gen %"PRIu32
 vtd_iotlb_cc_update(uint8_t bus, uint8_t devfn, uint64_t high, uint64_t low, uint32_t gen1, uint32_t gen2) "IOTLB context update bus 0x%"PRIx8" devfn 0x%"PRIx8" high 0x%"PRIx64" low 0x%"PRIx64" gen %"PRIu32" -> gen %"PRIu32
 vtd_iotlb_reset(const char *reason) "IOTLB reset (reason: %s)"
+vtd_replay_ce_valid(uint8_t bus, uint8_t dev, uint8_t fn, uint64_t hi, uint64_t lo) "replay valid context device %02"PRIx8":%02"PRIx8".%02"PRIx8" hi 0x%"PRIx64" lo 0x%"PRIx64
+vtd_replay_ce_invalid(uint8_t bus, uint8_t dev, uint8_t fn) "replay invalid context device %02"PRIx8":%02"PRIx8".%02"PRIx8
+vtd_page_walk(uint64_t hi, uint64_t lo, uint64_t start, uint64_t end) "Page walk for ce (0x%"PRIx64", 0x%"PRIx64") iova range 0x%"PRIx64" - 0x%"PRIx64
+vtd_page_walk_level(uint64_t addr, uint32_t level, uint64_t start, uint64_t end) "Page walk (base=0x%"PRIx64", level=%"PRIu32") iova range 0x%"PRIx64" - 0x%"PRIx64
+vtd_page_walk_one(uint32_t level, uint64_t iova, uint64_t gpa, uint64_t mask, int perm) "Page walk detected map level 0x%"PRIx32" iova 0x%"PRIx64" -> gpa 0x%"PRIx64" mask 0x%"PRIx64" perm %d"
+vtd_page_walk_skip_read(uint64_t iova, uint64_t next) "Page walk skip iova 0x%"PRIx64" - 0x%"PRIx64" due to unable to read"
+vtd_page_walk_skip_perm(uint64_t iova, uint64_t next) "Page walk skip iova 0x%"PRIx64" - 0x%"PRIx64" due to perm empty"
+vtd_page_walk_skip_reserve(uint64_t iova, uint64_t next) "Page walk skip iova 0x%"PRIx64" - 0x%"PRIx64" due to rsrv set"
 
 # hw/i386/amd_iommu.c
 amdvi_evntlog_fail(uint64_t addr, uint32_t head) "error: fail to write at addr 0x%"PRIx64" +  offset 0x%"PRIx32
diff --git a/include/exec/memory.h b/include/exec/memory.h
index 6bdd12c..e66a0d0 100644
--- a/include/exec/memory.h
+++ b/include/exec/memory.h
@@ -60,6 +60,8 @@ typedef enum {
     IOMMU_NO_FAIL  = 4, /* may not be present, don't repport error to guest */
 } IOMMUAccessFlags;
 
+#define IOMMU_ACCESS_FLAG(r, w) ((r) ? IOMMU_RO : 0) | ((w) ? IOMMU_WO : 0)
+
 struct IOMMUTLBEntry {
     AddressSpace    *target_as;
     hwaddr           iova;
-- 
2.7.4

^ permalink raw reply related	[flat|nested] 33+ messages in thread

* Re: [Qemu-devel] [RFC PATCH 11/13] intel_iommu: provide its own replay() callback
  2016-12-06 10:36 ` [Qemu-devel] [RFC PATCH 11/13] intel_iommu: provide its own replay() callback Peter Xu
@ 2016-12-08  3:01   ` Lan Tianyu
  2016-12-09  1:56     ` Peter Xu
  0 siblings, 1 reply; 33+ messages in thread
From: Lan Tianyu @ 2016-12-08  3:01 UTC (permalink / raw)
  To: Peter Xu, qemu-devel
  Cc: kevin.tian, mst, jan.kiszka, bd.aviv, alex.williamson, jasowang

On 2016年12月06日 18:36, Peter Xu wrote:
> +/**
> + * vtd_page_walk - walk specific IOVA range, and call the hook
> + *
> + * @ce: context entry to walk upon
> + * @start: IOVA address to start the walk
> + * @end: size of the IOVA range to walk over


*·@end:·IOVA·range·end·address·(start·<=·addr·<·end)
This comment for vtd_page_walk_level() maybe more accurate here?


-- 
Best regards
Tianyu Lan

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [Qemu-devel] [RFC PATCH 11/13] intel_iommu: provide its own replay() callback
  2016-12-08  3:01   ` Lan Tianyu
@ 2016-12-09  1:56     ` Peter Xu
  0 siblings, 0 replies; 33+ messages in thread
From: Peter Xu @ 2016-12-09  1:56 UTC (permalink / raw)
  To: Lan Tianyu
  Cc: qemu-devel, kevin.tian, mst, jan.kiszka, bd.aviv, alex.williamson,
	jasowang

On Thu, Dec 08, 2016 at 11:01:53AM +0800, Lan Tianyu wrote:
> On 2016年12月06日 18:36, Peter Xu wrote:
> > +/**
> > + * vtd_page_walk - walk specific IOVA range, and call the hook
> > + *
> > + * @ce: context entry to walk upon
> > + * @start: IOVA address to start the walk
> > + * @end: size of the IOVA range to walk over
> 
> 
> *·@end:·IOVA·range·end·address·(start·<=·addr·<·end)
> This comment for vtd_page_walk_level() maybe more accurate here?

You are right, thanks. :)

-- peterx

^ permalink raw reply	[flat|nested] 33+ messages in thread

* [Qemu-devel] [RFC PATCH 12/13] intel_iommu: do replay when context invalidate
  2016-12-06 10:36 [Qemu-devel] [RFC PATCH 00/13] VT-d replay and misc cleanup Peter Xu
                   ` (10 preceding siblings ...)
  2016-12-06 10:36 ` [Qemu-devel] [RFC PATCH 11/13] intel_iommu: provide its own replay() callback Peter Xu
@ 2016-12-06 10:36 ` Peter Xu
  2016-12-06 10:36 ` [Qemu-devel] [RFC PATCH 13/13] intel_iommu: use page_walk for iotlb inv notify Peter Xu
                   ` (4 subsequent siblings)
  16 siblings, 0 replies; 33+ messages in thread
From: Peter Xu @ 2016-12-06 10:36 UTC (permalink / raw)
  To: qemu-devel
  Cc: tianyu.lan, kevin.tian, mst, jan.kiszka, bd.aviv, peterx,
	alex.williamson, jasowang

Before this one we only invalidate context cache when we receive context
entry invalidations. However it's possible that the invalidation also
contains a domain switch (only if cache-mode is enabled for vIOMMU). In
that case we need to notify all the registered components about the new
mapping.

Signed-off-by: Peter Xu <peterx@redhat.com>
---
 hw/i386/intel_iommu.c | 1 +
 1 file changed, 1 insertion(+)

diff --git a/hw/i386/intel_iommu.c b/hw/i386/intel_iommu.c
index 2fcd7af..0220e63 100644
--- a/hw/i386/intel_iommu.c
+++ b/hw/i386/intel_iommu.c
@@ -1188,6 +1188,7 @@ static void vtd_context_device_invalidate(IntelIOMMUState *s,
                 trace_vtd_inv_desc_cc_device(bus_n, (devfn_it >> 3) & 0x1f,
                                              devfn_it & 3);
                 vtd_as->context_cache_entry.context_cache_gen = 0;
+                memory_region_iommu_replay_all(&vtd_as->iommu);
             }
         }
     }
-- 
2.7.4

^ permalink raw reply related	[flat|nested] 33+ messages in thread

* [Qemu-devel] [RFC PATCH 13/13] intel_iommu: use page_walk for iotlb inv notify
  2016-12-06 10:36 [Qemu-devel] [RFC PATCH 00/13] VT-d replay and misc cleanup Peter Xu
                   ` (11 preceding siblings ...)
  2016-12-06 10:36 ` [Qemu-devel] [RFC PATCH 12/13] intel_iommu: do replay when context invalidate Peter Xu
@ 2016-12-06 10:36 ` Peter Xu
  2016-12-06 10:49 ` [Qemu-devel] [RFC PATCH 00/13] VT-d replay and misc cleanup Peter Xu
                   ` (3 subsequent siblings)
  16 siblings, 0 replies; 33+ messages in thread
From: Peter Xu @ 2016-12-06 10:36 UTC (permalink / raw)
  To: qemu-devel
  Cc: tianyu.lan, kevin.tian, mst, jan.kiszka, bd.aviv, peterx,
	alex.williamson, jasowang

Instead of translate() every page for iotlb invalidations (which is
slower), we walk the pages when needed and notify in a hook function.
This will also simplify the code a bit.

Signed-off-by: Peter Xu <peterx@redhat.com>
---
 hw/i386/intel_iommu.c | 64 +++++++++++++++------------------------------------
 1 file changed, 19 insertions(+), 45 deletions(-)

diff --git a/hw/i386/intel_iommu.c b/hw/i386/intel_iommu.c
index 0220e63..226dbcd 100644
--- a/hw/i386/intel_iommu.c
+++ b/hw/i386/intel_iommu.c
@@ -149,23 +149,6 @@ static uint64_t vtd_set_clear_mask_quad(IntelIOMMUState *s, hwaddr addr,
     return new_val;
 }
 
-static int vtd_get_did_dev(IntelIOMMUState *s, uint8_t bus_num, uint8_t devfn,
-                           uint16_t *domain_id)
-{
-    VTDContextEntry ce;
-    int ret_fr;
-
-    assert(domain_id);
-
-    ret_fr = vtd_dev_to_context_entry(s, bus_num, devfn, &ce);
-    if (ret_fr) {
-        return -1;
-    }
-
-    *domain_id =  VTD_CONTEXT_ENTRY_DID(ce.hi);
-    return 0;
-}
-
 /* GHashTable functions */
 static gboolean vtd_uint64_equal(gconstpointer v1, gconstpointer v2)
 {
@@ -868,7 +851,8 @@ next:
  * @private: private data for the hook function
  */
 static int vtd_page_walk(VTDContextEntry *ce, uint64_t start, uint64_t end,
-                         vtd_page_walk_hook hook_fn, void *private)
+                         vtd_page_walk_hook hook_fn, void *private,
+                         bool notify_unmap)
 {
     dma_addr_t addr = vtd_get_slpt_base_from_context(ce);
     uint32_t level = vtd_get_level_from_context_entry(ce);
@@ -887,7 +871,7 @@ static int vtd_page_walk(VTDContextEntry *ce, uint64_t start, uint64_t end,
     trace_vtd_page_walk(ce->hi, ce->lo, start, end);
 
     return vtd_page_walk_level(addr, start, end, hook_fn, private,
-                               level, true, true, NULL, false);
+                               level, true, true, NULL, notify_unmap);
 }
 
 /* Map a device to its corresponding domain (context-entry) */
@@ -1238,39 +1222,29 @@ static void vtd_iotlb_domain_invalidate(IntelIOMMUState *s, uint16_t domain_id)
                                 &domain_id);
 }
 
+static int vtd_page_invalidate_notify_hook(IOMMUTLBEntry *entry,
+                                           void *private)
+{
+    memory_region_notify_iommu((MemoryRegion *)private, *entry);
+    return 0;
+}
+
 static void vtd_iotlb_page_invalidate_notify(IntelIOMMUState *s,
                                            uint16_t domain_id, hwaddr addr,
                                            uint8_t am)
 {
     IntelIOMMUNotifierNode *node;
+    VTDContextEntry ce;
+    int ret;
 
     QLIST_FOREACH(node, &(s->notifiers_list), next) {
         VTDAddressSpace *vtd_as = node->vtd_as;
-        uint16_t vfio_domain_id;
-        int ret = vtd_get_did_dev(s, pci_bus_num(vtd_as->bus), vtd_as->devfn,
-                                  &vfio_domain_id);
-        if (!ret && domain_id == vfio_domain_id) {
-            hwaddr original_addr = addr;
-
-            while (addr < original_addr + (1 << am) * VTD_PAGE_SIZE) {
-                IOMMUTLBEntry entry = s->iommu_ops.translate(
-                                                         &node->vtd_as->iommu,
-                                                         addr,
-                                                         IOMMU_NO_FAIL);
-
-                if (entry.perm == IOMMU_NONE &&
-                        node->notifier_flag & IOMMU_NOTIFIER_UNMAP) {
-                    entry.target_as = &address_space_memory;
-                    entry.iova = addr & VTD_PAGE_MASK_4K;
-                    entry.translated_addr = 0;
-                    entry.addr_mask = ~VTD_PAGE_MASK(VTD_PAGE_SHIFT);
-                    memory_region_notify_iommu(&node->vtd_as->iommu, entry);
-                    addr += VTD_PAGE_SIZE;
-                } else if (node->notifier_flag & IOMMU_NOTIFIER_MAP) {
-                        memory_region_notify_iommu(&node->vtd_as->iommu, entry);
-                        addr += entry.addr_mask + 1;
-                }
-            }
+        ret = vtd_dev_to_context_entry(s, pci_bus_num(vtd_as->bus),
+                                       vtd_as->devfn, &ce);
+        if (!ret && domain_id == VTD_CONTEXT_ENTRY_DID(ce.hi)) {
+            vtd_page_walk(&ce, addr, addr + (1 << am) * VTD_PAGE_SIZE,
+                          vtd_page_invalidate_notify_hook,
+                          (void *)&vtd_as->iommu, true);
         }
     }
 }
@@ -2623,7 +2597,7 @@ static void vtd_iommu_replay(MemoryRegion *mr, IOMMUNotifier *n)
          */
         trace_vtd_replay_ce_valid(bus_n, PCI_SLOT(vtd_as->devfn),
                                   PCI_FUNC(vtd_as->devfn), ce.hi, ce.lo);
-        vtd_page_walk(&ce, 0, ~0, vtd_replay_hook, (void *)n);
+        vtd_page_walk(&ce, 0, ~0, vtd_replay_hook, (void *)n, false);
     } else {
         trace_vtd_replay_ce_invalid(bus_n, PCI_SLOT(vtd_as->devfn),
                                     PCI_FUNC(vtd_as->devfn));
-- 
2.7.4

^ permalink raw reply related	[flat|nested] 33+ messages in thread

* Re: [Qemu-devel] [RFC PATCH 00/13] VT-d replay and misc cleanup
  2016-12-06 10:36 [Qemu-devel] [RFC PATCH 00/13] VT-d replay and misc cleanup Peter Xu
                   ` (12 preceding siblings ...)
  2016-12-06 10:36 ` [Qemu-devel] [RFC PATCH 13/13] intel_iommu: use page_walk for iotlb inv notify Peter Xu
@ 2016-12-06 10:49 ` Peter Xu
  2016-12-13  7:45 ` Peter Xu
                   ` (2 subsequent siblings)
  16 siblings, 0 replies; 33+ messages in thread
From: Peter Xu @ 2016-12-06 10:49 UTC (permalink / raw)
  To: qemu-devel
  Cc: tianyu.lan, kevin.tian, mst, jan.kiszka, bd.aviv, alex.williamson,
	jasowang, Paolo Bonzini, David Gibson

On Tue, Dec 06, 2016 at 06:36:15PM +0800, Peter Xu wrote:
> This RFC series is a continue work for Aviv B.D.'s vfio enablement
> series with vt-d. Aviv has done a great job there, and what we still
> lack there are mostly the following:
> 
> (1) VFIO got duplicated IOTLB notifications due to splitted VT-d IOMMU
>     memory region.
> 
> (2) VT-d still haven't provide a correct replay() mechanism (e.g.,
>     when IOMMU domain switches, things will broke).
> 
> Here I'm trying to solve the above two issues.
> 
> (1) is solved by patch 7, (2) is solved by patch 11-12.

One thing to mention: this series is based on Aviv's series v6:

  https://lists.gnu.org/archive/html/qemu-devel/2016-11/msg01452.html

I pushed an online branch for better reference:

  https://github.com/xzpeter/qemu/tree/vtd-vfio-enablement

Forgot to CC Paolo & DavidG. Adding in.

Thanks,

-- peterx

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [Qemu-devel] [RFC PATCH 00/13] VT-d replay and misc cleanup
  2016-12-06 10:36 [Qemu-devel] [RFC PATCH 00/13] VT-d replay and misc cleanup Peter Xu
                   ` (13 preceding siblings ...)
  2016-12-06 10:49 ` [Qemu-devel] [RFC PATCH 00/13] VT-d replay and misc cleanup Peter Xu
@ 2016-12-13  7:45 ` Peter Xu
  2016-12-18  8:42 ` Liu, Yi L
  2016-12-19  9:33 ` Peter Xu
  16 siblings, 0 replies; 33+ messages in thread
From: Peter Xu @ 2016-12-13  7:45 UTC (permalink / raw)
  To: qemu-devel
  Cc: tianyu.lan, kevin.tian, mst, jan.kiszka, bd.aviv, alex.williamson,
	jasowang

On Tue, Dec 06, 2016 at 06:36:15PM +0800, Peter Xu wrote:
> This RFC series is a continue work for Aviv B.D.'s vfio enablement
> series with vt-d. Aviv has done a great job there, and what we still
> lack there are mostly the following:
> 
> (1) VFIO got duplicated IOTLB notifications due to splitted VT-d IOMMU
>     memory region.
> 
> (2) VT-d still haven't provide a correct replay() mechanism (e.g.,
>     when IOMMU domain switches, things will broke).
> 
> Here I'm trying to solve the above two issues.

Ping.

Alex/Michael/Aviv, do any of you have time to help review this series?

Thanks,

-- peterx

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [Qemu-devel] [RFC PATCH 00/13] VT-d replay and misc cleanup
  2016-12-06 10:36 [Qemu-devel] [RFC PATCH 00/13] VT-d replay and misc cleanup Peter Xu
                   ` (14 preceding siblings ...)
  2016-12-13  7:45 ` Peter Xu
@ 2016-12-18  8:42 ` Liu, Yi L
  2016-12-19  9:22   ` Peter Xu
  2016-12-19  9:33 ` Peter Xu
  16 siblings, 1 reply; 33+ messages in thread
From: Liu, Yi L @ 2016-12-18  8:42 UTC (permalink / raw)
  To: Peter Xu; +Cc: kevin.tian, tianyu.lan, jasowang, qemu-devel, yi.l.liu

On Tue, Dec 06, 2016 at 06:36:15PM +0800, Peter Xu wrote:
> This RFC series is a continue work for Aviv B.D.'s vfio enablement
> series with vt-d. Aviv has done a great job there, and what we still
> lack there are mostly the following:
> 
> (1) VFIO got duplicated IOTLB notifications due to splitted VT-d IOMMU
>     memory region.
> 
> (2) VT-d still haven't provide a correct replay() mechanism (e.g.,
>     when IOMMU domain switches, things will broke).
> 
> Here I'm trying to solve the above two issues.
> 
> (1) is solved by patch 7, (2) is solved by patch 11-12.
> 
> Basically it contains the following:
> 
> patch 1:    picked up from Jason's vhost DMAR series, which is a bugfix
> 
> patch 2-6:  Cleanups/Enhancements for existing vt-d codes (please see
>             specific commit message for details, there are patches
>             that I thought may be suitable for 2.8 as well, but looks
>             like it's too late)
> 
> patch 7:    Solve the issue that vfio is notified more than once for
>             IOTLB notifications with Aviv's patches
> 
> patch 8-10: Some trivial memory APIs added for further patches, and
>             add customize replay() support for MemoryRegion (I see
>             Aviv's latest v7 contains similar replay, I can rebase
>             onto that, merely the same thing)
> 
> patch 11:   Provide a valid vt-d replay() callback, using page walk
> 
Peter,
Does your patch set based on Aviv's patch？ I found the page cannot be
applied in my side.

BTW. it may be better if you can split the patches for mis cleanup
and the patches for replay/"fix duplicate notify".

Thanks,
Yi L
> patch 12:   Enable the domain switch support - we replay() when
>             context entry got invalidated
> 
> patch 13:   Enhancement for existing invalidation notification,
>             instead of using translate() for each page, we leverage
>             the new vtd_page_walk() interface, which should be faster.
> 
> I would glad to hear about any review comments for above patches
> (especially patch 8-13, which is the main part of this series),
> especially any issue I missed in the series.
> 
> =========
> Test Done
> =========
> 
> Build test passed for x86_64/arm/ppc64.
> 
> Simply tested with x86_64, assigning two PCI devices to a single VM,
> boot the VM using:
> 
> bin=x86_64-softmmu/qemu-system-x86_64
> $bin -M q35,accel=kvm,kernel-irqchip=split -m 1G \
>      -device intel-iommu,intremap=on,eim=off,cache-mode=on \
>      -netdev user,id=net0,hostfwd=tcp::5555-:22 \
>      -device virtio-net-pci,netdev=net0 \
>      -device vfio-pci,host=03:00.0 \
>      -device vfio-pci,host=02:00.0 \
>      -trace events=".trace.vfio" \
>      /var/lib/libvirt/images/vm1.qcow2
> 
> pxdev:bin [vtd-vfio-enablement]# cat .trace.vfio
> vtd_page_walk*
> vtd_replay*
> vtd_inv_desc*
> 
> Then, in the guest, run the following tool:
> 
>   https://github.com/xzpeter/clibs/blob/master/gpl/userspace/vfio-bind-group/vfio-bind-group.c
> 
> With parameter:
> 
>   ./vfio-bind-group 00:03.0 00:04.0
> 
> Check host side trace log, I can see pages are replayed and mapped in
> 00:04.0 device address space, like:
> 
> ...
> vtd_replay_ce_valid replay valid context device 00:04.00 hi 0x301 lo 0x3be77001
> vtd_page_walk Page walk for ce (0x301, 0x3be77001) iova range 0x0 - 0x8000000000
> vtd_page_walk_level Page walk (base=0x3be77000, level=3) iova range 0x0 - 0x8000000000
> vtd_page_walk_level Page walk (base=0x3c88a000, level=2) iova range 0x0 - 0x40000000
> vtd_page_walk_level Page walk (base=0x366cb000, level=1) iova range 0x0 - 0x200000
> vtd_page_walk_one Page walk detected map level 0x1 iova 0x0 -> gpa 0x366cb000 mask 0xfff perm 3
> vtd_page_walk_one Page walk detected map level 0x1 iova 0x1000 -> gpa 0x366cb000 mask 0xfff perm 3
> vtd_page_walk_one Page walk detected map level 0x1 iova 0x2000 -> gpa 0x366cb000 mask 0xfff perm 3
> vtd_page_walk_one Page walk detected map level 0x1 iova 0x3000 -> gpa 0x366cb000 mask 0xfff perm 3
> vtd_page_walk_one Page walk detected map level 0x1 iova 0x4000 -> gpa 0x366cb000 mask 0xfff perm 3
> vtd_page_walk_one Page walk detected map level 0x1 iova 0x5000 -> gpa 0x366cb000 mask 0xfff perm 3
> vtd_page_walk_one Page walk detected map level 0x1 iova 0x6000 -> gpa 0x366cb000 mask 0xfff perm 3
> vtd_page_walk_one Page walk detected map level 0x1 iova 0x7000 -> gpa 0x366cb000 mask 0xfff perm 3
> vtd_page_walk_one Page walk detected map level 0x1 iova 0x8000 -> gpa 0x366cb000 mask 0xfff perm 3
> vtd_page_walk_one Page walk detected map level 0x1 iova 0x9000 -> gpa 0x366cb000 mask 0xfff perm 3
> vtd_page_walk_one Page walk detected map level 0x1 iova 0xa000 -> gpa 0x366cb000 mask 0xfff perm 3
> vtd_page_walk_one Page walk detected map level 0x1 iova 0xb000 -> gpa 0x366cb000 mask 0xfff perm 3
> vtd_page_walk_one Page walk detected map level 0x1 iova 0xc000 -> gpa 0x366cb000 mask 0xfff perm 3
> vtd_page_walk_one Page walk detected map level 0x1 iova 0xd000 -> gpa 0x366cb000 mask 0xfff perm 3
> vtd_page_walk_one Page walk detected map level 0x1 iova 0xe000 -> gpa 0x366cb000 mask 0xfff perm 3
> ...
> 
> =========
> Todo List
> =========
> 
> - error reporting for the assigned devices (as Tianyu has mentioned)
> 
> - per-domain address-space: A better solution in the future may be -
>   we maintain one address space per IOMMU domain in the guest (so
>   multiple devices can share a same address space if they are sharing
>   the same IOMMU domains in the guest), rather than one address space
>   per device (which is current implementation of vt-d). However that's
>   a step further than this series, and let's see whether we can first
>   provide a workable version of device assignment with vt-d
>   protection.
> 
> - more to come...
> 
> Thanks,
> 
> Jason Wang (1):
>   intel_iommu: allocate new key when creating new address space
> 
> Peter Xu (12):
>   intel_iommu: simplify irq region translation
>   intel_iommu: renaming gpa to iova where proper
>   intel_iommu: fix trace for inv desc handling
>   intel_iommu: fix trace for addr translation
>   intel_iommu: vtd_slpt_level_shift check level
>   memory: add section range info for IOMMU notifier
>   memory: provide iommu_replay_all()
>   memory: introduce memory_region_notify_one()
>   memory: add MemoryRegionIOMMUOps.replay() callback
>   intel_iommu: provide its own replay() callback
>   intel_iommu: do replay when context invalidate
>   intel_iommu: use page_walk for iotlb inv notify
> 
>  hw/i386/intel_iommu.c | 521 ++++++++++++++++++++++++++++++++------------------
>  hw/i386/trace-events  |  27 +++
>  hw/vfio/common.c      |   7 +-
>  include/exec/memory.h |  30 +++
>  memory.c              |  42 +++-
>  5 files changed, 432 insertions(+), 195 deletions(-)
> 
> -- 
> 2.7.4
> 
> 

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [Qemu-devel] [RFC PATCH 00/13] VT-d replay and misc cleanup
  2016-12-18  8:42 ` Liu, Yi L
@ 2016-12-19  9:22   ` Peter Xu
  2016-12-19 11:25     ` Liu, Yi L
  0 siblings, 1 reply; 33+ messages in thread
From: Peter Xu @ 2016-12-19  9:22 UTC (permalink / raw)
  To: Liu, Yi L; +Cc: kevin.tian, tianyu.lan, jasowang, qemu-devel, yi.l.liu

On Sun, Dec 18, 2016 at 04:42:50PM +0800, Liu, Yi L wrote:
> On Tue, Dec 06, 2016 at 06:36:15PM +0800, Peter Xu wrote:
> > This RFC series is a continue work for Aviv B.D.'s vfio enablement
> > series with vt-d. Aviv has done a great job there, and what we still
> > lack there are mostly the following:
> > 
> > (1) VFIO got duplicated IOTLB notifications due to splitted VT-d IOMMU
> >     memory region.
> > 
> > (2) VT-d still haven't provide a correct replay() mechanism (e.g.,
> >     when IOMMU domain switches, things will broke).
> > 
> > Here I'm trying to solve the above two issues.
> > 
> > (1) is solved by patch 7, (2) is solved by patch 11-12.
> > 
> > Basically it contains the following:
> > 
> > patch 1:    picked up from Jason's vhost DMAR series, which is a bugfix
> > 
> > patch 2-6:  Cleanups/Enhancements for existing vt-d codes (please see
> >             specific commit message for details, there are patches
> >             that I thought may be suitable for 2.8 as well, but looks
> >             like it's too late)
> > 
> > patch 7:    Solve the issue that vfio is notified more than once for
> >             IOTLB notifications with Aviv's patches
> > 
> > patch 8-10: Some trivial memory APIs added for further patches, and
> >             add customize replay() support for MemoryRegion (I see
> >             Aviv's latest v7 contains similar replay, I can rebase
> >             onto that, merely the same thing)
> > 
> > patch 11:   Provide a valid vt-d replay() callback, using page walk
> > 
> Peter,
> Does your patch set based on Aviv's patch？ I found the page cannot be
> applied in my side.

Hi, Yi,

This series is based on Aviv's v6 series. If you wanna try it, you may
want to fetch the tree from:

 https://github.com/xzpeter/qemu/tree/vtd-vfio-enablement

So you won't need to bother on the applying.

> 
> BTW. it may be better if you can split the patches for mis cleanup
> and the patches for replay/"fix duplicate notify".

Yes. Here I just want to make sure things are stick together (e.g., to
test the replay, I will need to use the traces). And I feel it awkward
to maintain several series upstream while they interact with each
other. Sorry for the troublesome.

Thanks,

-- peterx

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [Qemu-devel] [RFC PATCH 00/13] VT-d replay and misc cleanup
  2016-12-19  9:22   ` Peter Xu
@ 2016-12-19 11:25     ` Liu, Yi L
  0 siblings, 0 replies; 33+ messages in thread
From: Liu, Yi L @ 2016-12-19 11:25 UTC (permalink / raw)
  To: Peter Xu, Liu, Yi L
  Cc: Tian, Kevin, Lan, Tianyu, jasowang@redhat.com,
	qemu-devel@nongnu.org

> -----Original Message-----
> From: Peter Xu [mailto:peterx@redhat.com]
> Sent: Monday, December 19, 2016 5:23 PM
> To: Liu, Yi L <yi.l.liu@linux.intel.com>
> Cc: Tian, Kevin <kevin.tian@intel.com>; Lan, Tianyu <tianyu.lan@intel.com>;
> jasowang@redhat.com; qemu-devel@nongnu.org; Liu, Yi L <yi.l.liu@intel.com>
> Subject: Re: [Qemu-devel] [RFC PATCH 00/13] VT-d replay and misc cleanup
> 
> On Sun, Dec 18, 2016 at 04:42:50PM +0800, Liu, Yi L wrote:
> > On Tue, Dec 06, 2016 at 06:36:15PM +0800, Peter Xu wrote:
> > > This RFC series is a continue work for Aviv B.D.'s vfio enablement
> > > series with vt-d. Aviv has done a great job there, and what we still
> > > lack there are mostly the following:
> > >
> > > (1) VFIO got duplicated IOTLB notifications due to splitted VT-d IOMMU
> > >     memory region.
> > >
> > > (2) VT-d still haven't provide a correct replay() mechanism (e.g.,
> > >     when IOMMU domain switches, things will broke).
> > >
> > > Here I'm trying to solve the above two issues.
> > >
> > > (1) is solved by patch 7, (2) is solved by patch 11-12.
> > >
> > > Basically it contains the following:
> > >
> > > patch 1:    picked up from Jason's vhost DMAR series, which is a bugfix
> > >
> > > patch 2-6:  Cleanups/Enhancements for existing vt-d codes (please see
> > >             specific commit message for details, there are patches
> > >             that I thought may be suitable for 2.8 as well, but looks
> > >             like it's too late)
> > >
> > > patch 7:    Solve the issue that vfio is notified more than once for
> > >             IOTLB notifications with Aviv's patches
> > >
> > > patch 8-10: Some trivial memory APIs added for further patches, and
> > >             add customize replay() support for MemoryRegion (I see
> > >             Aviv's latest v7 contains similar replay, I can rebase
> > >             onto that, merely the same thing)
> > >
> > > patch 11:   Provide a valid vt-d replay() callback, using page walk
> > >
> > Peter,
> > Does your patch set based on Aviv's patch？ I found the page cannot be
> > applied in my side.
> 
> Hi, Yi,
> 
> This series is based on Aviv's v6 series. If you wanna try it, you may
> want to fetch the tree from:
> 
>  https://github.com/xzpeter/qemu/tree/vtd-vfio-enablement
> 
> So you won't need to bother on the applying.
> 
Aha, looks like you mentioned you github link previously. I forgot it. Would
try it then. I may rebase my current svm work. thx for your contribution.^_^

> >
> > BTW. it may be better if you can split the patches for mis cleanup
> > and the patches for replay/"fix duplicate notify".
> 
> Yes. Here I just want to make sure things are stick together (e.g., to
> test the replay, I will need to use the traces). And I feel it awkward
> to maintain several series upstream while they interact with each
> other. Sorry for the troublesome.
> 
I totally understand what bothers you. It'll be fine. Follow your plan~

Thanks,
Yi L

> Thanks,
> 
> -- peterx

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [Qemu-devel] [RFC PATCH 00/13] VT-d replay and misc cleanup
  2016-12-06 10:36 [Qemu-devel] [RFC PATCH 00/13] VT-d replay and misc cleanup Peter Xu
                   ` (15 preceding siblings ...)
  2016-12-18  8:42 ` Liu, Yi L
@ 2016-12-19  9:33 ` Peter Xu
  16 siblings, 0 replies; 33+ messages in thread
From: Peter Xu @ 2016-12-19  9:33 UTC (permalink / raw)
  To: qemu-devel
  Cc: tianyu.lan, kevin.tian, mst, jan.kiszka, bd.aviv, alex.williamson,
	jasowang

On Tue, Dec 06, 2016 at 06:36:15PM +0800, Peter Xu wrote:
> This RFC series is a continue work for Aviv B.D.'s vfio enablement
> series with vt-d. Aviv has done a great job there, and what we still
> lack there are mostly the following:
> 
> (1) VFIO got duplicated IOTLB notifications due to splitted VT-d IOMMU
>     memory region.
> 
> (2) VT-d still haven't provide a correct replay() mechanism (e.g.,
>     when IOMMU domain switches, things will broke).
> 
> Here I'm trying to solve the above two issues.
> 
> (1) is solved by patch 7, (2) is solved by patch 11-12.
> 
> Basically it contains the following:
> 
> patch 1:    picked up from Jason's vhost DMAR series, which is a bugfix
> 
> patch 2-6:  Cleanups/Enhancements for existing vt-d codes (please see
>             specific commit message for details, there are patches
>             that I thought may be suitable for 2.8 as well, but looks
>             like it's too late)
> 
> patch 7:    Solve the issue that vfio is notified more than once for
>             IOTLB notifications with Aviv's patches
> 
> patch 8-10: Some trivial memory APIs added for further patches, and
>             add customize replay() support for MemoryRegion (I see
>             Aviv's latest v7 contains similar replay, I can rebase
>             onto that, merely the same thing)
> 
> patch 11:   Provide a valid vt-d replay() callback, using page walk
> 
> patch 12:   Enable the domain switch support - we replay() when
>             context entry got invalidated
> 
> patch 13:   Enhancement for existing invalidation notification,
>             instead of using translate() for each page, we leverage
>             the new vtd_page_walk() interface, which should be faster.
> 
> I would glad to hear about any review comments for above patches
> (especially patch 8-13, which is the main part of this series),
> especially any issue I missed in the series.

Hi, Michael,

Could you help have a look on this series? Hope we can move this
series forward a bit since it's still lack of review.

Btw, IMHO we can merge patch 1 now since people might encounter issue
without it (and it has been dangling quite a long time upstream).

Thanks,

-- peterx

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [Qemu-devel] [RFC PATCH 11/13] intel_iommu: provide its own replay() callback
@ 2016-12-29  7:38 Liu, Yi L
  2016-12-29  9:55 ` Peter Xu
  0 siblings, 1 reply; 33+ messages in thread
From: Liu, Yi L @ 2016-12-29  7:38 UTC (permalink / raw)
  To: ', Peter Xu'
  Cc: Tian, Kevin, Lan, Tianyu, 'qemu-devel@nongnu.org',
	'bd.aviv@gmail.com'

> The default replay() don't work for VT-d since vt-d will have a huge
> default memory region which covers address range 0-(2^64-1). This will
> normally bring a dead loop when guest starts.
> 
> The solution is simple - we don't walk over all the regions. Instead, we
> jump over the regions when we found that the page directories are empty.
> It'll greatly reduce the time to walk the whole region.
> 
> To achieve this, we provided a page walk helper to do that, invoking
> corresponding hook function when we found an page we are interested in.
> vtd_page_walk_level() is the core logic for the page walking. It's
> interface is designed to suite further use case, e.g., to invalidate a
> range of addresses.
> 
> Signed-off-by: Peter Xu <address@hidden>
> ---
>  hw/i386/intel_iommu.c | 212 ++++++++++++++++++++++++++++++++++++++++++++++++--
>  hw/i386/trace-events  |   8 ++
>  include/exec/memory.h |   2 +
>  3 files changed, 217 insertions(+), 5 deletions(-)
> 
> diff --git a/hw/i386/intel_iommu.c b/hw/i386/intel_iommu.c
> index 46b8a2f..2fcd7af 100644
> --- a/hw/i386/intel_iommu.c
> +++ b/hw/i386/intel_iommu.c
> @@ -620,6 +620,22 @@ static inline uint32_t 
> vtd_get_agaw_from_context_entry(VTDContextEntry *ce)
>      return 30 + (ce->hi & VTD_CONTEXT_ENTRY_AW) * 9;
>  }
>  
> +static inline uint64_t vtd_iova_limit(VTDContextEntry *ce)
> +{
> +    uint32_t ce_agaw = vtd_get_agaw_from_context_entry(ce);
> +    return 1ULL << MIN(ce_agaw, VTD_MGAW);
> +}
> +
> +/* Return true if IOVA passes range check, otherwise false. */
> +static inline bool vtd_iova_range_check(uint64_t iova, VTDContextEntry *ce)
> +{
> +    /*
> +     * Check if @iova is above 2^X-1, where X is the minimum of MGAW
> +     * in CAP_REG and AW in context-entry.
> +     */
> +    return !(iova & ~(vtd_iova_limit(ce) - 1));
> +}
> +
>  static const uint64_t vtd_paging_entry_rsvd_field[] = {
>      [0] = ~0ULL,
>      /* For not large page */
> @@ -656,13 +672,9 @@ static int vtd_iova_to_slpte(VTDContextEntry *ce, uint64_t 
> iova,
>      uint32_t level = vtd_get_level_from_context_entry(ce);
>      uint32_t offset;
>      uint64_t slpte;
> -    uint32_t ce_agaw = vtd_get_agaw_from_context_entry(ce);
>      uint64_t access_right_check = 0;
>  
> -    /* Check if @iova is above 2^X-1, where X is the minimum of MGAW
> -     * in CAP_REG and AW in context-entry.
> -     */
> -    if (iova & ~((1ULL << MIN(ce_agaw, VTD_MGAW)) - 1)) {
> +    if (!vtd_iova_range_check(iova, ce)) {
>          error_report("IOVA 0x%"PRIx64 " exceeds limits", iova);
>          return -VTD_FR_ADDR_BEYOND_MGAW;
>      }
> @@ -718,6 +730,166 @@ static int vtd_iova_to_slpte(VTDContextEntry *ce, 
> uint64_t iova,
>      }
>  }
>  
> +typedef int (*vtd_page_walk_hook)(IOMMUTLBEntry *entry, void *private);
> +
> +/**
> + * vtd_page_walk_level - walk over specific level for IOVA range
> + *
> + * @addr: base GPA addr to start the walk
> + * @start: IOVA range start address
> + * @end: IOVA range end address (start <= addr < end)
> + * @hook_fn: hook func to be called when detected page
> + * @private: private data to be passed into hook func
> + * @read: whether parent level has read permission
> + * @write: whether parent level has write permission
> + * @skipped: accumulated skipped ranges
> + * @notify_unmap: whether we should notify invalid entries
> + */
> +static int vtd_page_walk_level(dma_addr_t addr, uint64_t start,
> +                               uint64_t end, vtd_page_walk_hook hook_fn,
> +                               void *private, uint32_t level,
> +                               bool read, bool write, uint64_t *skipped,
> +                               bool notify_unmap)
> +{
> +    bool read_cur, write_cur, entry_valid;
> +    uint32_t offset;
> +    uint64_t slpte;
> +    uint64_t subpage_size, subpage_mask;
> +    IOMMUTLBEntry entry;
> +    uint64_t iova = start;
> +    uint64_t iova_next;
> +    uint64_t skipped_local = 0;
> +    int ret = 0;
> +
> +    trace_vtd_page_walk_level(addr, level, start, end);
> +
> +    subpage_size = 1ULL << vtd_slpt_level_shift(level);
> +    subpage_mask = vtd_slpt_level_page_mask(level);
> +
> +    while (iova < end) {
> +        iova_next = (iova & subpage_mask) + subpage_size;
> +
> +        offset = vtd_iova_level_offset(iova, level);
> +        slpte = vtd_get_slpte(addr, offset);
> +
> +        /*
> +         * When one of the following case happens, we assume the whole
> +         * range is invalid:
> +         *
> +         * 1. read block failed
> +         * 3. reserved area non-zero
> +         * 2. both read & write flag are not set
> +         */
> +
> +        if (slpte == (uint64_t)-1) {
> +            trace_vtd_page_walk_skip_read(iova, iova_next);
> +            skipped_local++;
> +            goto next;
> +        }
> +
> +        if (vtd_slpte_nonzero_rsvd(slpte, level)) {
> +            trace_vtd_page_walk_skip_reserve(iova, iova_next);
> +            skipped_local++;
> +            goto next;
> +        }
> +
> +        /* Permissions are stacked with parents' */
> +        read_cur = read && (slpte & VTD_SL_R);
> +        write_cur = write && (slpte & VTD_SL_W);
> +
> +        /*
> +         * As long as we have either read/write permission, this is
> +         * a valid entry. The rule works for both page or page tables.
> +         */
> +        entry_valid = read_cur | write_cur;
> +
> +        if (vtd_is_last_slpte(slpte, level)) {
> +            entry.target_as = &address_space_memory;
> +            entry.iova = iova & subpage_mask;
> +            /*
> +             * This might be meaningless addr if (!read_cur &&
> +             * !write_cur), but after all this field will be
> +             * meaningless in that case, so let's share the code to
> +             * generate the IOTLBs no matter it's an MAP or UNMAP
> +             */
> +            entry.translated_addr = vtd_get_slpte_addr(slpte);
> +            entry.addr_mask = ~subpage_mask;
> +            entry.perm = IOMMU_ACCESS_FLAG(read_cur, write_cur);
> +            if (!entry_valid && !notify_unmap) {
> +                trace_vtd_page_walk_skip_perm(iova, iova_next);
> +                skipped_local++;
> +                goto next;
> +            }
> +            trace_vtd_page_walk_one(level, entry.iova, addr,
> +                                    entry.addr_mask, entry.perm);
> +            if (hook_fn) {
> +                ret = hook_fn(&entry, private);
> +                if (ret < 0) {
> +                    error_report("Detected error in page walk hook "
> +                                 "function, stop walk.");
> +                    return ret;
> +                }
> +            }
> +        } else {
> +            if (!entry_valid) {
> +                trace_vtd_page_walk_skip_perm(iova, iova_next);
> +                skipped_local++;
> +                goto next;
> +            }
> +            ret = vtd_page_walk_level(vtd_get_slpte_addr(slpte), iova,
> +                                      MIN(iova_next, end), hook_fn, private,
> +                                      level - 1, read_cur, write_cur,
> +                                      &skipped_local, notify_unmap);
> +            if (ret < 0) {
> +                error_report("Detected page walk error on addr 0x%"PRIx64
> +                             " level %"PRIu32", stop walk.", addr, level - 1);
> +                return ret;
> +            }
> +        }
> +
> +next:
> +        iova = iova_next;
> +    }
> +
> +    if (skipped) {
> +        *skipped += skipped_local;
> +    }
> +
> +    return 0;
> +}
> +
> +/**
> + * vtd_page_walk - walk specific IOVA range, and call the hook
> + *
> + * @ce: context entry to walk upon
> + * @start: IOVA address to start the walk
> + * @end: size of the IOVA range to walk over
> + * @hook_fn: the hook that to be called for each detected area
> + * @private: private data for the hook function
> + */
> +static int vtd_page_walk(VTDContextEntry *ce, uint64_t start, uint64_t end,
> +                         vtd_page_walk_hook hook_fn, void *private)
> +{
> +    dma_addr_t addr = vtd_get_slpt_base_from_context(ce);
> +    uint32_t level = vtd_get_level_from_context_entry(ce);
> +
> +    if (!vtd_iova_range_check(start, ce)) {
> +        error_report("IOVA start 0x%"PRIx64 " end 0x%"PRIx64" exceeds limits",
> +                     start, end);
> +        return -VTD_FR_ADDR_BEYOND_MGAW;
> +    }
> +
> +    if (!vtd_iova_range_check(end, ce)) {
> +        /* Fix end so that it reaches the maximum */
> +        end = vtd_iova_limit(ce);
> +    }
> +
> +    trace_vtd_page_walk(ce->hi, ce->lo, start, end);
> +
> +    return vtd_page_walk_level(addr, start, end, hook_fn, private,
> +                               level, true, true, NULL, false);
> +}
> +
>  /* Map a device to its corresponding domain (context-entry) */
>  static int vtd_dev_to_context_entry(IntelIOMMUState *s, uint8_t bus_num,
>                                      uint8_t devfn, VTDContextEntry *ce)
> @@ -2430,6 +2602,35 @@ VTDAddressSpace *vtd_find_add_as(IntelIOMMUState *s, 
> PCIBus *bus, int devfn)
>      return vtd_dev_as;
>  }
>  
> +static int vtd_replay_hook(IOMMUTLBEntry *entry, void *private)
> +{
> +    memory_region_notify_one((IOMMUNotifier *)private, entry);
> +    return 0;
> +}
> +
> +static void vtd_iommu_replay(MemoryRegion *mr, IOMMUNotifier *n)
> +{
> +    VTDAddressSpace *vtd_as = container_of(mr, VTDAddressSpace, iommu);
> +    IntelIOMMUState *s = vtd_as->iommu_state;
> +    uint8_t bus_n = pci_bus_num(vtd_as->bus);
> +    VTDContextEntry ce;
> +
> +    if (vtd_dev_to_context_entry(s, bus_n, vtd_as->devfn, &ce) == 0) {

Hi Peter,

Again I'd like to give my cheers on this patch set.

Hereby, I have a minor concern on this part.
The replay would be performed in vfio_listener_region_add() when device is
assigned. Since guest is just started, so there will be no valid context entry for the
assigned device. So there will be no vtd_page_walk. Thus no change to SL page
table in host. The SL page table would still be a IOVA->HPA mapping. It's true
that the gIOVA->HPA mapping can be built by page map/unmap when guest
trying to issue dma. So it is just ok without relpay in the beginning. Hot plug should
be all the same. I guess it is the design here?

However, it may be incompatible with a future work in which we expose an
vIOMMU to guest but guest disable IOVA usage. In this scenario, the SL page
table in host would be a GPA->HPA instead of a gIOVA->HPA mapping. It would
rely on memory_replay to build a GPA->HPA mapping. If I'm correct, I think it
would be expected to have a replay in the time device is assigned.

Regards,
Yi L

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [Qemu-devel] [RFC PATCH 11/13] intel_iommu: provide its own replay() callback
  2016-12-29  7:38 [Qemu-devel] [RFC PATCH 11/13] intel_iommu: provide its own replay() callback Liu, Yi L
@ 2016-12-29  9:55 ` Peter Xu
  2016-12-29 10:23   ` Liu, Yi L
  0 siblings, 1 reply; 33+ messages in thread
From: Peter Xu @ 2016-12-29  9:55 UTC (permalink / raw)
  To: Liu, Yi L
  Cc: Tian, Kevin, Lan, Tianyu, 'qemu-devel@nongnu.org',
	'bd.aviv@gmail.com'

On Thu, Dec 29, 2016 at 07:38:20AM +0000, Liu, Yi L wrote:

[...]

> > +static void vtd_iommu_replay(MemoryRegion *mr, IOMMUNotifier *n)
> > +{
> > +    VTDAddressSpace *vtd_as = container_of(mr, VTDAddressSpace, iommu);
> > +    IntelIOMMUState *s = vtd_as->iommu_state;
> > +    uint8_t bus_n = pci_bus_num(vtd_as->bus);
> > +    VTDContextEntry ce;
> > +
> > +    if (vtd_dev_to_context_entry(s, bus_n, vtd_as->devfn, &ce) == 0) {
> 
> Hi Peter,
> 
> Again I'd like to give my cheers on this patch set.

Thanks.

> 
> Hereby, I have a minor concern on this part.
> The replay would be performed in vfio_listener_region_add() when device is
> assigned. Since guest is just started, so there will be no valid context entry for the
> assigned device. So there will be no vtd_page_walk. Thus no change to SL page
> table in host. The SL page table would still be a IOVA->HPA mapping. It's true
> that the gIOVA->HPA mapping can be built by page map/unmap when guest
> trying to issue dma. So it is just ok without relpay in the beginning. Hot plug should
> be all the same. I guess it is the design here?

I am not sure whether I get your point here - I think it's by design.
We don't really need replay if we are sure that the IOMMU address
space has totally no mapping (i.e., when the guest init). Replay is
only useful when there are existing mappings in the IOMMU address
space.

Another thing to mention: IMHO IOVA -> HPA mapping should be built by
page map/unmap when guest issues IOTLB invalidations, not DMA (guest
will first issue IOTLB invalidations, only after that the iova address
would be a valid DMA address).

> 
> However, it may be incompatible with a future work in which we expose an
> vIOMMU to guest but guest disable IOVA usage. In this scenario, the SL page
> table in host would be a GPA->HPA instead of a gIOVA->HPA mapping. It would
> rely on memory_replay to build a GPA->HPA mapping. If I'm correct, I think it
> would be expected to have a replay in the time device is assigned.

If you need a GPA -> HPA mapping (AFAICT you mean virtualized SVM and
the so-called first level translation), IMHO the default mapping
behavior of vfio_listener_region_add() suites here (when
memory_region_is_iommu(section->mr) == false), which maps the whole
guest physical address space, just like the case when we use vfio
devices without vIOMMU.

Thanks,

-- peterx

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [Qemu-devel] [RFC PATCH 11/13] intel_iommu: provide its own replay() callback
  2016-12-29  9:55 ` Peter Xu
@ 2016-12-29 10:23   ` Liu, Yi L
  2016-12-30  3:43     ` Peter Xu
  0 siblings, 1 reply; 33+ messages in thread
From: Liu, Yi L @ 2016-12-29 10:23 UTC (permalink / raw)
  To: Peter Xu
  Cc: Tian, Kevin, Lan, Tianyu, 'qemu-devel@nongnu.org',
	'bd.aviv@gmail.com'

> -----Original Message-----
> From: Peter Xu [mailto:peterx@redhat.com]
> Sent: Thursday, December 29, 2016 5:56 PM
> To: Liu, Yi L <yi.l.liu@intel.com>
> Cc: Tian, Kevin <kevin.tian@intel.com>; Lan, Tianyu <tianyu.lan@intel.com>; 'qemu-
> devel@nongnu.org' <qemu-devel@nongnu.org>; 'bd.aviv@gmail.com'
> <bd.aviv@gmail.com>
> Subject: Re: [Qemu-devel] [RFC PATCH 11/13] intel_iommu: provide its own replay()
> callback
> 
> On Thu, Dec 29, 2016 at 07:38:20AM +0000, Liu, Yi L wrote:
> 
> [...]
> 
> > > +static void vtd_iommu_replay(MemoryRegion *mr, IOMMUNotifier *n)
> > > +{
> > > +    VTDAddressSpace *vtd_as = container_of(mr, VTDAddressSpace, iommu);
> > > +    IntelIOMMUState *s = vtd_as->iommu_state;
> > > +    uint8_t bus_n = pci_bus_num(vtd_as->bus);
> > > +    VTDContextEntry ce;
> > > +
> > > +    if (vtd_dev_to_context_entry(s, bus_n, vtd_as->devfn, &ce) == 0) {
> >
> > Hi Peter,
> >
> > Again I'd like to give my cheers on this patch set.
> 
> Thanks.
> 
> >
> > Hereby, I have a minor concern on this part.
> > The replay would be performed in vfio_listener_region_add() when device is
> > assigned. Since guest is just started, so there will be no valid context entry for the
> > assigned device. So there will be no vtd_page_walk. Thus no change to SL page
> > table in host. The SL page table would still be a IOVA->HPA mapping. It's true
> > that the gIOVA->HPA mapping can be built by page map/unmap when guest
> > trying to issue dma. So it is just ok without relpay in the beginning. Hot plug should
> > be all the same. I guess it is the design here?
> 
> I am not sure whether I get your point here - I think it's by design.
> We don't really need replay if we are sure that the IOMMU address
> space has totally no mapping (i.e., when the guest init). Replay is
> only useful when there are existing mappings in the IOMMU address
> space.
yes, if we are sure no existing mapping, it is ok. There is another case which may
complain it. If a device is hotplug in, then there should be existing mapping. Just
looked code again. If corresponding context entry is not present, the
vtd_page_walk should be bypassed.

> 
> Another thing to mention: IMHO IOVA -> HPA mapping should be built by
> page map/unmap when guest issues IOTLB invalidations, not DMA (guest
> will first issue IOTLB invalidations, only after that the iova address
> would be a valid DMA address).
Exactly. thx for the correction.

> 
> >
> > However, it may be incompatible with a future work in which we expose an
> > vIOMMU to guest but guest disable IOVA usage. In this scenario, the SL page
> > table in host would be a GPA->HPA instead of a gIOVA->HPA mapping. It would
> > rely on memory_replay to build a GPA->HPA mapping. If I'm correct, I think it
> > would be expected to have a replay in the time device is assigned.
> 
> If you need a GPA -> HPA mapping (AFAICT you mean virtualized SVM and
> the so-called first level translation), IMHO the default mapping
> behavior of vfio_listener_region_add() suites here (when
> memory_region_is_iommu(section->mr) == false), which maps the whole
> guest physical address space, just like the case when we use vfio
> devices without vIOMMU.
Yes, I mean SVM virtualizaion. To virtualize SVM, an vIOMMU would be exposed to
guest. So memory_region_is_iommu(section->mr) would be true. Then, the original
mapping the whole guest physical address space would not run. Thus no GPA->HPA
mapping would be built. This is what I really concern here.

Thanks,
Yi L

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [Qemu-devel] [RFC PATCH 11/13] intel_iommu: provide its own replay() callback
  2016-12-29 10:23   ` Liu, Yi L
@ 2016-12-30  3:43     ` Peter Xu
  2016-12-30  4:39       ` Liu, Yi L
  0 siblings, 1 reply; 33+ messages in thread
From: Peter Xu @ 2016-12-30  3:43 UTC (permalink / raw)
  To: Liu, Yi L
  Cc: Tian, Kevin, Lan, Tianyu, 'qemu-devel@nongnu.org',
	'bd.aviv@gmail.com'

On Thu, Dec 29, 2016 at 10:23:21AM +0000, Liu, Yi L wrote:

[...]

> > >
> > > Hereby, I have a minor concern on this part.
> > > The replay would be performed in vfio_listener_region_add() when device is
> > > assigned. Since guest is just started, so there will be no valid context entry for the
> > > assigned device. So there will be no vtd_page_walk. Thus no change to SL page
> > > table in host. The SL page table would still be a IOVA->HPA mapping. It's true
> > > that the gIOVA->HPA mapping can be built by page map/unmap when guest
> > > trying to issue dma. So it is just ok without relpay in the beginning. Hot plug should
> > > be all the same. I guess it is the design here?
> > 
> > I am not sure whether I get your point here - I think it's by design.
> > We don't really need replay if we are sure that the IOMMU address
> > space has totally no mapping (i.e., when the guest init). Replay is
> > only useful when there are existing mappings in the IOMMU address
> > space.
> yes, if we are sure no existing mapping, it is ok. There is another case which may
> complain it. If a device is hotplug in, then there should be existing mapping. Just
> looked code again. If corresponding context entry is not present, the
> vtd_page_walk should be bypassed.

Yes, if the context entry is not present, the replay will be bypassed.
And if it exists, then the replay will work (along with the page walk
process). Shouldn't that the behavior we want?

Please clarify if I misunderstood your question.

[...]

> > >
> > > However, it may be incompatible with a future work in which we expose an
> > > vIOMMU to guest but guest disable IOVA usage. In this scenario, the SL page
> > > table in host would be a GPA->HPA instead of a gIOVA->HPA mapping. It would
> > > rely on memory_replay to build a GPA->HPA mapping. If I'm correct, I think it
> > > would be expected to have a replay in the time device is assigned.
> > 
> > If you need a GPA -> HPA mapping (AFAICT you mean virtualized SVM and
> > the so-called first level translation), IMHO the default mapping
> > behavior of vfio_listener_region_add() suites here (when
> > memory_region_is_iommu(section->mr) == false), which maps the whole
> > guest physical address space, just like the case when we use vfio
> > devices without vIOMMU.
> Yes, I mean SVM virtualizaion. To virtualize SVM, an vIOMMU would be exposed to
> guest. So memory_region_is_iommu(section->mr) would be true. Then, the original
> mapping the whole guest physical address space would not run. Thus no GPA->HPA
> mapping would be built. This is what I really concern here.

What I meant above was, we might be able to leverage existing
no-viommu logic to build the SLPT for virtualized SVM. For example, we
can switch the if block from:

  if (memory_region_is_iommu(section->mr))

into something similiar like:

  if (memory_region_is_iommu(section->mr) &&
      !FIRST_LEVEL_TRANSLATION_ENABLED(section->mr))

Just a quick thought, not sure whether that would be an applicable
idea. Thanks,

-- peterx

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [Qemu-devel] [RFC PATCH 11/13] intel_iommu: provide its own replay() callback
  2016-12-30  3:43     ` Peter Xu
@ 2016-12-30  4:39       ` Liu, Yi L
  2016-12-30  9:18         ` Peter Xu
  0 siblings, 1 reply; 33+ messages in thread
From: Liu, Yi L @ 2016-12-30  4:39 UTC (permalink / raw)
  To: Peter Xu
  Cc: Tian, Kevin, Lan, Tianyu, 'qemu-devel@nongnu.org',
	'bd.aviv@gmail.com'

> -----Original Message-----
> From: Peter Xu [mailto:peterx@redhat.com]
> Sent: Friday, December 30, 2016 11:44 AM
> To: Liu, Yi L <yi.l.liu@intel.com>
> Cc: Tian, Kevin <kevin.tian@intel.com>; Lan, Tianyu <tianyu.lan@intel.com>; 'qemu-
> devel@nongnu.org' <qemu-devel@nongnu.org>; 'bd.aviv@gmail.com'
> <bd.aviv@gmail.com>
> Subject: Re: [Qemu-devel] [RFC PATCH 11/13] intel_iommu: provide its own replay()
> callback
> 
> On Thu, Dec 29, 2016 at 10:23:21AM +0000, Liu, Yi L wrote:
> 
> [...]
> 
> > > >
> > > > Hereby, I have a minor concern on this part.
> > > > The replay would be performed in vfio_listener_region_add() when device is
> > > > assigned. Since guest is just started, so there will be no valid context entry for
> the
> > > > assigned device. So there will be no vtd_page_walk. Thus no change to SL page
> > > > table in host. The SL page table would still be a IOVA->HPA mapping. It's true
> > > > that the gIOVA->HPA mapping can be built by page map/unmap when guest
> > > > trying to issue dma. So it is just ok without relpay in the beginning. Hot plug
> should
> > > > be all the same. I guess it is the design here?
> > >
> > > I am not sure whether I get your point here - I think it's by design.
> > > We don't really need replay if we are sure that the IOMMU address
> > > space has totally no mapping (i.e., when the guest init). Replay is
> > > only useful when there are existing mappings in the IOMMU address
> > > space.
> > yes, if we are sure no existing mapping, it is ok. There is another case which may
> > complain it. If a device is hotplug in, then there should be existing mapping. Just
> > looked code again. If corresponding context entry is not present, the
> > vtd_page_walk should be bypassed.
> 
> Yes, if the context entry is not present, the replay will be bypassed.
> And if it exists, then the replay will work (along with the page walk
> process). Shouldn't that the behavior we want?
> 
> Please clarify if I misunderstood your question.
I was just assuming a replay should be performed when a new device is plug in.
However, according to your statements, it is not required. Is this the conclusion in
the previous discussion?

> [...]
> 
> > > >
> > > > However, it may be incompatible with a future work in which we expose an
> > > > vIOMMU to guest but guest disable IOVA usage. In this scenario, the SL page
> > > > table in host would be a GPA->HPA instead of a gIOVA->HPA mapping. It would
> > > > rely on memory_replay to build a GPA->HPA mapping. If I'm correct, I think it
> > > > would be expected to have a replay in the time device is assigned.
> > >
> > > If you need a GPA -> HPA mapping (AFAICT you mean virtualized SVM and
> > > the so-called first level translation), IMHO the default mapping
> > > behavior of vfio_listener_region_add() suites here (when
> > > memory_region_is_iommu(section->mr) == false), which maps the whole
> > > guest physical address space, just like the case when we use vfio
> > > devices without vIOMMU.
> > Yes, I mean SVM virtualizaion. To virtualize SVM, an vIOMMU would be exposed to
> > guest. So memory_region_is_iommu(section->mr) would be true. Then, the
> original
> > mapping the whole guest physical address space would not run. Thus no GPA->HPA
> > mapping would be built. This is what I really concern here.
> 
> What I meant above was, we might be able to leverage existing
> no-viommu logic to build the SLPT for virtualized SVM. For example, we
> can switch the if block from:
> 
>   if (memory_region_is_iommu(section->mr))
> 
> into something similiar like:
> 
>   if (memory_region_is_iommu(section->mr) &&
>       !FIRST_LEVEL_TRANSLATION_ENABLED(section->mr))
> 
> Just a quick thought, not sure whether that would be an applicable
> idea. Thanks,
If I remember correctly, with vIOMMU exposed, section->mr here should
not be a ram, the logic would fail in the following code. I agree with you
that map the whole guest physical address space should work with SVM
virtualization. May need to think more on how to make it happen.

hw/i386/common.c, a snippet:
    /* Here we assume that memory_region_is_ram(section->mr)==true */

    vaddr = memory_region_get_ram_ptr(section->mr) +
            section->offset_within_region +
            (iova - section->offset_within_address_space);

Thanks,
Yi L

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [Qemu-devel] [RFC PATCH 11/13] intel_iommu: provide its own replay() callback
  2016-12-30  4:39       ` Liu, Yi L
@ 2016-12-30  9:18         ` Peter Xu
  0 siblings, 0 replies; 33+ messages in thread
From: Peter Xu @ 2016-12-30  9:18 UTC (permalink / raw)
  To: Liu, Yi L
  Cc: Tian, Kevin, Lan, Tianyu, 'qemu-devel@nongnu.org',
	'bd.aviv@gmail.com'

On Fri, Dec 30, 2016 at 04:39:49AM +0000, Liu, Yi L wrote:
> > -----Original Message-----
> > From: Peter Xu [mailto:peterx@redhat.com]
> > Sent: Friday, December 30, 2016 11:44 AM
> > To: Liu, Yi L <yi.l.liu@intel.com>
> > Cc: Tian, Kevin <kevin.tian@intel.com>; Lan, Tianyu <tianyu.lan@intel.com>; 'qemu-
> > devel@nongnu.org' <qemu-devel@nongnu.org>; 'bd.aviv@gmail.com'
> > <bd.aviv@gmail.com>
> > Subject: Re: [Qemu-devel] [RFC PATCH 11/13] intel_iommu: provide its own replay()
> > callback
> > 
> > On Thu, Dec 29, 2016 at 10:23:21AM +0000, Liu, Yi L wrote:
> > 
> > [...]
> > 
> > > > >
> > > > > Hereby, I have a minor concern on this part.
> > > > > The replay would be performed in vfio_listener_region_add() when device is
> > > > > assigned. Since guest is just started, so there will be no valid context entry for
> > the
> > > > > assigned device. So there will be no vtd_page_walk. Thus no change to SL page
> > > > > table in host. The SL page table would still be a IOVA->HPA mapping. It's true
> > > > > that the gIOVA->HPA mapping can be built by page map/unmap when guest
> > > > > trying to issue dma. So it is just ok without relpay in the beginning. Hot plug
> > should
> > > > > be all the same. I guess it is the design here?
> > > >
> > > > I am not sure whether I get your point here - I think it's by design.
> > > > We don't really need replay if we are sure that the IOMMU address
> > > > space has totally no mapping (i.e., when the guest init). Replay is
> > > > only useful when there are existing mappings in the IOMMU address
> > > > space.
> > > yes, if we are sure no existing mapping, it is ok. There is another case which may
> > > complain it. If a device is hotplug in, then there should be existing mapping. Just
> > > looked code again. If corresponding context entry is not present, the
> > > vtd_page_walk should be bypassed.
> > 
> > Yes, if the context entry is not present, the replay will be bypassed.
> > And if it exists, then the replay will work (along with the page walk
> > process). Shouldn't that the behavior we want?
> > 
> > Please clarify if I misunderstood your question.
> I was just assuming a replay should be performed when a new device is plug in.
> However, according to your statements, it is not required. Is this the conclusion in
> the previous discussion?

Hmm, I mean whether the replay is needed for the hotplugged device
should depend on which iommu domain it joins when it is plugged. I
think it should belong to nowhere when plugged but I am not quite
sure... IIUC that'll depend on whether the context entry of that
plugged device is valid - and I think it should be an invalid one.

> 
> > [...]
> > 
> > > > >
> > > > > However, it may be incompatible with a future work in which we expose an
> > > > > vIOMMU to guest but guest disable IOVA usage. In this scenario, the SL page
> > > > > table in host would be a GPA->HPA instead of a gIOVA->HPA mapping. It would
> > > > > rely on memory_replay to build a GPA->HPA mapping. If I'm correct, I think it
> > > > > would be expected to have a replay in the time device is assigned.
> > > >
> > > > If you need a GPA -> HPA mapping (AFAICT you mean virtualized SVM and
> > > > the so-called first level translation), IMHO the default mapping
> > > > behavior of vfio_listener_region_add() suites here (when
> > > > memory_region_is_iommu(section->mr) == false), which maps the whole
> > > > guest physical address space, just like the case when we use vfio
> > > > devices without vIOMMU.
> > > Yes, I mean SVM virtualizaion. To virtualize SVM, an vIOMMU would be exposed to
> > > guest. So memory_region_is_iommu(section->mr) would be true. Then, the
> > original
> > > mapping the whole guest physical address space would not run. Thus no GPA->HPA
> > > mapping would be built. This is what I really concern here.
> > 
> > What I meant above was, we might be able to leverage existing
> > no-viommu logic to build the SLPT for virtualized SVM. For example, we
> > can switch the if block from:
> > 
> >   if (memory_region_is_iommu(section->mr))
> > 
> > into something similiar like:
> > 
> >   if (memory_region_is_iommu(section->mr) &&
> >       !FIRST_LEVEL_TRANSLATION_ENABLED(section->mr))
> > 
> > Just a quick thought, not sure whether that would be an applicable
> > idea. Thanks,
> If I remember correctly, with vIOMMU exposed, section->mr here should
> not be a ram, the logic would fail in the following code. I agree with you
> that map the whole guest physical address space should work with SVM
> virtualization. May need to think more on how to make it happen.
> 
> hw/i386/common.c, a snippet:
>     /* Here we assume that memory_region_is_ram(section->mr)==true */
> 
>     vaddr = memory_region_get_ram_ptr(section->mr) +
>             section->offset_within_region +
>             (iova - section->offset_within_address_space);

Right. That code won't work if only with a change in the if() clause -
I wasn't showing any workable codes (even it's not pesudo), only to
show what I meant. :)

-- peterx

^ permalink raw reply	[flat|nested] 33+ messages in thread

end of thread, other threads:[~2016-12-30  9:19 UTC | newest]

Thread overview: 33+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2016-12-06 10:36 [Qemu-devel] [RFC PATCH 00/13] VT-d replay and misc cleanup Peter Xu
2016-12-06 10:36 ` [Qemu-devel] [RFC PATCH 01/13] intel_iommu: allocate new key when creating new address space Peter Xu
2016-12-12  8:16   ` Liu, Yi L
2016-12-14  3:05     ` Peter Xu
2016-12-14  3:24       ` Liu, Yi L
2016-12-06 10:36 ` [Qemu-devel] [RFC PATCH 02/13] intel_iommu: simplify irq region translation Peter Xu
2016-12-06 10:36 ` [Qemu-devel] [RFC PATCH 03/13] intel_iommu: renaming gpa to iova where proper Peter Xu
2016-12-18  8:39   ` Liu, Yi L
2016-12-19  9:23     ` Peter Xu
2016-12-06 10:36 ` [Qemu-devel] [RFC PATCH 04/13] intel_iommu: fix trace for inv desc handling Peter Xu
2016-12-06 10:36 ` [Qemu-devel] [RFC PATCH 05/13] intel_iommu: fix trace for addr translation Peter Xu
2016-12-06 10:36 ` [Qemu-devel] [RFC PATCH 06/13] intel_iommu: vtd_slpt_level_shift check level Peter Xu
2016-12-06 10:36 ` [Qemu-devel] [RFC PATCH 07/13] memory: add section range info for IOMMU notifier Peter Xu
2016-12-06 10:36 ` [Qemu-devel] [RFC PATCH 08/13] memory: provide iommu_replay_all() Peter Xu
2016-12-06 10:36 ` [Qemu-devel] [RFC PATCH 09/13] memory: introduce memory_region_notify_one() Peter Xu
2016-12-06 10:36 ` [Qemu-devel] [RFC PATCH 10/13] memory: add MemoryRegionIOMMUOps.replay() callback Peter Xu
2016-12-06 10:36 ` [Qemu-devel] [RFC PATCH 11/13] intel_iommu: provide its own replay() callback Peter Xu
2016-12-08  3:01   ` Lan Tianyu
2016-12-09  1:56     ` Peter Xu
2016-12-06 10:36 ` [Qemu-devel] [RFC PATCH 12/13] intel_iommu: do replay when context invalidate Peter Xu
2016-12-06 10:36 ` [Qemu-devel] [RFC PATCH 13/13] intel_iommu: use page_walk for iotlb inv notify Peter Xu
2016-12-06 10:49 ` [Qemu-devel] [RFC PATCH 00/13] VT-d replay and misc cleanup Peter Xu
2016-12-13  7:45 ` Peter Xu
2016-12-18  8:42 ` Liu, Yi L
2016-12-19  9:22   ` Peter Xu
2016-12-19 11:25     ` Liu, Yi L
2016-12-19  9:33 ` Peter Xu
  -- strict thread matches above, loose matches on Subject: below --
2016-12-29  7:38 [Qemu-devel] [RFC PATCH 11/13] intel_iommu: provide its own replay() callback Liu, Yi L
2016-12-29  9:55 ` Peter Xu
2016-12-29 10:23   ` Liu, Yi L
2016-12-30  3:43     ` Peter Xu
2016-12-30  4:39       ` Liu, Yi L
2016-12-30  9:18         ` Peter Xu

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.