[Qemu-devel] [RFC v2 0/3] intel_iommu: support scalable mode

qemu-devel.nongnu.org archive mirror
 help / color / mirror / Atom feed

* [Qemu-devel] [RFC v2 0/3] intel_iommu: support scalable mode
@ 2019-02-28 13:47 Yi Sun
  2019-02-28 13:47 ` [Qemu-devel] [RFC v2 1/3] intel_iommu: scalable mode emulation Yi Sun
                   ` (3 more replies)
  0 siblings, 4 replies; 13+ messages in thread
From: Yi Sun @ 2019-02-28 13:47 UTC (permalink / raw)
  To: qemu-devel
  Cc: pbonzini, rth, ehabkost, mst, marcel.apfelbaum, peterx, jasowang,
	kevin.tian, yi.l.liu, yi.y.sun, Yi Sun

Intel vt-d rev3.0 [1] introduces a new translation mode called
'scalable mode', which enables PASID-granular translations for
first level, second level, nested and pass-through modes. The
vt-d scalable mode is the key ingredient to enable Scalable I/O
Virtualization (Scalable IOV) [2] [3], which allows sharing a
device in minimal possible granularity (ADI - Assignable Device
Interface). As a result, previous Extended Context (ECS) mode
is deprecated (no production ever implements ECS).

This patch set emulates a minimal capability set of VT-d scalable
mode, equivalent to what is available in VT-d legacy mode today:
    1. Scalable mode root entry, context entry and PASID table
    2. Seconds level translation under scalable mode
    3. Queued invalidation (with 256 bits descriptor)
    4. Pass-through mode

Corresponding intel-iommu driver support will be included in
kernel 5.0:
    https://www.spinics.net/lists/kernel/msg2985279.html

We will add emulation of full scalable mode capability along with
guest iommu driver progress later, e.g.:
    1. First level translation
    2. Nested translation
    3. Per-PASID invalidation descriptors
    4. Page request services for handling recoverable faults

To verify the patches, below cases were tested according to Peter Xu's
suggestions.
    +---------+----------------------------------------------------------------+----------------------------------------------------------------+
    |         |                      w/ Device Passthr                         |                     w/o Device Passthr                         |
    |         +-------------------------------+--------------------------------+-------------------------------+--------------------------------+
    |         | virtio-net-pci, vhost=on      | virtio-net-pci, vhost=off      | virtio-net-pci, vhost=on      | virtio-net-pci, vhost=off      |
    |         +-------------------------------+--------------------------------+-------------------------------+--------------------------------+
    |         | netperf | kernel bld | data cp| netperf | kernel bld | data cp | netperf | kernel bld | data cp| netperf | kernel bld | data cp |
    +---------+-------------------------------+--------------------------------+-------------------------------+--------------------------------+
    | Legacy  | Pass    | Pass       | Pass   | Pass    | Pass       | Pass    | Pass    | Pass       | Pass   | Pass    | Pass       | Pass    |
    +---------+-------------------------------+--------------------------------+-------------------------------+--------------------------------+
    | Scalable| Pass    | Pass       | Pass   | Pass    | Pass       | Pass    | Pass    | Pass       | Pass   | Pass    | Pass       | Pass    |
    +---------+-------------------------------+--------------------------------+-------------------------------+--------------------------------+

References:
[1] https://software.intel.com/en-us/download/intel-virtualization-technology-for-directed-io-architecture-specification
[2] https://software.intel.com/en-us/download/intel-scalable-io-virtualization-technical-specification
[3] https://schd.ws/hosted_files/lc32018/00/LC3-SIOV-final.pdf
---
v1->v2

Patch 1:
    - remove unnecessary macros.
    - rename macros to capital.
    - make 're->hi' assignment be unconditional to simplify codes.
    - remove 'vtd_get_context_base' to embed its content into caller.
    - remove 'vtd_context_entry_format' to to embed its content into caller.
    - remove unnecessary memset for 'pe->val'.
    - use 'INTEL_IOMMU_DEVICE' to get 'IntelIOMMUState' to remove input
      'IntelIOMMUState *s' parameter.
    - call 'vtd_get_domain_id' to get domain_id.
    - check error code returned by 'vtd_ce_get_rid2pasid_entry' in
      'vtd_dev_pt_enabled'.
    - check '!is_fpd_set' of context entry before handing pasid entry.
    - move 's->root_scalable' assignment to patch 3.
    - add comment for 'VTD_FR_PASID_TABLE_INV'.
    - remove not used 'VTD_ROOT_ENTRY_SIZE'.
    - change 'VTD_CTX_ENTRY_LECY_SIZE' to 'VTD_CTX_ENTRY_LEGACY_SIZE'.
    - change 'VTD_CTX_ENTRY_SM_SIZE' to 'VTD_CTX_ENTRY_SCALABLE_SIZE'.
    - use union in 'struct VTDContextEntry' to reduce code changes.
Patch 2:
    - modify s-o-b position.
    - remove unnecessary macros.
    - change 'iq_dw' type to bool.
    - remove initialization to 'inv_desc->val[]'.
    - modify 'VTDInvDesc' to add a union 'val[4]' to be compatible
      with both legacy mode and scalable mode.
Patch 3:
    - rename "scalable-mode" to "x-scalable-mode".
    - remove caching_mode check when scalable_mode is set.
    - check dma_drain check when scalable_mode is set. This is requested
      by spec.
    - remove redundant macros.
---

Liu, Yi L (2):
  intel_iommu: scalable mode emulation
  intel_iommu: add 256 bits qi_desc support

Yi Sun (1):
  intel_iommu: add scalable-mode option to make scalable mode work

 hw/i386/intel_iommu.c          | 540 ++++++++++++++++++++++++++++++++++-------
 hw/i386/intel_iommu_internal.h |  54 ++++-
 hw/i386/trace-events           |   2 +-
 include/hw/i386/intel_iommu.h  |  28 ++-
 4 files changed, 534 insertions(+), 90 deletions(-)

-- 
1.9.1

^ permalink raw reply	[flat|nested] 13+ messages in thread

* [Qemu-devel] [RFC v2 1/3] intel_iommu: scalable mode emulation
  2019-02-28 13:47 [Qemu-devel] [RFC v2 0/3] intel_iommu: support scalable mode Yi Sun
@ 2019-02-28 13:47 ` Yi Sun
  2019-03-01  6:52   ` Peter Xu
  2019-02-28 13:47 ` [Qemu-devel] [RFC v2 2/3] intel_iommu: add 256 bits qi_desc support Yi Sun
                   ` (2 subsequent siblings)
  3 siblings, 1 reply; 13+ messages in thread
From: Yi Sun @ 2019-02-28 13:47 UTC (permalink / raw)
  To: qemu-devel
  Cc: pbonzini, rth, ehabkost, mst, marcel.apfelbaum, peterx, jasowang,
	kevin.tian, yi.l.liu, yi.y.sun, Yi Sun

From: "Liu, Yi L" <yi.l.liu@intel.com>

Intel(R) VT-d 3.0 spec introduces scalable mode address translation to
replace extended context mode. This patch extends current emulator to
support Scalable Mode which includes root table, context table and new
pasid table format change. Now intel_iommu emulates both legacy mode
and scalable mode (with legacy-equivalent capability set).

The key points are below:
1. Extend root table operations to support both legacy mode and scalable
   mode.
2. Extend context table operations to support both legacy mode and
   scalable mode.
3. Add pasid tabled operations to support scalable mode.

Signed-off-by: Liu, Yi L <yi.l.liu@intel.com>
[Yi Sun is co-developer to contribute much to refine the whole commit.]
Signed-off-by: Yi Sun <yi.y.sun@linux.intel.com>
---
v2:
    - remove unnecessary macros.
    - rename macros to capital.
    - make 're->hi' assignment be unconditional to simplify codes.
    - remove 'vtd_get_context_base' to embed its content into caller.
    - remove 'vtd_context_entry_format' to to embed its content into caller.
    - remove unnecessary memset for 'pe->val'.
    - use 'INTEL_IOMMU_DEVICE' to get 'IntelIOMMUState' to remove input
      'IntelIOMMUState *s' parameter.
    - call 'vtd_get_domain_id' to get domain_id.
    - check error code returned by 'vtd_ce_get_rid2pasid_entry' in
      'vtd_dev_pt_enabled'.
    - check '!is_fpd_set' of context entry before handing pasid entry.
    - move 's->root_scalable' assignment to patch 3.
    - add comment for 'VTD_FR_PASID_TABLE_INV'.
    - remove not used 'VTD_ROOT_ENTRY_SIZE'.
    - change 'VTD_CTX_ENTRY_LECY_SIZE' to 'VTD_CTX_ENTRY_LEGACY_SIZE'.
    - change 'VTD_CTX_ENTRY_SM_SIZE' to 'VTD_CTX_ENTRY_SCALABLE_SIZE'.
    - use union in 'struct VTDContextEntry' to reduce code changes.
---
 hw/i386/intel_iommu.c          | 476 ++++++++++++++++++++++++++++++++++-------
 hw/i386/intel_iommu_internal.h |  41 +++-
 hw/i386/trace-events           |   2 +-
 include/hw/i386/intel_iommu.h  |  24 ++-
 4 files changed, 466 insertions(+), 77 deletions(-)

diff --git a/hw/i386/intel_iommu.c b/hw/i386/intel_iommu.c
index ee22e75..109fdbc 100644
--- a/hw/i386/intel_iommu.c
+++ b/hw/i386/intel_iommu.c
@@ -37,6 +37,16 @@
 #include "kvm_i386.h"
 #include "trace.h"
 
+/* context entry operations */
+#define VTD_CE_GET_RID2PASID(ce) \
+    ((ce)->val[1] & VTD_SM_CONTEXT_ENTRY_RID2PASID_MASK)
+#define VTD_CE_GET_PASID_DIR_TABLE(ce) \
+    ((ce)->val[0] & VTD_PASID_DIR_BASE_ADDR_MASK)
+
+/* pe operations */
+#define VTD_PE_GET_TYPE(pe) ((pe)->val[0] & VTD_SM_PASID_ENTRY_PGTT)
+#define VTD_PE_GET_LEVEL(pe) (2 + (((pe)->val[0] >> 2) & VTD_SM_PASID_ENTRY_AW))
+
 static void vtd_address_space_refresh_all(IntelIOMMUState *s);
 static void vtd_address_space_unmap(VTDAddressSpace *as, IOMMUNotifier *n);
 
@@ -512,9 +522,15 @@ static void vtd_generate_completion_event(IntelIOMMUState *s)
     }
 }
 
-static inline bool vtd_root_entry_present(VTDRootEntry *root)
+static inline bool vtd_root_entry_present(IntelIOMMUState *s,
+                                          VTDRootEntry *re,
+                                          uint8_t devfn)
 {
-    return root->val & VTD_ROOT_ENTRY_P;
+    if (s->root_scalable && devfn > UINT8_MAX / 2) {
+        return re->hi & VTD_ROOT_ENTRY_P;
+    }
+
+    return re->lo & VTD_ROOT_ENTRY_P;
 }
 
 static int vtd_get_root_entry(IntelIOMMUState *s, uint8_t index,
@@ -524,30 +540,48 @@ static int vtd_get_root_entry(IntelIOMMUState *s, uint8_t index,
 
     addr = s->root + index * sizeof(*re);
     if (dma_memory_read(&address_space_memory, addr, re, sizeof(*re))) {
-        re->val = 0;
+        re->lo = 0;
         return -VTD_FR_ROOT_TABLE_INV;
     }
-    re->val = le64_to_cpu(re->val);
+    re->lo = le64_to_cpu(re->lo);
+    re->hi = le64_to_cpu(re->hi);
     return 0;
 }
 
-static inline bool vtd_ce_present(VTDContextEntry *context)
+static inline bool vtd_ce_present(VTDContextEntry *ce)
 {
-    return context->lo & VTD_CONTEXT_ENTRY_P;
+    return ce->lo & VTD_CONTEXT_ENTRY_P;
 }
 
-static int vtd_get_context_entry_from_root(VTDRootEntry *root, uint8_t index,
+static int vtd_get_context_entry_from_root(IntelIOMMUState *s,
+                                           VTDRootEntry *re,
+                                           uint8_t index,
                                            VTDContextEntry *ce)
 {
-    dma_addr_t addr;
+    dma_addr_t addr, ce_size;
 
     /* we have checked that root entry is present */
-    addr = (root->val & VTD_ROOT_ENTRY_CTP) + index * sizeof(*ce);
-    if (dma_memory_read(&address_space_memory, addr, ce, sizeof(*ce))) {
+    ce_size = s->root_scalable ? VTD_CTX_ENTRY_SCALABLE_SIZE :
+              VTD_CTX_ENTRY_LEGACY_SIZE;
+
+    if (s->root_scalable && index > UINT8_MAX / 2) {
+        index = index & (~VTD_DEVFN_CHECK_MASK);
+        addr = re->hi & VTD_ROOT_ENTRY_CTP;
+    } else {
+        addr = re->lo & VTD_ROOT_ENTRY_CTP;
+    }
+
+    addr = addr + index * ce_size;
+    if (dma_memory_read(&address_space_memory, addr, ce, ce_size)) {
         return -VTD_FR_CONTEXT_TABLE_INV;
     }
+
     ce->lo = le64_to_cpu(ce->lo);
     ce->hi = le64_to_cpu(ce->hi);
+    if (s->root_scalable) {
+        ce->val[2] = le64_to_cpu(ce->val[2]);
+        ce->val[3] = le64_to_cpu(ce->val[3]);
+    }
     return 0;
 }
 
@@ -600,6 +634,144 @@ static inline bool vtd_is_level_supported(IntelIOMMUState *s, uint32_t level)
            (1ULL << (level - 2 + VTD_CAP_SAGAW_SHIFT));
 }
 
+/* Return true if check passed, otherwise false */
+static inline bool vtd_pe_type_check(X86IOMMUState *x86_iommu,
+                                     VTDPASIDEntry *pe)
+{
+    switch (VTD_PE_GET_TYPE(pe)) {
+    case VTD_SM_PASID_ENTRY_FLT:
+    case VTD_SM_PASID_ENTRY_SLT:
+    case VTD_SM_PASID_ENTRY_NESTED:
+        break;
+    case VTD_SM_PASID_ENTRY_PT:
+        if (!x86_iommu->pt_supported) {
+            return false;
+        }
+        break;
+    default:
+        /* Unknwon type */
+        return false;
+    }
+    return true;
+}
+
+static inline int vtd_get_pasid_dire(dma_addr_t pasid_dir_base,
+                                     uint32_t pasid,
+                                     VTDPASIDDirEntry *pdire)
+{
+    uint32_t index;
+    dma_addr_t addr, entry_size;
+
+    index = VTD_PASID_DIR_INDEX(pasid);
+    entry_size = VTD_PASID_DIR_ENTRY_SIZE;
+    addr = pasid_dir_base + index * entry_size;
+    if (dma_memory_read(&address_space_memory, addr, pdire, entry_size)) {
+        return -VTD_FR_PASID_TABLE_INV;
+    }
+
+    return 0;
+}
+
+static inline int vtd_get_pasid_entry(IntelIOMMUState *s,
+                                      uint32_t pasid,
+                                      VTDPASIDDirEntry *pdire,
+                                      VTDPASIDEntry *pe)
+{
+    uint32_t index;
+    dma_addr_t addr, entry_size;
+    X86IOMMUState *x86_iommu = X86_IOMMU_DEVICE(s);
+
+    index = VTD_PASID_TABLE_INDEX(pasid);
+    entry_size = VTD_PASID_ENTRY_SIZE;
+    addr = pdire->val & VTD_PASID_TABLE_BASE_ADDR_MASK;
+    addr = addr + index * entry_size;
+    if (dma_memory_read(&address_space_memory, addr, pe, entry_size)) {
+        return -VTD_FR_PASID_TABLE_INV;
+    }
+
+    /* Do translation type check */
+    if (!vtd_pe_type_check(x86_iommu, pe)) {
+        return -VTD_FR_PASID_TABLE_INV;
+    }
+
+    if (!vtd_is_level_supported(s, VTD_PE_GET_LEVEL(pe))) {
+        return -VTD_FR_PASID_TABLE_INV;
+    }
+
+    return 0;
+}
+
+static int vtd_get_pasid_entry_from_pasid(IntelIOMMUState *s,
+                                          dma_addr_t pasid_dir_base,
+                                          uint32_t pasid,
+                                          VTDPASIDEntry *pe)
+{
+    int ret;
+    VTDPASIDDirEntry pdire;
+
+    ret = vtd_get_pasid_dire(pasid_dir_base, pasid, &pdire);
+    if (ret) {
+        return ret;
+    }
+
+    ret = vtd_get_pasid_entry(s, pasid, &pdire, pe);
+    if (ret) {
+        return ret;
+    }
+
+    return ret;
+}
+
+static inline int vtd_ce_get_rid2pasid_entry(IntelIOMMUState *s,
+                                             VTDContextEntry *ce,
+                                             VTDPASIDEntry *pe)
+{
+    uint32_t pasid;
+    dma_addr_t pasid_dir_base;
+    int ret = 0;
+
+    pasid = VTD_CE_GET_RID2PASID(ce);
+    pasid_dir_base = VTD_CE_GET_PASID_DIR_TABLE(ce);
+    ret = vtd_get_pasid_entry_from_pasid(s, pasid_dir_base, pasid, pe);
+
+    return ret;
+}
+
+static inline int vtd_ce_get_pasid_fpd(IntelIOMMUState *s,
+                                       VTDContextEntry *ce,
+                                       bool *pe_fpd_set)
+{
+    int ret;
+    uint32_t pasid;
+    dma_addr_t pasid_dir_base;
+    VTDPASIDDirEntry pdire;
+    VTDPASIDEntry pe;
+
+    pasid = VTD_CE_GET_RID2PASID(ce);
+    pasid_dir_base = VTD_CE_GET_PASID_DIR_TABLE(ce);
+
+    ret = vtd_get_pasid_dire(pasid_dir_base, pasid, &pdire);
+    if (ret) {
+        return ret;
+    }
+
+    if (pdire.val & VTD_PASID_DIR_FPD) {
+        *pe_fpd_set = true;
+        return 0;
+    }
+
+    ret = vtd_get_pasid_entry(s, pasid, &pdire, &pe);
+    if (ret) {
+        return ret;
+    }
+
+    if (pe.val[0] & VTD_PASID_ENTRY_FPD) {
+        *pe_fpd_set = true;
+    }
+
+    return 0;
+}
+
 /* Get the page-table level that hardware should use for the second-level
  * page-table walk from the Address Width field of context-entry.
  */
@@ -608,11 +780,37 @@ static inline uint32_t vtd_ce_get_level(VTDContextEntry *ce)
     return 2 + (ce->hi & VTD_CONTEXT_ENTRY_AW);
 }
 
+static inline uint32_t vtd_get_iova_level(IntelIOMMUState *s,
+                                          VTDContextEntry *ce)
+{
+    VTDPASIDEntry pe;
+
+    if (s->root_scalable) {
+        vtd_ce_get_rid2pasid_entry(s, ce, &pe);
+        return VTD_PE_GET_LEVEL(&pe);
+    }
+
+    return vtd_ce_get_level(ce);
+}
+
 static inline uint32_t vtd_ce_get_agaw(VTDContextEntry *ce)
 {
     return 30 + (ce->hi & VTD_CONTEXT_ENTRY_AW) * 9;
 }
 
+static inline uint32_t vtd_get_iova_agaw(IntelIOMMUState *s,
+                                         VTDContextEntry *ce)
+{
+    VTDPASIDEntry pe;
+
+    if (s->root_scalable) {
+        vtd_ce_get_rid2pasid_entry(s, ce, &pe);
+        return 30 + ((pe.val[0] >> 2) & VTD_SM_PASID_ENTRY_AW) * 9;
+    }
+
+    return vtd_ce_get_agaw(ce);
+}
+
 static inline uint32_t vtd_ce_get_type(VTDContextEntry *ce)
 {
     return ce->lo & VTD_CONTEXT_ENTRY_TT;
@@ -622,6 +820,17 @@ static inline uint32_t vtd_ce_get_type(VTDContextEntry *ce)
 static inline bool vtd_ce_type_check(X86IOMMUState *x86_iommu,
                                      VTDContextEntry *ce)
 {
+    IntelIOMMUState *s = INTEL_IOMMU_DEVICE(x86_iommu);
+
+    if (s->root_scalable) {
+        /*
+         * Translation Type locates in context entry only when VTD is in
+         * legacy mode. For scalable mode, need to return true to avoid
+         * unnecessary fault.
+         */
+        return true;
+    }
+
     switch (vtd_ce_get_type(ce)) {
     case VTD_CONTEXT_TT_MULTI_LEVEL:
         /* Always supported */
@@ -639,7 +848,7 @@ static inline bool vtd_ce_type_check(X86IOMMUState *x86_iommu,
         }
         break;
     default:
-        /* Unknwon type */
+        /* Unknown type */
         error_report_once("%s: unknown ce type: %"PRIu32, __func__,
                           vtd_ce_get_type(ce));
         return false;
@@ -647,21 +856,36 @@ static inline bool vtd_ce_type_check(X86IOMMUState *x86_iommu,
     return true;
 }
 
-static inline uint64_t vtd_iova_limit(VTDContextEntry *ce, uint8_t aw)
+static inline uint64_t vtd_iova_limit(IntelIOMMUState *s,
+                                      VTDContextEntry *ce, uint8_t aw)
 {
-    uint32_t ce_agaw = vtd_ce_get_agaw(ce);
+    uint32_t ce_agaw = vtd_get_iova_agaw(s, ce);
     return 1ULL << MIN(ce_agaw, aw);
 }
 
 /* Return true if IOVA passes range check, otherwise false. */
-static inline bool vtd_iova_range_check(uint64_t iova, VTDContextEntry *ce,
+static inline bool vtd_iova_range_check(IntelIOMMUState *s,
+                                        uint64_t iova, VTDContextEntry *ce,
                                         uint8_t aw)
 {
     /*
      * Check if @iova is above 2^X-1, where X is the minimum of MGAW
      * in CAP_REG and AW in context-entry.
      */
-    return !(iova & ~(vtd_iova_limit(ce, aw) - 1));
+    return !(iova & ~(vtd_iova_limit(s, ce, aw) - 1));
+}
+
+static inline dma_addr_t vtd_get_iova_pgtbl_base(IntelIOMMUState *s,
+                                                 VTDContextEntry *ce)
+{
+    VTDPASIDEntry pe;
+
+    if (s->root_scalable) {
+        vtd_ce_get_rid2pasid_entry(s, ce, &pe);
+        return pe.val[0] & VTD_SM_PASID_ENTRY_SLPTPTR;
+    }
+
+    return vtd_ce_get_slpt_base(ce);
 }
 
 /*
@@ -707,17 +931,18 @@ static VTDBus *vtd_find_as_from_bus_num(IntelIOMMUState *s, uint8_t bus_num)
 /* Given the @iova, get relevant @slptep. @slpte_level will be the last level
  * of the translation, can be used for deciding the size of large page.
  */
-static int vtd_iova_to_slpte(VTDContextEntry *ce, uint64_t iova, bool is_write,
+static int vtd_iova_to_slpte(IntelIOMMUState *s, VTDContextEntry *ce,
+                             uint64_t iova, bool is_write,
                              uint64_t *slptep, uint32_t *slpte_level,
                              bool *reads, bool *writes, uint8_t aw_bits)
 {
-    dma_addr_t addr = vtd_ce_get_slpt_base(ce);
-    uint32_t level = vtd_ce_get_level(ce);
+    dma_addr_t addr = vtd_get_iova_pgtbl_base(s, ce);
+    uint32_t level = vtd_get_iova_level(s, ce);
     uint32_t offset;
     uint64_t slpte;
     uint64_t access_right_check;
 
-    if (!vtd_iova_range_check(iova, ce, aw_bits)) {
+    if (!vtd_iova_range_check(s, iova, ce, aw_bits)) {
         error_report_once("%s: detected IOVA overflow (iova=0x%" PRIx64 ")",
                           __func__, iova);
         return -VTD_FR_ADDR_BEYOND_MGAW;
@@ -733,7 +958,7 @@ static int vtd_iova_to_slpte(VTDContextEntry *ce, uint64_t iova, bool is_write,
         if (slpte == (uint64_t)-1) {
             error_report_once("%s: detected read error on DMAR slpte "
                               "(iova=0x%" PRIx64 ")", __func__, iova);
-            if (level == vtd_ce_get_level(ce)) {
+            if (level == vtd_get_iova_level(s, ce)) {
                 /* Invalid programming of context-entry */
                 return -VTD_FR_CONTEXT_ENTRY_INV;
             } else {
@@ -962,29 +1187,96 @@ next:
 /**
  * vtd_page_walk - walk specific IOVA range, and call the hook
  *
+ * @s: intel iommu state
  * @ce: context entry to walk upon
  * @start: IOVA address to start the walk
  * @end: IOVA range end address (start <= addr < end)
  * @info: page walking information struct
  */
-static int vtd_page_walk(VTDContextEntry *ce, uint64_t start, uint64_t end,
+static int vtd_page_walk(IntelIOMMUState *s, VTDContextEntry *ce,
+                         uint64_t start, uint64_t end,
                          vtd_page_walk_info *info)
 {
-    dma_addr_t addr = vtd_ce_get_slpt_base(ce);
-    uint32_t level = vtd_ce_get_level(ce);
+    dma_addr_t addr = vtd_get_iova_pgtbl_base(s, ce);
+    uint32_t level = vtd_get_iova_level(s, ce);
 
-    if (!vtd_iova_range_check(start, ce, info->aw)) {
+    if (!vtd_iova_range_check(s, start, ce, info->aw)) {
         return -VTD_FR_ADDR_BEYOND_MGAW;
     }
 
-    if (!vtd_iova_range_check(end, ce, info->aw)) {
+    if (!vtd_iova_range_check(s, end, ce, info->aw)) {
         /* Fix end so that it reaches the maximum */
-        end = vtd_iova_limit(ce, info->aw);
+        end = vtd_iova_limit(s, ce, info->aw);
     }
 
     return vtd_page_walk_level(addr, start, end, level, true, true, info);
 }
 
+static int vtd_root_entry_rsvd_bits_check(IntelIOMMUState *s,
+                                          VTDRootEntry *re)
+{
+    /* Legacy Mode reserved bits check */
+    if (!s->root_scalable &&
+        (re->hi || (re->lo & VTD_ROOT_ENTRY_RSVD(s->aw_bits))))
+        goto rsvd_err;
+
+    /* Scalable Mode reserved bits check */
+    if (s->root_scalable &&
+        ((re->lo & VTD_ROOT_ENTRY_RSVD(s->aw_bits)) ||
+         (re->hi & VTD_ROOT_ENTRY_RSVD(s->aw_bits))))
+        goto rsvd_err;
+
+    return 0;
+
+rsvd_err:
+    error_report_once("%s: invalid root entry: hi=0x%"PRIx64
+                      ", lo=0x%"PRIx64,
+                      __func__, re->hi, re->lo);
+    return -VTD_FR_ROOT_ENTRY_RSVD;
+}
+
+static inline int vtd_context_entry_rsvd_bits_check(IntelIOMMUState *s,
+                                                    VTDContextEntry *ce)
+{
+    if (!s->root_scalable &&
+        (ce->hi & VTD_CONTEXT_ENTRY_RSVD_HI ||
+         ce->lo & VTD_CONTEXT_ENTRY_RSVD_LO(s->aw_bits))) {
+        error_report_once("%s: invalid context entry: hi=%"PRIx64
+                          ", lo=%"PRIx64" (reserved nonzero)",
+                          __func__, ce->hi, ce->lo);
+        return -VTD_FR_CONTEXT_ENTRY_RSVD;
+    }
+
+    if (s->root_scalable &&
+        (ce->val[0] & VTD_SM_CONTEXT_ENTRY_RSVD_VAL0(s->aw_bits) ||
+         ce->val[1] & VTD_SM_CONTEXT_ENTRY_RSVD_VAL1 ||
+         ce->val[2] ||
+         ce->val[3])) {
+        error_report_once("%s: invalid context entry: val[3]=%"PRIx64
+                          ", val[2]=%"PRIx64
+                          ", val[1]=%"PRIx64
+                          ", val[0]=%"PRIx64" (reserved nonzero)",
+                          __func__, ce->val[3], ce->val[2],
+                          ce->val[1], ce->val[0]);
+        return -VTD_FR_CONTEXT_ENTRY_RSVD;
+    }
+
+    return 0;
+}
+
+static inline int vtd_ce_rid2pasid_check(IntelIOMMUState *s,
+                                         VTDContextEntry *ce)
+{
+    VTDPASIDEntry pe;
+
+    /*
+     * Make sure in Scalable Mode, a present context entry
+     * has valid rid2pasid setting, which includes valid
+     * rid2pasid field and corresponding pasid entry setting
+     */
+    return vtd_ce_get_rid2pasid_entry(s, ce, &pe);
+}
+
 /* Map a device to its corresponding domain (context-entry) */
 static int vtd_dev_to_context_entry(IntelIOMMUState *s, uint8_t bus_num,
                                     uint8_t devfn, VTDContextEntry *ce)
@@ -998,20 +1290,18 @@ static int vtd_dev_to_context_entry(IntelIOMMUState *s, uint8_t bus_num,
         return ret_fr;
     }
 
-    if (!vtd_root_entry_present(&re)) {
+    if (!vtd_root_entry_present(s, &re, devfn)) {
         /* Not error - it's okay we don't have root entry. */
         trace_vtd_re_not_present(bus_num);
         return -VTD_FR_ROOT_ENTRY_P;
     }
 
-    if (re.rsvd || (re.val & VTD_ROOT_ENTRY_RSVD(s->aw_bits))) {
-        error_report_once("%s: invalid root entry: rsvd=0x%"PRIx64
-                          ", val=0x%"PRIx64" (reserved nonzero)",
-                          __func__, re.rsvd, re.val);
-        return -VTD_FR_ROOT_ENTRY_RSVD;
+    ret_fr = vtd_root_entry_rsvd_bits_check(s, &re);
+    if (ret_fr) {
+        return ret_fr;
     }
 
-    ret_fr = vtd_get_context_entry_from_root(&re, devfn, ce);
+    ret_fr = vtd_get_context_entry_from_root(s, &re, devfn, ce);
     if (ret_fr) {
         return ret_fr;
     }
@@ -1022,19 +1312,18 @@ static int vtd_dev_to_context_entry(IntelIOMMUState *s, uint8_t bus_num,
         return -VTD_FR_CONTEXT_ENTRY_P;
     }
 
-    if ((ce->hi & VTD_CONTEXT_ENTRY_RSVD_HI) ||
-               (ce->lo & VTD_CONTEXT_ENTRY_RSVD_LO(s->aw_bits))) {
-        error_report_once("%s: invalid context entry: hi=%"PRIx64
-                          ", lo=%"PRIx64" (reserved nonzero)",
-                          __func__, ce->hi, ce->lo);
-        return -VTD_FR_CONTEXT_ENTRY_RSVD;
+    ret_fr = vtd_context_entry_rsvd_bits_check(s, ce);
+    if (ret_fr) {
+        return ret_fr;
     }
 
     /* Check if the programming of context-entry is valid */
-    if (!vtd_is_level_supported(s, vtd_ce_get_level(ce))) {
+    if (!s->root_scalable &&
+        !vtd_is_level_supported(s, vtd_ce_get_level(ce))) {
         error_report_once("%s: invalid context entry: hi=%"PRIx64
                           ", lo=%"PRIx64" (level %d not supported)",
-                          __func__, ce->hi, ce->lo, vtd_ce_get_level(ce));
+                          __func__, ce->hi, ce->lo,
+                          vtd_ce_get_level(ce));
         return -VTD_FR_CONTEXT_ENTRY_INV;
     }
 
@@ -1044,6 +1333,19 @@ static int vtd_dev_to_context_entry(IntelIOMMUState *s, uint8_t bus_num,
         return -VTD_FR_CONTEXT_ENTRY_INV;
     }
 
+    /*
+     * Check if the programming of context-entry.rid2pasid
+     * and corresponding pasid setting is valid, and thus
+     * avoids to check pasid entry fetching result in future
+     * helper function calling.
+     */
+    if (s->root_scalable) {
+        ret_fr = vtd_ce_rid2pasid_check(s, ce);
+        if (ret_fr) {
+            return ret_fr;
+        }
+    }
+
     return 0;
 }
 
@@ -1054,6 +1356,19 @@ static int vtd_sync_shadow_page_hook(IOMMUTLBEntry *entry,
     return 0;
 }
 
+static inline uint16_t vtd_get_domain_id(IntelIOMMUState *s,
+                                         VTDContextEntry *ce)
+{
+    VTDPASIDEntry pe;
+
+    if (s->root_scalable) {
+        vtd_ce_get_rid2pasid_entry(s, ce, &pe);
+        return VTD_SM_PASID_ENTRY_DID(pe.val[1]);
+    }
+
+    return VTD_CONTEXT_ENTRY_DID(ce->hi);
+}
+
 static int vtd_sync_shadow_page_table_range(VTDAddressSpace *vtd_as,
                                             VTDContextEntry *ce,
                                             hwaddr addr, hwaddr size)
@@ -1065,10 +1380,10 @@ static int vtd_sync_shadow_page_table_range(VTDAddressSpace *vtd_as,
         .notify_unmap = true,
         .aw = s->aw_bits,
         .as = vtd_as,
-        .domain_id = VTD_CONTEXT_ENTRY_DID(ce->hi),
+        .domain_id = vtd_get_domain_id(s, ce),
     };
 
-    return vtd_page_walk(ce, addr, addr + size, &info);
+    return vtd_page_walk(s, ce, addr, addr + size, &info);
 }
 
 static int vtd_sync_shadow_page_table(VTDAddressSpace *vtd_as)
@@ -1103,35 +1418,24 @@ static int vtd_sync_shadow_page_table(VTDAddressSpace *vtd_as)
 }
 
 /*
- * Fetch translation type for specific device. Returns <0 if error
- * happens, otherwise return the shifted type to check against
- * VTD_CONTEXT_TT_*.
+ * Check if specific device is configed to bypass address
+ * translation for DMA requests. In Scalable Mode, bypass
+ * 1st-level translation or 2nd-level translation, it depends
+ * on PGTT setting.
  */
-static int vtd_dev_get_trans_type(VTDAddressSpace *as)
+static bool vtd_dev_pt_enabled(VTDAddressSpace *as)
 {
     IntelIOMMUState *s;
     VTDContextEntry ce;
+    VTDPASIDEntry pe;
     int ret;
 
-    s = as->iommu_state;
+    assert(as);
 
+    s = as->iommu_state;
     ret = vtd_dev_to_context_entry(s, pci_bus_num(as->bus),
                                    as->devfn, &ce);
     if (ret) {
-        return ret;
-    }
-
-    return vtd_ce_get_type(&ce);
-}
-
-static bool vtd_dev_pt_enabled(VTDAddressSpace *as)
-{
-    int ret;
-
-    assert(as);
-
-    ret = vtd_dev_get_trans_type(as);
-    if (ret < 0) {
         /*
          * Possibly failed to parse the context entry for some reason
          * (e.g., during init, or any guest configuration errors on
@@ -1141,7 +1445,17 @@ static bool vtd_dev_pt_enabled(VTDAddressSpace *as)
         return false;
     }
 
-    return ret == VTD_CONTEXT_TT_PASS_THROUGH;
+    if (s->root_scalable) {
+        ret = vtd_ce_get_rid2pasid_entry(s, &ce, &pe);
+        if (ret) {
+            error_report_once("%s: vtd_ce_get_rid2pasid_entry error: %"PRId32,
+                              __func__, ret);
+            return false;
+        }
+        return (VTD_PE_GET_TYPE(&pe) == VTD_SM_PASID_ENTRY_PT);
+    }
+
+    return (vtd_ce_get_type(&ce) == VTD_CONTEXT_TT_PASS_THROUGH);
 }
 
 /* Return whether the device is using IOMMU translation. */
@@ -1322,9 +1636,24 @@ static bool vtd_do_iommu_translate(VTDAddressSpace *vtd_as, PCIBus *bus,
                                cc_entry->context_cache_gen);
         ce = cc_entry->context_entry;
         is_fpd_set = ce.lo & VTD_CONTEXT_ENTRY_FPD;
+        if (!is_fpd_set && s->root_scalable) {
+            ret_fr = vtd_ce_get_pasid_fpd(s, &ce, &is_fpd_set);
+            if (ret_fr) {
+                ret_fr = -ret_fr;
+                if (is_fpd_set && vtd_is_qualified_fault(ret_fr)) {
+                    trace_vtd_fault_disabled();
+                } else {
+                    vtd_report_dmar_fault(s, source_id, addr, ret_fr, is_write);
+                }
+                goto error;
+            }
+        }
     } else {
         ret_fr = vtd_dev_to_context_entry(s, bus_num, devfn, &ce);
         is_fpd_set = ce.lo & VTD_CONTEXT_ENTRY_FPD;
+        if (!ret_fr && !is_fpd_set && s->root_scalable) {
+            ret_fr = vtd_ce_get_pasid_fpd(s, &ce, &is_fpd_set);
+        }
         if (ret_fr) {
             ret_fr = -ret_fr;
             if (is_fpd_set && vtd_is_qualified_fault(ret_fr)) {
@@ -1367,7 +1696,7 @@ static bool vtd_do_iommu_translate(VTDAddressSpace *vtd_as, PCIBus *bus,
         return true;
     }
 
-    ret_fr = vtd_iova_to_slpte(&ce, addr, is_write, &slpte, &level,
+    ret_fr = vtd_iova_to_slpte(s, &ce, addr, is_write, &slpte, &level,
                                &reads, &writes, s->aw_bits);
     if (ret_fr) {
         ret_fr = -ret_fr;
@@ -1381,7 +1710,7 @@ static bool vtd_do_iommu_translate(VTDAddressSpace *vtd_as, PCIBus *bus,
 
     page_mask = vtd_slpt_level_page_mask(level);
     access_flags = IOMMU_ACCESS_FLAG(reads, writes);
-    vtd_update_iotlb(s, source_id, VTD_CONTEXT_ENTRY_DID(ce.hi), addr, slpte,
+    vtd_update_iotlb(s, source_id, vtd_get_domain_id(s, &ce), addr, slpte,
                      access_flags, level);
 out:
     vtd_iommu_unlock(s);
@@ -1573,7 +1902,7 @@ static void vtd_iotlb_domain_invalidate(IntelIOMMUState *s, uint16_t domain_id)
     QLIST_FOREACH(vtd_as, &s->vtd_as_with_notifiers, next) {
         if (!vtd_dev_to_context_entry(s, pci_bus_num(vtd_as->bus),
                                       vtd_as->devfn, &ce) &&
-            domain_id == VTD_CONTEXT_ENTRY_DID(ce.hi)) {
+            domain_id == vtd_get_domain_id(s, &ce)) {
             vtd_sync_shadow_page_table(vtd_as);
         }
     }
@@ -1591,7 +1920,7 @@ static void vtd_iotlb_page_invalidate_notify(IntelIOMMUState *s,
     QLIST_FOREACH(vtd_as, &(s->vtd_as_with_notifiers), next) {
         ret = vtd_dev_to_context_entry(s, pci_bus_num(vtd_as->bus),
                                        vtd_as->devfn, &ce);
-        if (!ret && domain_id == VTD_CONTEXT_ENTRY_DID(ce.hi)) {
+        if (!ret && domain_id == vtd_get_domain_id(s, &ce)) {
             if (vtd_as_has_map_notifier(vtd_as)) {
                 /*
                  * As long as we have MAP notifications registered in
@@ -2629,6 +2958,7 @@ static const VMStateDescription vtd_vmstate = {
         VMSTATE_UINT8_ARRAY(csr, IntelIOMMUState, DMAR_REG_SIZE),
         VMSTATE_UINT8(iq_last_desc_type, IntelIOMMUState),
         VMSTATE_BOOL(root_extended, IntelIOMMUState),
+        VMSTATE_BOOL(root_scalable, IntelIOMMUState),
         VMSTATE_BOOL(dmar_enabled, IntelIOMMUState),
         VMSTATE_BOOL(qi_enabled, IntelIOMMUState),
         VMSTATE_BOOL(intr_enabled, IntelIOMMUState),
@@ -3098,9 +3428,10 @@ static void vtd_iommu_replay(IOMMUMemoryRegion *iommu_mr, IOMMUNotifier *n)
     vtd_address_space_unmap(vtd_as, n);
 
     if (vtd_dev_to_context_entry(s, bus_n, vtd_as->devfn, &ce) == 0) {
-        trace_vtd_replay_ce_valid(bus_n, PCI_SLOT(vtd_as->devfn),
+        trace_vtd_replay_ce_valid(s->root_scalable ? "scalable mode" : "",
+                                  bus_n, PCI_SLOT(vtd_as->devfn),
                                   PCI_FUNC(vtd_as->devfn),
-                                  VTD_CONTEXT_ENTRY_DID(ce.hi),
+                                  vtd_get_domain_id(s, &ce),
                                   ce.hi, ce.lo);
         if (vtd_as_has_map_notifier(vtd_as)) {
             /* This is required only for MAP typed notifiers */
@@ -3110,10 +3441,10 @@ static void vtd_iommu_replay(IOMMUMemoryRegion *iommu_mr, IOMMUNotifier *n)
                 .notify_unmap = false,
                 .aw = s->aw_bits,
                 .as = vtd_as,
-                .domain_id = VTD_CONTEXT_ENTRY_DID(ce.hi),
+                .domain_id = vtd_get_domain_id(s, &ce),
             };
 
-            vtd_page_walk(&ce, 0, ~0ULL, &info);
+            vtd_page_walk(s, &ce, 0, ~0ULL, &info);
         }
     } else {
         trace_vtd_replay_ce_invalid(bus_n, PCI_SLOT(vtd_as->devfn),
@@ -3137,6 +3468,7 @@ static void vtd_init(IntelIOMMUState *s)
 
     s->root = 0;
     s->root_extended = false;
+    s->root_scalable = false;
     s->dmar_enabled = false;
     s->intr_enabled = false;
     s->iq_head = 0;
@@ -3199,7 +3531,7 @@ static void vtd_init(IntelIOMMUState *s)
     vtd_define_long(s, DMAR_GCMD_REG, 0, 0xff800000UL, 0);
     vtd_define_long_wo(s, DMAR_GCMD_REG, 0xff800000UL);
     vtd_define_long(s, DMAR_GSTS_REG, 0, 0, 0);
-    vtd_define_quad(s, DMAR_RTADDR_REG, 0, 0xfffffffffffff000ULL, 0);
+    vtd_define_quad(s, DMAR_RTADDR_REG, 0, 0xfffffffffffffc00ULL, 0);
     vtd_define_quad(s, DMAR_CCMD_REG, 0, 0xe0000003ffffffffULL, 0);
     vtd_define_quad_wo(s, DMAR_CCMD_REG, 0x3ffff0000ULL);
 
diff --git a/hw/i386/intel_iommu_internal.h b/hw/i386/intel_iommu_internal.h
index 00e9edb..fe72bc3 100644
--- a/hw/i386/intel_iommu_internal.h
+++ b/hw/i386/intel_iommu_internal.h
@@ -172,6 +172,7 @@
 
 /* RTADDR_REG */
 #define VTD_RTADDR_RTT              (1ULL << 11)
+#define VTD_RTADDR_SMT              (1ULL << 10)
 #define VTD_RTADDR_ADDR_MASK(aw)    (VTD_HAW_MASK(aw) ^ 0xfffULL)
 
 /* IRTA_REG */
@@ -294,6 +295,8 @@ typedef enum VTDFaultReason {
                                   * request while disabled */
     VTD_FR_IR_SID_ERR = 0x26,   /* Invalid Source-ID */
 
+    VTD_FR_PASID_TABLE_INV = 0x58,  /*Invalid PASID table entry */
+
     /* This is not a normal fault reason. We use this to indicate some faults
      * that are not referenced by the VT-d specification.
      * Fault event with such reason should not be recorded.
@@ -411,8 +414,8 @@ typedef struct VTDIOTLBPageInvInfo VTDIOTLBPageInvInfo;
 #define VTD_PAGE_MASK_1G            (~((1ULL << VTD_PAGE_SHIFT_1G) - 1))
 
 struct VTDRootEntry {
-    uint64_t val;
-    uint64_t rsvd;
+    uint64_t lo;
+    uint64_t hi;
 };
 typedef struct VTDRootEntry VTDRootEntry;
 
@@ -423,6 +426,8 @@ typedef struct VTDRootEntry VTDRootEntry;
 #define VTD_ROOT_ENTRY_NR           (VTD_PAGE_SIZE / sizeof(VTDRootEntry))
 #define VTD_ROOT_ENTRY_RSVD(aw)     (0xffeULL | ~VTD_HAW_MASK(aw))
 
+#define VTD_DEVFN_CHECK_MASK        0x80
+
 /* Masks for struct VTDContextEntry */
 /* lo */
 #define VTD_CONTEXT_ENTRY_P         (1ULL << 0)
@@ -441,6 +446,38 @@ typedef struct VTDRootEntry VTDRootEntry;
 
 #define VTD_CONTEXT_ENTRY_NR        (VTD_PAGE_SIZE / sizeof(VTDContextEntry))
 
+#define VTD_CTX_ENTRY_LEGACY_SIZE     16
+#define VTD_CTX_ENTRY_SCALABLE_SIZE   32
+
+#define VTD_SM_CONTEXT_ENTRY_RID2PASID_MASK 0xfffff
+#define VTD_SM_CONTEXT_ENTRY_RSVD_VAL0(aw)  (0x1e0ULL | ~VTD_HAW_MASK(aw))
+#define VTD_SM_CONTEXT_ENTRY_RSVD_VAL1      0xffffffffffe00000ULL
+
+/* PASID Table Related Definitions */
+#define VTD_PASID_DIR_BASE_ADDR_MASK  (~0xfffULL)
+#define VTD_PASID_TABLE_BASE_ADDR_MASK (~0xfffULL)
+#define VTD_PASID_DIR_ENTRY_SIZE      8
+#define VTD_PASID_ENTRY_SIZE          64
+#define VTD_PASID_DIR_BITS_MASK       (0x3fffULL)
+#define VTD_PASID_DIR_INDEX(pasid)    (((pasid) >> 6) & VTD_PASID_DIR_BITS_MASK)
+#define VTD_PASID_DIR_FPD             (1ULL << 1) /* Fault Processing Disable */
+#define VTD_PASID_TABLE_BITS_MASK     (0x3fULL)
+#define VTD_PASID_TABLE_INDEX(pasid)  ((pasid) & VTD_PASID_TABLE_BITS_MASK)
+#define VTD_PASID_ENTRY_FPD           (1ULL << 1) /* Fault Processing Disable */
+
+/* PASID Granular Translation Type Mask */
+#define VTD_SM_PASID_ENTRY_PGTT        (7ULL << 6)
+#define VTD_SM_PASID_ENTRY_FLT         (1ULL << 6)
+#define VTD_SM_PASID_ENTRY_SLT         (2ULL << 6)
+#define VTD_SM_PASID_ENTRY_NESTED      (3ULL << 6)
+#define VTD_SM_PASID_ENTRY_PT          (4ULL << 6)
+
+#define VTD_SM_PASID_ENTRY_AW          7ULL /* Adjusted guest-address-width */
+#define VTD_SM_PASID_ENTRY_DID(val)    ((val) & VTD_DOMAIN_ID_MASK)
+
+/* Second Level Page Translation Pointer*/
+#define VTD_SM_PASID_ENTRY_SLPTPTR     (~0xfffULL)
+
 /* Paging Structure common */
 #define VTD_SL_PT_PAGE_SIZE_MASK    (1ULL << 7)
 /* Bits to decide the offset for each level */
diff --git a/hw/i386/trace-events b/hw/i386/trace-events
index 77244fc..cae1b76 100644
--- a/hw/i386/trace-events
+++ b/hw/i386/trace-events
@@ -30,7 +30,7 @@ vtd_iotlb_cc_hit(uint8_t bus, uint8_t devfn, uint64_t high, uint64_t low, uint32
 vtd_iotlb_cc_update(uint8_t bus, uint8_t devfn, uint64_t high, uint64_t low, uint32_t gen1, uint32_t gen2) "IOTLB context update bus 0x%"PRIx8" devfn 0x%"PRIx8" high 0x%"PRIx64" low 0x%"PRIx64" gen %"PRIu32" -> gen %"PRIu32
 vtd_iotlb_reset(const char *reason) "IOTLB reset (reason: %s)"
 vtd_fault_disabled(void) "Fault processing disabled for context entry"
-vtd_replay_ce_valid(uint8_t bus, uint8_t dev, uint8_t fn, uint16_t domain, uint64_t hi, uint64_t lo) "replay valid context device %02"PRIx8":%02"PRIx8".%02"PRIx8" domain 0x%"PRIx16" hi 0x%"PRIx64" lo 0x%"PRIx64
+vtd_replay_ce_valid(const char *mode, uint8_t bus, uint8_t dev, uint8_t fn, uint16_t domain, uint64_t hi, uint64_t lo) "%s: replay valid context device %02"PRIx8":%02"PRIx8".%02"PRIx8" domain 0x%"PRIx16" hi 0x%"PRIx64" lo 0x%"PRIx64
 vtd_replay_ce_invalid(uint8_t bus, uint8_t dev, uint8_t fn) "replay invalid context device %02"PRIx8":%02"PRIx8".%02"PRIx8
 vtd_page_walk_level(uint64_t addr, uint32_t level, uint64_t start, uint64_t end) "walk (base=0x%"PRIx64", level=%"PRIu32") iova range 0x%"PRIx64" - 0x%"PRIx64
 vtd_page_walk_one(uint16_t domain, uint64_t iova, uint64_t gpa, uint64_t mask, int perm) "domain 0x%"PRIu16" iova 0x%"PRIx64" -> gpa 0x%"PRIx64" mask 0x%"PRIx64" perm %d"
diff --git a/include/hw/i386/intel_iommu.h b/include/hw/i386/intel_iommu.h
index a321cc9..72c5ca6 100644
--- a/include/hw/i386/intel_iommu.h
+++ b/include/hw/i386/intel_iommu.h
@@ -66,11 +66,20 @@ typedef struct VTDIOTLBEntry VTDIOTLBEntry;
 typedef struct VTDBus VTDBus;
 typedef union VTD_IR_TableEntry VTD_IR_TableEntry;
 typedef union VTD_IR_MSIAddress VTD_IR_MSIAddress;
+typedef struct VTDPASIDDirEntry VTDPASIDDirEntry;
+typedef struct VTDPASIDEntry VTDPASIDEntry;
 
 /* Context-Entry */
 struct VTDContextEntry {
-    uint64_t lo;
-    uint64_t hi;
+    union {
+        struct {
+            uint64_t lo;
+            uint64_t hi;
+        };
+        struct {
+            uint64_t val[4];
+        };
+    };
 };
 
 struct VTDContextCacheEntry {
@@ -81,6 +90,16 @@ struct VTDContextCacheEntry {
     struct VTDContextEntry context_entry;
 };
 
+/* PASID Directory Entry */
+struct VTDPASIDDirEntry {
+    uint64_t val;
+};
+
+/* PASID Table Entry */
+struct VTDPASIDEntry {
+    uint64_t val[8];
+};
+
 struct VTDAddressSpace {
     PCIBus *bus;
     uint8_t devfn;
@@ -212,6 +231,7 @@ struct IntelIOMMUState {
 
     dma_addr_t root;                /* Current root table pointer */
     bool root_extended;             /* Type of root table (extended or not) */
+    bool root_scalable;             /* Type of root table (scalable or not) */
     bool dmar_enabled;              /* Set if DMA remapping is enabled */
 
     uint16_t iq_head;               /* Current invalidation queue head */
-- 
1.9.1

^ permalink raw reply related	[flat|nested] 13+ messages in thread

* Re: [Qemu-devel] [RFC v2 1/3] intel_iommu: scalable mode emulation
  2019-02-28 13:47 ` [Qemu-devel] [RFC v2 1/3] intel_iommu: scalable mode emulation Yi Sun
@ 2019-03-01  6:52   ` Peter Xu
  2019-03-01  7:51     ` Yi Sun
  0 siblings, 1 reply; 13+ messages in thread
From: Peter Xu @ 2019-03-01  6:52 UTC (permalink / raw)
  To: Yi Sun
  Cc: qemu-devel, pbonzini, rth, ehabkost, mst, marcel.apfelbaum,
	jasowang, kevin.tian, yi.l.liu, yi.y.sun

On Thu, Feb 28, 2019 at 09:47:55PM +0800, Yi Sun wrote:
> From: "Liu, Yi L" <yi.l.liu@intel.com>
> 
> Intel(R) VT-d 3.0 spec introduces scalable mode address translation to
> replace extended context mode. This patch extends current emulator to
> support Scalable Mode which includes root table, context table and new
> pasid table format change. Now intel_iommu emulates both legacy mode
> and scalable mode (with legacy-equivalent capability set).
> 
> The key points are below:
> 1. Extend root table operations to support both legacy mode and scalable
>    mode.
> 2. Extend context table operations to support both legacy mode and
>    scalable mode.
> 3. Add pasid tabled operations to support scalable mode.
> 
> Signed-off-by: Liu, Yi L <yi.l.liu@intel.com>
> [Yi Sun is co-developer to contribute much to refine the whole commit.]
> Signed-off-by: Yi Sun <yi.y.sun@linux.intel.com>

Hi, Yi,

The Patch looks very good to me already, though I still have some
trivial comments below.

[...]

> -static inline bool vtd_ce_present(VTDContextEntry *context)
> +static inline bool vtd_ce_present(VTDContextEntry *ce)
>  {
> -    return context->lo & VTD_CONTEXT_ENTRY_P;
> +    return ce->lo & VTD_CONTEXT_ENTRY_P;

The renaming seems not needed.

>  }
>  
> -static int vtd_get_context_entry_from_root(VTDRootEntry *root, uint8_t index,
> +static int vtd_get_context_entry_from_root(IntelIOMMUState *s,
> +                                           VTDRootEntry *re,
> +                                           uint8_t index,
>                                             VTDContextEntry *ce)
>  {
> -    dma_addr_t addr;
> +    dma_addr_t addr, ce_size;
>  
>      /* we have checked that root entry is present */
> -    addr = (root->val & VTD_ROOT_ENTRY_CTP) + index * sizeof(*ce);
> -    if (dma_memory_read(&address_space_memory, addr, ce, sizeof(*ce))) {
> +    ce_size = s->root_scalable ? VTD_CTX_ENTRY_SCALABLE_SIZE :
> +              VTD_CTX_ENTRY_LEGACY_SIZE;
> +
> +    if (s->root_scalable && index > UINT8_MAX / 2) {
> +        index = index & (~VTD_DEVFN_CHECK_MASK);
> +        addr = re->hi & VTD_ROOT_ENTRY_CTP;
> +    } else {
> +        addr = re->lo & VTD_ROOT_ENTRY_CTP;
> +    }
> +
> +    addr = addr + index * ce_size;
> +    if (dma_memory_read(&address_space_memory, addr, ce, ce_size)) {
>          return -VTD_FR_CONTEXT_TABLE_INV;
>      }
> +
>      ce->lo = le64_to_cpu(ce->lo);
>      ce->hi = le64_to_cpu(ce->hi);
> +    if (s->root_scalable) {

(or use ce_size which might be more obvious)

> +        ce->val[2] = le64_to_cpu(ce->val[2]);
> +        ce->val[3] = le64_to_cpu(ce->val[3]);
> +    }
>      return 0;
>  }

[...]

> +static inline int vtd_ce_get_rid2pasid_entry(IntelIOMMUState *s,
> +                                             VTDContextEntry *ce,
> +                                             VTDPASIDEntry *pe)
> +{
> +    uint32_t pasid;
> +    dma_addr_t pasid_dir_base;
> +    int ret = 0;
> +
> +    pasid = VTD_CE_GET_RID2PASID(ce);
> +    pasid_dir_base = VTD_CE_GET_PASID_DIR_TABLE(ce);
> +    ret = vtd_get_pasid_entry_from_pasid(s, pasid_dir_base, pasid, pe);
> +
> +    return ret;
> +}
> +
> +static inline int vtd_ce_get_pasid_fpd(IntelIOMMUState *s,
> +                                       VTDContextEntry *ce,
> +                                       bool *pe_fpd_set)

Many functions are defined as inlined (even some functions that may
not be that short IMHO) and many of them are not.  Could you share how
you decide which function should be inlined?

[...]

>  static inline bool vtd_ce_type_check(X86IOMMUState *x86_iommu,
>                                       VTDContextEntry *ce)
>  {
> +    IntelIOMMUState *s = INTEL_IOMMU_DEVICE(x86_iommu);
> +
> +    if (s->root_scalable) {
> +        /*
> +         * Translation Type locates in context entry only when VTD is in
> +         * legacy mode. For scalable mode, need to return true to avoid
> +         * unnecessary fault.
> +         */
> +        return true;
> +    }

Do you think we can move this check directly into caller of
vtd_ce_type_check() which is vtd_dev_to_context_entry()?  Then:

  if (scalable_mode)
    vtd_ce_rid2pasid_check()
  else
    vtd_ce_type_check()

You can comment on function vtd_ce_type_check() that this only checks
legacy context entries, since calling vtd_ce_type_check() upon an
scalable mode context entry does not make much sense itself already.

[...]

>  /* Return whether the device is using IOMMU translation. */
> @@ -1322,9 +1636,24 @@ static bool vtd_do_iommu_translate(VTDAddressSpace *vtd_as, PCIBus *bus,
>                                 cc_entry->context_cache_gen);
>          ce = cc_entry->context_entry;
>          is_fpd_set = ce.lo & VTD_CONTEXT_ENTRY_FPD;
> +        if (!is_fpd_set && s->root_scalable) {
> +            ret_fr = vtd_ce_get_pasid_fpd(s, &ce, &is_fpd_set);
> +            if (ret_fr) {
> +                ret_fr = -ret_fr;
> +                if (is_fpd_set && vtd_is_qualified_fault(ret_fr)) {

I noticed that I can't find how's vtd_qualified_faults defined and how
that was reflected in the vt-d spec...  Do you know?  Meanwhile, do
you need to add the new error VTD_FR_PASID_TABLE_INV into the table?

Also, this pattern repeated for times.  Maybe it's time to introduce a
new helper.  Your call on this:

        if (ret_fr) {
            ret_fr = -ret_fr;
            if (is_fpd_set && vtd_is_qualified_fault(ret_fr)) {
                trace_vtd_fault_disabled();
            } else {
                vtd_report_dmar_fault(s, source_id, addr, ret_fr, is_write);
            }
            goto error;
        }

> +                    trace_vtd_fault_disabled();
> +                } else {
> +                    vtd_report_dmar_fault(s, source_id, addr, ret_fr, is_write);
> +                }
> +                goto error;
> +            }
> +        }

[...]

> @@ -2629,6 +2958,7 @@ static const VMStateDescription vtd_vmstate = {
>          VMSTATE_UINT8_ARRAY(csr, IntelIOMMUState, DMAR_REG_SIZE),
>          VMSTATE_UINT8(iq_last_desc_type, IntelIOMMUState),
>          VMSTATE_BOOL(root_extended, IntelIOMMUState),
> +        VMSTATE_BOOL(root_scalable, IntelIOMMUState),
>          VMSTATE_BOOL(dmar_enabled, IntelIOMMUState),
>          VMSTATE_BOOL(qi_enabled, IntelIOMMUState),
>          VMSTATE_BOOL(intr_enabled, IntelIOMMUState),
> @@ -3098,9 +3428,10 @@ static void vtd_iommu_replay(IOMMUMemoryRegion *iommu_mr, IOMMUNotifier *n)
>      vtd_address_space_unmap(vtd_as, n);
>  
>      if (vtd_dev_to_context_entry(s, bus_n, vtd_as->devfn, &ce) == 0) {
> -        trace_vtd_replay_ce_valid(bus_n, PCI_SLOT(vtd_as->devfn),
> +        trace_vtd_replay_ce_valid(s->root_scalable ? "scalable mode" : "",

"scalable mode" : "legacy mode"?

Regards,

-- 
Peter Xu

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [Qemu-devel] [RFC v2 1/3] intel_iommu: scalable mode emulation
  2019-03-01  6:52   ` Peter Xu
@ 2019-03-01  7:51     ` Yi Sun
  0 siblings, 0 replies; 13+ messages in thread
From: Yi Sun @ 2019-03-01  7:51 UTC (permalink / raw)
  To: Peter Xu
  Cc: qemu-devel, pbonzini, rth, ehabkost, mst, marcel.apfelbaum,
	jasowang, kevin.tian, yi.l.liu, yi.y.sun

On 19-03-01 14:52:19, Peter Xu wrote:
> On Thu, Feb 28, 2019 at 09:47:55PM +0800, Yi Sun wrote:
> > From: "Liu, Yi L" <yi.l.liu@intel.com>
> > 
> > Intel(R) VT-d 3.0 spec introduces scalable mode address translation to
> > replace extended context mode. This patch extends current emulator to
> > support Scalable Mode which includes root table, context table and new
> > pasid table format change. Now intel_iommu emulates both legacy mode
> > and scalable mode (with legacy-equivalent capability set).
> > 
> > The key points are below:
> > 1. Extend root table operations to support both legacy mode and scalable
> >    mode.
> > 2. Extend context table operations to support both legacy mode and
> >    scalable mode.
> > 3. Add pasid tabled operations to support scalable mode.
> > 
> > Signed-off-by: Liu, Yi L <yi.l.liu@intel.com>
> > [Yi Sun is co-developer to contribute much to refine the whole commit.]
> > Signed-off-by: Yi Sun <yi.y.sun@linux.intel.com>
> 
> Hi, Yi,
> 
> The Patch looks very good to me already, though I still have some
> trivial comments below.
> 
Thanks for the review!

> [...]
> 
> > -static inline bool vtd_ce_present(VTDContextEntry *context)
> > +static inline bool vtd_ce_present(VTDContextEntry *ce)
> >  {
> > -    return context->lo & VTD_CONTEXT_ENTRY_P;
> > +    return ce->lo & VTD_CONTEXT_ENTRY_P;
> 
> The renaming seems not needed.
> 
Ok.

> >  }
> >  
> > -static int vtd_get_context_entry_from_root(VTDRootEntry *root, uint8_t index,
> > +static int vtd_get_context_entry_from_root(IntelIOMMUState *s,
> > +                                           VTDRootEntry *re,
> > +                                           uint8_t index,
> >                                             VTDContextEntry *ce)
> >  {
> > -    dma_addr_t addr;
> > +    dma_addr_t addr, ce_size;
> >  
> >      /* we have checked that root entry is present */
> > -    addr = (root->val & VTD_ROOT_ENTRY_CTP) + index * sizeof(*ce);
> > -    if (dma_memory_read(&address_space_memory, addr, ce, sizeof(*ce))) {
> > +    ce_size = s->root_scalable ? VTD_CTX_ENTRY_SCALABLE_SIZE :
> > +              VTD_CTX_ENTRY_LEGACY_SIZE;
> > +
> > +    if (s->root_scalable && index > UINT8_MAX / 2) {
> > +        index = index & (~VTD_DEVFN_CHECK_MASK);
> > +        addr = re->hi & VTD_ROOT_ENTRY_CTP;
> > +    } else {
> > +        addr = re->lo & VTD_ROOT_ENTRY_CTP;
> > +    }
> > +
> > +    addr = addr + index * ce_size;
> > +    if (dma_memory_read(&address_space_memory, addr, ce, ce_size)) {
> >          return -VTD_FR_CONTEXT_TABLE_INV;
> >      }
> > +
> >      ce->lo = le64_to_cpu(ce->lo);
> >      ce->hi = le64_to_cpu(ce->hi);
> > +    if (s->root_scalable) {
> 
> (or use ce_size which might be more obvious)
> 
Ok, thanks.

> > +        ce->val[2] = le64_to_cpu(ce->val[2]);
> > +        ce->val[3] = le64_to_cpu(ce->val[3]);
> > +    }
> >      return 0;
> >  }
> 
> [...]
> 
> > +static inline int vtd_ce_get_rid2pasid_entry(IntelIOMMUState *s,
> > +                                             VTDContextEntry *ce,
> > +                                             VTDPASIDEntry *pe)
> > +{
> > +    uint32_t pasid;
> > +    dma_addr_t pasid_dir_base;
> > +    int ret = 0;
> > +
> > +    pasid = VTD_CE_GET_RID2PASID(ce);
> > +    pasid_dir_base = VTD_CE_GET_PASID_DIR_TABLE(ce);
> > +    ret = vtd_get_pasid_entry_from_pasid(s, pasid_dir_base, pasid, pe);
> > +
> > +    return ret;
> > +}
> > +
> > +static inline int vtd_ce_get_pasid_fpd(IntelIOMMUState *s,
> > +                                       VTDContextEntry *ce,
> > +                                       bool *pe_fpd_set)
> 
> Many functions are defined as inlined (even some functions that may
> not be that short IMHO) and many of them are not.  Could you share how
> you decide which function should be inlined?
> 
Sorry for that. It is caused by some historical reasons and I forgot
to refine them. I will go through the codes and remove unnecessary
'inline'.

For me, the function with simple flow (a few codes, without complex
calling stack, without circle) should be declared as 'inline' to
improve performance. 

> [...]
> 
> >  static inline bool vtd_ce_type_check(X86IOMMUState *x86_iommu,
> >                                       VTDContextEntry *ce)
> >  {
> > +    IntelIOMMUState *s = INTEL_IOMMU_DEVICE(x86_iommu);
> > +
> > +    if (s->root_scalable) {
> > +        /*
> > +         * Translation Type locates in context entry only when VTD is in
> > +         * legacy mode. For scalable mode, need to return true to avoid
> > +         * unnecessary fault.
> > +         */
> > +        return true;
> > +    }
> 
> Do you think we can move this check directly into caller of
> vtd_ce_type_check() which is vtd_dev_to_context_entry()?  Then:
> 
>   if (scalable_mode)
>     vtd_ce_rid2pasid_check()
>   else
>     vtd_ce_type_check()
> 
> You can comment on function vtd_ce_type_check() that this only checks
> legacy context entries, since calling vtd_ce_type_check() upon an
> scalable mode context entry does not make much sense itself already.
> 
A good suggestion, thanks!

> [...]
> 
> >  /* Return whether the device is using IOMMU translation. */
> > @@ -1322,9 +1636,24 @@ static bool vtd_do_iommu_translate(VTDAddressSpace *vtd_as, PCIBus *bus,
> >                                 cc_entry->context_cache_gen);
> >          ce = cc_entry->context_entry;
> >          is_fpd_set = ce.lo & VTD_CONTEXT_ENTRY_FPD;
> > +        if (!is_fpd_set && s->root_scalable) {
> > +            ret_fr = vtd_ce_get_pasid_fpd(s, &ce, &is_fpd_set);
> > +            if (ret_fr) {
> > +                ret_fr = -ret_fr;
> > +                if (is_fpd_set && vtd_is_qualified_fault(ret_fr)) {
> 
> I noticed that I can't find how's vtd_qualified_faults defined and how
> that was reflected in the vt-d spec...  Do you know?  Meanwhile, do
> you need to add the new error VTD_FR_PASID_TABLE_INV into the table?
> 
The faults are defined in spec 7.2.3, the table 25. FYR.

Yes, I should add VTD_FR_PASID_TABLE_INV into table. Thanks!

> Also, this pattern repeated for times.  Maybe it's time to introduce a
> new helper.  Your call on this:
> 
Ok, I will consider it.

>         if (ret_fr) {
>             ret_fr = -ret_fr;
>             if (is_fpd_set && vtd_is_qualified_fault(ret_fr)) {
>                 trace_vtd_fault_disabled();
>             } else {
>                 vtd_report_dmar_fault(s, source_id, addr, ret_fr, is_write);
>             }
>             goto error;
>         }
> 
> > +                    trace_vtd_fault_disabled();
> > +                } else {
> > +                    vtd_report_dmar_fault(s, source_id, addr, ret_fr, is_write);
> > +                }
> > +                goto error;
> > +            }
> > +        }
> 
> [...]
> 
> > @@ -2629,6 +2958,7 @@ static const VMStateDescription vtd_vmstate = {
> >          VMSTATE_UINT8_ARRAY(csr, IntelIOMMUState, DMAR_REG_SIZE),
> >          VMSTATE_UINT8(iq_last_desc_type, IntelIOMMUState),
> >          VMSTATE_BOOL(root_extended, IntelIOMMUState),
> > +        VMSTATE_BOOL(root_scalable, IntelIOMMUState),
> >          VMSTATE_BOOL(dmar_enabled, IntelIOMMUState),
> >          VMSTATE_BOOL(qi_enabled, IntelIOMMUState),
> >          VMSTATE_BOOL(intr_enabled, IntelIOMMUState),
> > @@ -3098,9 +3428,10 @@ static void vtd_iommu_replay(IOMMUMemoryRegion *iommu_mr, IOMMUNotifier *n)
> >      vtd_address_space_unmap(vtd_as, n);
> >  
> >      if (vtd_dev_to_context_entry(s, bus_n, vtd_as->devfn, &ce) == 0) {
> > -        trace_vtd_replay_ce_valid(bus_n, PCI_SLOT(vtd_as->devfn),
> > +        trace_vtd_replay_ce_valid(s->root_scalable ? "scalable mode" : "",
> 
> "scalable mode" : "legacy mode"?
> 
Sure.

> Regards,
> 
> -- 
> Peter Xu

^ permalink raw reply	[flat|nested] 13+ messages in thread

* [Qemu-devel] [RFC v2 2/3] intel_iommu: add 256 bits qi_desc support
  2019-02-28 13:47 [Qemu-devel] [RFC v2 0/3] intel_iommu: support scalable mode Yi Sun
  2019-02-28 13:47 ` [Qemu-devel] [RFC v2 1/3] intel_iommu: scalable mode emulation Yi Sun
@ 2019-02-28 13:47 ` Yi Sun
  2019-03-01  6:59   ` Peter Xu
  2019-02-28 13:47 ` [Qemu-devel] [RFC v2 3/3] intel_iommu: add scalable-mode option to make scalable mode work Yi Sun
  2019-03-01  7:07 ` [Qemu-devel] [RFC v2 0/3] intel_iommu: support scalable mode Peter Xu
  3 siblings, 1 reply; 13+ messages in thread
From: Yi Sun @ 2019-02-28 13:47 UTC (permalink / raw)
  To: qemu-devel
  Cc: pbonzini, rth, ehabkost, mst, marcel.apfelbaum, peterx, jasowang,
	kevin.tian, yi.l.liu, yi.y.sun, Yi Sun

From: "Liu, Yi L" <yi.l.liu@intel.com>

Per Intel(R) VT-d 3.0, the qi_desc is 256 bits in Scalable
Mode. This patch adds emulation of 256bits qi_desc.

Signed-off-by: Liu, Yi L <yi.l.liu@intel.com>
[Yi Sun is co-developer to rebase and refine the patch.]
Signed-off-by: Yi Sun <yi.y.sun@linux.intel.com>
---
v2:
    - modify s-o-b position.
    - remove unnecessary macros.
    - change 'iq_dw' type to bool.
    - remove initialization to 'inv_desc->val[]'.
    - modify 'VTDInvDesc' to add a union 'val[4]' to be compatible
      with both legacy mode and scalable mode.
---
 hw/i386/intel_iommu.c          | 40 +++++++++++++++++++++++++++++-----------
 hw/i386/intel_iommu_internal.h |  9 ++++++++-
 include/hw/i386/intel_iommu.h  |  1 +
 3 files changed, 38 insertions(+), 12 deletions(-)

diff --git a/hw/i386/intel_iommu.c b/hw/i386/intel_iommu.c
index 109fdbc..d1eb0c5 100644
--- a/hw/i386/intel_iommu.c
+++ b/hw/i386/intel_iommu.c
@@ -2028,7 +2028,7 @@ static void vtd_handle_gcmd_qie(IntelIOMMUState *s, bool en)
     if (en) {
         s->iq = iqa_val & VTD_IQA_IQA_MASK(s->aw_bits);
         /* 2^(x+8) entries */
-        s->iq_size = 1UL << ((iqa_val & VTD_IQA_QS) + 8);
+        s->iq_size = 1UL << ((iqa_val & VTD_IQA_QS) + 8 - (s->iq_dw ? 1 : 0));
         s->qi_enabled = true;
         trace_vtd_inv_qi_setup(s->iq, s->iq_size);
         /* Ok - report back to driver */
@@ -2195,19 +2195,24 @@ static void vtd_handle_iotlb_write(IntelIOMMUState *s)
 }
 
 /* Fetch an Invalidation Descriptor from the Invalidation Queue */
-static bool vtd_get_inv_desc(dma_addr_t base_addr, uint32_t offset,
+static bool vtd_get_inv_desc(IntelIOMMUState *s,
                              VTDInvDesc *inv_desc)
 {
-    dma_addr_t addr = base_addr + offset * sizeof(*inv_desc);
-    if (dma_memory_read(&address_space_memory, addr, inv_desc,
-        sizeof(*inv_desc))) {
-        error_report_once("Read INV DESC failed");
-        inv_desc->lo = 0;
-        inv_desc->hi = 0;
+    dma_addr_t base_addr = s->iq;
+    uint32_t offset = s->iq_head;
+    uint32_t dw = s->iq_dw ? 32 : 16;
+    dma_addr_t addr = base_addr + offset * dw;
+
+    if (dma_memory_read(&address_space_memory, addr, inv_desc, dw)) {
+        error_report_once("Read INV DESC failed.");
         return false;
     }
     inv_desc->lo = le64_to_cpu(inv_desc->lo);
     inv_desc->hi = le64_to_cpu(inv_desc->hi);
+    if (dw == 32) {
+        inv_desc->val[2] = le64_to_cpu(inv_desc->val[2]);
+        inv_desc->val[3] = le64_to_cpu(inv_desc->val[3]);
+    }
     return true;
 }
 
@@ -2413,10 +2418,11 @@ static bool vtd_process_inv_desc(IntelIOMMUState *s)
     uint8_t desc_type;
 
     trace_vtd_inv_qi_head(s->iq_head);
-    if (!vtd_get_inv_desc(s->iq, s->iq_head, &inv_desc)) {
+    if (!vtd_get_inv_desc(s, &inv_desc)) {
         s->iq_last_desc_type = VTD_INV_DESC_NONE;
         return false;
     }
+
     desc_type = inv_desc.lo & VTD_INV_DESC_TYPE;
     /* FIXME: should update at first or at last? */
     s->iq_last_desc_type = desc_type;
@@ -2501,7 +2507,12 @@ static void vtd_handle_iqt_write(IntelIOMMUState *s)
 {
     uint64_t val = vtd_get_quad_raw(s, DMAR_IQT_REG);
 
-    s->iq_tail = VTD_IQT_QT(val);
+    if (s->iq_dw && val & VTD_IQT_QT_256_RSV_BIT) {
+        error_report_once("%s: RSV bit is set: val=0x%"PRIx64,
+                          __func__, val);
+        return;
+    }
+    s->iq_tail = VTD_IQT_QT(s->iq_dw, val);
     trace_vtd_inv_qi_tail(s->iq_tail);
 
     if (s->qi_enabled && !(vtd_get_long_raw(s, DMAR_FSTS_REG) & VTD_FSTS_IQE)) {
@@ -2770,6 +2781,12 @@ static void vtd_mem_write(void *opaque, hwaddr addr,
         } else {
             vtd_set_quad(s, addr, val);
         }
+        if (s->ecap & VTD_ECAP_SMTS &&
+            val & VTD_IQA_DW_MASK) {
+            s->iq_dw = true;
+        } else {
+            s->iq_dw = false;
+        }
         break;
 
     case DMAR_IQA_REG_HI:
@@ -3477,6 +3494,7 @@ static void vtd_init(IntelIOMMUState *s)
     s->iq_size = 0;
     s->qi_enabled = false;
     s->iq_last_desc_type = VTD_INV_DESC_NONE;
+    s->iq_dw = false;
     s->next_frcd_reg = 0;
     s->cap = VTD_CAP_FRO | VTD_CAP_NFR | VTD_CAP_ND |
              VTD_CAP_MAMV | VTD_CAP_PSI | VTD_CAP_SLLPS |
@@ -3554,7 +3572,7 @@ static void vtd_init(IntelIOMMUState *s)
 
     vtd_define_quad(s, DMAR_IQH_REG, 0, 0, 0);
     vtd_define_quad(s, DMAR_IQT_REG, 0, 0x7fff0ULL, 0);
-    vtd_define_quad(s, DMAR_IQA_REG, 0, 0xfffffffffffff007ULL, 0);
+    vtd_define_quad(s, DMAR_IQA_REG, 0, 0xfffffffffffff807ULL, 0);
     vtd_define_long(s, DMAR_ICS_REG, 0, 0, 0x1UL);
     vtd_define_long(s, DMAR_IECTL_REG, 0x80000000UL, 0x80000000UL, 0);
     vtd_define_long(s, DMAR_IEDATA_REG, 0, 0xffffffffUL, 0);
diff --git a/hw/i386/intel_iommu_internal.h b/hw/i386/intel_iommu_internal.h
index fe72bc3..016fa4c 100644
--- a/hw/i386/intel_iommu_internal.h
+++ b/hw/i386/intel_iommu_internal.h
@@ -190,6 +190,7 @@
 #define VTD_ECAP_EIM                (1ULL << 4)
 #define VTD_ECAP_PT                 (1ULL << 6)
 #define VTD_ECAP_MHMV               (15ULL << 20)
+#define VTD_ECAP_SMTS               (1ULL << 43)
 
 /* CAP_REG */
 /* (offset >> 4) << 24 */
@@ -218,11 +219,14 @@
 #define VTD_CAP_SAGAW_48bit         (0x4ULL << VTD_CAP_SAGAW_SHIFT)
 
 /* IQT_REG */
-#define VTD_IQT_QT(val)             (((val) >> 4) & 0x7fffULL)
+#define VTD_IQT_QT(dw_bit, val)     (dw_bit ? (((val) >> 5) & 0x3fffULL) : \
+                                     (((val) >> 4) & 0x7fffULL))
+#define VTD_IQT_QT_256_RSV_BIT      0x10
 
 /* IQA_REG */
 #define VTD_IQA_IQA_MASK(aw)        (VTD_HAW_MASK(aw) ^ 0xfffULL)
 #define VTD_IQA_QS                  0x7ULL
+#define VTD_IQA_DW_MASK             0x800
 
 /* IQH_REG */
 #define VTD_IQH_QH_SHIFT            4
@@ -324,6 +328,9 @@ union VTDInvDesc {
         uint64_t lo;
         uint64_t hi;
     };
+    struct {
+        uint64_t val[4];
+    };
     union {
         VTDInvDescIEC iec;
     };
diff --git a/include/hw/i386/intel_iommu.h b/include/hw/i386/intel_iommu.h
index 72c5ca6..2877c94 100644
--- a/include/hw/i386/intel_iommu.h
+++ b/include/hw/i386/intel_iommu.h
@@ -238,6 +238,7 @@ struct IntelIOMMUState {
     uint16_t iq_tail;               /* Current invalidation queue tail */
     dma_addr_t iq;                  /* Current invalidation queue pointer */
     uint16_t iq_size;               /* IQ Size in number of entries */
+    bool iq_dw;                     /* IQ descriptor width 256bit or not */
     bool qi_enabled;                /* Set if the QI is enabled */
     uint8_t iq_last_desc_type;      /* The type of last completed descriptor */
 
-- 
1.9.1

^ permalink raw reply related	[flat|nested] 13+ messages in thread

* Re: [Qemu-devel] [RFC v2 2/3] intel_iommu: add 256 bits qi_desc support
  2019-02-28 13:47 ` [Qemu-devel] [RFC v2 2/3] intel_iommu: add 256 bits qi_desc support Yi Sun
@ 2019-03-01  6:59   ` Peter Xu
  2019-03-01  7:51     ` Yi Sun
  0 siblings, 1 reply; 13+ messages in thread
From: Peter Xu @ 2019-03-01  6:59 UTC (permalink / raw)
  To: Yi Sun
  Cc: qemu-devel, pbonzini, rth, ehabkost, mst, marcel.apfelbaum,
	jasowang, kevin.tian, yi.l.liu, yi.y.sun

On Thu, Feb 28, 2019 at 09:47:56PM +0800, Yi Sun wrote:
> From: "Liu, Yi L" <yi.l.liu@intel.com>
> 
> Per Intel(R) VT-d 3.0, the qi_desc is 256 bits in Scalable
> Mode. This patch adds emulation of 256bits qi_desc.
> 
> Signed-off-by: Liu, Yi L <yi.l.liu@intel.com>
> [Yi Sun is co-developer to rebase and refine the patch.]
> Signed-off-by: Yi Sun <yi.y.sun@linux.intel.com>

[...]

> @@ -2501,7 +2507,12 @@ static void vtd_handle_iqt_write(IntelIOMMUState *s)
>  {
>      uint64_t val = vtd_get_quad_raw(s, DMAR_IQT_REG);
>  
> -    s->iq_tail = VTD_IQT_QT(val);
> +    if (s->iq_dw && val & VTD_IQT_QT_256_RSV_BIT) {

Nit: Let's do (val & VTD_IQT_QT_256_RSV_BIT) to be clear.  With that:

Reviewed-by: Peter Xu <peterx@redhat.com>

Regards,

-- 
Peter Xu

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [Qemu-devel] [RFC v2 2/3] intel_iommu: add 256 bits qi_desc support
  2019-03-01  6:59   ` Peter Xu
@ 2019-03-01  7:51     ` Yi Sun
  0 siblings, 0 replies; 13+ messages in thread
From: Yi Sun @ 2019-03-01  7:51 UTC (permalink / raw)
  To: Peter Xu
  Cc: qemu-devel, pbonzini, rth, ehabkost, mst, marcel.apfelbaum,
	jasowang, kevin.tian, yi.l.liu, yi.y.sun

On 19-03-01 14:59:00, Peter Xu wrote:
> On Thu, Feb 28, 2019 at 09:47:56PM +0800, Yi Sun wrote:
> > From: "Liu, Yi L" <yi.l.liu@intel.com>
> > 
> > Per Intel(R) VT-d 3.0, the qi_desc is 256 bits in Scalable
> > Mode. This patch adds emulation of 256bits qi_desc.
> > 
> > Signed-off-by: Liu, Yi L <yi.l.liu@intel.com>
> > [Yi Sun is co-developer to rebase and refine the patch.]
> > Signed-off-by: Yi Sun <yi.y.sun@linux.intel.com>
> 
> [...]
> 
> > @@ -2501,7 +2507,12 @@ static void vtd_handle_iqt_write(IntelIOMMUState *s)
> >  {
> >      uint64_t val = vtd_get_quad_raw(s, DMAR_IQT_REG);
> >  
> > -    s->iq_tail = VTD_IQT_QT(val);
> > +    if (s->iq_dw && val & VTD_IQT_QT_256_RSV_BIT) {
> 
> Nit: Let's do (val & VTD_IQT_QT_256_RSV_BIT) to be clear.  With that:
> 
Sure. Thanks!

> Reviewed-by: Peter Xu <peterx@redhat.com>
> 
> Regards,
> 
> -- 
> Peter Xu

^ permalink raw reply	[flat|nested] 13+ messages in thread

* [Qemu-devel] [RFC v2 3/3] intel_iommu: add scalable-mode option to make scalable mode work
  2019-02-28 13:47 [Qemu-devel] [RFC v2 0/3] intel_iommu: support scalable mode Yi Sun
  2019-02-28 13:47 ` [Qemu-devel] [RFC v2 1/3] intel_iommu: scalable mode emulation Yi Sun
  2019-02-28 13:47 ` [Qemu-devel] [RFC v2 2/3] intel_iommu: add 256 bits qi_desc support Yi Sun
@ 2019-02-28 13:47 ` Yi Sun
  2019-03-01  7:04   ` Peter Xu
  2019-03-01  7:07 ` [Qemu-devel] [RFC v2 0/3] intel_iommu: support scalable mode Peter Xu
  3 siblings, 1 reply; 13+ messages in thread
From: Yi Sun @ 2019-02-28 13:47 UTC (permalink / raw)
  To: qemu-devel
  Cc: pbonzini, rth, ehabkost, mst, marcel.apfelbaum, peterx, jasowang,
	kevin.tian, yi.l.liu, yi.y.sun, Yi Sun

This patch adds an option to provide flexibility for user to expose
Scalable Mode to guest. User could expose Scalable Mode to guest by
the config as below:

"-device intel-iommu,caching-mode=on,scalable-mode=on"

The Linux iommu driver has supported scalable mode. Please refer below
patch set:

    https://www.spinics.net/lists/kernel/msg2985279.html

Signed-off-by: Liu, Yi L <yi.l.liu@intel.com>
Signed-off-by: Yi Sun <yi.y.sun@linux.intel.com>
---
v2:
    - rename "scalable-mode" to "x-scalable-mode".
    - remove caching_mode check when scalable_mode is set.
    - check dma_drain check when scalable_mode is set. This is requested
      by spec.
    - remove redundant macros.
---
 hw/i386/intel_iommu.c          | 24 ++++++++++++++++++++++++
 hw/i386/intel_iommu_internal.h |  4 ++++
 include/hw/i386/intel_iommu.h  |  3 ++-
 3 files changed, 30 insertions(+), 1 deletion(-)

diff --git a/hw/i386/intel_iommu.c b/hw/i386/intel_iommu.c
index d1eb0c5..ec7722d 100644
--- a/hw/i386/intel_iommu.c
+++ b/hw/i386/intel_iommu.c
@@ -1733,6 +1733,9 @@ static void vtd_root_table_setup(IntelIOMMUState *s)
 {
     s->root = vtd_get_quad_raw(s, DMAR_RTADDR_REG);
     s->root_extended = s->root & VTD_RTADDR_RTT;
+    if (s->scalable_mode) {
+        s->root_scalable = s->root & VTD_RTADDR_SMT;
+    }
     s->root &= VTD_RTADDR_ADDR_MASK(s->aw_bits);
 
     trace_vtd_reg_dmar_root(s->root, s->root_extended);
@@ -2442,6 +2445,17 @@ static bool vtd_process_inv_desc(IntelIOMMUState *s)
         }
         break;
 
+    /*
+     * TODO: the entity of below two cases will be implemented in future series.
+     * To make guest (which integrates scalable mode support patch set in
+     * iommu driver) work, just return true is enough so far.
+     */
+    case VTD_INV_DESC_PC:
+        break;
+
+    case VTD_INV_DESC_PIOTLB:
+        break;
+
     case VTD_INV_DESC_WAIT:
         trace_vtd_inv_desc("wait", inv_desc.hi, inv_desc.lo);
         if (!vtd_process_wait_desc(s, &inv_desc)) {
@@ -3006,6 +3020,7 @@ static Property vtd_properties[] = {
     DEFINE_PROP_UINT8("aw-bits", IntelIOMMUState, aw_bits,
                       VTD_HOST_ADDRESS_WIDTH),
     DEFINE_PROP_BOOL("caching-mode", IntelIOMMUState, caching_mode, FALSE),
+    DEFINE_PROP_BOOL("x-scalable-mode", IntelIOMMUState, scalable_mode, FALSE),
     DEFINE_PROP_BOOL("dma-drain", IntelIOMMUState, dma_drain, true),
     DEFINE_PROP_END_OF_LIST(),
 };
@@ -3540,6 +3555,15 @@ static void vtd_init(IntelIOMMUState *s)
         s->cap |= VTD_CAP_CM;
     }
 
+    /* TODO: read cap/ecap from host to decide which cap to be exposed. */
+    if (s->scalable_mode) {
+        if (!s->dma_drain) {
+            error_report("Need to set dma_drain for scalable mode");
+            exit(1);
+        }
+        s->ecap |= VTD_ECAP_SMTS | VTD_ECAP_SRS | VTD_ECAP_SLTS;
+    }
+
     vtd_reset_caches(s);
 
     /* Define registers with default values and bit semantics */
diff --git a/hw/i386/intel_iommu_internal.h b/hw/i386/intel_iommu_internal.h
index 016fa4c..1160618 100644
--- a/hw/i386/intel_iommu_internal.h
+++ b/hw/i386/intel_iommu_internal.h
@@ -190,7 +190,9 @@
 #define VTD_ECAP_EIM                (1ULL << 4)
 #define VTD_ECAP_PT                 (1ULL << 6)
 #define VTD_ECAP_MHMV               (15ULL << 20)
+#define VTD_ECAP_SRS                (1ULL << 31)
 #define VTD_ECAP_SMTS               (1ULL << 43)
+#define VTD_ECAP_SLTS               (1ULL << 46)
 
 /* CAP_REG */
 /* (offset >> 4) << 24 */
@@ -345,6 +347,8 @@ typedef union VTDInvDesc VTDInvDesc;
 #define VTD_INV_DESC_IEC                0x4 /* Interrupt Entry Cache
                                                Invalidate Descriptor */
 #define VTD_INV_DESC_WAIT               0x5 /* Invalidation Wait Descriptor */
+#define VTD_INV_DESC_PIOTLB             0x6 /* PASID-IOTLB Invalidate Desc */
+#define VTD_INV_DESC_PC                 0x7 /* PASID-cache Invalidate Desc */
 #define VTD_INV_DESC_NONE               0   /* Not an Invalidate Descriptor */
 
 /* Masks for Invalidation Wait Descriptor*/
diff --git a/include/hw/i386/intel_iommu.h b/include/hw/i386/intel_iommu.h
index 2877c94..c11e3d5 100644
--- a/include/hw/i386/intel_iommu.h
+++ b/include/hw/i386/intel_iommu.h
@@ -227,7 +227,8 @@ struct IntelIOMMUState {
     uint8_t womask[DMAR_REG_SIZE];  /* WO (write only - read returns 0) */
     uint32_t version;
 
-    bool caching_mode;          /* RO - is cap CM enabled? */
+    bool caching_mode;              /* RO - is cap CM enabled? */
+    bool scalable_mode;             /* RO - is Scalable Mode supported? */
 
     dma_addr_t root;                /* Current root table pointer */
     bool root_extended;             /* Type of root table (extended or not) */
-- 
1.9.1

^ permalink raw reply related	[flat|nested] 13+ messages in thread

* Re: [Qemu-devel] [RFC v2 3/3] intel_iommu: add scalable-mode option to make scalable mode work
  2019-02-28 13:47 ` [Qemu-devel] [RFC v2 3/3] intel_iommu: add scalable-mode option to make scalable mode work Yi Sun
@ 2019-03-01  7:04   ` Peter Xu
  2019-03-01  7:54     ` Yi Sun
  0 siblings, 1 reply; 13+ messages in thread
From: Peter Xu @ 2019-03-01  7:04 UTC (permalink / raw)
  To: Yi Sun
  Cc: qemu-devel, pbonzini, rth, ehabkost, mst, marcel.apfelbaum,
	jasowang, kevin.tian, yi.l.liu, yi.y.sun

On Thu, Feb 28, 2019 at 09:47:57PM +0800, Yi Sun wrote:
> This patch adds an option to provide flexibility for user to expose
> Scalable Mode to guest. User could expose Scalable Mode to guest by
> the config as below:
> 
> "-device intel-iommu,caching-mode=on,scalable-mode=on"
> 
> The Linux iommu driver has supported scalable mode. Please refer below
> patch set:
> 
>     https://www.spinics.net/lists/kernel/msg2985279.html
> 
> Signed-off-by: Liu, Yi L <yi.l.liu@intel.com>
> Signed-off-by: Yi Sun <yi.y.sun@linux.intel.com>
> ---
> v2:
>     - rename "scalable-mode" to "x-scalable-mode".
>     - remove caching_mode check when scalable_mode is set.
>     - check dma_drain check when scalable_mode is set. This is requested
>       by spec.
>     - remove redundant macros.
> ---
>  hw/i386/intel_iommu.c          | 24 ++++++++++++++++++++++++
>  hw/i386/intel_iommu_internal.h |  4 ++++
>  include/hw/i386/intel_iommu.h  |  3 ++-
>  3 files changed, 30 insertions(+), 1 deletion(-)
> 
> diff --git a/hw/i386/intel_iommu.c b/hw/i386/intel_iommu.c
> index d1eb0c5..ec7722d 100644
> --- a/hw/i386/intel_iommu.c
> +++ b/hw/i386/intel_iommu.c
> @@ -1733,6 +1733,9 @@ static void vtd_root_table_setup(IntelIOMMUState *s)
>  {
>      s->root = vtd_get_quad_raw(s, DMAR_RTADDR_REG);
>      s->root_extended = s->root & VTD_RTADDR_RTT;
> +    if (s->scalable_mode) {
> +        s->root_scalable = s->root & VTD_RTADDR_SMT;
> +    }
>      s->root &= VTD_RTADDR_ADDR_MASK(s->aw_bits);
>  
>      trace_vtd_reg_dmar_root(s->root, s->root_extended);
> @@ -2442,6 +2445,17 @@ static bool vtd_process_inv_desc(IntelIOMMUState *s)
>          }
>          break;
>  
> +    /*
> +     * TODO: the entity of below two cases will be implemented in future series.
> +     * To make guest (which integrates scalable mode support patch set in
> +     * iommu driver) work, just return true is enough so far.
> +     */
> +    case VTD_INV_DESC_PC:
> +        break;
> +
> +    case VTD_INV_DESC_PIOTLB:
> +        break;
> +
>      case VTD_INV_DESC_WAIT:
>          trace_vtd_inv_desc("wait", inv_desc.hi, inv_desc.lo);
>          if (!vtd_process_wait_desc(s, &inv_desc)) {
> @@ -3006,6 +3020,7 @@ static Property vtd_properties[] = {
>      DEFINE_PROP_UINT8("aw-bits", IntelIOMMUState, aw_bits,
>                        VTD_HOST_ADDRESS_WIDTH),
>      DEFINE_PROP_BOOL("caching-mode", IntelIOMMUState, caching_mode, FALSE),
> +    DEFINE_PROP_BOOL("x-scalable-mode", IntelIOMMUState, scalable_mode, FALSE),
>      DEFINE_PROP_BOOL("dma-drain", IntelIOMMUState, dma_drain, true),
>      DEFINE_PROP_END_OF_LIST(),
>  };
> @@ -3540,6 +3555,15 @@ static void vtd_init(IntelIOMMUState *s)
>          s->cap |= VTD_CAP_CM;
>      }
>  
> +    /* TODO: read cap/ecap from host to decide which cap to be exposed. */
> +    if (s->scalable_mode) {
> +        if (!s->dma_drain) {
> +            error_report("Need to set dma_drain for scalable mode");
> +            exit(1);
> +        }

This patch looks mostly good to me, only that can we move this check
to vtd_decide_config()?  That's where most similar checks are done.

> +        s->ecap |= VTD_ECAP_SMTS | VTD_ECAP_SRS | VTD_ECAP_SLTS;
> +    }
> +
>      vtd_reset_caches(s);
>  
>      /* Define registers with default values and bit semantics */
> diff --git a/hw/i386/intel_iommu_internal.h b/hw/i386/intel_iommu_internal.h
> index 016fa4c..1160618 100644
> --- a/hw/i386/intel_iommu_internal.h
> +++ b/hw/i386/intel_iommu_internal.h
> @@ -190,7 +190,9 @@
>  #define VTD_ECAP_EIM                (1ULL << 4)
>  #define VTD_ECAP_PT                 (1ULL << 6)
>  #define VTD_ECAP_MHMV               (15ULL << 20)
> +#define VTD_ECAP_SRS                (1ULL << 31)
>  #define VTD_ECAP_SMTS               (1ULL << 43)
> +#define VTD_ECAP_SLTS               (1ULL << 46)
>  
>  /* CAP_REG */
>  /* (offset >> 4) << 24 */
> @@ -345,6 +347,8 @@ typedef union VTDInvDesc VTDInvDesc;
>  #define VTD_INV_DESC_IEC                0x4 /* Interrupt Entry Cache
>                                                 Invalidate Descriptor */
>  #define VTD_INV_DESC_WAIT               0x5 /* Invalidation Wait Descriptor */
> +#define VTD_INV_DESC_PIOTLB             0x6 /* PASID-IOTLB Invalidate Desc */
> +#define VTD_INV_DESC_PC                 0x7 /* PASID-cache Invalidate Desc */
>  #define VTD_INV_DESC_NONE               0   /* Not an Invalidate Descriptor */
>  
>  /* Masks for Invalidation Wait Descriptor*/
> diff --git a/include/hw/i386/intel_iommu.h b/include/hw/i386/intel_iommu.h
> index 2877c94..c11e3d5 100644
> --- a/include/hw/i386/intel_iommu.h
> +++ b/include/hw/i386/intel_iommu.h
> @@ -227,7 +227,8 @@ struct IntelIOMMUState {
>      uint8_t womask[DMAR_REG_SIZE];  /* WO (write only - read returns 0) */
>      uint32_t version;
>  
> -    bool caching_mode;          /* RO - is cap CM enabled? */
> +    bool caching_mode;              /* RO - is cap CM enabled? */
> +    bool scalable_mode;             /* RO - is Scalable Mode supported? */
>  
>      dma_addr_t root;                /* Current root table pointer */
>      bool root_extended;             /* Type of root table (extended or not) */
> -- 
> 1.9.1
> 

Regards,

-- 
Peter Xu

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [Qemu-devel] [RFC v2 3/3] intel_iommu: add scalable-mode option to make scalable mode work
  2019-03-01  7:04   ` Peter Xu
@ 2019-03-01  7:54     ` Yi Sun
  0 siblings, 0 replies; 13+ messages in thread
From: Yi Sun @ 2019-03-01  7:54 UTC (permalink / raw)
  To: Peter Xu
  Cc: qemu-devel, pbonzini, rth, ehabkost, mst, marcel.apfelbaum,
	jasowang, kevin.tian, yi.l.liu, yi.y.sun

On 19-03-01 15:04:14, Peter Xu wrote:

[...]

> > @@ -3540,6 +3555,15 @@ static void vtd_init(IntelIOMMUState *s)
> >          s->cap |= VTD_CAP_CM;
> >      }
> >  
> > +    /* TODO: read cap/ecap from host to decide which cap to be exposed. */
> > +    if (s->scalable_mode) {
> > +        if (!s->dma_drain) {
> > +            error_report("Need to set dma_drain for scalable mode");
> > +            exit(1);
> > +        }
> 
> This patch looks mostly good to me, only that can we move this check
> to vtd_decide_config()?  That's where most similar checks are done.
>
I think that is fine. Thanks!

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [Qemu-devel] [RFC v2 0/3] intel_iommu: support scalable mode
  2019-02-28 13:47 [Qemu-devel] [RFC v2 0/3] intel_iommu: support scalable mode Yi Sun
                   ` (2 preceding siblings ...)
  2019-02-28 13:47 ` [Qemu-devel] [RFC v2 3/3] intel_iommu: add scalable-mode option to make scalable mode work Yi Sun
@ 2019-03-01  7:07 ` Peter Xu
  2019-03-01  7:13   ` Yi Sun
  3 siblings, 1 reply; 13+ messages in thread
From: Peter Xu @ 2019-03-01  7:07 UTC (permalink / raw)
  To: Yi Sun
  Cc: qemu-devel, pbonzini, rth, ehabkost, mst, marcel.apfelbaum,
	jasowang, kevin.tian, yi.l.liu, yi.y.sun

On Thu, Feb 28, 2019 at 09:47:54PM +0800, Yi Sun wrote:
> Intel vt-d rev3.0 [1] introduces a new translation mode called
> 'scalable mode', which enables PASID-granular translations for
> first level, second level, nested and pass-through modes. The
> vt-d scalable mode is the key ingredient to enable Scalable I/O
> Virtualization (Scalable IOV) [2] [3], which allows sharing a
> device in minimal possible granularity (ADI - Assignable Device
> Interface). As a result, previous Extended Context (ECS) mode
> is deprecated (no production ever implements ECS).
> 
> This patch set emulates a minimal capability set of VT-d scalable
> mode, equivalent to what is available in VT-d legacy mode today:
>     1. Scalable mode root entry, context entry and PASID table
>     2. Seconds level translation under scalable mode
>     3. Queued invalidation (with 256 bits descriptor)
>     4. Pass-through mode
> 
> Corresponding intel-iommu driver support will be included in
> kernel 5.0:
>     https://www.spinics.net/lists/kernel/msg2985279.html
> 
> We will add emulation of full scalable mode capability along with
> guest iommu driver progress later, e.g.:
>     1. First level translation
>     2. Nested translation
>     3. Per-PASID invalidation descriptors
>     4. Page request services for handling recoverable faults
> 
> To verify the patches, below cases were tested according to Peter Xu's
> suggestions.
>     +---------+----------------------------------------------------------------+----------------------------------------------------------------+
>     |         |                      w/ Device Passthr                         |                     w/o Device Passthr                         |
>     |         +-------------------------------+--------------------------------+-------------------------------+--------------------------------+
>     |         | virtio-net-pci, vhost=on      | virtio-net-pci, vhost=off      | virtio-net-pci, vhost=on      | virtio-net-pci, vhost=off      |
>     |         +-------------------------------+--------------------------------+-------------------------------+--------------------------------+
>     |         | netperf | kernel bld | data cp| netperf | kernel bld | data cp | netperf | kernel bld | data cp| netperf | kernel bld | data cp |
>     +---------+-------------------------------+--------------------------------+-------------------------------+--------------------------------+
>     | Legacy  | Pass    | Pass       | Pass   | Pass    | Pass       | Pass    | Pass    | Pass       | Pass   | Pass    | Pass       | Pass    |
>     +---------+-------------------------------+--------------------------------+-------------------------------+--------------------------------+
>     | Scalable| Pass    | Pass       | Pass   | Pass    | Pass       | Pass    | Pass    | Pass       | Pass   | Pass    | Pass       | Pass    |
>     +---------+-------------------------------+--------------------------------+-------------------------------+--------------------------------+

Hi, Yi,

Thanks very much for the thorough test matrix!

The last thing I'd like to confirm is have you tested device
assignment with v2?  And note that when you test with virtio devices
you should not need caching-mode=on (but caching-mode=on should not
break anyone though).

I've still got some comments here and there but it looks very good at
least to me overall.

Thanks,

-- 
Peter Xu

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [Qemu-devel] [RFC v2 0/3] intel_iommu: support scalable mode
  2019-03-01  7:07 ` [Qemu-devel] [RFC v2 0/3] intel_iommu: support scalable mode Peter Xu
@ 2019-03-01  7:13   ` Yi Sun
  2019-03-01  7:30     ` Tian, Kevin
  0 siblings, 1 reply; 13+ messages in thread
From: Yi Sun @ 2019-03-01  7:13 UTC (permalink / raw)
  To: Peter Xu
  Cc: qemu-devel, pbonzini, rth, ehabkost, mst, marcel.apfelbaum,
	jasowang, kevin.tian, yi.l.liu, yi.y.sun

On 19-03-01 15:07:34, Peter Xu wrote:
> On Thu, Feb 28, 2019 at 09:47:54PM +0800, Yi Sun wrote:
> > Intel vt-d rev3.0 [1] introduces a new translation mode called
> > 'scalable mode', which enables PASID-granular translations for
> > first level, second level, nested and pass-through modes. The
> > vt-d scalable mode is the key ingredient to enable Scalable I/O
> > Virtualization (Scalable IOV) [2] [3], which allows sharing a
> > device in minimal possible granularity (ADI - Assignable Device
> > Interface). As a result, previous Extended Context (ECS) mode
> > is deprecated (no production ever implements ECS).
> > 
> > This patch set emulates a minimal capability set of VT-d scalable
> > mode, equivalent to what is available in VT-d legacy mode today:
> >     1. Scalable mode root entry, context entry and PASID table
> >     2. Seconds level translation under scalable mode
> >     3. Queued invalidation (with 256 bits descriptor)
> >     4. Pass-through mode
> > 
> > Corresponding intel-iommu driver support will be included in
> > kernel 5.0:
> >     https://www.spinics.net/lists/kernel/msg2985279.html
> > 
> > We will add emulation of full scalable mode capability along with
> > guest iommu driver progress later, e.g.:
> >     1. First level translation
> >     2. Nested translation
> >     3. Per-PASID invalidation descriptors
> >     4. Page request services for handling recoverable faults
> > 
> > To verify the patches, below cases were tested according to Peter Xu's
> > suggestions.
> >     +---------+----------------------------------------------------------------+----------------------------------------------------------------+
> >     |         |                      w/ Device Passthr                         |                     w/o Device Passthr                         |
> >     |         +-------------------------------+--------------------------------+-------------------------------+--------------------------------+
> >     |         | virtio-net-pci, vhost=on      | virtio-net-pci, vhost=off      | virtio-net-pci, vhost=on      | virtio-net-pci, vhost=off      |
> >     |         +-------------------------------+--------------------------------+-------------------------------+--------------------------------+
> >     |         | netperf | kernel bld | data cp| netperf | kernel bld | data cp | netperf | kernel bld | data cp| netperf | kernel bld | data cp |
> >     +---------+-------------------------------+--------------------------------+-------------------------------+--------------------------------+
> >     | Legacy  | Pass    | Pass       | Pass   | Pass    | Pass       | Pass    | Pass    | Pass       | Pass   | Pass    | Pass       | Pass    |
> >     +---------+-------------------------------+--------------------------------+-------------------------------+--------------------------------+
> >     | Scalable| Pass    | Pass       | Pass   | Pass    | Pass       | Pass    | Pass    | Pass       | Pass   | Pass    | Pass       | Pass    |
> >     +---------+-------------------------------+--------------------------------+-------------------------------+--------------------------------+
> 
> Hi, Yi,
> 
> Thanks very much for the thorough test matrix!
> 
Thanks for the review and comments! :)

> The last thing I'd like to confirm is have you tested device
> assignment with v2?  And note that when you test with virtio devices

Yes, I tested a MDEV assignment which can walk the Scalable Mode
patches flows (both kernel and qemu).

> you should not need caching-mode=on (but caching-mode=on should not
> break anyone though).
> 
For virtio-net-pci without device assignment, I did not use
"caching-mode=on".
 
> I've still got some comments here and there but it looks very good at
> least to me overall.
> 
> Thanks,
> 
> -- 
> Peter Xu

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [Qemu-devel] [RFC v2 0/3] intel_iommu: support scalable mode
  2019-03-01  7:13   ` Yi Sun
@ 2019-03-01  7:30     ` Tian, Kevin
  0 siblings, 0 replies; 13+ messages in thread
From: Tian, Kevin @ 2019-03-01  7:30 UTC (permalink / raw)
  To: Yi Sun, Peter Xu
  Cc: qemu-devel@nongnu.org, pbonzini@redhat.com, rth@twiddle.net,
	ehabkost@redhat.com, mst@redhat.com, marcel.apfelbaum@gmail.com,
	jasowang@redhat.com, Liu, Yi L, Sun, Yi Y

> From: Yi Sun [mailto:yi.y.sun@linux.intel.com]
> Sent: Friday, March 1, 2019 3:13 PM
> 
> On 19-03-01 15:07:34, Peter Xu wrote:
> > On Thu, Feb 28, 2019 at 09:47:54PM +0800, Yi Sun wrote:
> > > Intel vt-d rev3.0 [1] introduces a new translation mode called
> > > 'scalable mode', which enables PASID-granular translations for
> > > first level, second level, nested and pass-through modes. The
> > > vt-d scalable mode is the key ingredient to enable Scalable I/O
> > > Virtualization (Scalable IOV) [2] [3], which allows sharing a
> > > device in minimal possible granularity (ADI - Assignable Device
> > > Interface). As a result, previous Extended Context (ECS) mode
> > > is deprecated (no production ever implements ECS).
> > >
> > > This patch set emulates a minimal capability set of VT-d scalable
> > > mode, equivalent to what is available in VT-d legacy mode today:
> > >     1. Scalable mode root entry, context entry and PASID table
> > >     2. Seconds level translation under scalable mode
> > >     3. Queued invalidation (with 256 bits descriptor)
> > >     4. Pass-through mode
> > >
> > > Corresponding intel-iommu driver support will be included in
> > > kernel 5.0:
> > >     https://www.spinics.net/lists/kernel/msg2985279.html
> > >
> > > We will add emulation of full scalable mode capability along with
> > > guest iommu driver progress later, e.g.:
> > >     1. First level translation
> > >     2. Nested translation
> > >     3. Per-PASID invalidation descriptors
> > >     4. Page request services for handling recoverable faults
> > >
> > > To verify the patches, below cases were tested according to Peter Xu's
> > > suggestions.
> > >     +---------+----------------------------------------------------------------+-----------------------
> -----------------------------------------+
> > >     |         |                      w/ Device Passthr                         |                     w/o Device
> Passthr                         |
> > >     |         +-------------------------------+--------------------------------+-------------------------
> ------+--------------------------------+
> > >     |         | virtio-net-pci, vhost=on      | virtio-net-pci, vhost=off      | virtio-
> net-pci, vhost=on      | virtio-net-pci, vhost=off      |
> > >     |         +-------------------------------+--------------------------------+-------------------------
> ------+--------------------------------+
> > >     |         | netperf | kernel bld | data cp| netperf | kernel bld | data cp |
> netperf | kernel bld | data cp| netperf | kernel bld | data cp |
> > >     +---------+-------------------------------+--------------------------------+----------------------
> ---------+--------------------------------+
> > >     | Legacy  | Pass    | Pass       | Pass   | Pass    | Pass       | Pass    | Pass    |
> Pass       | Pass   | Pass    | Pass       | Pass    |
> > >     +---------+-------------------------------+--------------------------------+----------------------
> ---------+--------------------------------+
> > >     | Scalable| Pass    | Pass       | Pass   | Pass    | Pass       | Pass    | Pass    |
> Pass       | Pass   | Pass    | Pass       | Pass    |
> > >     +---------+-------------------------------+--------------------------------+----------------------
> ---------+--------------------------------+
> >
> > Hi, Yi,
> >
> > Thanks very much for the thorough test matrix!
> >
> Thanks for the review and comments! :)
> 
> > The last thing I'd like to confirm is have you tested device
> > assignment with v2?  And note that when you test with virtio devices
> 
> Yes, I tested a MDEV assignment which can walk the Scalable Mode
> patches flows (both kernel and qemu).

not just MDEV. You should also try physical PCI endpoint device.

> 
> > you should not need caching-mode=on (but caching-mode=on should not
> > break anyone though).
> >
> For virtio-net-pci without device assignment, I did not use
> "caching-mode=on".
> 
> > I've still got some comments here and there but it looks very good at
> > least to me overall.
> >
> > Thanks,
> >
> > --
> > Peter Xu

^ permalink raw reply	[flat|nested] 13+ messages in thread

end of thread, other threads:[~2019-03-01  7:57 UTC | newest]

Thread overview: 13+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2019-02-28 13:47 [Qemu-devel] [RFC v2 0/3] intel_iommu: support scalable mode Yi Sun
2019-02-28 13:47 ` [Qemu-devel] [RFC v2 1/3] intel_iommu: scalable mode emulation Yi Sun
2019-03-01  6:52   ` Peter Xu
2019-03-01  7:51     ` Yi Sun
2019-02-28 13:47 ` [Qemu-devel] [RFC v2 2/3] intel_iommu: add 256 bits qi_desc support Yi Sun
2019-03-01  6:59   ` Peter Xu
2019-03-01  7:51     ` Yi Sun
2019-02-28 13:47 ` [Qemu-devel] [RFC v2 3/3] intel_iommu: add scalable-mode option to make scalable mode work Yi Sun
2019-03-01  7:04   ` Peter Xu
2019-03-01  7:54     ` Yi Sun
2019-03-01  7:07 ` [Qemu-devel] [RFC v2 0/3] intel_iommu: support scalable mode Peter Xu
2019-03-01  7:13   ` Yi Sun
2019-03-01  7:30     ` Tian, Kevin

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).