[PATCH 0/8] x86/EPT: various adjustments

xen-devel.lists.xenproject.org archive mirror
 help / color / mirror / Atom feed

* [PATCH 0/8] x86/EPT: various adjustments
@ 2014-03-26 15:23 Jan Beulich
  2014-03-26 15:31 ` [PATCH 1/8] x86/p2m: make p2m_change_type() behavior match p2m_change_entry_type_global()'s Jan Beulich
                   ` (8 more replies)
  0 siblings, 9 replies; 36+ messages in thread
From: Jan Beulich @ 2014-03-26 15:23 UTC (permalink / raw)
  To: xen-devel
  Cc: Kevin Tian, Keir Fraser, Tim Deegan, Eddie Dong, Jun Nakajima,
	Yang Z Zhang

1: p2m: make p2m_change_type() behavior match p2m_change_entry_type_global()'s
2: EPT: simplification and cleanup
3: EPT: also dump permissions and memory types
4: EPT: relax treatment of APIC MFN
5: vMTRR: pass domain to mtrr_*_msr_set()
6: EPT: force re-evaluation of memory type as necessary
7: EPT: split super pages upon mismatching memory types
8: EPT: IOMMU snoop capability should not affect memory type selection

Signed-off-by: Jan Beulich <jbeulich@suse.com>

^ permalink raw reply	[flat|nested] 36+ messages in thread

* [PATCH 1/8] x86/p2m: make p2m_change_type() behavior match p2m_change_entry_type_global()'s
  2014-03-26 15:23 [PATCH 0/8] x86/EPT: various adjustments Jan Beulich
@ 2014-03-26 15:31 ` Jan Beulich
  2014-03-27 12:52   ` Tim Deegan
  2014-03-26 15:33 ` [PATCH 2/8] x86/EPT: simplification and cleanup Jan Beulich
                   ` (7 subsequent siblings)
  8 siblings, 1 reply; 36+ messages in thread
From: Jan Beulich @ 2014-03-26 15:31 UTC (permalink / raw)
  To: xen-devel
  Cc: Kevin Tian, Keir Fraser, Tim Deegan, Eddie Dong, Jun Nakajima,
	Yang Z Zhang

[-- Attachment #1: Type: text/plain, Size: 1745 bytes --]

- don't reset access permissions
- don't shatter super pages

Signed-off-by: Jan Beulich <jbeulich@suse.com>
---
Should p2m_change_type() also behave consistently?

--- a/xen/arch/x86/mm/p2m.c
+++ b/xen/arch/x86/mm/p2m.c
@@ -693,8 +693,7 @@ p2m_type_t p2m_change_type(struct domain
     return pt;
 }
 
-/* Modify the p2m type of a range of gfns from ot to nt.
- * Resets the access permissions. */
+/* Modify the p2m type of a range of gfns from ot to nt. */
 void p2m_change_type_range(struct domain *d, 
                            unsigned long start, unsigned long end,
                            p2m_type_t ot, p2m_type_t nt)
@@ -710,11 +709,29 @@ void p2m_change_type_range(struct domain
     p2m_lock(p2m);
     p2m->defer_nested_flush = 1;
 
-    for ( gfn = start; gfn < end; gfn++ )
+    for ( gfn = start; gfn < end; )
     {
-        mfn = p2m->get_entry(p2m, gfn, &pt, &a, 0, NULL);
+        unsigned int order;
+
+        mfn = p2m->get_entry(p2m, gfn, &pt, &a, 0, &order);
+        while ( order > PAGE_ORDER_4K )
+        {
+            if ( pt != ot )
+                break;
+            if ( !(gfn & ((1UL << order) - 1)) &&
+                 end > (gfn | ((1UL << order) - 1)) )
+                break;
+            if ( order == PAGE_ORDER_1G )
+                order = PAGE_ORDER_2M;
+            else
+                order = PAGE_ORDER_4K;
+        }
         if ( pt == ot )
-            set_p2m_entry(p2m, gfn, mfn, PAGE_ORDER_4K, nt, p2m->default_access);
+            set_p2m_entry(p2m, gfn, mfn, order, nt, a);
+        gfn += 1UL << order;
+        gfn &= -1UL << order;
+        if ( !gfn )
+            break;
     }
 
     p2m->defer_nested_flush = 0;




[-- Attachment #2: x86-p2m-consistent-type-change.patch --]
[-- Type: text/plain, Size: 1822 bytes --]

x86/p2m: make p2m_change_type() behavior match p2m_change_entry_type_global()'s

- don't reset access permissions
- don't shatter super pages

Signed-off-by: Jan Beulich <jbeulich@suse.com>
---
Should p2m_change_type() also behave consistently?

--- a/xen/arch/x86/mm/p2m.c
+++ b/xen/arch/x86/mm/p2m.c
@@ -693,8 +693,7 @@ p2m_type_t p2m_change_type(struct domain
     return pt;
 }
 
-/* Modify the p2m type of a range of gfns from ot to nt.
- * Resets the access permissions. */
+/* Modify the p2m type of a range of gfns from ot to nt. */
 void p2m_change_type_range(struct domain *d, 
                            unsigned long start, unsigned long end,
                            p2m_type_t ot, p2m_type_t nt)
@@ -710,11 +709,29 @@ void p2m_change_type_range(struct domain
     p2m_lock(p2m);
     p2m->defer_nested_flush = 1;
 
-    for ( gfn = start; gfn < end; gfn++ )
+    for ( gfn = start; gfn < end; )
     {
-        mfn = p2m->get_entry(p2m, gfn, &pt, &a, 0, NULL);
+        unsigned int order;
+
+        mfn = p2m->get_entry(p2m, gfn, &pt, &a, 0, &order);
+        while ( order > PAGE_ORDER_4K )
+        {
+            if ( pt != ot )
+                break;
+            if ( !(gfn & ((1UL << order) - 1)) &&
+                 end > (gfn | ((1UL << order) - 1)) )
+                break;
+            if ( order == PAGE_ORDER_1G )
+                order = PAGE_ORDER_2M;
+            else
+                order = PAGE_ORDER_4K;
+        }
         if ( pt == ot )
-            set_p2m_entry(p2m, gfn, mfn, PAGE_ORDER_4K, nt, p2m->default_access);
+            set_p2m_entry(p2m, gfn, mfn, order, nt, a);
+        gfn += 1UL << order;
+        gfn &= -1UL << order;
+        if ( !gfn )
+            break;
     }
 
     p2m->defer_nested_flush = 0;

[-- Attachment #3: Type: text/plain, Size: 126 bytes --]

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 36+ messages in thread

* [PATCH 2/8] x86/EPT: simplification and cleanup
  2014-03-26 15:23 [PATCH 0/8] x86/EPT: various adjustments Jan Beulich
  2014-03-26 15:31 ` [PATCH 1/8] x86/p2m: make p2m_change_type() behavior match p2m_change_entry_type_global()'s Jan Beulich
@ 2014-03-26 15:33 ` Jan Beulich
  2014-03-27 12:52   ` Tim Deegan
  2014-03-26 15:34 ` [PATCH 3/8] x86/EPT: also dump permissions and memory types Jan Beulich
                   ` (6 subsequent siblings)
  8 siblings, 1 reply; 36+ messages in thread
From: Jan Beulich @ 2014-03-26 15:33 UTC (permalink / raw)
  To: xen-devel
  Cc: Kevin Tian, Keir Fraser, Tim Deegan, Eddie Dong, Jun Nakajima,
	Yang Z Zhang

[-- Attachment #1: Type: text/plain, Size: 9161 bytes --]

- drop rsvd*_ prefixes from fields not really reserved anymore
- replace odd uses of <expr> ? 1 : 0
- drop pointless variables from ept_set_entry()
- streamline IOMMU mirroring code in ept_set_entry()
- don't open code is_epte_valid() (and properly use it when dumping)
- streamline entry cloning in ept_split_super_page()
- compact dumping code and output

Signed-off-by: Jan Beulich <jbeulich@suse.com>

--- a/xen/arch/x86/mm/p2m-ept.c
+++ b/xen/arch/x86/mm/p2m-ept.c
@@ -182,14 +182,13 @@ static int ept_split_super_page(struct p
     {
         ept_entry_t *epte = table + i;
 
-        epte->epte = 0;
-        epte->emt = ept_entry->emt;
-        epte->ipat = ept_entry->ipat;
-        epte->sp = (level > 1) ? 1 : 0;
-        epte->access = ept_entry->access;
-        epte->sa_p2mt = ept_entry->sa_p2mt;
-        epte->mfn = ept_entry->mfn + i * trunk;
-        epte->rsvd2_snp = ( iommu_enabled && iommu_snoop ) ? 1 : 0;
+        *epte = *ept_entry;
+        epte->sp = (level > 1);
+        epte->mfn += i * trunk;
+        epte->snp = (iommu_enabled && iommu_snoop);
+        ASSERT(!epte->rsvd1);
+        ASSERT(!epte->avail1);
+        ASSERT(!epte->avail3);
 
         ept_p2m_type_to_flags(epte, epte->sa_p2mt, epte->access);
 
@@ -281,8 +280,6 @@ ept_set_entry(struct p2m_domain *p2m, un
 {
     ept_entry_t *table, *ept_entry = NULL;
     unsigned long gfn_remainder = gfn;
-    unsigned long offset = 0;
-    u32 index;
     int i, target = order / EPT_TABLE_ORDER;
     int rv = 0;
     int ret = 0;
@@ -324,13 +321,10 @@ ept_set_entry(struct p2m_domain *p2m, un
 
     ASSERT(ret != GUEST_TABLE_POD_PAGE || i != target);
 
-    index = gfn_remainder >> (i * EPT_TABLE_ORDER);
-    offset = gfn_remainder & ((1UL << (i * EPT_TABLE_ORDER)) - 1);
-
-    ept_entry = table + index;
+    ept_entry = table + (gfn_remainder >> (i * EPT_TABLE_ORDER));
 
     /* In case VT-d uses same page table, this flag is needed by VT-d */ 
-    vtd_pte_present = is_epte_present(ept_entry) ? 1 : 0;
+    vtd_pte_present = is_epte_present(ept_entry);
 
     /*
      * If we're here with i > target, we must be at a leaf node, and
@@ -364,7 +358,7 @@ ept_set_entry(struct p2m_domain *p2m, un
                                                 direct_mmio);
 
             new_entry.ipat = ipat;
-            new_entry.sp = order ? 1 : 0;
+            new_entry.sp = !!order;
             new_entry.sa_p2mt = p2mt;
             new_entry.access = p2ma;
             new_entry.rsvd2_snp = (iommu_enabled && iommu_snoop);
@@ -406,17 +400,14 @@ ept_set_entry(struct p2m_domain *p2m, un
         /* We just installed the pages we need. */
         ASSERT(i == target);
 
-        index = gfn_remainder >> (i * EPT_TABLE_ORDER);
-        offset = gfn_remainder & ((1UL << (i * EPT_TABLE_ORDER)) - 1);
-
-        ept_entry = table + index;
+        ept_entry = table + (gfn_remainder >> (i * EPT_TABLE_ORDER));
 
         new_entry.emt = epte_get_entry_emt(d, gfn, mfn, &ipat, direct_mmio);
         new_entry.ipat = ipat;
-        new_entry.sp = i ? 1 : 0;
+        new_entry.sp = !!i;
         new_entry.sa_p2mt = p2mt;
         new_entry.access = p2ma;
-        new_entry.rsvd2_snp = (iommu_enabled && iommu_snoop);
+        new_entry.snp = (iommu_enabled && iommu_snoop);
 
         /* the caller should take care of the previous page */
         new_entry.mfn = mfn_x(mfn);
@@ -445,36 +436,20 @@ out:
         ept_sync_domain(p2m);
 
     /* For non-nested p2m, may need to change VT-d page table.*/
-    if ( rv && !p2m_is_nestedp2m(p2m) && iommu_enabled &&
-         need_iommu(p2m->domain) && need_modify_vtd_table )
+    if ( rv && !p2m_is_nestedp2m(p2m) && need_iommu(d) &&
+         need_modify_vtd_table )
     {
         if ( iommu_hap_pt_share )
-            iommu_pte_flush(d, gfn, (u64*)ept_entry, order, vtd_pte_present);
+            iommu_pte_flush(d, gfn, &ept_entry->epte, order, vtd_pte_present);
         else
         {
             if ( p2mt == p2m_ram_rw )
-            {
-                if ( order > 0 )
-                {
-                    for ( i = 0; i < (1 << order); i++ )
-                        iommu_map_page(
-                            p2m->domain, gfn - offset + i, mfn_x(mfn) - offset + i,
-                            IOMMUF_readable | IOMMUF_writable);
-                }
-                else if ( !order )
-                    iommu_map_page(
-                        p2m->domain, gfn, mfn_x(mfn), IOMMUF_readable | IOMMUF_writable);
-            }
+                for ( i = 0; i < (1 << order); i++ )
+                    iommu_map_page(d, gfn + i, mfn_x(mfn) + i,
+                                   IOMMUF_readable | IOMMUF_writable);
             else
-            {
-                if ( order > 0 )
-                {
-                    for ( i = 0; i < (1 << order); i++ )
-                        iommu_unmap_page(p2m->domain, gfn - offset + i);
-                }
-                else if ( !order )
-                    iommu_unmap_page(p2m->domain, gfn);
-            }
+                for ( i = 0; i < (1 << order); i++ )
+                    iommu_unmap_page(d, gfn + i);
         }
     }
 
@@ -558,9 +533,7 @@ static mfn_t ept_get_entry(struct p2m_do
             goto out;
     }
 
-    /* Need to check for all-zeroes because typecode 0 is p2m_ram and an
-     * entirely empty entry shouldn't have RAM type. */
-    if ( ept_entry->epte != 0 && ept_entry->sa_p2mt != p2m_invalid )
+    if ( is_epte_valid(ept_entry) )
     {
         *t = ept_entry->sa_p2mt;
         *a = ept_entry->access;
@@ -750,12 +723,9 @@ static void ept_dump_p2m_table(unsigned 
 {
     struct domain *d;
     ept_entry_t *table, *ept_entry;
-    mfn_t mfn;
     int order;
     int i;
-    int is_pod;
     int ret = 0;
-    unsigned long index;
     unsigned long gfn, gfn_remainder;
     unsigned long record_counter = 0;
     struct p2m_domain *p2m;
@@ -768,12 +738,11 @@ static void ept_dump_p2m_table(unsigned 
 
         p2m = p2m_get_hostp2m(d);
         ept = &p2m->ept;
-        printk("\ndomain%d EPT p2m table: \n", d->domain_id);
+        printk("\ndomain%d EPT p2m table:\n", d->domain_id);
 
-        for ( gfn = 0; gfn <= p2m->max_mapped_pfn; gfn += (1 << order) )
+        for ( gfn = 0; gfn <= p2m->max_mapped_pfn; gfn += 1UL << order )
         {
             gfn_remainder = gfn;
-            mfn = _mfn(INVALID_MFN);
             table = map_domain_page(pagetable_get_pfn(p2m_get_pagetable(p2m)));
 
             for ( i = ept_get_wl(ept); i > 0; i-- )
@@ -784,25 +753,18 @@ static void ept_dump_p2m_table(unsigned 
             }
 
             order = i * EPT_TABLE_ORDER;
-
-            if ( ret == GUEST_TABLE_MAP_FAILED )
-                goto out;
-
-            index = gfn_remainder >> order;
-            ept_entry = table + index;
-            if ( ept_entry->sa_p2mt != p2m_invalid )
+            ept_entry = table + (gfn_remainder >> order);
+            if ( ret != GUEST_TABLE_MAP_FAILED && is_epte_valid(ept_entry) )
             {
-                ( ept_entry->sa_p2mt == p2m_populate_on_demand ) ? 
-                ( mfn = _mfn(INVALID_MFN), is_pod = 1 ) :
-                ( mfn = _mfn(ept_entry->mfn), is_pod = 0 );
-
-                printk("gfn: %-16lx  mfn: %-16lx  order: %2d  is_pod: %d\n",
-                       gfn, mfn_x(mfn), order, is_pod);
+                if ( ept_entry->sa_p2mt == p2m_populate_on_demand )
+                    printk("gfn: %13lx order: %2d PoD\n", gfn, order);
+                else
+                    printk("gfn: %13lx order: %2d mfn: %13lx\n",
+                           gfn, order, ept_entry->mfn + 0UL);
 
                 if ( !(record_counter++ % 100) )
                     process_pending_softirqs();
             }
-out:
             unmap_domain_page(table);
         }
     }
--- a/xen/include/asm-x86/hvm/vmx/vmx.h
+++ b/xen/include/asm-x86/hvm/vmx/vmx.h
@@ -39,13 +39,13 @@ typedef union {
         sp          :   1,  /* bit 7 - Is this a superpage? */
         rsvd1       :   2,  /* bits 9:8 - Reserved for future use */
         avail1      :   1,  /* bit 10 - Software available 1 */
-        rsvd2_snp   :   1,  /* bit 11 - Used for VT-d snoop control
-                               in shared EPT/VT-d usage */
+        snp         :   1,  /* bit 11 - VT-d snoop control in shared
+                               EPT/VT-d usage */
         mfn         :   40, /* bits 51:12 - Machine physical frame number */
         sa_p2mt     :   6,  /* bits 57:52 - Software available 2 */
         access      :   4,  /* bits 61:58 - p2m_access_t */
-        rsvd3_tm    :   1,  /* bit 62 - Used for VT-d transient-mapping
-                               hint in shared EPT/VT-d usage */
+        tm          :   1,  /* bit 62 - VT-d transient-mapping hint in
+                               shared EPT/VT-d usage */
         avail3      :   1;  /* bit 63 - Software available 3 */
     };
     u64 epte;



[-- Attachment #2: EPT-simplify.patch --]
[-- Type: text/plain, Size: 9196 bytes --]

x86/EPT: simplification and cleanup

- drop rsvd*_ prefixes from fields not really reserved anymore
- replace odd uses of <expr> ? 1 : 0
- drop pointless variables from ept_set_entry()
- streamline IOMMU mirroring code in ept_set_entry()
- don't open code is_epte_valid() (and properly use it when dumping)
- streamline entry cloning in ept_split_super_page()
- compact dumping code and output

Signed-off-by: Jan Beulich <jbeulich@suse.com>

--- a/xen/arch/x86/mm/p2m-ept.c
+++ b/xen/arch/x86/mm/p2m-ept.c
@@ -182,14 +182,13 @@ static int ept_split_super_page(struct p
     {
         ept_entry_t *epte = table + i;
 
-        epte->epte = 0;
-        epte->emt = ept_entry->emt;
-        epte->ipat = ept_entry->ipat;
-        epte->sp = (level > 1) ? 1 : 0;
-        epte->access = ept_entry->access;
-        epte->sa_p2mt = ept_entry->sa_p2mt;
-        epte->mfn = ept_entry->mfn + i * trunk;
-        epte->rsvd2_snp = ( iommu_enabled && iommu_snoop ) ? 1 : 0;
+        *epte = *ept_entry;
+        epte->sp = (level > 1);
+        epte->mfn += i * trunk;
+        epte->snp = (iommu_enabled && iommu_snoop);
+        ASSERT(!epte->rsvd1);
+        ASSERT(!epte->avail1);
+        ASSERT(!epte->avail3);
 
         ept_p2m_type_to_flags(epte, epte->sa_p2mt, epte->access);
 
@@ -281,8 +280,6 @@ ept_set_entry(struct p2m_domain *p2m, un
 {
     ept_entry_t *table, *ept_entry = NULL;
     unsigned long gfn_remainder = gfn;
-    unsigned long offset = 0;
-    u32 index;
     int i, target = order / EPT_TABLE_ORDER;
     int rv = 0;
     int ret = 0;
@@ -324,13 +321,10 @@ ept_set_entry(struct p2m_domain *p2m, un
 
     ASSERT(ret != GUEST_TABLE_POD_PAGE || i != target);
 
-    index = gfn_remainder >> (i * EPT_TABLE_ORDER);
-    offset = gfn_remainder & ((1UL << (i * EPT_TABLE_ORDER)) - 1);
-
-    ept_entry = table + index;
+    ept_entry = table + (gfn_remainder >> (i * EPT_TABLE_ORDER));
 
     /* In case VT-d uses same page table, this flag is needed by VT-d */ 
-    vtd_pte_present = is_epte_present(ept_entry) ? 1 : 0;
+    vtd_pte_present = is_epte_present(ept_entry);
 
     /*
      * If we're here with i > target, we must be at a leaf node, and
@@ -364,7 +358,7 @@ ept_set_entry(struct p2m_domain *p2m, un
                                                 direct_mmio);
 
             new_entry.ipat = ipat;
-            new_entry.sp = order ? 1 : 0;
+            new_entry.sp = !!order;
             new_entry.sa_p2mt = p2mt;
             new_entry.access = p2ma;
             new_entry.rsvd2_snp = (iommu_enabled && iommu_snoop);
@@ -406,17 +400,14 @@ ept_set_entry(struct p2m_domain *p2m, un
         /* We just installed the pages we need. */
         ASSERT(i == target);
 
-        index = gfn_remainder >> (i * EPT_TABLE_ORDER);
-        offset = gfn_remainder & ((1UL << (i * EPT_TABLE_ORDER)) - 1);
-
-        ept_entry = table + index;
+        ept_entry = table + (gfn_remainder >> (i * EPT_TABLE_ORDER));
 
         new_entry.emt = epte_get_entry_emt(d, gfn, mfn, &ipat, direct_mmio);
         new_entry.ipat = ipat;
-        new_entry.sp = i ? 1 : 0;
+        new_entry.sp = !!i;
         new_entry.sa_p2mt = p2mt;
         new_entry.access = p2ma;
-        new_entry.rsvd2_snp = (iommu_enabled && iommu_snoop);
+        new_entry.snp = (iommu_enabled && iommu_snoop);
 
         /* the caller should take care of the previous page */
         new_entry.mfn = mfn_x(mfn);
@@ -445,36 +436,20 @@ out:
         ept_sync_domain(p2m);
 
     /* For non-nested p2m, may need to change VT-d page table.*/
-    if ( rv && !p2m_is_nestedp2m(p2m) && iommu_enabled &&
-         need_iommu(p2m->domain) && need_modify_vtd_table )
+    if ( rv && !p2m_is_nestedp2m(p2m) && need_iommu(d) &&
+         need_modify_vtd_table )
     {
         if ( iommu_hap_pt_share )
-            iommu_pte_flush(d, gfn, (u64*)ept_entry, order, vtd_pte_present);
+            iommu_pte_flush(d, gfn, &ept_entry->epte, order, vtd_pte_present);
         else
         {
             if ( p2mt == p2m_ram_rw )
-            {
-                if ( order > 0 )
-                {
-                    for ( i = 0; i < (1 << order); i++ )
-                        iommu_map_page(
-                            p2m->domain, gfn - offset + i, mfn_x(mfn) - offset + i,
-                            IOMMUF_readable | IOMMUF_writable);
-                }
-                else if ( !order )
-                    iommu_map_page(
-                        p2m->domain, gfn, mfn_x(mfn), IOMMUF_readable | IOMMUF_writable);
-            }
+                for ( i = 0; i < (1 << order); i++ )
+                    iommu_map_page(d, gfn + i, mfn_x(mfn) + i,
+                                   IOMMUF_readable | IOMMUF_writable);
             else
-            {
-                if ( order > 0 )
-                {
-                    for ( i = 0; i < (1 << order); i++ )
-                        iommu_unmap_page(p2m->domain, gfn - offset + i);
-                }
-                else if ( !order )
-                    iommu_unmap_page(p2m->domain, gfn);
-            }
+                for ( i = 0; i < (1 << order); i++ )
+                    iommu_unmap_page(d, gfn + i);
         }
     }
 
@@ -558,9 +533,7 @@ static mfn_t ept_get_entry(struct p2m_do
             goto out;
     }
 
-    /* Need to check for all-zeroes because typecode 0 is p2m_ram and an
-     * entirely empty entry shouldn't have RAM type. */
-    if ( ept_entry->epte != 0 && ept_entry->sa_p2mt != p2m_invalid )
+    if ( is_epte_valid(ept_entry) )
     {
         *t = ept_entry->sa_p2mt;
         *a = ept_entry->access;
@@ -750,12 +723,9 @@ static void ept_dump_p2m_table(unsigned 
 {
     struct domain *d;
     ept_entry_t *table, *ept_entry;
-    mfn_t mfn;
     int order;
     int i;
-    int is_pod;
     int ret = 0;
-    unsigned long index;
     unsigned long gfn, gfn_remainder;
     unsigned long record_counter = 0;
     struct p2m_domain *p2m;
@@ -768,12 +738,11 @@ static void ept_dump_p2m_table(unsigned 
 
         p2m = p2m_get_hostp2m(d);
         ept = &p2m->ept;
-        printk("\ndomain%d EPT p2m table: \n", d->domain_id);
+        printk("\ndomain%d EPT p2m table:\n", d->domain_id);
 
-        for ( gfn = 0; gfn <= p2m->max_mapped_pfn; gfn += (1 << order) )
+        for ( gfn = 0; gfn <= p2m->max_mapped_pfn; gfn += 1UL << order )
         {
             gfn_remainder = gfn;
-            mfn = _mfn(INVALID_MFN);
             table = map_domain_page(pagetable_get_pfn(p2m_get_pagetable(p2m)));
 
             for ( i = ept_get_wl(ept); i > 0; i-- )
@@ -784,25 +753,18 @@ static void ept_dump_p2m_table(unsigned 
             }
 
             order = i * EPT_TABLE_ORDER;
-
-            if ( ret == GUEST_TABLE_MAP_FAILED )
-                goto out;
-
-            index = gfn_remainder >> order;
-            ept_entry = table + index;
-            if ( ept_entry->sa_p2mt != p2m_invalid )
+            ept_entry = table + (gfn_remainder >> order);
+            if ( ret != GUEST_TABLE_MAP_FAILED && is_epte_valid(ept_entry) )
             {
-                ( ept_entry->sa_p2mt == p2m_populate_on_demand ) ? 
-                ( mfn = _mfn(INVALID_MFN), is_pod = 1 ) :
-                ( mfn = _mfn(ept_entry->mfn), is_pod = 0 );
-
-                printk("gfn: %-16lx  mfn: %-16lx  order: %2d  is_pod: %d\n",
-                       gfn, mfn_x(mfn), order, is_pod);
+                if ( ept_entry->sa_p2mt == p2m_populate_on_demand )
+                    printk("gfn: %13lx order: %2d PoD\n", gfn, order);
+                else
+                    printk("gfn: %13lx order: %2d mfn: %13lx\n",
+                           gfn, order, ept_entry->mfn + 0UL);
 
                 if ( !(record_counter++ % 100) )
                     process_pending_softirqs();
             }
-out:
             unmap_domain_page(table);
         }
     }
--- a/xen/include/asm-x86/hvm/vmx/vmx.h
+++ b/xen/include/asm-x86/hvm/vmx/vmx.h
@@ -39,13 +39,13 @@ typedef union {
         sp          :   1,  /* bit 7 - Is this a superpage? */
         rsvd1       :   2,  /* bits 9:8 - Reserved for future use */
         avail1      :   1,  /* bit 10 - Software available 1 */
-        rsvd2_snp   :   1,  /* bit 11 - Used for VT-d snoop control
-                               in shared EPT/VT-d usage */
+        snp         :   1,  /* bit 11 - VT-d snoop control in shared
+                               EPT/VT-d usage */
         mfn         :   40, /* bits 51:12 - Machine physical frame number */
         sa_p2mt     :   6,  /* bits 57:52 - Software available 2 */
         access      :   4,  /* bits 61:58 - p2m_access_t */
-        rsvd3_tm    :   1,  /* bit 62 - Used for VT-d transient-mapping
-                               hint in shared EPT/VT-d usage */
+        tm          :   1,  /* bit 62 - VT-d transient-mapping hint in
+                               shared EPT/VT-d usage */
         avail3      :   1;  /* bit 63 - Software available 3 */
     };
     u64 epte;

[-- Attachment #3: Type: text/plain, Size: 126 bytes --]

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 36+ messages in thread

* [PATCH 3/8] x86/EPT: also dump permissions and memory types
  2014-03-26 15:23 [PATCH 0/8] x86/EPT: various adjustments Jan Beulich
  2014-03-26 15:31 ` [PATCH 1/8] x86/p2m: make p2m_change_type() behavior match p2m_change_entry_type_global()'s Jan Beulich
  2014-03-26 15:33 ` [PATCH 2/8] x86/EPT: simplification and cleanup Jan Beulich
@ 2014-03-26 15:34 ` Jan Beulich
  2014-03-27 12:54   ` Tim Deegan
  2014-03-27 13:06   ` Andrew Cooper
  2014-03-26 15:34 ` [PATCH 4/8] x86/EPT: relax treatment of APIC MFN Jan Beulich
                   ` (5 subsequent siblings)
  8 siblings, 2 replies; 36+ messages in thread
From: Jan Beulich @ 2014-03-26 15:34 UTC (permalink / raw)
  To: xen-devel
  Cc: Kevin Tian, Keir Fraser, Tim Deegan, Eddie Dong, Jun Nakajima,
	Yang Z Zhang

[-- Attachment #1: Type: text/plain, Size: 1633 bytes --]

Signed-off-by: Jan Beulich <jbeulich@suse.com>

--- a/xen/arch/x86/mm/p2m-ept.c
+++ b/xen/arch/x86/mm/p2m-ept.c
@@ -730,6 +730,14 @@ static void ept_dump_p2m_table(unsigned 
     unsigned long record_counter = 0;
     struct p2m_domain *p2m;
     struct ept_data *ept;
+    static const char memoryTypes[8][2] = {
+        [0 ... 7] = "?",
+        [MTRR_TYPE_UNCACHABLE]     = "UC",
+        [MTRR_TYPE_WRCOMB]         = "WC",
+        [MTRR_TYPE_WRTHROUGH]      = "WT",
+        [MTRR_TYPE_WRPROT]         = "WP",
+        [MTRR_TYPE_WRBACK]         = "WB",
+    };
 
     for_each_domain(d)
     {
@@ -759,8 +767,15 @@ static void ept_dump_p2m_table(unsigned 
                 if ( ept_entry->sa_p2mt == p2m_populate_on_demand )
                     printk("gfn: %13lx order: %2d PoD\n", gfn, order);
                 else
-                    printk("gfn: %13lx order: %2d mfn: %13lx\n",
-                           gfn, order, ept_entry->mfn + 0UL);
+                    printk("gfn: %13lx order: %2d mfn: %13lx %c%c%c %c%c%c\n",
+                           gfn, order, ept_entry->mfn + 0UL,
+                           ept_entry->r ? 'r' : ' ',
+                           ept_entry->w ? 'w' : ' ',
+                           ept_entry->x ? 'x' : ' ',
+                           memoryTypes[ept_entry->emt][0],
+                           memoryTypes[ept_entry->emt][1]
+                           ?: ept_entry->emt + '0',
+                           ept_entry->ipat ? '!' : ' ');
 
                 if ( !(record_counter++ % 100) )
                     process_pending_softirqs();




[-- Attachment #2: EPT-dump-types.patch --]
[-- Type: text/plain, Size: 1678 bytes --]

x86/EPT: also dump permissions and memory types

Signed-off-by: Jan Beulich <jbeulich@suse.com>

--- a/xen/arch/x86/mm/p2m-ept.c
+++ b/xen/arch/x86/mm/p2m-ept.c
@@ -730,6 +730,14 @@ static void ept_dump_p2m_table(unsigned 
     unsigned long record_counter = 0;
     struct p2m_domain *p2m;
     struct ept_data *ept;
+    static const char memoryTypes[8][2] = {
+        [0 ... 7] = "?",
+        [MTRR_TYPE_UNCACHABLE]     = "UC",
+        [MTRR_TYPE_WRCOMB]         = "WC",
+        [MTRR_TYPE_WRTHROUGH]      = "WT",
+        [MTRR_TYPE_WRPROT]         = "WP",
+        [MTRR_TYPE_WRBACK]         = "WB",
+    };
 
     for_each_domain(d)
     {
@@ -759,8 +767,15 @@ static void ept_dump_p2m_table(unsigned 
                 if ( ept_entry->sa_p2mt == p2m_populate_on_demand )
                     printk("gfn: %13lx order: %2d PoD\n", gfn, order);
                 else
-                    printk("gfn: %13lx order: %2d mfn: %13lx\n",
-                           gfn, order, ept_entry->mfn + 0UL);
+                    printk("gfn: %13lx order: %2d mfn: %13lx %c%c%c %c%c%c\n",
+                           gfn, order, ept_entry->mfn + 0UL,
+                           ept_entry->r ? 'r' : ' ',
+                           ept_entry->w ? 'w' : ' ',
+                           ept_entry->x ? 'x' : ' ',
+                           memoryTypes[ept_entry->emt][0],
+                           memoryTypes[ept_entry->emt][1]
+                           ?: ept_entry->emt + '0',
+                           ept_entry->ipat ? '!' : ' ');
 
                 if ( !(record_counter++ % 100) )
                     process_pending_softirqs();

[-- Attachment #3: Type: text/plain, Size: 126 bytes --]

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 36+ messages in thread

* [PATCH 4/8] x86/EPT: relax treatment of APIC MFN
  2014-03-26 15:23 [PATCH 0/8] x86/EPT: various adjustments Jan Beulich
                   ` (2 preceding siblings ...)
  2014-03-26 15:34 ` [PATCH 3/8] x86/EPT: also dump permissions and memory types Jan Beulich
@ 2014-03-26 15:34 ` Jan Beulich
  2014-03-27 12:55   ` Tim Deegan
  2014-03-26 15:35 ` [PATCH 5/8] x86/vMTRR: pass domain to mtrr_*_msr_set() Jan Beulich
                   ` (4 subsequent siblings)
  8 siblings, 1 reply; 36+ messages in thread
From: Jan Beulich @ 2014-03-26 15:34 UTC (permalink / raw)
  To: xen-devel
  Cc: Kevin Tian, Keir Fraser, Tim Deegan, Eddie Dong, Jun Nakajima,
	Yang Z Zhang

[-- Attachment #1: Type: text/plain, Size: 697 bytes --]

There's no point in this being mapped UC by the guest due to using a
respective PAT index - set the ignore-PAT flag to true.

Signed-off-by: Jan Beulich <jbeulich@suse.com>

--- a/xen/arch/x86/hvm/mtrr.c
+++ b/xen/arch/x86/hvm/mtrr.c
@@ -719,8 +719,12 @@ uint8_t epte_get_entry_emt(struct domain
     }
 
     if ( direct_mmio )
-        return mfn_x(mfn) != d->arch.hvm_domain.vmx.apic_access_mfn
-               ? MTRR_TYPE_UNCACHABLE : MTRR_TYPE_WRBACK;
+    {
+        if ( mfn_x(mfn) != d->arch.hvm_domain.vmx.apic_access_mfn )
+            return MTRR_TYPE_UNCACHABLE;
+        *ipat = 1;
+        return MTRR_TYPE_WRBACK;
+    }
 
     if ( iommu_snoop )
     {




[-- Attachment #2: EPT-relax-APIC-MFN.patch --]
[-- Type: text/plain, Size: 731 bytes --]

x86/EPT: relax treatment of APIC MFN

There's no point in this being mapped UC by the guest due to using a
respective PAT index - set the ignore-PAT flag to true.

Signed-off-by: Jan Beulich <jbeulich@suse.com>

--- a/xen/arch/x86/hvm/mtrr.c
+++ b/xen/arch/x86/hvm/mtrr.c
@@ -719,8 +719,12 @@ uint8_t epte_get_entry_emt(struct domain
     }
 
     if ( direct_mmio )
-        return mfn_x(mfn) != d->arch.hvm_domain.vmx.apic_access_mfn
-               ? MTRR_TYPE_UNCACHABLE : MTRR_TYPE_WRBACK;
+    {
+        if ( mfn_x(mfn) != d->arch.hvm_domain.vmx.apic_access_mfn )
+            return MTRR_TYPE_UNCACHABLE;
+        *ipat = 1;
+        return MTRR_TYPE_WRBACK;
+    }
 
     if ( iommu_snoop )
     {

[-- Attachment #3: Type: text/plain, Size: 126 bytes --]

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 36+ messages in thread

* [PATCH 5/8] x86/vMTRR: pass domain to mtrr_*_msr_set()
  2014-03-26 15:23 [PATCH 0/8] x86/EPT: various adjustments Jan Beulich
                   ` (3 preceding siblings ...)
  2014-03-26 15:34 ` [PATCH 4/8] x86/EPT: relax treatment of APIC MFN Jan Beulich
@ 2014-03-26 15:35 ` Jan Beulich
  2014-03-27 12:55   ` Tim Deegan
  2014-03-26 15:36 ` [PATCH 6/8] x86/EPT: force re‑evaluation of memory type as necessary Jan Beulich
                   ` (3 subsequent siblings)
  8 siblings, 1 reply; 36+ messages in thread
From: Jan Beulich @ 2014-03-26 15:35 UTC (permalink / raw)
  To: xen-devel
  Cc: Kevin Tian, Keir Fraser, Tim Deegan, Eddie Dong, Jun Nakajima,
	Yang Z Zhang

[-- Attachment #1: Type: text/plain, Size: 4476 bytes --]

This is in preparation for the next patch, and mtrr_def_type_msr_set()
and mtrr_fix_range_msr_set() in sync with mtrr_var_range_msr_set() in
this regard.

Signed-off-by: Jan Beulich <jbeulich@suse.com>

--- a/xen/arch/x86/hvm/hvm.c
+++ b/xen/arch/x86/hvm/hvm.c
@@ -3297,13 +3297,15 @@ int hvm_msr_write_intercept(unsigned int
     case MSR_MTRRdefType:
         if ( !mtrr )
             goto gp_fault;
-        if ( !mtrr_def_type_msr_set(&v->arch.hvm_vcpu.mtrr, msr_content) )
+        if ( !mtrr_def_type_msr_set(v->domain, &v->arch.hvm_vcpu.mtrr,
+                                    msr_content) )
            goto gp_fault;
         break;
     case MSR_MTRRfix64K_00000:
         if ( !mtrr )
             goto gp_fault;
-        if ( !mtrr_fix_range_msr_set(&v->arch.hvm_vcpu.mtrr, 0, msr_content) )
+        if ( !mtrr_fix_range_msr_set(v->domain, &v->arch.hvm_vcpu.mtrr, 0,
+                                     msr_content) )
             goto gp_fault;
         break;
     case MSR_MTRRfix16K_80000:
@@ -3311,7 +3313,7 @@ int hvm_msr_write_intercept(unsigned int
         if ( !mtrr )
             goto gp_fault;
         index = msr - MSR_MTRRfix16K_80000 + 1;
-        if ( !mtrr_fix_range_msr_set(&v->arch.hvm_vcpu.mtrr,
+        if ( !mtrr_fix_range_msr_set(v->domain, &v->arch.hvm_vcpu.mtrr,
                                      index, msr_content) )
             goto gp_fault;
         break;
@@ -3319,7 +3321,7 @@ int hvm_msr_write_intercept(unsigned int
         if ( !mtrr )
             goto gp_fault;
         index = msr - MSR_MTRRfix4K_C0000 + 3;
-        if ( !mtrr_fix_range_msr_set(&v->arch.hvm_vcpu.mtrr,
+        if ( !mtrr_fix_range_msr_set(v->domain, &v->arch.hvm_vcpu.mtrr,
                                      index, msr_content) )
             goto gp_fault;
         break;
--- a/xen/arch/x86/hvm/mtrr.c
+++ b/xen/arch/x86/hvm/mtrr.c
@@ -412,7 +412,8 @@ static inline bool_t valid_mtrr_type(uin
     return 0;
 }
 
-bool_t mtrr_def_type_msr_set(struct mtrr_state *m, uint64_t msr_content)
+bool_t mtrr_def_type_msr_set(struct domain *d, struct mtrr_state *m,
+                             uint64_t msr_content)
 {
     uint8_t def_type = msr_content & 0xff;
     uint8_t enabled = (msr_content >> 10) & 0x3;
@@ -436,8 +437,8 @@ bool_t mtrr_def_type_msr_set(struct mtrr
     return 1;
 }
 
-bool_t mtrr_fix_range_msr_set(struct mtrr_state *m, uint32_t row,
-                              uint64_t msr_content)
+bool_t mtrr_fix_range_msr_set(struct domain *d, struct mtrr_state *m,
+                              uint32_t row, uint64_t msr_content)
 {
     uint64_t *fixed_range_base = (uint64_t *)m->fixed_ranges;
 
@@ -669,7 +670,7 @@ static int hvm_load_mtrr_msr(struct doma
     mtrr_state->mtrr_cap = hw_mtrr.msr_mtrr_cap;
 
     for ( i = 0; i < NUM_FIXED_MSR; i++ )
-        mtrr_fix_range_msr_set(mtrr_state, i, hw_mtrr.msr_mtrr_fixed[i]);
+        mtrr_fix_range_msr_set(d, mtrr_state, i, hw_mtrr.msr_mtrr_fixed[i]);
 
     for ( i = 0; i < MTRR_VCNT; i++ )
     {
@@ -681,7 +682,7 @@ static int hvm_load_mtrr_msr(struct doma
                                hw_mtrr.msr_mtrr_var[i * 2 + 1]);
     }
 
-    mtrr_def_type_msr_set(mtrr_state, hw_mtrr.msr_mtrr_def_type);
+    mtrr_def_type_msr_set(d, mtrr_state, hw_mtrr.msr_mtrr_def_type);
 
     return 0;
 }
--- a/xen/include/asm-x86/mtrr.h
+++ b/xen/include/asm-x86/mtrr.h
@@ -82,12 +82,12 @@ extern void mtrr_aps_sync_begin(void);
 extern void mtrr_aps_sync_end(void);
 extern void mtrr_bp_restore(void);
 
-extern bool_t mtrr_var_range_msr_set(
-    struct domain *d, struct mtrr_state *m,
-    uint32_t msr, uint64_t msr_content);
-extern bool_t mtrr_fix_range_msr_set(struct mtrr_state *v,
-				uint32_t row, uint64_t msr_content);
-extern bool_t mtrr_def_type_msr_set(struct mtrr_state *v, uint64_t msr_content);
+extern bool_t mtrr_var_range_msr_set(struct domain *, struct mtrr_state *,
+                                     uint32_t msr, uint64_t msr_content);
+extern bool_t mtrr_fix_range_msr_set(struct domain *, struct mtrr_state *,
+                                     uint32_t row, uint64_t msr_content);
+extern bool_t mtrr_def_type_msr_set(struct domain *, struct mtrr_state *,
+                                    uint64_t msr_content);
 extern bool_t pat_msr_set(uint64_t *pat, uint64_t msr);
 
 bool_t is_var_mtrr_overlapped(struct mtrr_state *m);



[-- Attachment #2: x86-HVM-MTRR-set-domain.patch --]
[-- Type: text/plain, Size: 4518 bytes --]

x86/vMTRR: pass domain to mtrr_*_msr_set()

This is in preparation for the next patch, and mtrr_def_type_msr_set()
and mtrr_fix_range_msr_set() in sync with mtrr_var_range_msr_set() in
this regard.

Signed-off-by: Jan Beulich <jbeulich@suse.com>

--- a/xen/arch/x86/hvm/hvm.c
+++ b/xen/arch/x86/hvm/hvm.c
@@ -3297,13 +3297,15 @@ int hvm_msr_write_intercept(unsigned int
     case MSR_MTRRdefType:
         if ( !mtrr )
             goto gp_fault;
-        if ( !mtrr_def_type_msr_set(&v->arch.hvm_vcpu.mtrr, msr_content) )
+        if ( !mtrr_def_type_msr_set(v->domain, &v->arch.hvm_vcpu.mtrr,
+                                    msr_content) )
            goto gp_fault;
         break;
     case MSR_MTRRfix64K_00000:
         if ( !mtrr )
             goto gp_fault;
-        if ( !mtrr_fix_range_msr_set(&v->arch.hvm_vcpu.mtrr, 0, msr_content) )
+        if ( !mtrr_fix_range_msr_set(v->domain, &v->arch.hvm_vcpu.mtrr, 0,
+                                     msr_content) )
             goto gp_fault;
         break;
     case MSR_MTRRfix16K_80000:
@@ -3311,7 +3313,7 @@ int hvm_msr_write_intercept(unsigned int
         if ( !mtrr )
             goto gp_fault;
         index = msr - MSR_MTRRfix16K_80000 + 1;
-        if ( !mtrr_fix_range_msr_set(&v->arch.hvm_vcpu.mtrr,
+        if ( !mtrr_fix_range_msr_set(v->domain, &v->arch.hvm_vcpu.mtrr,
                                      index, msr_content) )
             goto gp_fault;
         break;
@@ -3319,7 +3321,7 @@ int hvm_msr_write_intercept(unsigned int
         if ( !mtrr )
             goto gp_fault;
         index = msr - MSR_MTRRfix4K_C0000 + 3;
-        if ( !mtrr_fix_range_msr_set(&v->arch.hvm_vcpu.mtrr,
+        if ( !mtrr_fix_range_msr_set(v->domain, &v->arch.hvm_vcpu.mtrr,
                                      index, msr_content) )
             goto gp_fault;
         break;
--- a/xen/arch/x86/hvm/mtrr.c
+++ b/xen/arch/x86/hvm/mtrr.c
@@ -412,7 +412,8 @@ static inline bool_t valid_mtrr_type(uin
     return 0;
 }
 
-bool_t mtrr_def_type_msr_set(struct mtrr_state *m, uint64_t msr_content)
+bool_t mtrr_def_type_msr_set(struct domain *d, struct mtrr_state *m,
+                             uint64_t msr_content)
 {
     uint8_t def_type = msr_content & 0xff;
     uint8_t enabled = (msr_content >> 10) & 0x3;
@@ -436,8 +437,8 @@ bool_t mtrr_def_type_msr_set(struct mtrr
     return 1;
 }
 
-bool_t mtrr_fix_range_msr_set(struct mtrr_state *m, uint32_t row,
-                              uint64_t msr_content)
+bool_t mtrr_fix_range_msr_set(struct domain *d, struct mtrr_state *m,
+                              uint32_t row, uint64_t msr_content)
 {
     uint64_t *fixed_range_base = (uint64_t *)m->fixed_ranges;
 
@@ -669,7 +670,7 @@ static int hvm_load_mtrr_msr(struct doma
     mtrr_state->mtrr_cap = hw_mtrr.msr_mtrr_cap;
 
     for ( i = 0; i < NUM_FIXED_MSR; i++ )
-        mtrr_fix_range_msr_set(mtrr_state, i, hw_mtrr.msr_mtrr_fixed[i]);
+        mtrr_fix_range_msr_set(d, mtrr_state, i, hw_mtrr.msr_mtrr_fixed[i]);
 
     for ( i = 0; i < MTRR_VCNT; i++ )
     {
@@ -681,7 +682,7 @@ static int hvm_load_mtrr_msr(struct doma
                                hw_mtrr.msr_mtrr_var[i * 2 + 1]);
     }
 
-    mtrr_def_type_msr_set(mtrr_state, hw_mtrr.msr_mtrr_def_type);
+    mtrr_def_type_msr_set(d, mtrr_state, hw_mtrr.msr_mtrr_def_type);
 
     return 0;
 }
--- a/xen/include/asm-x86/mtrr.h
+++ b/xen/include/asm-x86/mtrr.h
@@ -82,12 +82,12 @@ extern void mtrr_aps_sync_begin(void);
 extern void mtrr_aps_sync_end(void);
 extern void mtrr_bp_restore(void);
 
-extern bool_t mtrr_var_range_msr_set(
-    struct domain *d, struct mtrr_state *m,
-    uint32_t msr, uint64_t msr_content);
-extern bool_t mtrr_fix_range_msr_set(struct mtrr_state *v,
-				uint32_t row, uint64_t msr_content);
-extern bool_t mtrr_def_type_msr_set(struct mtrr_state *v, uint64_t msr_content);
+extern bool_t mtrr_var_range_msr_set(struct domain *, struct mtrr_state *,
+                                     uint32_t msr, uint64_t msr_content);
+extern bool_t mtrr_fix_range_msr_set(struct domain *, struct mtrr_state *,
+                                     uint32_t row, uint64_t msr_content);
+extern bool_t mtrr_def_type_msr_set(struct domain *, struct mtrr_state *,
+                                    uint64_t msr_content);
 extern bool_t pat_msr_set(uint64_t *pat, uint64_t msr);
 
 bool_t is_var_mtrr_overlapped(struct mtrr_state *m);

[-- Attachment #3: Type: text/plain, Size: 126 bytes --]

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 36+ messages in thread

* [PATCH 6/8] x86/EPT: force re‑evaluation of memory type as necessary
  2014-03-26 15:23 [PATCH 0/8] x86/EPT: various adjustments Jan Beulich
                   ` (4 preceding siblings ...)
  2014-03-26 15:35 ` [PATCH 5/8] x86/vMTRR: pass domain to mtrr_*_msr_set() Jan Beulich
@ 2014-03-26 15:36 ` Jan Beulich
  2014-03-27 13:08   ` [PATCH 6/8] x86/EPT: force re???evaluation " Tim Deegan
  2014-03-26 15:36 ` [PATCH 7/8] x86/EPT: split super pages upon mismatching memory types Jan Beulich
                   ` (2 subsequent siblings)
  8 siblings, 1 reply; 36+ messages in thread
From: Jan Beulich @ 2014-03-26 15:36 UTC (permalink / raw)
  To: xen-devel
  Cc: Kevin Tian, Keir Fraser, Tim Deegan, Eddie Dong, Jun Nakajima,
	Yang Z Zhang

[-- Attachment #1: Type: text/plain, Size: 14326 bytes --]

The main goal here is to drop the bogus dependency of
epte_get_entry_emt() on d->arch.hvm_domain.params[HVM_PARAM_IDENT_PT].

Any change to state influencing epte_get_entry_emt()'s decision needs
to result in re-calculation. Do this by using the EPT_MISCONFIG VM
exit, storing an invalid memory type into EPT's emt field (leaving the
IOMMU, which doesn't care about memory types, unaffected).

This is being done in a hierarchical manner to keep execution time
down: Initially only the top level directory gets invalidated this way.
Upon access, the involved intermediate page table levels get cleared
back to zero, and the leaf entry gets its field properly set. For 4k
leaves all other entries in the same directory also get processed to
amortize the cost of the extra VM exit (which halved the number of
these VM exits in my testing).

This restoring can result in spurious EPT_MISCONFIG VM exits (since
two vCPU-s may access addresses involving identical page table
structures). Rather than simply returning in such cases (and risking
that such a VM exit results from a real mis-configuration, which
would then result in an endless loop rather than killing the VM), a
per-vCPU flag is being introduced indicating when such a spurious VM
exit might validly happen - if another one occurs right after VM re-
entry, the flag would generally end up being clear, causing the VM
to be killed as before on such VM exits.

Note that putting a reserved memory type value in the EPT structures
isn't formally sanctioned by the specification. Intel isn't willing to
adjust the specification to make this or a similar use of the
EPT_MISCONFIG VM exit formally possible, but they have indicated that
us using this is low risk wrt forward compatibility.

Signed-off-by: Jan Beulich <jbeulich@suse.com>

--- a/xen/arch/x86/domctl.c
+++ b/xen/arch/x86/domctl.c
@@ -84,6 +84,8 @@ long arch_do_domctl(
             ret = ioports_permit_access(d, fp, fp + np - 1);
         else
             ret = ioports_deny_access(d, fp, fp + np - 1);
+        if ( !ret )
+            memory_type_changed(d);
     }
     break;
 
@@ -705,6 +707,8 @@ long arch_do_domctl(
                        ret, add ? "removing" : "denying", d->domain_id,
                        mfn, mfn + nr_mfns - 1);
         }
+        /* Do this unconditionally to cover errors on above failure paths. */
+        memory_type_changed(d);
     }
     break;
 
@@ -791,6 +795,8 @@ long arch_do_domctl(
                        "ioport_map: error %ld denying dom%d access to [%x,%x]\n",
                        ret, d->domain_id, fmp, fmp + np - 1);
         }
+        if ( !ret )
+            memory_type_changed(d);
     }
     break;
 
--- a/xen/arch/x86/hvm/hvm.c
+++ b/xen/arch/x86/hvm/hvm.c
@@ -252,6 +252,9 @@ int hvm_set_guest_pat(struct vcpu *v, u6
 
     if ( !hvm_funcs.set_guest_pat(v, guest_pat) )
         v->arch.hvm_vcpu.pat_cr = guest_pat;
+
+    memory_type_changed(v->domain);
+
     return 1;
 }
 
--- a/xen/arch/x86/hvm/mtrr.c
+++ b/xen/arch/x86/hvm/mtrr.c
@@ -431,8 +431,12 @@ bool_t mtrr_def_type_msr_set(struct doma
          return 0;
     }
 
-    m->enabled = enabled;
-    m->def_type = def_type;
+    if ( m->enabled != enabled || m->def_type != def_type )
+    {
+        m->enabled = enabled;
+        m->def_type = def_type;
+        memory_type_changed(d);
+    }
 
     return 1;
 }
@@ -452,6 +456,7 @@ bool_t mtrr_fix_range_msr_set(struct dom
                 return 0;
 
         fixed_range_base[row] = msr_content;
+        memory_type_changed(d);
     }
 
     return 1;
@@ -496,6 +501,8 @@ bool_t mtrr_var_range_msr_set(
 
     m->overlapped = is_var_mtrr_overlapped(m);
 
+    memory_type_changed(d);
+
     return 1;
 }
 
@@ -690,6 +697,12 @@ static int hvm_load_mtrr_msr(struct doma
 HVM_REGISTER_SAVE_RESTORE(MTRR, hvm_save_mtrr_msr, hvm_load_mtrr_msr,
                           1, HVMSR_PER_VCPU);
 
+void memory_type_changed(struct domain *d)
+{
+    if ( iommu_enabled && !iommu_snoop && d->vcpu && d->vcpu[0] )
+        p2m_memory_type_changed(d);
+}
+
 uint8_t epte_get_entry_emt(struct domain *d, unsigned long gfn, mfn_t mfn,
                            uint8_t *ipat, bool_t direct_mmio)
 {
@@ -733,8 +746,7 @@ uint8_t epte_get_entry_emt(struct domain
         return MTRR_TYPE_WRBACK;
     }
 
-    gmtrr_mtype = is_hvm_domain(d) && v &&
-                  d->arch.hvm_domain.params[HVM_PARAM_IDENT_PT] ?
+    gmtrr_mtype = is_hvm_domain(d) && v ?
                   get_mtrr_type(&v->arch.hvm_vcpu.mtrr, (gfn << PAGE_SHIFT)) :
                   MTRR_TYPE_WRBACK;
 
--- a/xen/arch/x86/hvm/vmx/vmx.c
+++ b/xen/arch/x86/hvm/vmx/vmx.c
@@ -3011,6 +3011,16 @@ void vmx_vmexit_handler(struct cpu_user_
         break;
     }
 
+    case EXIT_REASON_EPT_MISCONFIG:
+    {
+        paddr_t gpa;
+
+        __vmread(GUEST_PHYSICAL_ADDRESS, &gpa);
+        if ( !ept_handle_misconfig(gpa) )
+            goto exit_and_crash;
+        break;
+    }
+
     case EXIT_REASON_MONITOR_TRAP_FLAG:
         v->arch.hvm_vmx.exec_control &= ~CPU_BASED_MONITOR_TRAP_FLAG;
         vmx_update_cpu_exec_control(v);
--- a/xen/arch/x86/mm/p2m.c
+++ b/xen/arch/x86/mm/p2m.c
@@ -200,6 +200,18 @@ void p2m_change_entry_type_global(struct
     p2m_unlock(p2m);
 }
 
+void p2m_memory_type_changed(struct domain *d)
+{
+    struct p2m_domain *p2m = p2m_get_hostp2m(d);
+
+    if ( p2m->memory_type_changed )
+    {
+        p2m_lock(p2m);
+        p2m->memory_type_changed(p2m);
+        p2m_unlock(p2m);
+    }
+}
+
 mfn_t __get_gfn_type_access(struct p2m_domain *p2m, unsigned long gfn,
                     p2m_type_t *t, p2m_access_t *a, p2m_query_t q,
                     unsigned int *page_order, bool_t locked)
--- a/xen/arch/x86/mm/p2m-ept.c
+++ b/xen/arch/x86/mm/p2m-ept.c
@@ -660,6 +660,130 @@ static void ept_change_entry_type_global
     ept_sync_domain(p2m);
 }
 
+static bool_t ept_invalidate_emt(mfn_t mfn)
+{
+    ept_entry_t *epte = map_domain_page(mfn_x(mfn));
+    unsigned int i;
+    bool_t changed = 0;
+
+    for ( i = 0; i < EPT_PAGETABLE_ENTRIES; i++ )
+    {
+        ept_entry_t e = atomic_read_ept_entry(&epte[i]);
+
+        if ( !is_epte_valid(&e) || !is_epte_present(&e) ||
+             e.emt == MTRR_NUM_TYPES )
+            continue;
+
+        e.emt = MTRR_NUM_TYPES;
+        atomic_write_ept_entry(&epte[i], e);
+        changed = 1;
+    }
+
+    unmap_domain_page(epte);
+
+    return changed;
+}
+
+static void ept_memory_type_changed(struct p2m_domain *p2m)
+{
+    unsigned long mfn = ept_get_asr(&p2m->ept);
+
+    if ( !mfn )
+        return;
+
+    if ( ept_invalidate_emt(_mfn(mfn)) )
+        ept_sync_domain(p2m);
+}
+
+bool_t ept_handle_misconfig(uint64_t gpa)
+{
+    struct vcpu *curr = current;
+    struct p2m_domain *p2m = p2m_get_hostp2m(curr->domain);
+    struct ept_data *ept = &p2m->ept;
+    unsigned int level = ept_get_wl(ept);
+    unsigned long gfn = PFN_DOWN(gpa);
+    unsigned long mfn = ept_get_asr(ept);
+    ept_entry_t *epte;
+    bool_t okay;
+
+    if ( !mfn )
+        return 0;
+
+    p2m_lock(p2m);
+
+    for ( okay = curr->arch.hvm_vmx.ept_spurious_misconfig; ; --level) {
+        ept_entry_t e;
+        unsigned int i;
+
+        epte = map_domain_page(mfn);
+        i = (gfn >> (level * EPT_TABLE_ORDER)) & (EPT_PAGETABLE_ENTRIES - 1);
+        e = atomic_read_ept_entry(&epte[i]);
+
+        if ( level == 0 || is_epte_superpage(&e) )
+        {
+            uint8_t ipat = 0;
+            struct vcpu *v;
+
+            if ( e.emt != MTRR_NUM_TYPES )
+                break;
+
+            if ( level == 0 )
+            {
+                for ( gfn -= i, i = 0; i < EPT_PAGETABLE_ENTRIES; ++i )
+                {
+                    e = atomic_read_ept_entry(&epte[i]);
+                    if ( e.emt == MTRR_NUM_TYPES )
+                        e.emt = 0;
+                    if ( !is_epte_valid(&e) || !is_epte_present(&e) )
+                        continue;
+                    e.emt = epte_get_entry_emt(p2m->domain, gfn + i,
+                                               _mfn(e.mfn), &ipat,
+                                               e.sa_p2mt == p2m_mmio_direct);
+                    e.ipat = ipat;
+                    atomic_write_ept_entry(&epte[i], e);
+                }
+            }
+            else
+            {
+                e.emt = epte_get_entry_emt(p2m->domain, gfn, _mfn(e.mfn),
+                                           &ipat,
+                                           e.sa_p2mt == p2m_mmio_direct);
+                e.ipat = ipat;
+                atomic_write_ept_entry(&epte[i], e);
+            }
+
+            for_each_vcpu ( curr->domain, v )
+                v->arch.hvm_vmx.ept_spurious_misconfig = 1;
+            okay = 1;
+            break;
+        }
+
+        if ( e.emt == MTRR_NUM_TYPES )
+        {
+            ASSERT(is_epte_present(&e));
+            e.emt = 0;
+            atomic_write_ept_entry(&epte[i], e);
+            unmap_domain_page(epte);
+
+            ept_invalidate_emt(_mfn(e.mfn));
+            okay = 1;
+        }
+        else if ( is_epte_present(&e) && !e.emt )
+            unmap_domain_page(epte);
+        else
+            break;
+
+        mfn = e.mfn;
+    }
+
+    unmap_domain_page(epte);
+    curr->arch.hvm_vmx.ept_spurious_misconfig = 0;
+    ept_sync_domain(p2m);
+    p2m_unlock(p2m);
+
+    return okay;
+}
+
 static void __ept_sync_domain(void *info)
 {
     struct ept_data *ept = &((struct p2m_domain *)info)->ept;
@@ -697,6 +821,7 @@ int ept_p2m_init(struct p2m_domain *p2m)
     p2m->set_entry = ept_set_entry;
     p2m->get_entry = ept_get_entry;
     p2m->change_entry_type_global = ept_change_entry_type_global;
+    p2m->memory_type_changed = ept_memory_type_changed;
     p2m->audit_p2m = NULL;
 
     /* Set the memory type used when accessing EPT paging structures. */
@@ -737,6 +862,7 @@ static void ept_dump_p2m_table(unsigned 
         [MTRR_TYPE_WRTHROUGH]      = "WT",
         [MTRR_TYPE_WRPROT]         = "WP",
         [MTRR_TYPE_WRBACK]         = "WB",
+        [MTRR_NUM_TYPES]           = "??"
     };
 
     for_each_domain(d)
@@ -750,11 +876,16 @@ static void ept_dump_p2m_table(unsigned 
 
         for ( gfn = 0; gfn <= p2m->max_mapped_pfn; gfn += 1UL << order )
         {
+            char c = 0;
+
             gfn_remainder = gfn;
             table = map_domain_page(pagetable_get_pfn(p2m_get_pagetable(p2m)));
 
             for ( i = ept_get_wl(ept); i > 0; i-- )
             {
+                ept_entry = table + (gfn_remainder >> (i * EPT_TABLE_ORDER));
+                if ( ept_entry->emt == MTRR_NUM_TYPES )
+                    c = '?';
                 ret = ept_next_level(p2m, 1, &table, &gfn_remainder, i);
                 if ( ret != GUEST_TABLE_NORMAL_PAGE )
                     break;
@@ -775,7 +906,7 @@ static void ept_dump_p2m_table(unsigned 
                            memoryTypes[ept_entry->emt][0],
                            memoryTypes[ept_entry->emt][1]
                            ?: ept_entry->emt + '0',
-                           ept_entry->ipat ? '!' : ' ');
+                           c ?: ept_entry->ipat ? '!' : ' ');
 
                 if ( !(record_counter++ % 100) )
                     process_pending_softirqs();
--- a/xen/common/domctl.c
+++ b/xen/common/domctl.c
@@ -815,6 +815,10 @@ long do_domctl(XEN_GUEST_HANDLE_PARAM(xe
             ret = iomem_permit_access(d, mfn, mfn + nr_mfns - 1);
         else
             ret = iomem_deny_access(d, mfn, mfn + nr_mfns - 1);
+#ifdef CONFIG_X86
+        if ( !ret )
+            memory_type_changed(d);
+#endif
     }
     break;
 
--- a/xen/include/asm-x86/hvm/vmx/vmcs.h
+++ b/xen/include/asm-x86/hvm/vmx/vmcs.h
@@ -124,6 +124,9 @@ struct arch_vmx_struct {
 
     unsigned long        host_cr0;
 
+    /* Do we need to tolerate a spurious EPT_MISCONFIG VM exit? */
+    bool_t               ept_spurious_misconfig;
+
     /* Is the guest in real mode? */
     uint8_t              vmx_realmode;
     /* Are we emulating rather than VMENTERing? */
--- a/xen/include/asm-x86/hvm/vmx/vmx.h
+++ b/xen/include/asm-x86/hvm/vmx/vmx.h
@@ -520,6 +520,7 @@ int ept_p2m_init(struct p2m_domain *p2m)
 void ept_p2m_uninit(struct p2m_domain *p2m);
 
 void ept_walk_table(struct domain *d, unsigned long gfn);
+bool_t ept_handle_misconfig(uint64_t gpa);
 void setup_ept_dump(void);
 
 void update_guest_eip(void);
--- a/xen/include/asm-x86/mtrr.h
+++ b/xen/include/asm-x86/mtrr.h
@@ -88,6 +88,7 @@ extern bool_t mtrr_fix_range_msr_set(str
                                      uint32_t row, uint64_t msr_content);
 extern bool_t mtrr_def_type_msr_set(struct domain *, struct mtrr_state *,
                                     uint64_t msr_content);
+extern void memory_type_changed(struct domain *);
 extern bool_t pat_msr_set(uint64_t *pat, uint64_t msr);
 
 bool_t is_var_mtrr_overlapped(struct mtrr_state *m);
--- a/xen/include/asm-x86/p2m.h
+++ b/xen/include/asm-x86/p2m.h
@@ -233,6 +233,7 @@ struct p2m_domain {
     void               (*change_entry_type_global)(struct p2m_domain *p2m,
                                                    p2m_type_t ot,
                                                    p2m_type_t nt);
+    void               (*memory_type_changed)(struct p2m_domain *p2m);
     
     void               (*write_p2m_entry)(struct p2m_domain *p2m,
                                           unsigned long gfn, l1_pgentry_t *p,
@@ -506,6 +507,9 @@ void p2m_change_type_range(struct domain
 p2m_type_t p2m_change_type(struct domain *d, unsigned long gfn,
                            p2m_type_t ot, p2m_type_t nt);
 
+/* Report a change affecting memory types. */
+void p2m_memory_type_changed(struct domain *d);
+
 /* Set mmio addresses in the p2m table (for pass-through) */
 int set_mmio_p2m_entry(struct domain *d, unsigned long gfn, mfn_t mfn);
 int clear_mmio_p2m_entry(struct domain *d, unsigned long gfn);



[-- Attachment #2: EPT-sync-mem-types.patch --]
[-- Type: text/plain, Size: 14382 bytes --]

x86/EPT: force re-evaluation of memory type as necessary

The main goal here is to drop the bogus dependency of
epte_get_entry_emt() on d->arch.hvm_domain.params[HVM_PARAM_IDENT_PT].

Any change to state influencing epte_get_entry_emt()'s decision needs
to result in re-calculation. Do this by using the EPT_MISCONFIG VM
exit, storing an invalid memory type into EPT's emt field (leaving the
IOMMU, which doesn't care about memory types, unaffected).

This is being done in a hierarchical manner to keep execution time
down: Initially only the top level directory gets invalidated this way.
Upon access, the involved intermediate page table levels get cleared
back to zero, and the leaf entry gets its field properly set. For 4k
leaves all other entries in the same directory also get processed to
amortize the cost of the extra VM exit (which halved the number of
these VM exits in my testing).

This restoring can result in spurious EPT_MISCONFIG VM exits (since
two vCPU-s may access addresses involving identical page table
structures). Rather than simply returning in such cases (and risking
that such a VM exit results from a real mis-configuration, which
would then result in an endless loop rather than killing the VM), a
per-vCPU flag is being introduced indicating when such a spurious VM
exit might validly happen - if another one occurs right after VM re-
entry, the flag would generally end up being clear, causing the VM
to be killed as before on such VM exits.

Note that putting a reserved memory type value in the EPT structures
isn't formally sanctioned by the specification. Intel isn't willing to
adjust the specification to make this or a similar use of the
EPT_MISCONFIG VM exit formally possible, but they have indicated that
us using this is low risk wrt forward compatibility.

Signed-off-by: Jan Beulich <jbeulich@suse.com>

--- a/xen/arch/x86/domctl.c
+++ b/xen/arch/x86/domctl.c
@@ -84,6 +84,8 @@ long arch_do_domctl(
             ret = ioports_permit_access(d, fp, fp + np - 1);
         else
             ret = ioports_deny_access(d, fp, fp + np - 1);
+        if ( !ret )
+            memory_type_changed(d);
     }
     break;
 
@@ -705,6 +707,8 @@ long arch_do_domctl(
                        ret, add ? "removing" : "denying", d->domain_id,
                        mfn, mfn + nr_mfns - 1);
         }
+        /* Do this unconditionally to cover errors on above failure paths. */
+        memory_type_changed(d);
     }
     break;
 
@@ -791,6 +795,8 @@ long arch_do_domctl(
                        "ioport_map: error %ld denying dom%d access to [%x,%x]\n",
                        ret, d->domain_id, fmp, fmp + np - 1);
         }
+        if ( !ret )
+            memory_type_changed(d);
     }
     break;
 
--- a/xen/arch/x86/hvm/hvm.c
+++ b/xen/arch/x86/hvm/hvm.c
@@ -252,6 +252,9 @@ int hvm_set_guest_pat(struct vcpu *v, u6
 
     if ( !hvm_funcs.set_guest_pat(v, guest_pat) )
         v->arch.hvm_vcpu.pat_cr = guest_pat;
+
+    memory_type_changed(v->domain);
+
     return 1;
 }
 
--- a/xen/arch/x86/hvm/mtrr.c
+++ b/xen/arch/x86/hvm/mtrr.c
@@ -431,8 +431,12 @@ bool_t mtrr_def_type_msr_set(struct doma
          return 0;
     }
 
-    m->enabled = enabled;
-    m->def_type = def_type;
+    if ( m->enabled != enabled || m->def_type != def_type )
+    {
+        m->enabled = enabled;
+        m->def_type = def_type;
+        memory_type_changed(d);
+    }
 
     return 1;
 }
@@ -452,6 +456,7 @@ bool_t mtrr_fix_range_msr_set(struct dom
                 return 0;
 
         fixed_range_base[row] = msr_content;
+        memory_type_changed(d);
     }
 
     return 1;
@@ -496,6 +501,8 @@ bool_t mtrr_var_range_msr_set(
 
     m->overlapped = is_var_mtrr_overlapped(m);
 
+    memory_type_changed(d);
+
     return 1;
 }
 
@@ -690,6 +697,12 @@ static int hvm_load_mtrr_msr(struct doma
 HVM_REGISTER_SAVE_RESTORE(MTRR, hvm_save_mtrr_msr, hvm_load_mtrr_msr,
                           1, HVMSR_PER_VCPU);
 
+void memory_type_changed(struct domain *d)
+{
+    if ( iommu_enabled && !iommu_snoop && d->vcpu && d->vcpu[0] )
+        p2m_memory_type_changed(d);
+}
+
 uint8_t epte_get_entry_emt(struct domain *d, unsigned long gfn, mfn_t mfn,
                            uint8_t *ipat, bool_t direct_mmio)
 {
@@ -733,8 +746,7 @@ uint8_t epte_get_entry_emt(struct domain
         return MTRR_TYPE_WRBACK;
     }
 
-    gmtrr_mtype = is_hvm_domain(d) && v &&
-                  d->arch.hvm_domain.params[HVM_PARAM_IDENT_PT] ?
+    gmtrr_mtype = is_hvm_domain(d) && v ?
                   get_mtrr_type(&v->arch.hvm_vcpu.mtrr, (gfn << PAGE_SHIFT)) :
                   MTRR_TYPE_WRBACK;
 
--- a/xen/arch/x86/hvm/vmx/vmx.c
+++ b/xen/arch/x86/hvm/vmx/vmx.c
@@ -3011,6 +3011,16 @@ void vmx_vmexit_handler(struct cpu_user_
         break;
     }
 
+    case EXIT_REASON_EPT_MISCONFIG:
+    {
+        paddr_t gpa;
+
+        __vmread(GUEST_PHYSICAL_ADDRESS, &gpa);
+        if ( !ept_handle_misconfig(gpa) )
+            goto exit_and_crash;
+        break;
+    }
+
     case EXIT_REASON_MONITOR_TRAP_FLAG:
         v->arch.hvm_vmx.exec_control &= ~CPU_BASED_MONITOR_TRAP_FLAG;
         vmx_update_cpu_exec_control(v);
--- a/xen/arch/x86/mm/p2m.c
+++ b/xen/arch/x86/mm/p2m.c
@@ -200,6 +200,18 @@ void p2m_change_entry_type_global(struct
     p2m_unlock(p2m);
 }
 
+void p2m_memory_type_changed(struct domain *d)
+{
+    struct p2m_domain *p2m = p2m_get_hostp2m(d);
+
+    if ( p2m->memory_type_changed )
+    {
+        p2m_lock(p2m);
+        p2m->memory_type_changed(p2m);
+        p2m_unlock(p2m);
+    }
+}
+
 mfn_t __get_gfn_type_access(struct p2m_domain *p2m, unsigned long gfn,
                     p2m_type_t *t, p2m_access_t *a, p2m_query_t q,
                     unsigned int *page_order, bool_t locked)
--- a/xen/arch/x86/mm/p2m-ept.c
+++ b/xen/arch/x86/mm/p2m-ept.c
@@ -660,6 +660,130 @@ static void ept_change_entry_type_global
     ept_sync_domain(p2m);
 }
 
+static bool_t ept_invalidate_emt(mfn_t mfn)
+{
+    ept_entry_t *epte = map_domain_page(mfn_x(mfn));
+    unsigned int i;
+    bool_t changed = 0;
+
+    for ( i = 0; i < EPT_PAGETABLE_ENTRIES; i++ )
+    {
+        ept_entry_t e = atomic_read_ept_entry(&epte[i]);
+
+        if ( !is_epte_valid(&e) || !is_epte_present(&e) ||
+             e.emt == MTRR_NUM_TYPES )
+            continue;
+
+        e.emt = MTRR_NUM_TYPES;
+        atomic_write_ept_entry(&epte[i], e);
+        changed = 1;
+    }
+
+    unmap_domain_page(epte);
+
+    return changed;
+}
+
+static void ept_memory_type_changed(struct p2m_domain *p2m)
+{
+    unsigned long mfn = ept_get_asr(&p2m->ept);
+
+    if ( !mfn )
+        return;
+
+    if ( ept_invalidate_emt(_mfn(mfn)) )
+        ept_sync_domain(p2m);
+}
+
+bool_t ept_handle_misconfig(uint64_t gpa)
+{
+    struct vcpu *curr = current;
+    struct p2m_domain *p2m = p2m_get_hostp2m(curr->domain);
+    struct ept_data *ept = &p2m->ept;
+    unsigned int level = ept_get_wl(ept);
+    unsigned long gfn = PFN_DOWN(gpa);
+    unsigned long mfn = ept_get_asr(ept);
+    ept_entry_t *epte;
+    bool_t okay;
+
+    if ( !mfn )
+        return 0;
+
+    p2m_lock(p2m);
+
+    for ( okay = curr->arch.hvm_vmx.ept_spurious_misconfig; ; --level) {
+        ept_entry_t e;
+        unsigned int i;
+
+        epte = map_domain_page(mfn);
+        i = (gfn >> (level * EPT_TABLE_ORDER)) & (EPT_PAGETABLE_ENTRIES - 1);
+        e = atomic_read_ept_entry(&epte[i]);
+
+        if ( level == 0 || is_epte_superpage(&e) )
+        {
+            uint8_t ipat = 0;
+            struct vcpu *v;
+
+            if ( e.emt != MTRR_NUM_TYPES )
+                break;
+
+            if ( level == 0 )
+            {
+                for ( gfn -= i, i = 0; i < EPT_PAGETABLE_ENTRIES; ++i )
+                {
+                    e = atomic_read_ept_entry(&epte[i]);
+                    if ( e.emt == MTRR_NUM_TYPES )
+                        e.emt = 0;
+                    if ( !is_epte_valid(&e) || !is_epte_present(&e) )
+                        continue;
+                    e.emt = epte_get_entry_emt(p2m->domain, gfn + i,
+                                               _mfn(e.mfn), &ipat,
+                                               e.sa_p2mt == p2m_mmio_direct);
+                    e.ipat = ipat;
+                    atomic_write_ept_entry(&epte[i], e);
+                }
+            }
+            else
+            {
+                e.emt = epte_get_entry_emt(p2m->domain, gfn, _mfn(e.mfn),
+                                           &ipat,
+                                           e.sa_p2mt == p2m_mmio_direct);
+                e.ipat = ipat;
+                atomic_write_ept_entry(&epte[i], e);
+            }
+
+            for_each_vcpu ( curr->domain, v )
+                v->arch.hvm_vmx.ept_spurious_misconfig = 1;
+            okay = 1;
+            break;
+        }
+
+        if ( e.emt == MTRR_NUM_TYPES )
+        {
+            ASSERT(is_epte_present(&e));
+            e.emt = 0;
+            atomic_write_ept_entry(&epte[i], e);
+            unmap_domain_page(epte);
+
+            ept_invalidate_emt(_mfn(e.mfn));
+            okay = 1;
+        }
+        else if ( is_epte_present(&e) && !e.emt )
+            unmap_domain_page(epte);
+        else
+            break;
+
+        mfn = e.mfn;
+    }
+
+    unmap_domain_page(epte);
+    curr->arch.hvm_vmx.ept_spurious_misconfig = 0;
+    ept_sync_domain(p2m);
+    p2m_unlock(p2m);
+
+    return okay;
+}
+
 static void __ept_sync_domain(void *info)
 {
     struct ept_data *ept = &((struct p2m_domain *)info)->ept;
@@ -697,6 +821,7 @@ int ept_p2m_init(struct p2m_domain *p2m)
     p2m->set_entry = ept_set_entry;
     p2m->get_entry = ept_get_entry;
     p2m->change_entry_type_global = ept_change_entry_type_global;
+    p2m->memory_type_changed = ept_memory_type_changed;
     p2m->audit_p2m = NULL;
 
     /* Set the memory type used when accessing EPT paging structures. */
@@ -737,6 +862,7 @@ static void ept_dump_p2m_table(unsigned 
         [MTRR_TYPE_WRTHROUGH]      = "WT",
         [MTRR_TYPE_WRPROT]         = "WP",
         [MTRR_TYPE_WRBACK]         = "WB",
+        [MTRR_NUM_TYPES]           = "??"
     };
 
     for_each_domain(d)
@@ -750,11 +876,16 @@ static void ept_dump_p2m_table(unsigned 
 
         for ( gfn = 0; gfn <= p2m->max_mapped_pfn; gfn += 1UL << order )
         {
+            char c = 0;
+
             gfn_remainder = gfn;
             table = map_domain_page(pagetable_get_pfn(p2m_get_pagetable(p2m)));
 
             for ( i = ept_get_wl(ept); i > 0; i-- )
             {
+                ept_entry = table + (gfn_remainder >> (i * EPT_TABLE_ORDER));
+                if ( ept_entry->emt == MTRR_NUM_TYPES )
+                    c = '?';
                 ret = ept_next_level(p2m, 1, &table, &gfn_remainder, i);
                 if ( ret != GUEST_TABLE_NORMAL_PAGE )
                     break;
@@ -775,7 +906,7 @@ static void ept_dump_p2m_table(unsigned 
                            memoryTypes[ept_entry->emt][0],
                            memoryTypes[ept_entry->emt][1]
                            ?: ept_entry->emt + '0',
-                           ept_entry->ipat ? '!' : ' ');
+                           c ?: ept_entry->ipat ? '!' : ' ');
 
                 if ( !(record_counter++ % 100) )
                     process_pending_softirqs();
--- a/xen/common/domctl.c
+++ b/xen/common/domctl.c
@@ -815,6 +815,10 @@ long do_domctl(XEN_GUEST_HANDLE_PARAM(xe
             ret = iomem_permit_access(d, mfn, mfn + nr_mfns - 1);
         else
             ret = iomem_deny_access(d, mfn, mfn + nr_mfns - 1);
+#ifdef CONFIG_X86
+        if ( !ret )
+            memory_type_changed(d);
+#endif
     }
     break;
 
--- a/xen/include/asm-x86/hvm/vmx/vmcs.h
+++ b/xen/include/asm-x86/hvm/vmx/vmcs.h
@@ -124,6 +124,9 @@ struct arch_vmx_struct {
 
     unsigned long        host_cr0;
 
+    /* Do we need to tolerate a spurious EPT_MISCONFIG VM exit? */
+    bool_t               ept_spurious_misconfig;
+
     /* Is the guest in real mode? */
     uint8_t              vmx_realmode;
     /* Are we emulating rather than VMENTERing? */
--- a/xen/include/asm-x86/hvm/vmx/vmx.h
+++ b/xen/include/asm-x86/hvm/vmx/vmx.h
@@ -520,6 +520,7 @@ int ept_p2m_init(struct p2m_domain *p2m)
 void ept_p2m_uninit(struct p2m_domain *p2m);
 
 void ept_walk_table(struct domain *d, unsigned long gfn);
+bool_t ept_handle_misconfig(uint64_t gpa);
 void setup_ept_dump(void);
 
 void update_guest_eip(void);
--- a/xen/include/asm-x86/mtrr.h
+++ b/xen/include/asm-x86/mtrr.h
@@ -88,6 +88,7 @@ extern bool_t mtrr_fix_range_msr_set(str
                                      uint32_t row, uint64_t msr_content);
 extern bool_t mtrr_def_type_msr_set(struct domain *, struct mtrr_state *,
                                     uint64_t msr_content);
+extern void memory_type_changed(struct domain *);
 extern bool_t pat_msr_set(uint64_t *pat, uint64_t msr);
 
 bool_t is_var_mtrr_overlapped(struct mtrr_state *m);
--- a/xen/include/asm-x86/p2m.h
+++ b/xen/include/asm-x86/p2m.h
@@ -233,6 +233,7 @@ struct p2m_domain {
     void               (*change_entry_type_global)(struct p2m_domain *p2m,
                                                    p2m_type_t ot,
                                                    p2m_type_t nt);
+    void               (*memory_type_changed)(struct p2m_domain *p2m);
     
     void               (*write_p2m_entry)(struct p2m_domain *p2m,
                                           unsigned long gfn, l1_pgentry_t *p,
@@ -506,6 +507,9 @@ void p2m_change_type_range(struct domain
 p2m_type_t p2m_change_type(struct domain *d, unsigned long gfn,
                            p2m_type_t ot, p2m_type_t nt);
 
+/* Report a change affecting memory types. */
+void p2m_memory_type_changed(struct domain *d);
+
 /* Set mmio addresses in the p2m table (for pass-through) */
 int set_mmio_p2m_entry(struct domain *d, unsigned long gfn, mfn_t mfn);
 int clear_mmio_p2m_entry(struct domain *d, unsigned long gfn);

[-- Attachment #3: Type: text/plain, Size: 126 bytes --]

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 36+ messages in thread

* [PATCH 7/8] x86/EPT: split super pages upon mismatching memory types
  2014-03-26 15:23 [PATCH 0/8] x86/EPT: various adjustments Jan Beulich
                   ` (5 preceding siblings ...)
  2014-03-26 15:36 ` [PATCH 6/8] x86/EPT: force re‑evaluation of memory type as necessary Jan Beulich
@ 2014-03-26 15:36 ` Jan Beulich
  2014-03-27 15:04   ` Tim Deegan
  2014-03-26 15:37 ` [PATCH 8/8] x86/EPT: IOMMU snoop capability should not affect memory type selection Jan Beulich
  2014-03-26 16:25 ` [PATCH 0/8] x86/EPT: various adjustments Jan Beulich
  8 siblings, 1 reply; 36+ messages in thread
From: Jan Beulich @ 2014-03-26 15:36 UTC (permalink / raw)
  To: xen-devel
  Cc: Kevin Tian, Keir Fraser, Tim Deegan, Eddie Dong, Jun Nakajima,
	Yang Z Zhang

[-- Attachment #1: Type: text/plain, Size: 12990 bytes --]

... between constituent pages. To indicate such, the page order is
being passed down to the vMTRR routines, with a negative return value
(possible only on order-non-zero pages) indicating such collisions.

Some code redundancy reduction is being done to ept_set_entry() along
the way, allowing the new handling to be centralized to a single place
there.

In order to keep ept_set_entry() fast and simple, the actual splitting
is being deferred to the EPT_MISCONFIG VM exit handler.

Signed-off-by: Jan Beulich <jbeulich@suse.com>
---
One open question is whether it is okay for memory allocation (for an
intermediate page table) failure to be fatal to the domain. If it
isn't, then deferring the splitting work from ept_set_entry() to
ept_handle_misconfig() is not an option. But even with that eliminated
there's then still potential for ept_handle_misconfig() needing to
split pages, and it would then need to be determined how to gracefully
handle that.

--- a/xen/arch/x86/hvm/mtrr.c
+++ b/xen/arch/x86/hvm/mtrr.c
@@ -222,30 +222,40 @@ void hvm_vcpu_cacheattr_destroy(struct v
 
 /*
  * Get MTRR memory type for physical address pa.
+ *
+ * May return a negative value when order > 0, indicating to the caller
+ * that the respective mapping needs splitting.
  */
-static uint8_t get_mtrr_type(struct mtrr_state *m, paddr_t pa)
+static int get_mtrr_type(const struct mtrr_state *m,
+                         paddr_t pa, unsigned int order)
 {
-   int32_t     addr, seg, index;
    uint8_t     overlap_mtrr = 0;
    uint8_t     overlap_mtrr_pos = 0;
-   uint64_t    phys_base;
-   uint64_t    phys_mask;
-   uint8_t     num_var_ranges = m->mtrr_cap & 0xff;
+   uint64_t    mask = -(uint64_t)PAGE_SIZE << order;
+   unsigned int seg, num_var_ranges = m->mtrr_cap & 0xff;
 
    if ( unlikely(!(m->enabled & 0x2)) )
        return MTRR_TYPE_UNCACHABLE;
 
+   pa &= mask;
    if ( (pa < 0x100000) && (m->enabled & 1) )
    {
-       /* Fixed range MTRR takes effective */
-       addr = (uint32_t) pa;
+       /* Fixed range MTRR takes effect. */
+       uint32_t addr = (uint32_t)pa, index;
+
        if ( addr < 0x80000 )
        {
+           /* 0x00000 ... 0x7FFFF in 64k steps */
+           if ( order > 4 )
+               return -1;
            seg = (addr >> 16);
            return m->fixed_ranges[seg];
        }
        else if ( addr < 0xc0000 )
        {
+           /* 0x80000 ... 0xBFFFF in 16k steps */
+           if ( order > 2 )
+               return -1;
            seg = (addr - 0x80000) >> 14;
            index = (seg >> 3) + 1;
            seg &= 7;            /* select 0-7 segments */
@@ -253,7 +263,9 @@ static uint8_t get_mtrr_type(struct mtrr
        }
        else
        {
-           /* 0xC0000 --- 0x100000 */
+           /* 0xC0000 ... 0xFFFFF in 4k steps */
+           if ( order )
+               return -1;
            seg = (addr - 0xc0000) >> 12;
            index = (seg >> 3) + 3;
            seg &= 7;            /* select 0-7 segments */
@@ -264,14 +276,15 @@ static uint8_t get_mtrr_type(struct mtrr
    /* Match with variable MTRRs. */
    for ( seg = 0; seg < num_var_ranges; seg++ )
    {
-       phys_base = ((uint64_t*)m->var_ranges)[seg*2];
-       phys_mask = ((uint64_t*)m->var_ranges)[seg*2 + 1];
+       uint64_t phys_base = m->var_ranges[seg].base;
+       uint64_t phys_mask = m->var_ranges[seg].mask;
+
        if ( phys_mask & MTRR_PHYSMASK_VALID )
        {
-           if ( ((uint64_t) pa & phys_mask) >> MTRR_PHYSMASK_SHIFT ==
-                (phys_base & phys_mask) >> MTRR_PHYSMASK_SHIFT )
+           phys_mask &= mask;
+           if ( (pa & phys_mask) == (phys_base & phys_mask) )
            {
-               if ( unlikely(m->overlapped) )
+               if ( unlikely(m->overlapped) || order )
                {
                     overlap_mtrr |= 1 << (phys_base & MTRR_PHYSBASE_TYPE_MASK);
                     overlap_mtrr_pos = phys_base & MTRR_PHYSBASE_TYPE_MASK;
@@ -285,23 +298,24 @@ static uint8_t get_mtrr_type(struct mtrr
        }
    }
 
-   /* Overlapped or not found. */
+   /* Not found? */
    if ( unlikely(overlap_mtrr == 0) )
        return m->def_type;
 
-   if ( likely(!(overlap_mtrr & ~( ((uint8_t)1) << overlap_mtrr_pos ))) )
-       /* Covers both one variable memory range matches and
-        * two or more identical match.
-        */
+   /* One match, or multiple identical ones? */
+   if ( likely(overlap_mtrr == (1 << overlap_mtrr_pos)) )
        return overlap_mtrr_pos;
 
+   if ( order )
+       return -1;
+
+   /* Two or more matches, one being UC? */
    if ( overlap_mtrr & (1 << MTRR_TYPE_UNCACHABLE) )
-       /* Two or more match, one is UC. */
        return MTRR_TYPE_UNCACHABLE;
 
-   if ( !(overlap_mtrr &
-          ~((1 << MTRR_TYPE_WRTHROUGH) | (1 << MTRR_TYPE_WRBACK))) )
-       /* Two or more match, WT and WB. */
+   /* Two or more matches, all of them WT and WB? */
+   if ( overlap_mtrr ==
+        ((1 << MTRR_TYPE_WRTHROUGH) | (1 << MTRR_TYPE_WRBACK)) )
        return MTRR_TYPE_WRTHROUGH;
 
    /* Behaviour is undefined, but return the last overlapped type. */
@@ -341,7 +355,7 @@ static uint8_t effective_mm_type(struct 
      * just use it
      */ 
     if ( gmtrr_mtype == NO_HARDCODE_MEM_TYPE )
-        mtrr_mtype = get_mtrr_type(m, gpa);
+        mtrr_mtype = get_mtrr_type(m, gpa, 0);
     else
         mtrr_mtype = gmtrr_mtype;
 
@@ -370,7 +384,7 @@ uint32_t get_pat_flags(struct vcpu *v,
     guest_eff_mm_type = effective_mm_type(g, pat, gpaddr, 
                                           gl1e_flags, gmtrr_mtype);
     /* 2. Get the memory type of host physical address, with MTRR */
-    shadow_mtrr_type = get_mtrr_type(&mtrr_state, spaddr);
+    shadow_mtrr_type = get_mtrr_type(&mtrr_state, spaddr, 0);
 
     /* 3. Find the memory type in PAT, with host MTRR memory type
      * and guest effective memory type.
@@ -703,10 +717,10 @@ void memory_type_changed(struct domain *
         p2m_memory_type_changed(d);
 }
 
-uint8_t epte_get_entry_emt(struct domain *d, unsigned long gfn, mfn_t mfn,
-                           uint8_t *ipat, bool_t direct_mmio)
+int epte_get_entry_emt(struct domain *d, unsigned long gfn, mfn_t mfn,
+                       unsigned int order, uint8_t *ipat, bool_t direct_mmio)
 {
-    uint8_t gmtrr_mtype, hmtrr_mtype;
+    int gmtrr_mtype, hmtrr_mtype;
     uint32_t type;
     struct vcpu *v = current;
 
@@ -747,10 +761,12 @@ uint8_t epte_get_entry_emt(struct domain
     }
 
     gmtrr_mtype = is_hvm_domain(d) && v ?
-                  get_mtrr_type(&v->arch.hvm_vcpu.mtrr, (gfn << PAGE_SHIFT)) :
+                  get_mtrr_type(&v->arch.hvm_vcpu.mtrr,
+                                gfn << PAGE_SHIFT, order) :
                   MTRR_TYPE_WRBACK;
-
-    hmtrr_mtype = get_mtrr_type(&mtrr_state, (mfn_x(mfn) << PAGE_SHIFT));
+    hmtrr_mtype = get_mtrr_type(&mtrr_state, mfn_x(mfn) << PAGE_SHIFT, order);
+    if ( gmtrr_mtype < 0 || hmtrr_mtype < 0 )
+        return -1;
 
     /* If both types match we're fine. */
     if ( likely(gmtrr_mtype == hmtrr_mtype) )
--- a/xen/arch/x86/mm/p2m-ept.c
+++ b/xen/arch/x86/mm/p2m-ept.c
@@ -289,6 +289,7 @@ ept_set_entry(struct p2m_domain *p2m, un
     int vtd_pte_present = 0;
     int needs_sync = 1;
     ept_entry_t old_entry = { .epte = 0 };
+    ept_entry_t new_entry = { .epte = 0 };
     struct ept_data *ept = &p2m->ept;
     struct domain *d = p2m->domain;
 
@@ -338,7 +339,6 @@ ept_set_entry(struct p2m_domain *p2m, un
     if ( i == target )
     {
         /* We reached the target level. */
-        ept_entry_t new_entry = { .epte = 0 };
 
         /* No need to flush if the old entry wasn't valid */
         if ( !is_epte_present(ept_entry) )
@@ -349,35 +349,11 @@ ept_set_entry(struct p2m_domain *p2m, un
          *
          * Read-then-write is OK because we hold the p2m lock. */
         old_entry = *ept_entry;
-
-        if ( mfn_valid(mfn_x(mfn)) || direct_mmio || p2m_is_paged(p2mt) ||
-             (p2mt == p2m_ram_paging_in) )
-        {
-            /* Construct the new entry, and then write it once */
-            new_entry.emt = epte_get_entry_emt(p2m->domain, gfn, mfn, &ipat,
-                                                direct_mmio);
-
-            new_entry.ipat = ipat;
-            new_entry.sp = !!order;
-            new_entry.sa_p2mt = p2mt;
-            new_entry.access = p2ma;
-            new_entry.rsvd2_snp = (iommu_enabled && iommu_snoop);
-
-            new_entry.mfn = mfn_x(mfn);
-
-            if ( old_entry.mfn == new_entry.mfn )
-                need_modify_vtd_table = 0;
-
-            ept_p2m_type_to_flags(&new_entry, p2mt, p2ma);
-        }
-
-        atomic_write_ept_entry(ept_entry, new_entry);
     }
     else
     {
         /* We need to split the original page. */
         ept_entry_t split_ept_entry;
-        ept_entry_t new_entry = { .epte = 0 };
 
         ASSERT(is_epte_superpage(ept_entry));
 
@@ -401,8 +377,19 @@ ept_set_entry(struct p2m_domain *p2m, un
         ASSERT(i == target);
 
         ept_entry = table + (gfn_remainder >> (i * EPT_TABLE_ORDER));
+    }
+
+    if ( mfn_valid(mfn_x(mfn)) || direct_mmio || p2m_is_paged(p2mt) ||
+         (p2mt == p2m_ram_paging_in) )
+    {
+        int emt = epte_get_entry_emt(p2m->domain, gfn, mfn,
+                                     i * EPT_TABLE_ORDER, &ipat, direct_mmio);
+
+        if ( emt >= 0 )
+            new_entry.emt = emt;
+        else /* ept_handle_misconfig() will need to take care of this. */
+            new_entry.emt = MTRR_NUM_TYPES;
 
-        new_entry.emt = epte_get_entry_emt(d, gfn, mfn, &ipat, direct_mmio);
         new_entry.ipat = ipat;
         new_entry.sp = !!i;
         new_entry.sa_p2mt = p2mt;
@@ -417,10 +404,10 @@ ept_set_entry(struct p2m_domain *p2m, un
              need_modify_vtd_table = 0;
 
         ept_p2m_type_to_flags(&new_entry, p2mt, p2ma);
-
-        atomic_write_ept_entry(ept_entry, new_entry);
     }
 
+    atomic_write_ept_entry(ept_entry, new_entry);
+
     /* Track the highest gfn for which we have ever had a valid mapping */
     if ( p2mt != p2m_invalid &&
          (gfn + (1UL << order) - 1 > p2m->max_mapped_pfn) )
@@ -737,7 +724,7 @@ bool_t ept_handle_misconfig(uint64_t gpa
                     if ( !is_epte_valid(&e) || !is_epte_present(&e) )
                         continue;
                     e.emt = epte_get_entry_emt(p2m->domain, gfn + i,
-                                               _mfn(e.mfn), &ipat,
+                                               _mfn(e.mfn), 0, &ipat,
                                                e.sa_p2mt == p2m_mmio_direct);
                     e.ipat = ipat;
                     atomic_write_ept_entry(&epte[i], e);
@@ -745,9 +732,22 @@ bool_t ept_handle_misconfig(uint64_t gpa
             }
             else
             {
-                e.emt = epte_get_entry_emt(p2m->domain, gfn, _mfn(e.mfn),
-                                           &ipat,
-                                           e.sa_p2mt == p2m_mmio_direct);
+                int emt = epte_get_entry_emt(p2m->domain, gfn, _mfn(e.mfn),
+                                             level * EPT_TABLE_ORDER, &ipat,
+                                             e.sa_p2mt == p2m_mmio_direct);
+                if ( unlikely(emt < 0) )
+                {
+                    unmap_domain_page(epte);
+                    if ( ept_split_super_page(p2m, &e, level, level - 1) )
+                    {
+                        mfn = e.mfn;
+                        continue;
+                    }
+                    ept_free_entry(p2m, &e, level);
+                    okay = 0;
+                    break;
+                }
+                e.emt = emt;
                 e.ipat = ipat;
                 atomic_write_ept_entry(&epte[i], e);
             }
--- a/xen/include/asm-x86/mtrr.h
+++ b/xen/include/asm-x86/mtrr.h
@@ -72,8 +72,9 @@ extern int mtrr_del_page(int reg, unsign
 extern void mtrr_centaur_report_mcr(int mcr, u32 lo, u32 hi);
 extern u32 get_pat_flags(struct vcpu *v, u32 gl1e_flags, paddr_t gpaddr,
                   paddr_t spaddr, uint8_t gmtrr_mtype);
-extern uint8_t epte_get_entry_emt(struct domain *d, unsigned long gfn,
-                                  mfn_t mfn, uint8_t *ipat, bool_t direct_mmio);
+extern int epte_get_entry_emt(struct domain *, unsigned long gfn, mfn_t mfn,
+                              unsigned int order, uint8_t *ipat,
+                              bool_t direct_mmio);
 extern void ept_change_entry_emt_with_range(
     struct domain *d, unsigned long start_gfn, unsigned long end_gfn);
 extern unsigned char pat_type_2_pte_flags(unsigned char pat_type);



[-- Attachment #2: EPT-split-on-mixed-memory-types.patch --]
[-- Type: text/plain, Size: 13046 bytes --]

x86/EPT: split super pages upon mismatching memory types

... between constituent pages. To indicate such, the page order is
being passed down to the vMTRR routines, with a negative return value
(possible only on order-non-zero pages) indicating such collisions.

Some code redundancy reduction is being done to ept_set_entry() along
the way, allowing the new handling to be centralized to a single place
there.

In order to keep ept_set_entry() fast and simple, the actual splitting
is being deferred to the EPT_MISCONFIG VM exit handler.

Signed-off-by: Jan Beulich <jbeulich@suse.com>
---
One open question is whether it is okay for memory allocation (for an
intermediate page table) failure to be fatal to the domain. If it
isn't, then deferring the splitting work from ept_set_entry() to
ept_handle_misconfig() is not an option. But even with that eliminated
there's then still potential for ept_handle_misconfig() needing to
split pages, and it would then need to be determined how to gracefully
handle that.

--- a/xen/arch/x86/hvm/mtrr.c
+++ b/xen/arch/x86/hvm/mtrr.c
@@ -222,30 +222,40 @@ void hvm_vcpu_cacheattr_destroy(struct v
 
 /*
  * Get MTRR memory type for physical address pa.
+ *
+ * May return a negative value when order > 0, indicating to the caller
+ * that the respective mapping needs splitting.
  */
-static uint8_t get_mtrr_type(struct mtrr_state *m, paddr_t pa)
+static int get_mtrr_type(const struct mtrr_state *m,
+                         paddr_t pa, unsigned int order)
 {
-   int32_t     addr, seg, index;
    uint8_t     overlap_mtrr = 0;
    uint8_t     overlap_mtrr_pos = 0;
-   uint64_t    phys_base;
-   uint64_t    phys_mask;
-   uint8_t     num_var_ranges = m->mtrr_cap & 0xff;
+   uint64_t    mask = -(uint64_t)PAGE_SIZE << order;
+   unsigned int seg, num_var_ranges = m->mtrr_cap & 0xff;
 
    if ( unlikely(!(m->enabled & 0x2)) )
        return MTRR_TYPE_UNCACHABLE;
 
+   pa &= mask;
    if ( (pa < 0x100000) && (m->enabled & 1) )
    {
-       /* Fixed range MTRR takes effective */
-       addr = (uint32_t) pa;
+       /* Fixed range MTRR takes effect. */
+       uint32_t addr = (uint32_t)pa, index;
+
        if ( addr < 0x80000 )
        {
+           /* 0x00000 ... 0x7FFFF in 64k steps */
+           if ( order > 4 )
+               return -1;
            seg = (addr >> 16);
            return m->fixed_ranges[seg];
        }
        else if ( addr < 0xc0000 )
        {
+           /* 0x80000 ... 0xBFFFF in 16k steps */
+           if ( order > 2 )
+               return -1;
            seg = (addr - 0x80000) >> 14;
            index = (seg >> 3) + 1;
            seg &= 7;            /* select 0-7 segments */
@@ -253,7 +263,9 @@ static uint8_t get_mtrr_type(struct mtrr
        }
        else
        {
-           /* 0xC0000 --- 0x100000 */
+           /* 0xC0000 ... 0xFFFFF in 4k steps */
+           if ( order )
+               return -1;
            seg = (addr - 0xc0000) >> 12;
            index = (seg >> 3) + 3;
            seg &= 7;            /* select 0-7 segments */
@@ -264,14 +276,15 @@ static uint8_t get_mtrr_type(struct mtrr
    /* Match with variable MTRRs. */
    for ( seg = 0; seg < num_var_ranges; seg++ )
    {
-       phys_base = ((uint64_t*)m->var_ranges)[seg*2];
-       phys_mask = ((uint64_t*)m->var_ranges)[seg*2 + 1];
+       uint64_t phys_base = m->var_ranges[seg].base;
+       uint64_t phys_mask = m->var_ranges[seg].mask;
+
        if ( phys_mask & MTRR_PHYSMASK_VALID )
        {
-           if ( ((uint64_t) pa & phys_mask) >> MTRR_PHYSMASK_SHIFT ==
-                (phys_base & phys_mask) >> MTRR_PHYSMASK_SHIFT )
+           phys_mask &= mask;
+           if ( (pa & phys_mask) == (phys_base & phys_mask) )
            {
-               if ( unlikely(m->overlapped) )
+               if ( unlikely(m->overlapped) || order )
                {
                     overlap_mtrr |= 1 << (phys_base & MTRR_PHYSBASE_TYPE_MASK);
                     overlap_mtrr_pos = phys_base & MTRR_PHYSBASE_TYPE_MASK;
@@ -285,23 +298,24 @@ static uint8_t get_mtrr_type(struct mtrr
        }
    }
 
-   /* Overlapped or not found. */
+   /* Not found? */
    if ( unlikely(overlap_mtrr == 0) )
        return m->def_type;
 
-   if ( likely(!(overlap_mtrr & ~( ((uint8_t)1) << overlap_mtrr_pos ))) )
-       /* Covers both one variable memory range matches and
-        * two or more identical match.
-        */
+   /* One match, or multiple identical ones? */
+   if ( likely(overlap_mtrr == (1 << overlap_mtrr_pos)) )
        return overlap_mtrr_pos;
 
+   if ( order )
+       return -1;
+
+   /* Two or more matches, one being UC? */
    if ( overlap_mtrr & (1 << MTRR_TYPE_UNCACHABLE) )
-       /* Two or more match, one is UC. */
        return MTRR_TYPE_UNCACHABLE;
 
-   if ( !(overlap_mtrr &
-          ~((1 << MTRR_TYPE_WRTHROUGH) | (1 << MTRR_TYPE_WRBACK))) )
-       /* Two or more match, WT and WB. */
+   /* Two or more matches, all of them WT and WB? */
+   if ( overlap_mtrr ==
+        ((1 << MTRR_TYPE_WRTHROUGH) | (1 << MTRR_TYPE_WRBACK)) )
        return MTRR_TYPE_WRTHROUGH;
 
    /* Behaviour is undefined, but return the last overlapped type. */
@@ -341,7 +355,7 @@ static uint8_t effective_mm_type(struct 
      * just use it
      */ 
     if ( gmtrr_mtype == NO_HARDCODE_MEM_TYPE )
-        mtrr_mtype = get_mtrr_type(m, gpa);
+        mtrr_mtype = get_mtrr_type(m, gpa, 0);
     else
         mtrr_mtype = gmtrr_mtype;
 
@@ -370,7 +384,7 @@ uint32_t get_pat_flags(struct vcpu *v,
     guest_eff_mm_type = effective_mm_type(g, pat, gpaddr, 
                                           gl1e_flags, gmtrr_mtype);
     /* 2. Get the memory type of host physical address, with MTRR */
-    shadow_mtrr_type = get_mtrr_type(&mtrr_state, spaddr);
+    shadow_mtrr_type = get_mtrr_type(&mtrr_state, spaddr, 0);
 
     /* 3. Find the memory type in PAT, with host MTRR memory type
      * and guest effective memory type.
@@ -703,10 +717,10 @@ void memory_type_changed(struct domain *
         p2m_memory_type_changed(d);
 }
 
-uint8_t epte_get_entry_emt(struct domain *d, unsigned long gfn, mfn_t mfn,
-                           uint8_t *ipat, bool_t direct_mmio)
+int epte_get_entry_emt(struct domain *d, unsigned long gfn, mfn_t mfn,
+                       unsigned int order, uint8_t *ipat, bool_t direct_mmio)
 {
-    uint8_t gmtrr_mtype, hmtrr_mtype;
+    int gmtrr_mtype, hmtrr_mtype;
     uint32_t type;
     struct vcpu *v = current;
 
@@ -747,10 +761,12 @@ uint8_t epte_get_entry_emt(struct domain
     }
 
     gmtrr_mtype = is_hvm_domain(d) && v ?
-                  get_mtrr_type(&v->arch.hvm_vcpu.mtrr, (gfn << PAGE_SHIFT)) :
+                  get_mtrr_type(&v->arch.hvm_vcpu.mtrr,
+                                gfn << PAGE_SHIFT, order) :
                   MTRR_TYPE_WRBACK;
-
-    hmtrr_mtype = get_mtrr_type(&mtrr_state, (mfn_x(mfn) << PAGE_SHIFT));
+    hmtrr_mtype = get_mtrr_type(&mtrr_state, mfn_x(mfn) << PAGE_SHIFT, order);
+    if ( gmtrr_mtype < 0 || hmtrr_mtype < 0 )
+        return -1;
 
     /* If both types match we're fine. */
     if ( likely(gmtrr_mtype == hmtrr_mtype) )
--- a/xen/arch/x86/mm/p2m-ept.c
+++ b/xen/arch/x86/mm/p2m-ept.c
@@ -289,6 +289,7 @@ ept_set_entry(struct p2m_domain *p2m, un
     int vtd_pte_present = 0;
     int needs_sync = 1;
     ept_entry_t old_entry = { .epte = 0 };
+    ept_entry_t new_entry = { .epte = 0 };
     struct ept_data *ept = &p2m->ept;
     struct domain *d = p2m->domain;
 
@@ -338,7 +339,6 @@ ept_set_entry(struct p2m_domain *p2m, un
     if ( i == target )
     {
         /* We reached the target level. */
-        ept_entry_t new_entry = { .epte = 0 };
 
         /* No need to flush if the old entry wasn't valid */
         if ( !is_epte_present(ept_entry) )
@@ -349,35 +349,11 @@ ept_set_entry(struct p2m_domain *p2m, un
          *
          * Read-then-write is OK because we hold the p2m lock. */
         old_entry = *ept_entry;
-
-        if ( mfn_valid(mfn_x(mfn)) || direct_mmio || p2m_is_paged(p2mt) ||
-             (p2mt == p2m_ram_paging_in) )
-        {
-            /* Construct the new entry, and then write it once */
-            new_entry.emt = epte_get_entry_emt(p2m->domain, gfn, mfn, &ipat,
-                                                direct_mmio);
-
-            new_entry.ipat = ipat;
-            new_entry.sp = !!order;
-            new_entry.sa_p2mt = p2mt;
-            new_entry.access = p2ma;
-            new_entry.rsvd2_snp = (iommu_enabled && iommu_snoop);
-
-            new_entry.mfn = mfn_x(mfn);
-
-            if ( old_entry.mfn == new_entry.mfn )
-                need_modify_vtd_table = 0;
-
-            ept_p2m_type_to_flags(&new_entry, p2mt, p2ma);
-        }
-
-        atomic_write_ept_entry(ept_entry, new_entry);
     }
     else
     {
         /* We need to split the original page. */
         ept_entry_t split_ept_entry;
-        ept_entry_t new_entry = { .epte = 0 };
 
         ASSERT(is_epte_superpage(ept_entry));
 
@@ -401,8 +377,19 @@ ept_set_entry(struct p2m_domain *p2m, un
         ASSERT(i == target);
 
         ept_entry = table + (gfn_remainder >> (i * EPT_TABLE_ORDER));
+    }
+
+    if ( mfn_valid(mfn_x(mfn)) || direct_mmio || p2m_is_paged(p2mt) ||
+         (p2mt == p2m_ram_paging_in) )
+    {
+        int emt = epte_get_entry_emt(p2m->domain, gfn, mfn,
+                                     i * EPT_TABLE_ORDER, &ipat, direct_mmio);
+
+        if ( emt >= 0 )
+            new_entry.emt = emt;
+        else /* ept_handle_misconfig() will need to take care of this. */
+            new_entry.emt = MTRR_NUM_TYPES;
 
-        new_entry.emt = epte_get_entry_emt(d, gfn, mfn, &ipat, direct_mmio);
         new_entry.ipat = ipat;
         new_entry.sp = !!i;
         new_entry.sa_p2mt = p2mt;
@@ -417,10 +404,10 @@ ept_set_entry(struct p2m_domain *p2m, un
              need_modify_vtd_table = 0;
 
         ept_p2m_type_to_flags(&new_entry, p2mt, p2ma);
-
-        atomic_write_ept_entry(ept_entry, new_entry);
     }
 
+    atomic_write_ept_entry(ept_entry, new_entry);
+
     /* Track the highest gfn for which we have ever had a valid mapping */
     if ( p2mt != p2m_invalid &&
          (gfn + (1UL << order) - 1 > p2m->max_mapped_pfn) )
@@ -737,7 +724,7 @@ bool_t ept_handle_misconfig(uint64_t gpa
                     if ( !is_epte_valid(&e) || !is_epte_present(&e) )
                         continue;
                     e.emt = epte_get_entry_emt(p2m->domain, gfn + i,
-                                               _mfn(e.mfn), &ipat,
+                                               _mfn(e.mfn), 0, &ipat,
                                                e.sa_p2mt == p2m_mmio_direct);
                     e.ipat = ipat;
                     atomic_write_ept_entry(&epte[i], e);
@@ -745,9 +732,22 @@ bool_t ept_handle_misconfig(uint64_t gpa
             }
             else
             {
-                e.emt = epte_get_entry_emt(p2m->domain, gfn, _mfn(e.mfn),
-                                           &ipat,
-                                           e.sa_p2mt == p2m_mmio_direct);
+                int emt = epte_get_entry_emt(p2m->domain, gfn, _mfn(e.mfn),
+                                             level * EPT_TABLE_ORDER, &ipat,
+                                             e.sa_p2mt == p2m_mmio_direct);
+                if ( unlikely(emt < 0) )
+                {
+                    unmap_domain_page(epte);
+                    if ( ept_split_super_page(p2m, &e, level, level - 1) )
+                    {
+                        mfn = e.mfn;
+                        continue;
+                    }
+                    ept_free_entry(p2m, &e, level);
+                    okay = 0;
+                    break;
+                }
+                e.emt = emt;
                 e.ipat = ipat;
                 atomic_write_ept_entry(&epte[i], e);
             }
--- a/xen/include/asm-x86/mtrr.h
+++ b/xen/include/asm-x86/mtrr.h
@@ -72,8 +72,9 @@ extern int mtrr_del_page(int reg, unsign
 extern void mtrr_centaur_report_mcr(int mcr, u32 lo, u32 hi);
 extern u32 get_pat_flags(struct vcpu *v, u32 gl1e_flags, paddr_t gpaddr,
                   paddr_t spaddr, uint8_t gmtrr_mtype);
-extern uint8_t epte_get_entry_emt(struct domain *d, unsigned long gfn,
-                                  mfn_t mfn, uint8_t *ipat, bool_t direct_mmio);
+extern int epte_get_entry_emt(struct domain *, unsigned long gfn, mfn_t mfn,
+                              unsigned int order, uint8_t *ipat,
+                              bool_t direct_mmio);
 extern void ept_change_entry_emt_with_range(
     struct domain *d, unsigned long start_gfn, unsigned long end_gfn);
 extern unsigned char pat_type_2_pte_flags(unsigned char pat_type);

[-- Attachment #3: Type: text/plain, Size: 126 bytes --]

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 36+ messages in thread

* [PATCH 8/8] x86/EPT: IOMMU snoop capability should not affect memory type selection
  2014-03-26 15:23 [PATCH 0/8] x86/EPT: various adjustments Jan Beulich
                   ` (6 preceding siblings ...)
  2014-03-26 15:36 ` [PATCH 7/8] x86/EPT: split super pages upon mismatching memory types Jan Beulich
@ 2014-03-26 15:37 ` Jan Beulich
  2014-03-27 15:12   ` Tim Deegan
  2014-03-26 16:25 ` [PATCH 0/8] x86/EPT: various adjustments Jan Beulich
  8 siblings, 1 reply; 36+ messages in thread
From: Jan Beulich @ 2014-03-26 15:37 UTC (permalink / raw)
  To: xen-devel
  Cc: Kevin Tian, Keir Fraser, Tim Deegan, Eddie Dong, Jun Nakajima,
	Yang Z Zhang

[-- Attachment #1: Type: text/plain, Size: 4095 bytes --]

This capability solely makes a statement on cache coherency guarantees
by the IOMMU. It does specifically not imply any further guarantees
implied by certain memory types (cachability, ordering).

Signed-off-by: Jan Beulich <jbeulich@suse.com>
---
This already drops a few bits of the changes done for XSA-60. I think
that if the earlier patches (and particularly the use of the
EPT_MISCONFIG VM exit) are being accepted, we should switch that hole
mechanism, avoiding the need to play games with the guest's PAT MSR.

--- a/xen/arch/x86/hvm/hvm.c
+++ b/xen/arch/x86/hvm/hvm.c
@@ -1950,6 +1950,7 @@ static void hvm_update_cr(struct vcpu *v
 int hvm_set_cr0(unsigned long value)
 {
     struct vcpu *v = current;
+    struct domain *d = v->domain;
     unsigned long gfn, old_value = v->arch.hvm_vcpu.guest_cr[0];
     struct page_info *page;
 
@@ -1973,8 +1974,8 @@ int hvm_set_cr0(unsigned long value)
         goto gpf;
 
     /* A pvh is not expected to change to real mode. */
-    if ( is_pvh_vcpu(v)
-         && (value & (X86_CR0_PE | X86_CR0_PG)) != (X86_CR0_PG | X86_CR0_PE) )
+    if ( is_pvh_domain(d) &&
+         (value & (X86_CR0_PE | X86_CR0_PG)) != (X86_CR0_PG | X86_CR0_PE) )
     {
         printk(XENLOG_G_WARNING
                "PVH attempting to turn off PE/PG. CR0:%lx\n", value);
@@ -1996,16 +1997,16 @@ int hvm_set_cr0(unsigned long value)
             hvm_update_guest_efer(v);
         }
 
-        if ( !paging_mode_hap(v->domain) )
+        if ( !paging_mode_hap(d) )
         {
             /* The guest CR3 must be pointing to the guest physical. */
             gfn = v->arch.hvm_vcpu.guest_cr[3]>>PAGE_SHIFT;
-            page = get_page_from_gfn(v->domain, gfn, NULL, P2M_ALLOC);
+            page = get_page_from_gfn(d, gfn, NULL, P2M_ALLOC);
             if ( !page )
             {
                 gdprintk(XENLOG_ERR, "Invalid CR3 value = %lx\n",
                          v->arch.hvm_vcpu.guest_cr[3]);
-                domain_crash(v->domain);
+                domain_crash(d);
                 return X86EMUL_UNHANDLEABLE;
             }
 
@@ -2032,24 +2033,18 @@ int hvm_set_cr0(unsigned long value)
             hvm_update_guest_efer(v);
         }
 
-        if ( !paging_mode_hap(v->domain) )
+        if ( !paging_mode_hap(d) )
         {
             put_page(pagetable_get_page(v->arch.guest_table));
             v->arch.guest_table = pagetable_null();
         }
     }
 
-    /*
-     * When cr0.cd setting
-     * 1. For guest w/o VT-d, and for guest with VT-d but snooped, Xen need not
-     * do anything, since hardware snoop mechanism has ensured cache coherency;
-     * 2. For guest with VT-d but non-snooped, cache coherency cannot be
-     * guaranteed by h/w so need emulate UC memory type to guest.
-     */
     if ( ((value ^ old_value) & X86_CR0_CD) &&
-           has_arch_pdevs(v->domain) &&
-           iommu_enabled && !iommu_snoop &&
-           hvm_funcs.handle_cd )
+         iommu_enabled && hvm_funcs.handle_cd &&
+         (!rangeset_is_empty(d->iomem_caps) ||
+          !rangeset_is_empty(d->arch.ioport_caps) ||
+          has_arch_pdevs(d)) )
         hvm_funcs.handle_cd(v, value);
 
     hvm_update_cr(v, 0, value);
--- 2014-03-17.orig/xen/arch/x86/hvm/mtrr.c	2014-03-26 12:25:05.000000000 +0100
+++ 2014-03-17/xen/arch/x86/hvm/mtrr.c	2014-03-26 12:25:09.000000000 +0100
@@ -713,7 +713,7 @@ HVM_REGISTER_SAVE_RESTORE(MTRR, hvm_save
 
 void memory_type_changed(struct domain *d)
 {
-    if ( iommu_enabled && !iommu_snoop && d->vcpu && d->vcpu[0] )
+    if ( iommu_enabled && d->vcpu && d->vcpu[0] )
         p2m_memory_type_changed(d);
 }
 
@@ -754,12 +754,6 @@ int epte_get_entry_emt(struct domain *d,
         return MTRR_TYPE_WRBACK;
     }
 
-    if ( iommu_snoop )
-    {
-        *ipat = 1;
-        return MTRR_TYPE_WRBACK;
-    }
-
     gmtrr_mtype = is_hvm_domain(d) && v ?
                   get_mtrr_type(&v->arch.hvm_vcpu.mtrr,
                                 gfn << PAGE_SHIFT, order) :




[-- Attachment #2: EPT-drop-snoop-dependency.patch --]
[-- Type: text/plain, Size: 4164 bytes --]

x86/EPT: IOMMU snoop capability should not affect memory type selection

This capability solely makes a statement on cache coherency guarantees
by the IOMMU. It does specifically not imply any further guarantees
implied by certain memory types (cachability, ordering).

Signed-off-by: Jan Beulich <jbeulich@suse.com>
---
This already drops a few bits of the changes done for XSA-60. I think
that if the earlier patches (and particularly the use of the
EPT_MISCONFIG VM exit) are being accepted, we should switch that hole
mechanism, avoiding the need to play games with the guest's PAT MSR.

--- a/xen/arch/x86/hvm/hvm.c
+++ b/xen/arch/x86/hvm/hvm.c
@@ -1950,6 +1950,7 @@ static void hvm_update_cr(struct vcpu *v
 int hvm_set_cr0(unsigned long value)
 {
     struct vcpu *v = current;
+    struct domain *d = v->domain;
     unsigned long gfn, old_value = v->arch.hvm_vcpu.guest_cr[0];
     struct page_info *page;
 
@@ -1973,8 +1974,8 @@ int hvm_set_cr0(unsigned long value)
         goto gpf;
 
     /* A pvh is not expected to change to real mode. */
-    if ( is_pvh_vcpu(v)
-         && (value & (X86_CR0_PE | X86_CR0_PG)) != (X86_CR0_PG | X86_CR0_PE) )
+    if ( is_pvh_domain(d) &&
+         (value & (X86_CR0_PE | X86_CR0_PG)) != (X86_CR0_PG | X86_CR0_PE) )
     {
         printk(XENLOG_G_WARNING
                "PVH attempting to turn off PE/PG. CR0:%lx\n", value);
@@ -1996,16 +1997,16 @@ int hvm_set_cr0(unsigned long value)
             hvm_update_guest_efer(v);
         }
 
-        if ( !paging_mode_hap(v->domain) )
+        if ( !paging_mode_hap(d) )
         {
             /* The guest CR3 must be pointing to the guest physical. */
             gfn = v->arch.hvm_vcpu.guest_cr[3]>>PAGE_SHIFT;
-            page = get_page_from_gfn(v->domain, gfn, NULL, P2M_ALLOC);
+            page = get_page_from_gfn(d, gfn, NULL, P2M_ALLOC);
             if ( !page )
             {
                 gdprintk(XENLOG_ERR, "Invalid CR3 value = %lx\n",
                          v->arch.hvm_vcpu.guest_cr[3]);
-                domain_crash(v->domain);
+                domain_crash(d);
                 return X86EMUL_UNHANDLEABLE;
             }
 
@@ -2032,24 +2033,18 @@ int hvm_set_cr0(unsigned long value)
             hvm_update_guest_efer(v);
         }
 
-        if ( !paging_mode_hap(v->domain) )
+        if ( !paging_mode_hap(d) )
         {
             put_page(pagetable_get_page(v->arch.guest_table));
             v->arch.guest_table = pagetable_null();
         }
     }
 
-    /*
-     * When cr0.cd setting
-     * 1. For guest w/o VT-d, and for guest with VT-d but snooped, Xen need not
-     * do anything, since hardware snoop mechanism has ensured cache coherency;
-     * 2. For guest with VT-d but non-snooped, cache coherency cannot be
-     * guaranteed by h/w so need emulate UC memory type to guest.
-     */
     if ( ((value ^ old_value) & X86_CR0_CD) &&
-           has_arch_pdevs(v->domain) &&
-           iommu_enabled && !iommu_snoop &&
-           hvm_funcs.handle_cd )
+         iommu_enabled && hvm_funcs.handle_cd &&
+         (!rangeset_is_empty(d->iomem_caps) ||
+          !rangeset_is_empty(d->arch.ioport_caps) ||
+          has_arch_pdevs(d)) )
         hvm_funcs.handle_cd(v, value);
 
     hvm_update_cr(v, 0, value);
--- 2014-03-17.orig/xen/arch/x86/hvm/mtrr.c	2014-03-26 12:25:05.000000000 +0100
+++ 2014-03-17/xen/arch/x86/hvm/mtrr.c	2014-03-26 12:25:09.000000000 +0100
@@ -713,7 +713,7 @@ HVM_REGISTER_SAVE_RESTORE(MTRR, hvm_save
 
 void memory_type_changed(struct domain *d)
 {
-    if ( iommu_enabled && !iommu_snoop && d->vcpu && d->vcpu[0] )
+    if ( iommu_enabled && d->vcpu && d->vcpu[0] )
         p2m_memory_type_changed(d);
 }
 
@@ -754,12 +754,6 @@ int epte_get_entry_emt(struct domain *d,
         return MTRR_TYPE_WRBACK;
     }
 
-    if ( iommu_snoop )
-    {
-        *ipat = 1;
-        return MTRR_TYPE_WRBACK;
-    }
-
     gmtrr_mtype = is_hvm_domain(d) && v ?
                   get_mtrr_type(&v->arch.hvm_vcpu.mtrr,
                                 gfn << PAGE_SHIFT, order) :

[-- Attachment #3: Type: text/plain, Size: 126 bytes --]

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH 0/8] x86/EPT: various adjustments
  2014-03-26 15:23 [PATCH 0/8] x86/EPT: various adjustments Jan Beulich
                   ` (7 preceding siblings ...)
  2014-03-26 15:37 ` [PATCH 8/8] x86/EPT: IOMMU snoop capability should not affect memory type selection Jan Beulich
@ 2014-03-26 16:25 ` Jan Beulich
  2014-03-27 15:25   ` Tim Deegan
  2014-03-28  9:15   ` Tian, Kevin
  8 siblings, 2 replies; 36+ messages in thread
From: Jan Beulich @ 2014-03-26 16:25 UTC (permalink / raw)
  To: xen-devel
  Cc: Kevin Tian, Keir Fraser, suravee.suthikulpanit, Tim Deegan,
	Eddie Dong, Jun Nakajima, Sherry Hurwitz, Yang Z Zhang

>>> On 26.03.14 at 16:23, <JBeulich@suse.com> wrote:
> 1: p2m: make p2m_change_type() behavior match p2m_change_entry_type_global()'s
> 2: EPT: simplification and cleanup
> 3: EPT: also dump permissions and memory types
> 4: EPT: relax treatment of APIC MFN
> 5: vMTRR: pass domain to mtrr_*_msr_set()
> 6: EPT: force re-evaluation of memory type as necessary
> 7: EPT: split super pages upon mismatching memory types
> 8: EPT: IOMMU snoop capability should not affect memory type selection

Beyond this series comes the question of how to further leverage
this: One aspect (re-doing the XSA-60 fix) was already mentioned
in one of the patches. The other is that of avoiding the long running
nature of ept_change_entry_type_global(). That function is being
used for log dirty {en,dis}abling and as a safeguard in the PCI MSI-X
handling code. Whether to do so depends on the answers to a
couple of questions:

1) There's a behavioral discrepancy between shared EPT/IOMMU page
tables and separate ones. This was to some degree already discussed
in the context of what became commit 077fc1c0 ("When enabling log
dirty mode, it sets all guest's memory to readonly", see
http://lists.xenproject.org/archives/html/xen-devel/2014-02/msg00756.html)
with the promise to work on a better solution
(http://lists.xenproject.org/archives/html/xen-devel/2014-02/msg01230.html)
not having seen any action so far.

When the tables are shared, a switch to log-dirty mode will result in
IOMMU faults on DMA writes, whereas with separate tables no faults
would occur, but the log-dirty logic wouldn't know of the dirtied pages.
In the discussion above it was being said that migration with passed
through devices isn't supported anyway, but it seems to me that this
shouldn't be a long term statement of ours.

The question is whether it would be tolerable to make this more
"undefined" by also using the EPT_MISCONFIG VM exit here: From
the IOMMU perspective, some (initially all) of the pages would remain
writable, while with the propagation logic in the VM exit handler some
could then become readonly.

2) The other discrepancy is that of which pages get entered into the
page tables in the first place: Obviously for the CPU side (EPT) all
guest visible memory must be mapped. Otoh in the separate IOMMU
page tables case only p2m_ram_rw pages would ever get mapped
into those page tables, resulting in IOMMU faults should a device be
programmed to do DMA to/from any other region.

This particularly matters for the MSI-X fallback code, which gets into
action only when running with a Dom0 not [properly] making use of
PHYSDEVOP_prepare_msix. Considering that a first option might be
to rip that fallback out altogether, and kill the affected guest (even
if it's innocent) instead.

The other alternative, keeping the code, could again entertain the
EPT_MISCONFIG VM exit: Considering that in the separate tables
case DMA to p2m_mmio_direct pages faults anyway, IOMMU faults
due to the write protection that's being enforced here wouldn't
matter much. The problem, however, is that the EPT_MISCONFIG
approach specifically leaves the pages accessible by the IOMMU until
a particular page gets its memory type (and for the purpose here
then also access permissions) fixed up. Which imo is another
argument for doing away with the shared page tables.

3) As validly pointed out by Tim already, all this only deals with the
long-runningness on Intel systems, yet to fully address the issue
we'd also need some solution on the AMD side. So the question
regarding possible solutions there goes to AMD.

Jan

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH 1/8] x86/p2m: make p2m_change_type() behavior match p2m_change_entry_type_global()'s
  2014-03-26 15:31 ` [PATCH 1/8] x86/p2m: make p2m_change_type() behavior match p2m_change_entry_type_global()'s Jan Beulich
@ 2014-03-27 12:52   ` Tim Deegan
  0 siblings, 0 replies; 36+ messages in thread
From: Tim Deegan @ 2014-03-27 12:52 UTC (permalink / raw)
  To: Jan Beulich
  Cc: Kevin Tian, Keir Fraser, Eddie Dong, Jun Nakajima, Yang Z Zhang,
	xen-devel

At 15:31 +0000 on 26 Mar (1395844319), Jan Beulich wrote:
> - don't reset access permissions
> - don't shatter super pages
> 
> Signed-off-by: Jan Beulich <jbeulich@suse.com>

Reviewed-by: Tim Deegan <tim@xen.org>

> ---
> Should p2m_change_type() also behave consistently?

Well, I suppose all these operations ought to match.  It swould be
nice to hear from anyone who's actualy using the access types whether
it owuld be better to reset on type change or not.  My instinct is to
leave the access permission alone.

Cheers,

Tim.

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH 2/8] x86/EPT: simplification and cleanup
  2014-03-26 15:33 ` [PATCH 2/8] x86/EPT: simplification and cleanup Jan Beulich
@ 2014-03-27 12:52   ` Tim Deegan
  0 siblings, 0 replies; 36+ messages in thread
From: Tim Deegan @ 2014-03-27 12:52 UTC (permalink / raw)
  To: Jan Beulich
  Cc: Kevin Tian, Keir Fraser, Eddie Dong, Jun Nakajima, Yang Z Zhang,
	xen-devel

At 15:33 +0000 on 26 Mar (1395844399), Jan Beulich wrote:
> - drop rsvd*_ prefixes from fields not really reserved anymore
> - replace odd uses of <expr> ? 1 : 0
> - drop pointless variables from ept_set_entry()
> - streamline IOMMU mirroring code in ept_set_entry()
> - don't open code is_epte_valid() (and properly use it when dumping)
> - streamline entry cloning in ept_split_super_page()
> - compact dumping code and output
> 
> Signed-off-by: Jan Beulich <jbeulich@suse.com>

Reviewed-by: Tim Deegan <tim@xen.org>

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH 3/8] x86/EPT: also dump permissions and memory types
  2014-03-26 15:34 ` [PATCH 3/8] x86/EPT: also dump permissions and memory types Jan Beulich
@ 2014-03-27 12:54   ` Tim Deegan
  2014-03-27 15:16     ` Jan Beulich
  2014-03-27 13:06   ` Andrew Cooper
  1 sibling, 1 reply; 36+ messages in thread
From: Tim Deegan @ 2014-03-27 12:54 UTC (permalink / raw)
  To: Jan Beulich
  Cc: Kevin Tian, Keir Fraser, Eddie Dong, Jun Nakajima, Yang Z Zhang,
	xen-devel

At 15:34 +0000 on 26 Mar (1395844445), Jan Beulich wrote:
> Signed-off-by: Jan Beulich <jbeulich@suse.com>
> 
> --- a/xen/arch/x86/mm/p2m-ept.c
> +++ b/xen/arch/x86/mm/p2m-ept.c
> @@ -730,6 +730,14 @@ static void ept_dump_p2m_table(unsigned 
>      unsigned long record_counter = 0;
>      struct p2m_domain *p2m;
>      struct ept_data *ept;
> +    static const char memoryTypes[8][2] = {

s/memoryTypes/memory_types/ would be Xen style.

With that, Reviewed-by: Tim Deegan <tim@xen.org>

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH 4/8] x86/EPT: relax treatment of APIC MFN
  2014-03-26 15:34 ` [PATCH 4/8] x86/EPT: relax treatment of APIC MFN Jan Beulich
@ 2014-03-27 12:55   ` Tim Deegan
  0 siblings, 0 replies; 36+ messages in thread
From: Tim Deegan @ 2014-03-27 12:55 UTC (permalink / raw)
  To: Jan Beulich
  Cc: Kevin Tian, Keir Fraser, Eddie Dong, Jun Nakajima, Yang Z Zhang,
	xen-devel

At 15:34 +0000 on 26 Mar (1395844486), Jan Beulich wrote:
> There's no point in this being mapped UC by the guest due to using a
> respective PAT index - set the ignore-PAT flag to true.
> 
> Signed-off-by: Jan Beulich <jbeulich@suse.com>

Reviewed-by: Tim Deegan <tim@xen.org>

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH 5/8] x86/vMTRR: pass domain to mtrr_*_msr_set()
  2014-03-26 15:35 ` [PATCH 5/8] x86/vMTRR: pass domain to mtrr_*_msr_set() Jan Beulich
@ 2014-03-27 12:55   ` Tim Deegan
  0 siblings, 0 replies; 36+ messages in thread
From: Tim Deegan @ 2014-03-27 12:55 UTC (permalink / raw)
  To: Jan Beulich
  Cc: Kevin Tian, Keir Fraser, Eddie Dong, Jun Nakajima, Yang Z Zhang,
	xen-devel

At 15:35 +0000 on 26 Mar (1395844523), Jan Beulich wrote:
> This is in preparation for the next patch, and mtrr_def_type_msr_set()
> and mtrr_fix_range_msr_set() in sync with mtrr_var_range_msr_set() in
> this regard.
> 
> Signed-off-by: Jan Beulich <jbeulich@suse.com>

Reviewed-by: Tim Deegan <tim@xen.org>

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH 3/8] x86/EPT: also dump permissions and memory types
  2014-03-26 15:34 ` [PATCH 3/8] x86/EPT: also dump permissions and memory types Jan Beulich
  2014-03-27 12:54   ` Tim Deegan
@ 2014-03-27 13:06   ` Andrew Cooper
  2014-03-27 15:18     ` Jan Beulich
  1 sibling, 1 reply; 36+ messages in thread
From: Andrew Cooper @ 2014-03-27 13:06 UTC (permalink / raw)
  To: Jan Beulich
  Cc: Kevin Tian, Keir Fraser, Tim Deegan, Eddie Dong, Jun Nakajima,
	Yang Z Zhang, xen-devel


[-- Attachment #1.1: Type: text/plain, Size: 1924 bytes --]

On 26/03/14 15:34, Jan Beulich wrote:
> Signed-off-by: Jan Beulich <jbeulich@suse.com>
>
> --- a/xen/arch/x86/mm/p2m-ept.c
> +++ b/xen/arch/x86/mm/p2m-ept.c
> @@ -730,6 +730,14 @@ static void ept_dump_p2m_table(unsigned 
>      unsigned long record_counter = 0;
>      struct p2m_domain *p2m;
>      struct ept_data *ept;
> +    static const char memoryTypes[8][2] = {
> +        [0 ... 7] = "?",
> +        [MTRR_TYPE_UNCACHABLE]     = "UC",
> +        [MTRR_TYPE_WRCOMB]         = "WC",
> +        [MTRR_TYPE_WRTHROUGH]      = "WT",
> +        [MTRR_TYPE_WRPROT]         = "WP",
> +        [MTRR_TYPE_WRBACK]         = "WB",
> +    };

Dont we have an identical structure somewhere else in Xen already?

~Andrew

>  
>      for_each_domain(d)
>      {
> @@ -759,8 +767,15 @@ static void ept_dump_p2m_table(unsigned 
>                  if ( ept_entry->sa_p2mt == p2m_populate_on_demand )
>                      printk("gfn: %13lx order: %2d PoD\n", gfn, order);
>                  else
> -                    printk("gfn: %13lx order: %2d mfn: %13lx\n",
> -                           gfn, order, ept_entry->mfn + 0UL);
> +                    printk("gfn: %13lx order: %2d mfn: %13lx %c%c%c %c%c%c\n",
> +                           gfn, order, ept_entry->mfn + 0UL,
> +                           ept_entry->r ? 'r' : ' ',
> +                           ept_entry->w ? 'w' : ' ',
> +                           ept_entry->x ? 'x' : ' ',
> +                           memoryTypes[ept_entry->emt][0],
> +                           memoryTypes[ept_entry->emt][1]
> +                           ?: ept_entry->emt + '0',
> +                           ept_entry->ipat ? '!' : ' ');
>  
>                  if ( !(record_counter++ % 100) )
>                      process_pending_softirqs();
>
>
>
>
>
> _______________________________________________
> Xen-devel mailing list
> Xen-devel@lists.xen.org
> http://lists.xen.org/xen-devel


[-- Attachment #1.2: Type: text/html, Size: 2736 bytes --]

[-- Attachment #2: Type: text/plain, Size: 126 bytes --]

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH 6/8] x86/EPT: force re???evaluation of memory type as necessary
  2014-03-26 15:36 ` [PATCH 6/8] x86/EPT: force re‑evaluation of memory type as necessary Jan Beulich
@ 2014-03-27 13:08   ` Tim Deegan
  2014-03-27 15:23     ` Jan Beulich
  0 siblings, 1 reply; 36+ messages in thread
From: Tim Deegan @ 2014-03-27 13:08 UTC (permalink / raw)
  To: Jan Beulich
  Cc: Kevin Tian, Keir Fraser, Eddie Dong, Jun Nakajima, Yang Z Zhang,
	xen-devel

At 15:36 +0000 on 26 Mar (1395844584), Jan Beulich wrote:
> +bool_t ept_handle_misconfig(uint64_t gpa)
> +{
> +    struct vcpu *curr = current;
> +    struct p2m_domain *p2m = p2m_get_hostp2m(curr->domain);
> +    struct ept_data *ept = &p2m->ept;
> +    unsigned int level = ept_get_wl(ept);
> +    unsigned long gfn = PFN_DOWN(gpa);
> +    unsigned long mfn = ept_get_asr(ept);
> +    ept_entry_t *epte;
> +    bool_t okay;
> +
> +    if ( !mfn )
> +        return 0;
> +
> +    p2m_lock(p2m);
> +
> +    for ( okay = curr->arch.hvm_vmx.ept_spurious_misconfig; ; --level) {

I'd prefer a 'while (1)' with the --level at the bottom of the loop
beside where mfn gets set to the next level's mfn.  If not that, can
you at least move the init of 'okay' out to its own line?

> +        ept_entry_t e;
> +        unsigned int i;
> +
> +        epte = map_domain_page(mfn);
> +        i = (gfn >> (level * EPT_TABLE_ORDER)) & (EPT_PAGETABLE_ENTRIES - 1);
> +        e = atomic_read_ept_entry(&epte[i]);
> +
> +        if ( level == 0 || is_epte_superpage(&e) )
> +        {
> +            uint8_t ipat = 0;
> +            struct vcpu *v;
> +
> +            if ( e.emt != MTRR_NUM_TYPES )
> +                break;
> +
> +            if ( level == 0 )
> +            {
> +                for ( gfn -= i, i = 0; i < EPT_PAGETABLE_ENTRIES; ++i )
> +                {
> +                    e = atomic_read_ept_entry(&epte[i]);
> +                    if ( e.emt == MTRR_NUM_TYPES )
> +                        e.emt = 0;
> +                    if ( !is_epte_valid(&e) || !is_epte_present(&e) )
> +                        continue;
> +                    e.emt = epte_get_entry_emt(p2m->domain, gfn + i,
> +                                               _mfn(e.mfn), &ipat,
> +                                               e.sa_p2mt == p2m_mmio_direct);
> +                    e.ipat = ipat;
> +                    atomic_write_ept_entry(&epte[i], e);
> +                }
> +            }
> +            else
> +            {
> +                e.emt = epte_get_entry_emt(p2m->domain, gfn, _mfn(e.mfn),
> +                                           &ipat,
> +                                           e.sa_p2mt == p2m_mmio_direct);
> +                e.ipat = ipat;
> +                atomic_write_ept_entry(&epte[i], e);
> +            }
> +
> +            for_each_vcpu ( curr->domain, v )
> +                v->arch.hvm_vmx.ept_spurious_misconfig = 1;
> +            okay = 1;
> +            break;
> +        }
> +
> +        if ( e.emt == MTRR_NUM_TYPES )
> +        {
> +            ASSERT(is_epte_present(&e));
> +            e.emt = 0;
> +            atomic_write_ept_entry(&epte[i], e);
> +            unmap_domain_page(epte);
> +
> +            ept_invalidate_emt(_mfn(e.mfn));
> +            okay = 1;

Do you need to also set ept_spurious_misconfig flags here?  At the
moment you set them when a present entry gets reset, but what about a
walk that hits a not-present entry (e.g. mmio_dm)?  There, the first
CPU to take a fault will clear up the intermediate nodes, but not
make any chaneg at level 0 or set the ept_spurious_misconfig flags.

Cheers,

Tim.

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH 7/8] x86/EPT: split super pages upon mismatching memory types
  2014-03-26 15:36 ` [PATCH 7/8] x86/EPT: split super pages upon mismatching memory types Jan Beulich
@ 2014-03-27 15:04   ` Tim Deegan
  2014-03-27 15:25     ` Jan Beulich
  0 siblings, 1 reply; 36+ messages in thread
From: Tim Deegan @ 2014-03-27 15:04 UTC (permalink / raw)
  To: Jan Beulich
  Cc: Kevin Tian, Keir Fraser, Eddie Dong, Jun Nakajima, Yang Z Zhang,
	xen-devel

At 15:36 +0000 on 26 Mar (1395844618), Jan Beulich wrote:
> ... between constituent pages. To indicate such, the page order is
> being passed down to the vMTRR routines, with a negative return value
> (possible only on order-non-zero pages) indicating such collisions.
> 
> Some code redundancy reduction is being done to ept_set_entry() along
> the way, allowing the new handling to be centralized to a single place
> there.
> 
> In order to keep ept_set_entry() fast and simple, the actual splitting
> is being deferred to the EPT_MISCONFIG VM exit handler.

You expect that to be rare, right?  We're not going to incur a lot of
extra pagefaults?

> Signed-off-by: Jan Beulich <jbeulich@suse.com>
> ---
> One open question is whether it is okay for memory allocation (for an
> intermediate page table) failure to be fatal to the domain. If it
> isn't, then deferring the splitting work from ept_set_entry() to
> ept_handle_misconfig() is not an option. But even with that eliminated
> there's then still potential for ept_handle_misconfig() needing to
> split pages, and it would then need to be determined how to gracefully
> handle that.

Hmmm, true.  Since we can't tell in advance how much memory will be
needed, it's not possible to detect all the errors up front.  And I
can't think of a nice way to fail later on that doesn't go back to
having large loops over many gfns.  So I think we have to live with it.

Reviewed-by: Tim Deegan <tim@xen.org>

Cheers,

Tim.

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH 8/8] x86/EPT: IOMMU snoop capability should not affect memory type selection
  2014-03-26 15:37 ` [PATCH 8/8] x86/EPT: IOMMU snoop capability should not affect memory type selection Jan Beulich
@ 2014-03-27 15:12   ` Tim Deegan
  2014-03-27 15:32     ` Jan Beulich
  0 siblings, 1 reply; 36+ messages in thread
From: Tim Deegan @ 2014-03-27 15:12 UTC (permalink / raw)
  To: Jan Beulich
  Cc: Kevin Tian, Keir Fraser, Eddie Dong, Jun Nakajima, Yang Z Zhang,
	xen-devel

At 15:37 +0000 on 26 Mar (1395844658), Jan Beulich wrote:
> This capability solely makes a statement on cache coherency guarantees
> by the IOMMU. It does specifically not imply any further guarantees
> implied by certain memory types (cachability, ordering).

Can you give some examples of what this is protecting against?

Cachability is irrelevant unless there's some other form of direct
access that's not covered by the IOMMU, and x86 ordering is pretty
strict.

Cheers,

Tim.

> Signed-off-by: Jan Beulich <jbeulich@suse.com>
> ---
> This already drops a few bits of the changes done for XSA-60. I think
> that if the earlier patches (and particularly the use of the
> EPT_MISCONFIG VM exit) are being accepted, we should switch that hole
> mechanism, avoiding the need to play games with the guest's PAT MSR.
> 
> --- a/xen/arch/x86/hvm/hvm.c
> +++ b/xen/arch/x86/hvm/hvm.c
> @@ -1950,6 +1950,7 @@ static void hvm_update_cr(struct vcpu *v
>  int hvm_set_cr0(unsigned long value)
>  {
>      struct vcpu *v = current;
> +    struct domain *d = v->domain;
>      unsigned long gfn, old_value = v->arch.hvm_vcpu.guest_cr[0];
>      struct page_info *page;
>  
> @@ -1973,8 +1974,8 @@ int hvm_set_cr0(unsigned long value)
>          goto gpf;
>  
>      /* A pvh is not expected to change to real mode. */
> -    if ( is_pvh_vcpu(v)
> -         && (value & (X86_CR0_PE | X86_CR0_PG)) != (X86_CR0_PG | X86_CR0_PE) )
> +    if ( is_pvh_domain(d) &&
> +         (value & (X86_CR0_PE | X86_CR0_PG)) != (X86_CR0_PG | X86_CR0_PE) )
>      {
>          printk(XENLOG_G_WARNING
>                 "PVH attempting to turn off PE/PG. CR0:%lx\n", value);
> @@ -1996,16 +1997,16 @@ int hvm_set_cr0(unsigned long value)
>              hvm_update_guest_efer(v);
>          }
>  
> -        if ( !paging_mode_hap(v->domain) )
> +        if ( !paging_mode_hap(d) )
>          {
>              /* The guest CR3 must be pointing to the guest physical. */
>              gfn = v->arch.hvm_vcpu.guest_cr[3]>>PAGE_SHIFT;
> -            page = get_page_from_gfn(v->domain, gfn, NULL, P2M_ALLOC);
> +            page = get_page_from_gfn(d, gfn, NULL, P2M_ALLOC);
>              if ( !page )
>              {
>                  gdprintk(XENLOG_ERR, "Invalid CR3 value = %lx\n",
>                           v->arch.hvm_vcpu.guest_cr[3]);
> -                domain_crash(v->domain);
> +                domain_crash(d);
>                  return X86EMUL_UNHANDLEABLE;
>              }
>  
> @@ -2032,24 +2033,18 @@ int hvm_set_cr0(unsigned long value)
>              hvm_update_guest_efer(v);
>          }
>  
> -        if ( !paging_mode_hap(v->domain) )
> +        if ( !paging_mode_hap(d) )
>          {
>              put_page(pagetable_get_page(v->arch.guest_table));
>              v->arch.guest_table = pagetable_null();
>          }
>      }
>  
> -    /*
> -     * When cr0.cd setting
> -     * 1. For guest w/o VT-d, and for guest with VT-d but snooped, Xen need not
> -     * do anything, since hardware snoop mechanism has ensured cache coherency;
> -     * 2. For guest with VT-d but non-snooped, cache coherency cannot be
> -     * guaranteed by h/w so need emulate UC memory type to guest.
> -     */
>      if ( ((value ^ old_value) & X86_CR0_CD) &&
> -           has_arch_pdevs(v->domain) &&
> -           iommu_enabled && !iommu_snoop &&
> -           hvm_funcs.handle_cd )
> +         iommu_enabled && hvm_funcs.handle_cd &&
> +         (!rangeset_is_empty(d->iomem_caps) ||
> +          !rangeset_is_empty(d->arch.ioport_caps) ||
> +          has_arch_pdevs(d)) )
>          hvm_funcs.handle_cd(v, value);
>  
>      hvm_update_cr(v, 0, value);
> --- 2014-03-17.orig/xen/arch/x86/hvm/mtrr.c	2014-03-26 12:25:05.000000000 +0100
> +++ 2014-03-17/xen/arch/x86/hvm/mtrr.c	2014-03-26 12:25:09.000000000 +0100
> @@ -713,7 +713,7 @@ HVM_REGISTER_SAVE_RESTORE(MTRR, hvm_save
>  
>  void memory_type_changed(struct domain *d)
>  {
> -    if ( iommu_enabled && !iommu_snoop && d->vcpu && d->vcpu[0] )
> +    if ( iommu_enabled && d->vcpu && d->vcpu[0] )
>          p2m_memory_type_changed(d);
>  }
>  
> @@ -754,12 +754,6 @@ int epte_get_entry_emt(struct domain *d,
>          return MTRR_TYPE_WRBACK;
>      }
>  
> -    if ( iommu_snoop )
> -    {
> -        *ipat = 1;
> -        return MTRR_TYPE_WRBACK;
> -    }
> -
>      gmtrr_mtype = is_hvm_domain(d) && v ?
>                    get_mtrr_type(&v->arch.hvm_vcpu.mtrr,
>                                  gfn << PAGE_SHIFT, order) :
> 
> 
> 

> x86/EPT: IOMMU snoop capability should not affect memory type selection
> 
> This capability solely makes a statement on cache coherency guarantees
> by the IOMMU. It does specifically not imply any further guarantees
> implied by certain memory types (cachability, ordering).
> 
> Signed-off-by: Jan Beulich <jbeulich@suse.com>
> ---
> This already drops a few bits of the changes done for XSA-60. I think
> that if the earlier patches (and particularly the use of the
> EPT_MISCONFIG VM exit) are being accepted, we should switch that hole
> mechanism, avoiding the need to play games with the guest's PAT MSR.
> 
> --- a/xen/arch/x86/hvm/hvm.c
> +++ b/xen/arch/x86/hvm/hvm.c
> @@ -1950,6 +1950,7 @@ static void hvm_update_cr(struct vcpu *v
>  int hvm_set_cr0(unsigned long value)
>  {
>      struct vcpu *v = current;
> +    struct domain *d = v->domain;
>      unsigned long gfn, old_value = v->arch.hvm_vcpu.guest_cr[0];
>      struct page_info *page;
>  
> @@ -1973,8 +1974,8 @@ int hvm_set_cr0(unsigned long value)
>          goto gpf;
>  
>      /* A pvh is not expected to change to real mode. */
> -    if ( is_pvh_vcpu(v)
> -         && (value & (X86_CR0_PE | X86_CR0_PG)) != (X86_CR0_PG | X86_CR0_PE) )
> +    if ( is_pvh_domain(d) &&
> +         (value & (X86_CR0_PE | X86_CR0_PG)) != (X86_CR0_PG | X86_CR0_PE) )
>      {
>          printk(XENLOG_G_WARNING
>                 "PVH attempting to turn off PE/PG. CR0:%lx\n", value);
> @@ -1996,16 +1997,16 @@ int hvm_set_cr0(unsigned long value)
>              hvm_update_guest_efer(v);
>          }
>  
> -        if ( !paging_mode_hap(v->domain) )
> +        if ( !paging_mode_hap(d) )
>          {
>              /* The guest CR3 must be pointing to the guest physical. */
>              gfn = v->arch.hvm_vcpu.guest_cr[3]>>PAGE_SHIFT;
> -            page = get_page_from_gfn(v->domain, gfn, NULL, P2M_ALLOC);
> +            page = get_page_from_gfn(d, gfn, NULL, P2M_ALLOC);
>              if ( !page )
>              {
>                  gdprintk(XENLOG_ERR, "Invalid CR3 value = %lx\n",
>                           v->arch.hvm_vcpu.guest_cr[3]);
> -                domain_crash(v->domain);
> +                domain_crash(d);
>                  return X86EMUL_UNHANDLEABLE;
>              }
>  
> @@ -2032,24 +2033,18 @@ int hvm_set_cr0(unsigned long value)
>              hvm_update_guest_efer(v);
>          }
>  
> -        if ( !paging_mode_hap(v->domain) )
> +        if ( !paging_mode_hap(d) )
>          {
>              put_page(pagetable_get_page(v->arch.guest_table));
>              v->arch.guest_table = pagetable_null();
>          }
>      }
>  
> -    /*
> -     * When cr0.cd setting
> -     * 1. For guest w/o VT-d, and for guest with VT-d but snooped, Xen need not
> -     * do anything, since hardware snoop mechanism has ensured cache coherency;
> -     * 2. For guest with VT-d but non-snooped, cache coherency cannot be
> -     * guaranteed by h/w so need emulate UC memory type to guest.
> -     */
>      if ( ((value ^ old_value) & X86_CR0_CD) &&
> -           has_arch_pdevs(v->domain) &&
> -           iommu_enabled && !iommu_snoop &&
> -           hvm_funcs.handle_cd )
> +         iommu_enabled && hvm_funcs.handle_cd &&
> +         (!rangeset_is_empty(d->iomem_caps) ||
> +          !rangeset_is_empty(d->arch.ioport_caps) ||
> +          has_arch_pdevs(d)) )
>          hvm_funcs.handle_cd(v, value);
>  
>      hvm_update_cr(v, 0, value);
> --- 2014-03-17.orig/xen/arch/x86/hvm/mtrr.c	2014-03-26 12:25:05.000000000 +0100
> +++ 2014-03-17/xen/arch/x86/hvm/mtrr.c	2014-03-26 12:25:09.000000000 +0100
> @@ -713,7 +713,7 @@ HVM_REGISTER_SAVE_RESTORE(MTRR, hvm_save
>  
>  void memory_type_changed(struct domain *d)
>  {
> -    if ( iommu_enabled && !iommu_snoop && d->vcpu && d->vcpu[0] )
> +    if ( iommu_enabled && d->vcpu && d->vcpu[0] )
>          p2m_memory_type_changed(d);
>  }
>  
> @@ -754,12 +754,6 @@ int epte_get_entry_emt(struct domain *d,
>          return MTRR_TYPE_WRBACK;
>      }
>  
> -    if ( iommu_snoop )
> -    {
> -        *ipat = 1;
> -        return MTRR_TYPE_WRBACK;
> -    }
> -
>      gmtrr_mtype = is_hvm_domain(d) && v ?
>                    get_mtrr_type(&v->arch.hvm_vcpu.mtrr,
>                                  gfn << PAGE_SHIFT, order) :

> _______________________________________________
> Xen-devel mailing list
> Xen-devel@lists.xen.org
> http://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH 3/8] x86/EPT: also dump permissions and memory types
  2014-03-27 12:54   ` Tim Deegan
@ 2014-03-27 15:16     ` Jan Beulich
  0 siblings, 0 replies; 36+ messages in thread
From: Jan Beulich @ 2014-03-27 15:16 UTC (permalink / raw)
  To: Tim Deegan
  Cc: Kevin Tian, Keir Fraser, Eddie Dong, Jun Nakajima, Yang Z Zhang,
	xen-devel

>>> On 27.03.14 at 13:54, <tim@xen.org> wrote:
> At 15:34 +0000 on 26 Mar (1395844445), Jan Beulich wrote:
>> Signed-off-by: Jan Beulich <jbeulich@suse.com>
>> 
>> --- a/xen/arch/x86/mm/p2m-ept.c
>> +++ b/xen/arch/x86/mm/p2m-ept.c
>> @@ -730,6 +730,14 @@ static void ept_dump_p2m_table(unsigned 
>>      unsigned long record_counter = 0;
>>      struct p2m_domain *p2m;
>>      struct ept_data *ept;
>> +    static const char memoryTypes[8][2] = {
> 
> s/memoryTypes/memory_types/ would be Xen style.

Don't know what I was thinking...

Jan

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH 3/8] x86/EPT: also dump permissions and memory types
  2014-03-27 13:06   ` Andrew Cooper
@ 2014-03-27 15:18     ` Jan Beulich
  0 siblings, 0 replies; 36+ messages in thread
From: Jan Beulich @ 2014-03-27 15:18 UTC (permalink / raw)
  To: Andrew Cooper
  Cc: Kevin Tian, Keir Fraser, Tim Deegan, Eddie Dong, Jun Nakajima,
	Yang Z Zhang, xen-devel

>>> On 27.03.14 at 14:06, <andrew.cooper3@citrix.com> wrote:
> On 26/03/14 15:34, Jan Beulich wrote:
>> Signed-off-by: Jan Beulich <jbeulich@suse.com>
>>
>> --- a/xen/arch/x86/mm/p2m-ept.c
>> +++ b/xen/arch/x86/mm/p2m-ept.c
>> @@ -730,6 +730,14 @@ static void ept_dump_p2m_table(unsigned 
>>      unsigned long record_counter = 0;
>>      struct p2m_domain *p2m;
>>      struct ept_data *ept;
>> +    static const char memoryTypes[8][2] = {
>> +        [0 ... 7] = "?",
>> +        [MTRR_TYPE_UNCACHABLE]     = "UC",
>> +        [MTRR_TYPE_WRCOMB]         = "WC",
>> +        [MTRR_TYPE_WRTHROUGH]      = "WT",
>> +        [MTRR_TYPE_WRPROT]         = "WP",
>> +        [MTRR_TYPE_WRBACK]         = "WB",
>> +    };
> 
> Dont we have an identical structure somewhere else in Xen already?

I added a similar one recently, but I decidedly want only two characters
here (these dumps are already unreasonably large).

Jan

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH 6/8] x86/EPT: force re???evaluation of memory type as necessary
  2014-03-27 13:08   ` [PATCH 6/8] x86/EPT: force re???evaluation " Tim Deegan
@ 2014-03-27 15:23     ` Jan Beulich
  2014-03-27 15:48       ` Tim Deegan
  0 siblings, 1 reply; 36+ messages in thread
From: Jan Beulich @ 2014-03-27 15:23 UTC (permalink / raw)
  To: Tim Deegan
  Cc: Kevin Tian, Keir Fraser, Eddie Dong, Jun Nakajima, Yang Z Zhang,
	xen-devel

>>> On 27.03.14 at 14:08, <tim@xen.org> wrote:
> At 15:36 +0000 on 26 Mar (1395844584), Jan Beulich wrote:
>> +bool_t ept_handle_misconfig(uint64_t gpa)
>> +{
>> +    struct vcpu *curr = current;
>> +    struct p2m_domain *p2m = p2m_get_hostp2m(curr->domain);
>> +    struct ept_data *ept = &p2m->ept;
>> +    unsigned int level = ept_get_wl(ept);
>> +    unsigned long gfn = PFN_DOWN(gpa);
>> +    unsigned long mfn = ept_get_asr(ept);
>> +    ept_entry_t *epte;
>> +    bool_t okay;
>> +
>> +    if ( !mfn )
>> +        return 0;
>> +
>> +    p2m_lock(p2m);
>> +
>> +    for ( okay = curr->arch.hvm_vmx.ept_spurious_misconfig; ; --level) {
> 
> I'd prefer a 'while (1)' with the --level at the bottom of the loop
> beside where mfn gets set to the next level's mfn.  If not that, can
> you at least move the init of 'okay' out to its own line?

Moving it onto its own line is okay with me (albeit I think having it in the
for() is expressing more clearly that this is the loop start state).

>> +        ept_entry_t e;
>> +        unsigned int i;
>> +
>> +        epte = map_domain_page(mfn);
>> +        i = (gfn >> (level * EPT_TABLE_ORDER)) & (EPT_PAGETABLE_ENTRIES - 1);
>> +        e = atomic_read_ept_entry(&epte[i]);
>> +
>> +        if ( level == 0 || is_epte_superpage(&e) )
>> +        {
>> +            uint8_t ipat = 0;
>> +            struct vcpu *v;
>> +
>> +            if ( e.emt != MTRR_NUM_TYPES )
>> +                break;
>> +
>> +            if ( level == 0 )
>> +            {
>> +                for ( gfn -= i, i = 0; i < EPT_PAGETABLE_ENTRIES; ++i )
>> +                {
>> +                    e = atomic_read_ept_entry(&epte[i]);
>> +                    if ( e.emt == MTRR_NUM_TYPES )
>> +                        e.emt = 0;
>> +                    if ( !is_epte_valid(&e) || !is_epte_present(&e) )
>> +                        continue;
>> +                    e.emt = epte_get_entry_emt(p2m->domain, gfn + i,
>> +                                               _mfn(e.mfn), &ipat,
>> +                                               e.sa_p2mt == p2m_mmio_direct);
>> +                    e.ipat = ipat;
>> +                    atomic_write_ept_entry(&epte[i], e);
>> +                }
>> +            }
>> +            else
>> +            {
>> +                e.emt = epte_get_entry_emt(p2m->domain, gfn, _mfn(e.mfn),
>> +                                           &ipat,
>> +                                           e.sa_p2mt == p2m_mmio_direct);
>> +                e.ipat = ipat;
>> +                atomic_write_ept_entry(&epte[i], e);
>> +            }
>> +
>> +            for_each_vcpu ( curr->domain, v )
>> +                v->arch.hvm_vmx.ept_spurious_misconfig = 1;
>> +            okay = 1;
>> +            break;
>> +        }
>> +
>> +        if ( e.emt == MTRR_NUM_TYPES )
>> +        {
>> +            ASSERT(is_epte_present(&e));
>> +            e.emt = 0;
>> +            atomic_write_ept_entry(&epte[i], e);
>> +            unmap_domain_page(epte);
>> +
>> +            ept_invalidate_emt(_mfn(e.mfn));
>> +            okay = 1;
> 
> Do you need to also set ept_spurious_misconfig flags here?  At the
> moment you set them when a present entry gets reset, but what about a
> walk that hits a not-present entry (e.g. mmio_dm)?  There, the first
> CPU to take a fault will clear up the intermediate nodes, but not
> make any chaneg at level 0 or set the ept_spurious_misconfig flags.

Actually I think I need to swap the order of operations instead:
first invalidate the next level, then clear .emt. Then the current
model of needing to set the flag only when making an entry valid
will be correct/consistent afaict.

Jan

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH 0/8] x86/EPT: various adjustments
  2014-03-26 16:25 ` [PATCH 0/8] x86/EPT: various adjustments Jan Beulich
@ 2014-03-27 15:25   ` Tim Deegan
  2014-03-27 15:43     ` Jan Beulich
  2014-03-28  9:15   ` Tian, Kevin
  1 sibling, 1 reply; 36+ messages in thread
From: Tim Deegan @ 2014-03-27 15:25 UTC (permalink / raw)
  To: Jan Beulich
  Cc: Kevin Tian, Keir Fraser, suravee.suthikulpanit, Eddie Dong,
	Jun Nakajima, Sherry Hurwitz, Yang Z Zhang, xen-devel

At 16:25 +0000 on 26 Mar (1395847547), Jan Beulich wrote:
> >>> On 26.03.14 at 16:23, <JBeulich@suse.com> wrote:
> > 1: p2m: make p2m_change_type() behavior match p2m_change_entry_type_global()'s
> > 2: EPT: simplification and cleanup
> > 3: EPT: also dump permissions and memory types
> > 4: EPT: relax treatment of APIC MFN
> > 5: vMTRR: pass domain to mtrr_*_msr_set()
> > 6: EPT: force re-evaluation of memory type as necessary
> > 7: EPT: split super pages upon mismatching memory types
> > 8: EPT: IOMMU snoop capability should not affect memory type selection
> 
> Beyond this series comes the question of how to further leverage
> this: One aspect (re-doing the XSA-60 fix) was already mentioned
> in one of the patches. The other is that of avoiding the long running
> nature of ept_change_entry_type_global(). That function is being
> used for log dirty {en,dis}abling and as a safeguard in the PCI MSI-X
> handling code. Whether to do so depends on the answers to a
> couple of questions:
> 
> 1) There's a behavioral discrepancy between shared EPT/IOMMU page
> tables and separate ones. This was to some degree already discussed
> in the context of what became commit 077fc1c0 ("When enabling log
> dirty mode, it sets all guest's memory to readonly", see
> http://lists.xenproject.org/archives/html/xen-devel/2014-02/msg00756.html)
> with the promise to work on a better solution
> (http://lists.xenproject.org/archives/html/xen-devel/2014-02/msg01230.html)
> not having seen any action so far.
> 
> When the tables are shared, a switch to log-dirty mode will result in
> IOMMU faults on DMA writes, whereas with separate tables no faults
> would occur, but the log-dirty logic wouldn't know of the dirtied pages.
> In the discussion above it was being said that migration with passed
> through devices isn't supported anyway, but it seems to me that this
> shouldn't be a long term statement of ours.

Well, there are other problems with migration with passed-through
devices.   Until we support restartable DMA on IOMMU faults, we can't
even begin to go near that.  And once we do, we should just feed those
IOMMU faults through the same path as CPU pagefaults and everything will
just come out in the wash.

> The question is whether it would be tolerable to make this more
> "undefined" by also using the EPT_MISCONFIG VM exit here: From
> the IOMMU perspective, some (initially all) of the pages would remain
> writable, while with the propagation logic in the VM exit handler some
> could then become readonly.

The proposed uses of EPT_MISCONFIG seem OK to me -- they're entirely
to enforce correct emulation of CPU/MMU behaviour, which has no
equivalent in DMA anyway.

> 2) The other discrepancy is that of which pages get entered into the
> page tables in the first place: Obviously for the CPU side (EPT) all
> guest visible memory must be mapped. Otoh in the separate IOMMU
> page tables case only p2m_ram_rw pages would ever get mapped
> into those page tables, resulting in IOMMU faults should a device be
> programmed to do DMA to/from any other region.
> 
> This particularly matters for the MSI-X fallback code, which gets into
> action only when running with a Dom0 not [properly] making use of
> PHYSDEVOP_prepare_msix. Considering that a first option might be
> to rip that fallback out altogether, and kill the affected guest (even
> if it's innocent) instead.

So with the current fallback code, if we use shared tables it works OK
but with separate tables it doesn't (and never has) because the
mapping's not propagated?

But ISTR that on the AMD side, only p2m_ram_rw memory is visible via
the IOMMU even with shared tables.

I suspect that I don't understand the problem properly yet. :)

Tim.

> The other alternative, keeping the code, could again entertain the
> EPT_MISCONFIG VM exit: Considering that in the separate tables
> case DMA to p2m_mmio_direct pages faults anyway, IOMMU faults
> due to the write protection that's being enforced here wouldn't
> matter much. The problem, however, is that the EPT_MISCONFIG
> approach specifically leaves the pages accessible by the IOMMU until
> a particular page gets its memory type (and for the purpose here
> then also access permissions) fixed up. Which imo is another
> argument for doing away with the shared page tables.
> 
> 3) As validly pointed out by Tim already, all this only deals with the
> long-runningness on Intel systems, yet to fully address the issue
> we'd also need some solution on the AMD side. So the question
> regarding possible solutions there goes to AMD.

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH 7/8] x86/EPT: split super pages upon mismatching memory types
  2014-03-27 15:04   ` Tim Deegan
@ 2014-03-27 15:25     ` Jan Beulich
  0 siblings, 0 replies; 36+ messages in thread
From: Jan Beulich @ 2014-03-27 15:25 UTC (permalink / raw)
  To: Tim Deegan
  Cc: Kevin Tian, Keir Fraser, Eddie Dong, Jun Nakajima, Yang Z Zhang,
	xen-devel

>>> On 27.03.14 at 16:04, <tim@xen.org> wrote:
> At 15:36 +0000 on 26 Mar (1395844618), Jan Beulich wrote:
>> ... between constituent pages. To indicate such, the page order is
>> being passed down to the vMTRR routines, with a negative return value
>> (possible only on order-non-zero pages) indicating such collisions.
>> 
>> Some code redundancy reduction is being done to ept_set_entry() along
>> the way, allowing the new handling to be centralized to a single place
>> there.
>> 
>> In order to keep ept_set_entry() fast and simple, the actual splitting
>> is being deferred to the EPT_MISCONFIG VM exit handler.
> 
> You expect that to be rare, right?  We're not going to incur a lot of
> extra pagefaults?

Yes, since type changes are rare (normally limited to guest boot and
passthrough device hotplug).

Jan

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH 8/8] x86/EPT: IOMMU snoop capability should not affect memory type selection
  2014-03-27 15:12   ` Tim Deegan
@ 2014-03-27 15:32     ` Jan Beulich
  2014-03-27 15:56       ` Tim Deegan
  2014-04-04  7:30       ` Tian, Kevin
  0 siblings, 2 replies; 36+ messages in thread
From: Jan Beulich @ 2014-03-27 15:32 UTC (permalink / raw)
  To: Tim Deegan
  Cc: Kevin Tian, Keir Fraser, Eddie Dong, Jun Nakajima, Yang Z Zhang,
	xen-devel

>>> On 27.03.14 at 16:12, <tim@xen.org> wrote:
> At 15:37 +0000 on 26 Mar (1395844658), Jan Beulich wrote:
>> This capability solely makes a statement on cache coherency guarantees
>> by the IOMMU. It does specifically not imply any further guarantees
>> implied by certain memory types (cachability, ordering).
> 
> Can you give some examples of what this is protecting against?
> 
> Cachability is irrelevant unless there's some other form of direct
> access that's not covered by the IOMMU, and x86 ordering is pretty
> strict.

What the IOMMU gets to see already depends on cachability: Especially
for write buffers (WC) I don't think cache coherence really matters.

And my understanding of "snoop" also means that stuff held in caches
(WB) may not become visible as needed when some RAM page is
shared between CPU and some device (namely a GPU): The IOMMU
can certainly snoop any accesses that hit the bus, but I don't think it
can enforce write-back to happen for an area that's marked WB but
was supposed to be UC. But maybe I'm wrong with this...

Jan

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH 0/8] x86/EPT: various adjustments
  2014-03-27 15:25   ` Tim Deegan
@ 2014-03-27 15:43     ` Jan Beulich
  2014-03-27 15:51       ` Tim Deegan
  0 siblings, 1 reply; 36+ messages in thread
From: Jan Beulich @ 2014-03-27 15:43 UTC (permalink / raw)
  To: Tim Deegan
  Cc: Kevin Tian, Keir Fraser, suravee.suthikulpanit, Eddie Dong,
	Jun Nakajima, Sherry Hurwitz, Yang Z Zhang, xen-devel

>>> On 27.03.14 at 16:25, <tim@xen.org> wrote:
> At 16:25 +0000 on 26 Mar (1395847547), Jan Beulich wrote:
>> 2) The other discrepancy is that of which pages get entered into the
>> page tables in the first place: Obviously for the CPU side (EPT) all
>> guest visible memory must be mapped. Otoh in the separate IOMMU
>> page tables case only p2m_ram_rw pages would ever get mapped
>> into those page tables, resulting in IOMMU faults should a device be
>> programmed to do DMA to/from any other region.
>> 
>> This particularly matters for the MSI-X fallback code, which gets into
>> action only when running with a Dom0 not [properly] making use of
>> PHYSDEVOP_prepare_msix. Considering that a first option might be
>> to rip that fallback out altogether, and kill the affected guest (even
>> if it's innocent) instead.
> 
> So with the current fallback code, if we use shared tables it works OK
> but with separate tables it doesn't (and never has) because the
> mapping's not propagated?

No, with separate tables we're fine because only p2m_ram_rw pages
ever get entered into the IOMMU ones. I.e. in that case DMA accesses
to that range would always fault. The problem is that if we were to
use the EPT_MISCONFIG approach here, that wouldn't - in the shared
tables case - affect the IOMMU until the respective page got a CPU side
access, causing its entry's flags to be corrected.

> But ISTR that on the AMD side, only p2m_ram_rw memory is visible via
> the IOMMU even with shared tables.

That would be good; is that because of using bits that the IOMMU treats
as reserved (the basic reason for p2m_ram_rw needing to be number 0)?

Jan

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH 6/8] x86/EPT: force re???evaluation of memory type as necessary
  2014-03-27 15:23     ` Jan Beulich
@ 2014-03-27 15:48       ` Tim Deegan
  2014-03-27 16:16         ` Jan Beulich
  0 siblings, 1 reply; 36+ messages in thread
From: Tim Deegan @ 2014-03-27 15:48 UTC (permalink / raw)
  To: Jan Beulich
  Cc: Kevin Tian, Keir Fraser, Eddie Dong, Jun Nakajima, Yang Z Zhang,
	xen-devel

At 15:23 +0000 on 27 Mar (1395930232), Jan Beulich wrote:
> > 
> > Do you need to also set ept_spurious_misconfig flags here?  At the
> > moment you set them when a present entry gets reset, but what about a
> > walk that hits a not-present entry (e.g. mmio_dm)?  There, the first
> > CPU to take a fault will clear up the intermediate nodes, but not
> > make any chaneg at level 0 or set the ept_spurious_misconfig flags.
> 
> Actually I think I need to swap the order of operations instead:
> first invalidate the next level, then clear .emt.

Oh yes, good point.

> Then the current
> model of needing to set the flag only when making an entry valid
> will be correct/consistent afaict.

Let me work through an example: in a 2-level EPT table (for clarity),
the guest has an mmio-dm entry (or a log-dirty, or otherwise not
present).  The EPTEs look like this:

L1: present
L0: not present, type mmio_dm

after an MTRR change it look like this:

L1: present, bad EMT
L0: not present, type mmio_dm

Two CPUs fault on it at the same time.  The first one to the p2m lock
goes through this walk, clears the bad EMT on the L1 and
ept_invalidate_emt()s the whole L0 page.  That _doesn't_ set bad EMT 
on the L0 entry because the L0 entry is not present.

L1: present
L0: not present, type mmio_dm

The other CPU then takes the p2m lock, sees a perfectly good walk and
hasn't been told to expect a spurious fault --> crash. 

So if we clear any intermediate EMTs we need to warn the other CPUs
to expect spurious faults.

Cheers,

Tim.

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH 0/8] x86/EPT: various adjustments
  2014-03-27 15:43     ` Jan Beulich
@ 2014-03-27 15:51       ` Tim Deegan
  0 siblings, 0 replies; 36+ messages in thread
From: Tim Deegan @ 2014-03-27 15:51 UTC (permalink / raw)
  To: Jan Beulich
  Cc: Kevin Tian, Keir Fraser, suravee.suthikulpanit, Eddie Dong,
	Jun Nakajima, Sherry Hurwitz, Yang Z Zhang, xen-devel

At 15:43 +0000 on 27 Mar (1395931432), Jan Beulich wrote:
> >>> On 27.03.14 at 16:25, <tim@xen.org> wrote:
> > At 16:25 +0000 on 26 Mar (1395847547), Jan Beulich wrote:
> >> 2) The other discrepancy is that of which pages get entered into the
> >> page tables in the first place: Obviously for the CPU side (EPT) all
> >> guest visible memory must be mapped. Otoh in the separate IOMMU
> >> page tables case only p2m_ram_rw pages would ever get mapped
> >> into those page tables, resulting in IOMMU faults should a device be
> >> programmed to do DMA to/from any other region.
> >> 
> >> This particularly matters for the MSI-X fallback code, which gets into
> >> action only when running with a Dom0 not [properly] making use of
> >> PHYSDEVOP_prepare_msix. Considering that a first option might be
> >> to rip that fallback out altogether, and kill the affected guest (even
> >> if it's innocent) instead.
> > 
> > So with the current fallback code, if we use shared tables it works OK
> > but with separate tables it doesn't (and never has) because the
> > mapping's not propagated?
> 
> No, with separate tables we're fine because only p2m_ram_rw pages
> ever get entered into the IOMMU ones. I.e. in that case DMA accesses
> to that range would always fault. The problem is that if we were to
> use the EPT_MISCONFIG approach here, that wouldn't - in the shared
> tables case - affect the IOMMU until the respective page got a CPU side
> access, causing its entry's flags to be corrected.

Right, I see.  So, as you say, another reason to go back to having
separate tables.

> > But ISTR that on the AMD side, only p2m_ram_rw memory is visible via
> > the IOMMU even with shared tables.
> 
> That would be good; is that because of using bits that the IOMMU treats
> as reserved (the basic reason for p2m_ram_rw needing to be number 0)?

Yep.

Tim.

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH 8/8] x86/EPT: IOMMU snoop capability should not affect memory type selection
  2014-03-27 15:32     ` Jan Beulich
@ 2014-03-27 15:56       ` Tim Deegan
  2014-04-04  7:30       ` Tian, Kevin
  1 sibling, 0 replies; 36+ messages in thread
From: Tim Deegan @ 2014-03-27 15:56 UTC (permalink / raw)
  To: Jan Beulich
  Cc: Kevin Tian, Keir Fraser, Eddie Dong, Jun Nakajima, Yang Z Zhang,
	xen-devel

At 15:32 +0000 on 27 Mar (1395930772), Jan Beulich wrote:
> >>> On 27.03.14 at 16:12, <tim@xen.org> wrote:
> > At 15:37 +0000 on 26 Mar (1395844658), Jan Beulich wrote:
> >> This capability solely makes a statement on cache coherency guarantees
> >> by the IOMMU. It does specifically not imply any further guarantees
> >> implied by certain memory types (cachability, ordering).
> > 
> > Can you give some examples of what this is protecting against?
> > 
> > Cachability is irrelevant unless there's some other form of direct
> > access that's not covered by the IOMMU, and x86 ordering is pretty
> > strict.
> 
> What the IOMMU gets to see already depends on cachability: Especially
> for write buffers (WC) I don't think cache coherence really matters.

Right.  The x86 ordering rules, afaict, mean that this problem is
already solved for other cache-coherent memory writes, but maybe
it's not for OUTB, or for real direct MMIO?

At this point, I'll defer to anyone at Intel who understands how the
snooping actually works. :)  I have 34 more patches to review
today...

Tim.

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH 6/8] x86/EPT: force re???evaluation of memory type as necessary
  2014-03-27 15:48       ` Tim Deegan
@ 2014-03-27 16:16         ` Jan Beulich
  0 siblings, 0 replies; 36+ messages in thread
From: Jan Beulich @ 2014-03-27 16:16 UTC (permalink / raw)
  To: Tim Deegan
  Cc: Kevin Tian, Keir Fraser, Eddie Dong, Jun Nakajima, Yang Z Zhang,
	xen-devel

>>> On 27.03.14 at 16:48, <tim@xen.org> wrote:
> Let me work through an example: in a 2-level EPT table (for clarity),
> the guest has an mmio-dm entry (or a log-dirty, or otherwise not
> present).  The EPTEs look like this:
> 
> L1: present
> L0: not present, type mmio_dm
> 
> after an MTRR change it look like this:
> 
> L1: present, bad EMT
> L0: not present, type mmio_dm
> 
> Two CPUs fault on it at the same time.  The first one to the p2m lock
> goes through this walk, clears the bad EMT on the L1 and
> ept_invalidate_emt()s the whole L0 page.  That _doesn't_ set bad EMT 
> on the L0 entry because the L0 entry is not present.
> 
> L1: present
> L0: not present, type mmio_dm
> 
> The other CPU then takes the p2m lock, sees a perfectly good walk and
> hasn't been told to expect a spurious fault --> crash. 
> 
> So if we clear any intermediate EMTs we need to warn the other CPUs
> to expect spurious faults.

Right - I didn't take non-present (but valid) leaf entries properly into
account here. Thanks for spotting this!

Jan

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH 0/8] x86/EPT: various adjustments
  2014-03-26 16:25 ` [PATCH 0/8] x86/EPT: various adjustments Jan Beulich
  2014-03-27 15:25   ` Tim Deegan
@ 2014-03-28  9:15   ` Tian, Kevin
  1 sibling, 0 replies; 36+ messages in thread
From: Tian, Kevin @ 2014-03-28  9:15 UTC (permalink / raw)
  To: Jan Beulich, xen-devel
  Cc: Keir Fraser, suravee.suthikulpanit@amd.com, Tim Deegan,
	Dong, Eddie, Nakajima, Jun, Sherry Hurwitz, Zhang, Yang Z

> From: Jan Beulich [mailto:JBeulich@suse.com]
> Sent: Thursday, March 27, 2014 12:26 AM
> 
> >>> On 26.03.14 at 16:23, <JBeulich@suse.com> wrote:
> > 1: p2m: make p2m_change_type() behavior match
> p2m_change_entry_type_global()'s
> > 2: EPT: simplification and cleanup
> > 3: EPT: also dump permissions and memory types
> > 4: EPT: relax treatment of APIC MFN
> > 5: vMTRR: pass domain to mtrr_*_msr_set()
> > 6: EPT: force re-evaluation of memory type as necessary
> > 7: EPT: split super pages upon mismatching memory types
> > 8: EPT: IOMMU snoop capability should not affect memory type selection
> 
> Beyond this series comes the question of how to further leverage
> this: One aspect (re-doing the XSA-60 fix) was already mentioned
> in one of the patches. The other is that of avoiding the long running
> nature of ept_change_entry_type_global(). That function is being
> used for log dirty {en,dis}abling and as a safeguard in the PCI MSI-X
> handling code. Whether to do so depends on the answers to a
> couple of questions:

join this thread late. will spend some time to understanding the whole
story first, before giving my comments.

Thanks
Kevin

> 
> 1) There's a behavioral discrepancy between shared EPT/IOMMU page
> tables and separate ones. This was to some degree already discussed
> in the context of what became commit 077fc1c0 ("When enabling log
> dirty mode, it sets all guest's memory to readonly", see
> http://lists.xenproject.org/archives/html/xen-devel/2014-02/msg00756.html)
> with the promise to work on a better solution
> (http://lists.xenproject.org/archives/html/xen-devel/2014-02/msg01230.html)
> not having seen any action so far.
> 
> When the tables are shared, a switch to log-dirty mode will result in
> IOMMU faults on DMA writes, whereas with separate tables no faults
> would occur, but the log-dirty logic wouldn't know of the dirtied pages.
> In the discussion above it was being said that migration with passed
> through devices isn't supported anyway, but it seems to me that this
> shouldn't be a long term statement of ours.
> 
> The question is whether it would be tolerable to make this more
> "undefined" by also using the EPT_MISCONFIG VM exit here: From
> the IOMMU perspective, some (initially all) of the pages would remain
> writable, while with the propagation logic in the VM exit handler some
> could then become readonly.
> 
> 2) The other discrepancy is that of which pages get entered into the
> page tables in the first place: Obviously for the CPU side (EPT) all
> guest visible memory must be mapped. Otoh in the separate IOMMU
> page tables case only p2m_ram_rw pages would ever get mapped
> into those page tables, resulting in IOMMU faults should a device be
> programmed to do DMA to/from any other region.
> 
> This particularly matters for the MSI-X fallback code, which gets into
> action only when running with a Dom0 not [properly] making use of
> PHYSDEVOP_prepare_msix. Considering that a first option might be
> to rip that fallback out altogether, and kill the affected guest (even
> if it's innocent) instead.
> 
> The other alternative, keeping the code, could again entertain the
> EPT_MISCONFIG VM exit: Considering that in the separate tables
> case DMA to p2m_mmio_direct pages faults anyway, IOMMU faults
> due to the write protection that's being enforced here wouldn't
> matter much. The problem, however, is that the EPT_MISCONFIG
> approach specifically leaves the pages accessible by the IOMMU until
> a particular page gets its memory type (and for the purpose here
> then also access permissions) fixed up. Which imo is another
> argument for doing away with the shared page tables.
> 
> 3) As validly pointed out by Tim already, all this only deals with the
> long-runningness on Intel systems, yet to fully address the issue
> we'd also need some solution on the AMD side. So the question
> regarding possible solutions there goes to AMD.
> 
> Jan
> 

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH 8/8] x86/EPT: IOMMU snoop capability should not affect memory type selection
  2014-03-27 15:32     ` Jan Beulich
  2014-03-27 15:56       ` Tim Deegan
@ 2014-04-04  7:30       ` Tian, Kevin
  2014-04-04  8:28         ` Jan Beulich
  1 sibling, 1 reply; 36+ messages in thread
From: Tian, Kevin @ 2014-04-04  7:30 UTC (permalink / raw)
  To: Jan Beulich, Tim Deegan
  Cc: Zhang, Yang Z, xen-devel, Keir Fraser, Dong, Eddie, Nakajima, Jun

> From: Jan Beulich [mailto:JBeulich@suse.com]
> Sent: Thursday, March 27, 2014 11:33 PM
> 
> >>> On 27.03.14 at 16:12, <tim@xen.org> wrote:
> > At 15:37 +0000 on 26 Mar (1395844658), Jan Beulich wrote:
> >> This capability solely makes a statement on cache coherency guarantees
> >> by the IOMMU. It does specifically not imply any further guarantees
> >> implied by certain memory types (cachability, ordering).
> >
> > Can you give some examples of what this is protecting against?
> >
> > Cachability is irrelevant unless there's some other form of direct
> > access that's not covered by the IOMMU, and x86 ordering is pretty
> > strict.
> 
> What the IOMMU gets to see already depends on cachability: Especially
> for write buffers (WC) I don't think cache coherence really matters.
> 
> And my understanding of "snoop" also means that stuff held in caches
> (WB) may not become visible as needed when some RAM page is
> shared between CPU and some device (namely a GPU): The IOMMU
> can certainly snoop any accesses that hit the bus, but I don't think it
> can enforce write-back to happen for an area that's marked WB but
> was supposed to be UC. But maybe I'm wrong with this...
> 

WC is essentially a UC type, with only difference on caching writes in
a store buffer. The out-of-order implication means that WC is only used
for I/O buffers w/o write order requirements, e.g. frame buffer. GPU
access to frame buffer is not snooped, or more specifically, it may use
a different physical address (through GPU page table), which is different
from the address written by CPU (typically a PCI bar). Using WB on pages
which are supposed to WC doesn't work.

However, in virtualization WC may be used on memory pages, e.g. a
virtual frame buffer provided by Qemu. In that scenario WC must be
replaced with WB, otherwise there's cache inconsistency problem when
Qemu accesses virtual frame buffer as WB while guest uses WC.

Thanks
Kevin

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH 8/8] x86/EPT: IOMMU snoop capability should not affect memory type selection
  2014-04-04  7:30       ` Tian, Kevin
@ 2014-04-04  8:28         ` Jan Beulich
  2014-04-04  9:21           ` Tian, Kevin
  0 siblings, 1 reply; 36+ messages in thread
From: Jan Beulich @ 2014-04-04  8:28 UTC (permalink / raw)
  To: Kevin Tian
  Cc: Keir Fraser, Tim Deegan, Eddie Dong, Jun Nakajima, Yang Z Zhang,
	xen-devel

>>> On 04.04.14 at 09:30, <kevin.tian@intel.com> wrote:
>>  From: Jan Beulich [mailto:JBeulich@suse.com]
>> Sent: Thursday, March 27, 2014 11:33 PM
>> 
>> >>> On 27.03.14 at 16:12, <tim@xen.org> wrote:
>> > At 15:37 +0000 on 26 Mar (1395844658), Jan Beulich wrote:
>> >> This capability solely makes a statement on cache coherency guarantees
>> >> by the IOMMU. It does specifically not imply any further guarantees
>> >> implied by certain memory types (cachability, ordering).
>> >
>> > Can you give some examples of what this is protecting against?
>> >
>> > Cachability is irrelevant unless there's some other form of direct
>> > access that's not covered by the IOMMU, and x86 ordering is pretty
>> > strict.
>> 
>> What the IOMMU gets to see already depends on cachability: Especially
>> for write buffers (WC) I don't think cache coherence really matters.
>> 
>> And my understanding of "snoop" also means that stuff held in caches
>> (WB) may not become visible as needed when some RAM page is
>> shared between CPU and some device (namely a GPU): The IOMMU
>> can certainly snoop any accesses that hit the bus, but I don't think it
>> can enforce write-back to happen for an area that's marked WB but
>> was supposed to be UC. But maybe I'm wrong with this...
>> 
> 
> WC is essentially a UC type, with only difference on caching writes in
> a store buffer. The out-of-order implication means that WC is only used
> for I/O buffers w/o write order requirements, e.g. frame buffer. GPU
> access to frame buffer is not snooped, or more specifically, it may use
> a different physical address (through GPU page table), which is different
> from the address written by CPU (typically a PCI bar). Using WB on pages
> which are supposed to WC doesn't work.

Which isn't answering the original question (or at least I'm not able
to derive the answer from your statement above): Is this patch (or
parts thereof) correct?

> However, in virtualization WC may be used on memory pages, e.g. a
> virtual frame buffer provided by Qemu. In that scenario WC must be
> replaced with WB, otherwise there's cache inconsistency problem when
> Qemu accesses virtual frame buffer as WB while guest uses WC.

But that's being taken care of already by qemu pinning the memory
attribute of the frame buffer to WB. I.e. we need to only be concerned
of pass-through cards' frame buffers (and alike).

Jan

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH 8/8] x86/EPT: IOMMU snoop capability should not affect memory type selection
  2014-04-04  8:28         ` Jan Beulich
@ 2014-04-04  9:21           ` Tian, Kevin
  2014-04-04  9:45             ` Jan Beulich
  0 siblings, 1 reply; 36+ messages in thread
From: Tian, Kevin @ 2014-04-04  9:21 UTC (permalink / raw)
  To: Jan Beulich
  Cc: Keir Fraser, Tim Deegan, Dong, Eddie, Nakajima, Jun,
	Zhang, Yang Z, xen-devel

> From: Jan Beulich [mailto:JBeulich@suse.com]
> Sent: Friday, April 04, 2014 4:28 PM
> 
> >>> On 04.04.14 at 09:30, <kevin.tian@intel.com> wrote:
> >>  From: Jan Beulich [mailto:JBeulich@suse.com]
> >> Sent: Thursday, March 27, 2014 11:33 PM
> >>
> >> >>> On 27.03.14 at 16:12, <tim@xen.org> wrote:
> >> > At 15:37 +0000 on 26 Mar (1395844658), Jan Beulich wrote:
> >> >> This capability solely makes a statement on cache coherency guarantees
> >> >> by the IOMMU. It does specifically not imply any further guarantees
> >> >> implied by certain memory types (cachability, ordering).
> >> >
> >> > Can you give some examples of what this is protecting against?
> >> >
> >> > Cachability is irrelevant unless there's some other form of direct
> >> > access that's not covered by the IOMMU, and x86 ordering is pretty
> >> > strict.
> >>
> >> What the IOMMU gets to see already depends on cachability: Especially
> >> for write buffers (WC) I don't think cache coherence really matters.
> >>
> >> And my understanding of "snoop" also means that stuff held in caches
> >> (WB) may not become visible as needed when some RAM page is
> >> shared between CPU and some device (namely a GPU): The IOMMU
> >> can certainly snoop any accesses that hit the bus, but I don't think it
> >> can enforce write-back to happen for an area that's marked WB but
> >> was supposed to be UC. But maybe I'm wrong with this...
> >>
> >
> > WC is essentially a UC type, with only difference on caching writes in
> > a store buffer. The out-of-order implication means that WC is only used
> > for I/O buffers w/o write order requirements, e.g. frame buffer. GPU
> > access to frame buffer is not snooped, or more specifically, it may use
> > a different physical address (through GPU page table), which is different
> > from the address written by CPU (typically a PCI bar). Using WB on pages
> > which are supposed to WC doesn't work.
> 
> Which isn't answering the original question (or at least I'm not able
> to derive the answer from your statement above): Is this patch (or
> parts thereof) correct?

OK with the change in epte_get_entry_emt, but I'm still thinking whether 
there's correctness issue about cr0.cd handling in original logic.

> 
> > However, in virtualization WC may be used on memory pages, e.g. a
> > virtual frame buffer provided by Qemu. In that scenario WC must be
> > replaced with WB, otherwise there's cache inconsistency problem when
> > Qemu accesses virtual frame buffer as WB while guest uses WC.
> 
> But that's being taken care of already by qemu pinning the memory
> attribute of the frame buffer to WB. I.e. we need to only be concerned
> of pass-through cards' frame buffers (and alike).
> 

yes.

Thanks
Kevin

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH 8/8] x86/EPT: IOMMU snoop capability should not affect memory type selection
  2014-04-04  9:21           ` Tian, Kevin
@ 2014-04-04  9:45             ` Jan Beulich
  2014-04-07  0:58               ` Tian, Kevin
  0 siblings, 1 reply; 36+ messages in thread
From: Jan Beulich @ 2014-04-04  9:45 UTC (permalink / raw)
  To: Kevin Tian
  Cc: Keir Fraser, Tim Deegan, Eddie Dong, Jun Nakajima, Yang Z Zhang,
	xen-devel

>>> On 04.04.14 at 11:21, <kevin.tian@intel.com> wrote:
>>  From: Jan Beulich [mailto:JBeulich@suse.com]
>> Sent: Friday, April 04, 2014 4:28 PM
>> 
>> >>> On 04.04.14 at 09:30, <kevin.tian@intel.com> wrote:
>> >>  From: Jan Beulich [mailto:JBeulich@suse.com]
>> >> Sent: Thursday, March 27, 2014 11:33 PM
>> >>
>> >> >>> On 27.03.14 at 16:12, <tim@xen.org> wrote:
>> >> > At 15:37 +0000 on 26 Mar (1395844658), Jan Beulich wrote:
>> >> >> This capability solely makes a statement on cache coherency guarantees
>> >> >> by the IOMMU. It does specifically not imply any further guarantees
>> >> >> implied by certain memory types (cachability, ordering).
>> >> >
>> >> > Can you give some examples of what this is protecting against?
>> >> >
>> >> > Cachability is irrelevant unless there's some other form of direct
>> >> > access that's not covered by the IOMMU, and x86 ordering is pretty
>> >> > strict.
>> >>
>> >> What the IOMMU gets to see already depends on cachability: Especially
>> >> for write buffers (WC) I don't think cache coherence really matters.
>> >>
>> >> And my understanding of "snoop" also means that stuff held in caches
>> >> (WB) may not become visible as needed when some RAM page is
>> >> shared between CPU and some device (namely a GPU): The IOMMU
>> >> can certainly snoop any accesses that hit the bus, but I don't think it
>> >> can enforce write-back to happen for an area that's marked WB but
>> >> was supposed to be UC. But maybe I'm wrong with this...
>> >>
>> >
>> > WC is essentially a UC type, with only difference on caching writes in
>> > a store buffer. The out-of-order implication means that WC is only used
>> > for I/O buffers w/o write order requirements, e.g. frame buffer. GPU
>> > access to frame buffer is not snooped, or more specifically, it may use
>> > a different physical address (through GPU page table), which is different
>> > from the address written by CPU (typically a PCI bar). Using WB on pages
>> > which are supposed to WC doesn't work.
>> 
>> Which isn't answering the original question (or at least I'm not able
>> to derive the answer from your statement above): Is this patch (or
>> parts thereof) correct?
> 
> OK with the change in epte_get_entry_emt, but I'm still thinking whether 
> there's correctness issue about cr0.cd handling in original logic.

Okay, let us know of the outcome.

Jan

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH 8/8] x86/EPT: IOMMU snoop capability should not affect memory type selection
  2014-04-04  9:45             ` Jan Beulich
@ 2014-04-07  0:58               ` Tian, Kevin
  0 siblings, 0 replies; 36+ messages in thread
From: Tian, Kevin @ 2014-04-07  0:58 UTC (permalink / raw)
  To: Jan Beulich
  Cc: Keir Fraser, Tim Deegan, Dong, Eddie, Nakajima, Jun,
	Zhang, Yang Z, xen-devel

> From: Jan Beulich [mailto:JBeulich@suse.com]
> Sent: Friday, April 04, 2014 5:46 PM
> 
> >>> On 04.04.14 at 11:21, <kevin.tian@intel.com> wrote:
> >>  From: Jan Beulich [mailto:JBeulich@suse.com]
> >> Sent: Friday, April 04, 2014 4:28 PM
> >>
> >> >>> On 04.04.14 at 09:30, <kevin.tian@intel.com> wrote:
> >> >>  From: Jan Beulich [mailto:JBeulich@suse.com]
> >> >> Sent: Thursday, March 27, 2014 11:33 PM
> >> >>
> >> >> >>> On 27.03.14 at 16:12, <tim@xen.org> wrote:
> >> >> > At 15:37 +0000 on 26 Mar (1395844658), Jan Beulich wrote:
> >> >> >> This capability solely makes a statement on cache coherency
> guarantees
> >> >> >> by the IOMMU. It does specifically not imply any further guarantees
> >> >> >> implied by certain memory types (cachability, ordering).
> >> >> >
> >> >> > Can you give some examples of what this is protecting against?
> >> >> >
> >> >> > Cachability is irrelevant unless there's some other form of direct
> >> >> > access that's not covered by the IOMMU, and x86 ordering is pretty
> >> >> > strict.
> >> >>
> >> >> What the IOMMU gets to see already depends on cachability: Especially
> >> >> for write buffers (WC) I don't think cache coherence really matters.
> >> >>
> >> >> And my understanding of "snoop" also means that stuff held in caches
> >> >> (WB) may not become visible as needed when some RAM page is
> >> >> shared between CPU and some device (namely a GPU): The IOMMU
> >> >> can certainly snoop any accesses that hit the bus, but I don't think it
> >> >> can enforce write-back to happen for an area that's marked WB but
> >> >> was supposed to be UC. But maybe I'm wrong with this...
> >> >>
> >> >
> >> > WC is essentially a UC type, with only difference on caching writes in
> >> > a store buffer. The out-of-order implication means that WC is only used
> >> > for I/O buffers w/o write order requirements, e.g. frame buffer. GPU
> >> > access to frame buffer is not snooped, or more specifically, it may use
> >> > a different physical address (through GPU page table), which is different
> >> > from the address written by CPU (typically a PCI bar). Using WB on pages
> >> > which are supposed to WC doesn't work.
> >>
> >> Which isn't answering the original question (or at least I'm not able
> >> to derive the answer from your statement above): Is this patch (or
> >> parts thereof) correct?
> >
> > OK with the change in epte_get_entry_emt, but I'm still thinking whether
> > there's correctness issue about cr0.cd handling in original logic.
> 
> Okay, let us know of the outcome.
> 

Acked-by: Kevin Tian <kevin.tian@intel.com>

Thanks
Kevin

^ permalink raw reply	[flat|nested] 36+ messages in thread

end of thread, other threads:[~2014-04-07  0:58 UTC | newest]

Thread overview: 36+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2014-03-26 15:23 [PATCH 0/8] x86/EPT: various adjustments Jan Beulich
2014-03-26 15:31 ` [PATCH 1/8] x86/p2m: make p2m_change_type() behavior match p2m_change_entry_type_global()'s Jan Beulich
2014-03-27 12:52   ` Tim Deegan
2014-03-26 15:33 ` [PATCH 2/8] x86/EPT: simplification and cleanup Jan Beulich
2014-03-27 12:52   ` Tim Deegan
2014-03-26 15:34 ` [PATCH 3/8] x86/EPT: also dump permissions and memory types Jan Beulich
2014-03-27 12:54   ` Tim Deegan
2014-03-27 15:16     ` Jan Beulich
2014-03-27 13:06   ` Andrew Cooper
2014-03-27 15:18     ` Jan Beulich
2014-03-26 15:34 ` [PATCH 4/8] x86/EPT: relax treatment of APIC MFN Jan Beulich
2014-03-27 12:55   ` Tim Deegan
2014-03-26 15:35 ` [PATCH 5/8] x86/vMTRR: pass domain to mtrr_*_msr_set() Jan Beulich
2014-03-27 12:55   ` Tim Deegan
2014-03-26 15:36 ` [PATCH 6/8] x86/EPT: force re‑evaluation of memory type as necessary Jan Beulich
2014-03-27 13:08   ` [PATCH 6/8] x86/EPT: force re???evaluation " Tim Deegan
2014-03-27 15:23     ` Jan Beulich
2014-03-27 15:48       ` Tim Deegan
2014-03-27 16:16         ` Jan Beulich
2014-03-26 15:36 ` [PATCH 7/8] x86/EPT: split super pages upon mismatching memory types Jan Beulich
2014-03-27 15:04   ` Tim Deegan
2014-03-27 15:25     ` Jan Beulich
2014-03-26 15:37 ` [PATCH 8/8] x86/EPT: IOMMU snoop capability should not affect memory type selection Jan Beulich
2014-03-27 15:12   ` Tim Deegan
2014-03-27 15:32     ` Jan Beulich
2014-03-27 15:56       ` Tim Deegan
2014-04-04  7:30       ` Tian, Kevin
2014-04-04  8:28         ` Jan Beulich
2014-04-04  9:21           ` Tian, Kevin
2014-04-04  9:45             ` Jan Beulich
2014-04-07  0:58               ` Tian, Kevin
2014-03-26 16:25 ` [PATCH 0/8] x86/EPT: various adjustments Jan Beulich
2014-03-27 15:25   ` Tim Deegan
2014-03-27 15:43     ` Jan Beulich
2014-03-27 15:51       ` Tim Deegan
2014-03-28  9:15   ` Tian, Kevin

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).