[PATCH 00/22] x86: adventures in Address Space Isolation

All of lore.kernel.org
 help / color / mirror / Atom feed

* [PATCH 00/22] x86: adventures in Address Space Isolation
@ 2024-07-26 15:21 Roger Pau Monne
  2024-07-26 15:21 ` [PATCH 01/22] x86/mm: drop l{1,2,3,4}e_write_atomic() Roger Pau Monne
                   ` (21 more replies)
  0 siblings, 22 replies; 64+ messages in thread
From: Roger Pau Monne @ 2024-07-26 15:21 UTC (permalink / raw)
  To: xen-devel
  Cc: alejandro.vallejo, Roger Pau Monne, Jan Beulich, Andrew Cooper,
	Tim Deegan, Julien Grall, Stefano Stabellini, Daniel P. Smith,
	Marek Marczykowski-Górecki

Hello,

The aim of this series is to introduce the functionality required to
create linear mappings visible to a single pCPU.

Doing so requires having a per-CPU root page-table (L4), and hence
requires changes to the HVM monitor tables and shadowing the guest
selected L4 on PV guests.  As follow ups (and partially to ensure the
per-CPU mappings work fine) the CPU stacks are switched to use per-CPU
mappings, so that remote stack contents are not by default mapped on all
page-tables (note: for this to be true the directmap entries for the
stack pages would need to be removed also).

Patches before patch 12 are either small fixes or preparatory
non-functional changes in order to accommodate the rest of the series.

Patch 12 introduces a new 'asi' spec-ctrl option, that's used to enable
Address Space Isolation.

Patches 13-15 and 20 introduce logic to use per-CPU L4 on HVM and PV
guests.

Patches 16-18 add support for creating per-CPU mappings to the existing
page-table management functions, map_pages_to_xen() and related
functions.  Patch 19 introduce helpers for creating per-CPU mappings
using a fixmap interface.

Finally patches 21-22 add support for mapping the CPU stack in a per-CPU
fixmap region, and zeroing the stacks on guest context switch.

I've been testing the patches quite a lot using XenRT, and so far they
seem to not cause regressions (either with spec-ctrl=asi or without it),
but XenRT no longer tests shadow paging or 32bit PV guests.

This proposal is also missing an interface similar to map_domain_page()
in order to create per-CPU mappings that don't use a fixmap entry.  I
thought however that the current content was fair enough for a first
posting, and that I would like to get feedback on this before building
further functionality on top of it.

Note that none of the logic introduced in the series removes entries for
the directmap, so evne when creating the per-CPU mappings the underlying
physical addresses are fully accessible when using it's linear direct
map entries.

I also haven't done any benchmarking.  Doesn't seem to cripple
performance up to the point that XenRT jobs would timeout before
finishing, that the only objective reference I can provide at the
moment.

It's likely to still have some rough edges, handle with care.

Thanks, Roger.

Roger Pau Monne (22):
  x86/mm: drop l{1,2,3,4}e_write_atomic()
  x86/mm: rename l{1,2,3,4}e_read_atomic()
  x86/dom0: only disable SMAP for the PV dom0 build
  x86/mm: ensure L4 idle_pg_table is not modified past boot
  x86/mm: make virt_to_xen_l1e() static
  x86/mm: introduce a local domain variable to write_ptbase()
  x86/spec-ctrl: initialize per-domain XPTI in spec_ctrl_init_domain()
  x86/mm: avoid passing a domain parameter to L4 init function
  x86/pv: untie issuing FLUSH_ROOT_PGTBL from XPTI
  x86/mm: move FLUSH_ROOT_PGTBL handling before TLB flush
  x86/mm: split setup of the per-domain slot on context switch
  x86/spec-ctrl: introduce Address Space Isolation command line option
  x86/hvm: use a per-pCPU monitor table in HAP mode
  x86/hvm: use a per-pCPU monitor table in shadow mode
  x86/idle: allow using a per-pCPU L4
  x86/mm: introduce a per-CPU L3 table for the per-domain slot
  x86/mm: introduce support to populate a per-CPU page-table region
  x86/mm: allow modifying per-CPU entries of remote page-tables
  x86/mm: introduce a per-CPU fixmap area
  x86/pv: allow using a unique per-pCPU root page table (L4)
  x86/mm: switch to a per-CPU mapped stack when using ASI
  x86/mm: zero stack on stack switch or reset

 docs/misc/xen-command-line.pandoc      |  15 +-
 xen/arch/x86/boot/x86_64.S             |  11 +
 xen/arch/x86/domain.c                  |  75 +++-
 xen/arch/x86/domain_page.c             |   2 +-
 xen/arch/x86/flushtlb.c                |  18 +-
 xen/arch/x86/hvm/hvm.c                 |  67 ++++
 xen/arch/x86/hvm/svm/svm.c             |   5 +
 xen/arch/x86/hvm/vmx/vmcs.c            |   1 +
 xen/arch/x86/hvm/vmx/vmx.c             |   4 +
 xen/arch/x86/include/asm/config.h      |   4 +
 xen/arch/x86/include/asm/current.h     |  38 +-
 xen/arch/x86/include/asm/domain.h      |   7 +
 xen/arch/x86/include/asm/fixmap.h      |  50 +++
 xen/arch/x86/include/asm/flushtlb.h    |   3 +-
 xen/arch/x86/include/asm/hap.h         |   1 -
 xen/arch/x86/include/asm/hvm/hvm.h     |   8 +
 xen/arch/x86/include/asm/hvm/vcpu.h    |   6 +-
 xen/arch/x86/include/asm/mm.h          |  34 +-
 xen/arch/x86/include/asm/page.h        |  37 +-
 xen/arch/x86/include/asm/paging.h      |  18 +
 xen/arch/x86/include/asm/pv/mm.h       |   8 +
 xen/arch/x86/include/asm/setup.h       |   1 +
 xen/arch/x86/include/asm/smp.h         |  12 +
 xen/arch/x86/include/asm/spec_ctrl.h   |   2 +
 xen/arch/x86/include/asm/x86_64/page.h |   4 -
 xen/arch/x86/mm.c                      | 484 ++++++++++++++++++++-----
 xen/arch/x86/mm/hap/hap.c              |  74 ----
 xen/arch/x86/mm/paging.c               |   4 +-
 xen/arch/x86/mm/shadow/common.c        |  42 +--
 xen/arch/x86/mm/shadow/hvm.c           |  64 ++--
 xen/arch/x86/mm/shadow/multi.c         |  73 ++--
 xen/arch/x86/mm/shadow/private.h       |   4 +-
 xen/arch/x86/pv/dom0_build.c           |  16 +-
 xen/arch/x86/pv/domain.c               |  28 +-
 xen/arch/x86/pv/mm.c                   |  52 +++
 xen/arch/x86/setup.c                   |  55 +--
 xen/arch/x86/smp.c                     |  29 ++
 xen/arch/x86/smpboot.c                 |  78 +++-
 xen/arch/x86/spec_ctrl.c               |  78 +++-
 xen/arch/x86/traps.c                   |  14 +-
 xen/common/efi/runtime.c               |  12 +
 xen/common/smp.c                       |  10 +
 xen/include/xen/smp.h                  |   5 +
 43 files changed, 1198 insertions(+), 355 deletions(-)

-- 
2.45.2

^ permalink raw reply	[flat|nested] 64+ messages in thread

* [PATCH 01/22] x86/mm: drop l{1,2,3,4}e_write_atomic()
  2024-07-26 15:21 [PATCH 00/22] x86: adventures in Address Space Isolation Roger Pau Monne
@ 2024-07-26 15:21 ` Roger Pau Monne
  2024-07-29  7:52   ` Jan Beulich
  2024-07-26 15:21 ` [PATCH 02/22] x86/mm: rename l{1,2,3,4}e_read_atomic() Roger Pau Monne
                   ` (20 subsequent siblings)
  21 siblings, 1 reply; 64+ messages in thread
From: Roger Pau Monne @ 2024-07-26 15:21 UTC (permalink / raw)
  To: xen-devel; +Cc: alejandro.vallejo, Roger Pau Monne, Jan Beulich, Andrew Cooper

The l{1,2,3,4}e_write_atomic() and non _atomic suffixed helpers share the same
implementation, so it seems pointless and possibly confusing to have both.

Remove the l{1,2,3,4}e_write_atomic() helpers and switch it's user to
l{1,2,3,4}e_write(), as that's also atomic.  While there also remove
pte_write{,_atomic}() and just use write_atomic() in the wrappers.

No functional change intended.

Signed-off-by: Roger Pau Monné <roger.pau@citrix.com>
---
 xen/arch/x86/include/asm/page.h        | 21 +++-----------
 xen/arch/x86/include/asm/x86_64/page.h |  2 --
 xen/arch/x86/mm.c                      | 39 +++++++++++---------------
 3 files changed, 20 insertions(+), 42 deletions(-)

diff --git a/xen/arch/x86/include/asm/page.h b/xen/arch/x86/include/asm/page.h
index 350d1fb1100f..3d20ee507a33 100644
--- a/xen/arch/x86/include/asm/page.h
+++ b/xen/arch/x86/include/asm/page.h
@@ -26,27 +26,14 @@
     l4e_from_intpte(pte_read_atomic(&l4e_get_intpte(*(l4ep))))
 
 /* Write a pte atomically to memory. */
-#define l1e_write_atomic(l1ep, l1e) \
-    pte_write_atomic(&l1e_get_intpte(*(l1ep)), l1e_get_intpte(l1e))
-#define l2e_write_atomic(l2ep, l2e) \
-    pte_write_atomic(&l2e_get_intpte(*(l2ep)), l2e_get_intpte(l2e))
-#define l3e_write_atomic(l3ep, l3e) \
-    pte_write_atomic(&l3e_get_intpte(*(l3ep)), l3e_get_intpte(l3e))
-#define l4e_write_atomic(l4ep, l4e) \
-    pte_write_atomic(&l4e_get_intpte(*(l4ep)), l4e_get_intpte(l4e))
-
-/*
- * Write a pte safely but non-atomically to memory.
- * The PTE may become temporarily not-present during the update.
- */
 #define l1e_write(l1ep, l1e) \
-    pte_write(&l1e_get_intpte(*(l1ep)), l1e_get_intpte(l1e))
+    write_atomic(&l1e_get_intpte(*(l1ep)), l1e_get_intpte(l1e))
 #define l2e_write(l2ep, l2e) \
-    pte_write(&l2e_get_intpte(*(l2ep)), l2e_get_intpte(l2e))
+    write_atomic(&l2e_get_intpte(*(l2ep)), l2e_get_intpte(l2e))
 #define l3e_write(l3ep, l3e) \
-    pte_write(&l3e_get_intpte(*(l3ep)), l3e_get_intpte(l3e))
+    write_atomic(&l3e_get_intpte(*(l3ep)), l3e_get_intpte(l3e))
 #define l4e_write(l4ep, l4e) \
-    pte_write(&l4e_get_intpte(*(l4ep)), l4e_get_intpte(l4e))
+    write_atomic(&l4e_get_intpte(*(l4ep)), l4e_get_intpte(l4e))
 
 /* Get direct integer representation of a pte's contents (intpte_t). */
 #define l1e_get_intpte(x)          ((x).l1)
diff --git a/xen/arch/x86/include/asm/x86_64/page.h b/xen/arch/x86/include/asm/x86_64/page.h
index 19ca64d79223..03fcce61c052 100644
--- a/xen/arch/x86/include/asm/x86_64/page.h
+++ b/xen/arch/x86/include/asm/x86_64/page.h
@@ -70,8 +70,6 @@ typedef l4_pgentry_t root_pgentry_t;
 #endif /* !__ASSEMBLY__ */
 
 #define pte_read_atomic(ptep)       read_atomic(ptep)
-#define pte_write_atomic(ptep, pte) write_atomic(ptep, pte)
-#define pte_write(ptep, pte)        write_atomic(ptep, pte)
 
 /* Given a virtual address, get an entry offset into a linear page table. */
 #define l1_linear_offset(_a) (((_a) & VADDR_MASK) >> L1_PAGETABLE_SHIFT)
diff --git a/xen/arch/x86/mm.c b/xen/arch/x86/mm.c
index 95795567f2a5..fab2de5fae27 100644
--- a/xen/arch/x86/mm.c
+++ b/xen/arch/x86/mm.c
@@ -5253,7 +5253,7 @@ int map_pages_to_xen(
              !(flags & (_PAGE_PAT | MAP_SMALL_PAGES)) )
         {
             /* 1GB-page mapping. */
-            l3e_write_atomic(pl3e, l3e_from_mfn(mfn, l1f_to_lNf(flags)));
+            l3e_write(pl3e, l3e_from_mfn(mfn, l1f_to_lNf(flags)));
 
             if ( (l3e_get_flags(ol3e) & _PAGE_PRESENT) )
             {
@@ -5353,8 +5353,7 @@ int map_pages_to_xen(
             if ( (l3e_get_flags(*pl3e) & _PAGE_PRESENT) &&
                  (l3e_get_flags(*pl3e) & _PAGE_PSE) )
             {
-                l3e_write_atomic(pl3e,
-                                 l3e_from_mfn(l2mfn, __PAGE_HYPERVISOR));
+                l3e_write(pl3e, l3e_from_mfn(l2mfn, __PAGE_HYPERVISOR));
                 l2mfn = INVALID_MFN;
             }
             if ( locking )
@@ -5375,7 +5374,7 @@ int map_pages_to_xen(
         {
             /* Super-page mapping. */
             ol2e = *pl2e;
-            l2e_write_atomic(pl2e, l2e_from_mfn(mfn, l1f_to_lNf(flags)));
+            l2e_write(pl2e, l2e_from_mfn(mfn, l1f_to_lNf(flags)));
 
             if ( (l2e_get_flags(ol2e) & _PAGE_PRESENT) )
             {
@@ -5457,8 +5456,7 @@ int map_pages_to_xen(
                 if ( (l2e_get_flags(*pl2e) & _PAGE_PRESENT) &&
                      (l2e_get_flags(*pl2e) & _PAGE_PSE) )
                 {
-                    l2e_write_atomic(pl2e, l2e_from_mfn(l1mfn,
-                                                        __PAGE_HYPERVISOR));
+                    l2e_write(pl2e, l2e_from_mfn(l1mfn, __PAGE_HYPERVISOR));
                     l1mfn = INVALID_MFN;
                 }
                 if ( locking )
@@ -5471,7 +5469,7 @@ int map_pages_to_xen(
             if ( !pl1e )
                 pl1e = map_l1t_from_l2e(*pl2e) + l1_table_offset(virt);
             ol1e  = *pl1e;
-            l1e_write_atomic(pl1e, l1e_from_mfn(mfn, flags));
+            l1e_write(pl1e, l1e_from_mfn(mfn, flags));
             UNMAP_DOMAIN_PAGE(pl1e);
             if ( (l1e_get_flags(ol1e) & _PAGE_PRESENT) )
             {
@@ -5524,8 +5522,7 @@ int map_pages_to_xen(
                 UNMAP_DOMAIN_PAGE(l1t);
                 if ( i == L1_PAGETABLE_ENTRIES )
                 {
-                    l2e_write_atomic(pl2e, l2e_from_pfn(base_mfn,
-                                                        l1f_to_lNf(flags)));
+                    l2e_write(pl2e, l2e_from_pfn(base_mfn, l1f_to_lNf(flags)));
                     if ( locking )
                         spin_unlock(&map_pgdir_lock);
                     flush_area(virt - PAGE_SIZE,
@@ -5574,8 +5571,7 @@ int map_pages_to_xen(
             UNMAP_DOMAIN_PAGE(l2t);
             if ( i == L2_PAGETABLE_ENTRIES )
             {
-                l3e_write_atomic(pl3e, l3e_from_pfn(base_mfn,
-                                                    l1f_to_lNf(flags)));
+                l3e_write(pl3e, l3e_from_pfn(base_mfn, l1f_to_lNf(flags)));
                 if ( locking )
                     spin_unlock(&map_pgdir_lock);
                 flush_area(virt - PAGE_SIZE,
@@ -5674,7 +5670,7 @@ int modify_xen_mappings(unsigned long s, unsigned long e, unsigned int nf)
                     : l3e_from_pfn(l3e_get_pfn(*pl3e),
                                    (l3e_get_flags(*pl3e) & ~FLAGS_MASK) | nf);
 
-                l3e_write_atomic(pl3e, nl3e);
+                l3e_write(pl3e, nl3e);
                 v += 1UL << L3_PAGETABLE_SHIFT;
                 continue;
             }
@@ -5696,8 +5692,7 @@ int modify_xen_mappings(unsigned long s, unsigned long e, unsigned int nf)
             if ( (l3e_get_flags(*pl3e) & _PAGE_PRESENT) &&
                  (l3e_get_flags(*pl3e) & _PAGE_PSE) )
             {
-                l3e_write_atomic(pl3e,
-                                 l3e_from_mfn(l2mfn, __PAGE_HYPERVISOR));
+                l3e_write(pl3e, l3e_from_mfn(l2mfn, __PAGE_HYPERVISOR));
                 l2mfn = INVALID_MFN;
             }
             if ( locking )
@@ -5732,7 +5727,7 @@ int modify_xen_mappings(unsigned long s, unsigned long e, unsigned int nf)
                     : l2e_from_pfn(l2e_get_pfn(*pl2e),
                                    (l2e_get_flags(*pl2e) & ~FLAGS_MASK) | nf);
 
-                l2e_write_atomic(pl2e, nl2e);
+                l2e_write(pl2e, nl2e);
                 v += 1UL << L2_PAGETABLE_SHIFT;
             }
             else
@@ -5755,8 +5750,7 @@ int modify_xen_mappings(unsigned long s, unsigned long e, unsigned int nf)
                 if ( (l2e_get_flags(*pl2e) & _PAGE_PRESENT) &&
                      (l2e_get_flags(*pl2e) & _PAGE_PSE) )
                 {
-                    l2e_write_atomic(pl2e, l2e_from_mfn(l1mfn,
-                                                        __PAGE_HYPERVISOR));
+                    l2e_write(pl2e, l2e_from_mfn(l1mfn, __PAGE_HYPERVISOR));
                     l1mfn = INVALID_MFN;
                 }
                 if ( locking )
@@ -5785,7 +5779,7 @@ int modify_xen_mappings(unsigned long s, unsigned long e, unsigned int nf)
                 : l1e_from_pfn(l1e_get_pfn(*pl1e),
                                (l1e_get_flags(*pl1e) & ~FLAGS_MASK) | nf);
 
-            l1e_write_atomic(pl1e, nl1e);
+            l1e_write(pl1e, nl1e);
             UNMAP_DOMAIN_PAGE(pl1e);
             v += PAGE_SIZE;
 
@@ -5824,7 +5818,7 @@ int modify_xen_mappings(unsigned long s, unsigned long e, unsigned int nf)
             if ( i == L1_PAGETABLE_ENTRIES )
             {
                 /* Empty: zap the L2E and free the L1 page. */
-                l2e_write_atomic(pl2e, l2e_empty());
+                l2e_write(pl2e, l2e_empty());
                 if ( locking )
                     spin_unlock(&map_pgdir_lock);
                 flush_area(NULL, FLUSH_TLB_GLOBAL); /* flush before free */
@@ -5868,7 +5862,7 @@ int modify_xen_mappings(unsigned long s, unsigned long e, unsigned int nf)
             if ( i == L2_PAGETABLE_ENTRIES )
             {
                 /* Empty: zap the L3E and free the L2 page. */
-                l3e_write_atomic(pl3e, l3e_empty());
+                l3e_write(pl3e, l3e_empty());
                 if ( locking )
                     spin_unlock(&map_pgdir_lock);
                 flush_area(NULL, FLUSH_TLB_GLOBAL); /* flush before free */
@@ -5940,7 +5934,7 @@ void init_or_livepatch modify_xen_mappings_lite(
         {
             ASSERT(IS_ALIGNED(v, 1UL << L2_PAGETABLE_SHIFT));
 
-            l2e_write_atomic(pl2e, l2e_from_intpte((l2e.l2 & ~fm) | flags));
+            l2e_write(pl2e, l2e_from_intpte((l2e.l2 & ~fm) | flags));
 
             v += 1UL << L2_PAGETABLE_SHIFT;
             continue;
@@ -5958,8 +5952,7 @@ void init_or_livepatch modify_xen_mappings_lite(
 
                 ASSERT(l1f & _PAGE_PRESENT);
 
-                l1e_write_atomic(pl1e,
-                                 l1e_from_intpte((l1e.l1 & ~fm) | flags));
+                l1e_write(pl1e, l1e_from_intpte((l1e.l1 & ~fm) | flags));
 
                 v += 1UL << L1_PAGETABLE_SHIFT;
 
-- 
2.45.2



^ permalink raw reply related	[flat|nested] 64+ messages in thread

* [PATCH 02/22] x86/mm: rename l{1,2,3,4}e_read_atomic()
  2024-07-26 15:21 [PATCH 00/22] x86: adventures in Address Space Isolation Roger Pau Monne
  2024-07-26 15:21 ` [PATCH 01/22] x86/mm: drop l{1,2,3,4}e_write_atomic() Roger Pau Monne
@ 2024-07-26 15:21 ` Roger Pau Monne
  2024-07-29  7:53   ` Jan Beulich
  2024-07-26 15:21 ` [PATCH 03/22] x86/dom0: only disable SMAP for the PV dom0 build Roger Pau Monne
                   ` (19 subsequent siblings)
  21 siblings, 1 reply; 64+ messages in thread
From: Roger Pau Monne @ 2024-07-26 15:21 UTC (permalink / raw)
  To: xen-devel; +Cc: alejandro.vallejo, Roger Pau Monne, Jan Beulich, Andrew Cooper

There's no l{1,2,3,4}e_read() implementation, so drop the _atomic suffix from
the read helpers.  This allows unifying the naming with the write helpers,
which are also atomic but don't have the suffix already: l{1,2,3,4}e_write().

No functional change intended.

Signed-off-by: Roger Pau Monné <roger.pau@citrix.com>
---
 xen/arch/x86/include/asm/page.h        | 16 ++++++++--------
 xen/arch/x86/include/asm/x86_64/page.h |  2 --
 xen/arch/x86/mm.c                      | 12 ++++++------
 xen/arch/x86/traps.c                   |  8 ++++----
 4 files changed, 18 insertions(+), 20 deletions(-)

diff --git a/xen/arch/x86/include/asm/page.h b/xen/arch/x86/include/asm/page.h
index 3d20ee507a33..e48571de9332 100644
--- a/xen/arch/x86/include/asm/page.h
+++ b/xen/arch/x86/include/asm/page.h
@@ -16,14 +16,14 @@
 #include <asm/x86_64/page.h>
 
 /* Read a pte atomically from memory. */
-#define l1e_read_atomic(l1ep) \
-    l1e_from_intpte(pte_read_atomic(&l1e_get_intpte(*(l1ep))))
-#define l2e_read_atomic(l2ep) \
-    l2e_from_intpte(pte_read_atomic(&l2e_get_intpte(*(l2ep))))
-#define l3e_read_atomic(l3ep) \
-    l3e_from_intpte(pte_read_atomic(&l3e_get_intpte(*(l3ep))))
-#define l4e_read_atomic(l4ep) \
-    l4e_from_intpte(pte_read_atomic(&l4e_get_intpte(*(l4ep))))
+#define l1e_read(l1ep) \
+    l1e_from_intpte(read_atomic(&l1e_get_intpte(*(l1ep))))
+#define l2e_read(l2ep) \
+    l2e_from_intpte(read_atomic(&l2e_get_intpte(*(l2ep))))
+#define l3e_read(l3ep) \
+    l3e_from_intpte(read_atomic(&l3e_get_intpte(*(l3ep))))
+#define l4e_read(l4ep) \
+    l4e_from_intpte(read_atomic(&l4e_get_intpte(*(l4ep))))
 
 /* Write a pte atomically to memory. */
 #define l1e_write(l1ep, l1e) \
diff --git a/xen/arch/x86/include/asm/x86_64/page.h b/xen/arch/x86/include/asm/x86_64/page.h
index 03fcce61c052..465a70731214 100644
--- a/xen/arch/x86/include/asm/x86_64/page.h
+++ b/xen/arch/x86/include/asm/x86_64/page.h
@@ -69,8 +69,6 @@ typedef l4_pgentry_t root_pgentry_t;
 
 #endif /* !__ASSEMBLY__ */
 
-#define pte_read_atomic(ptep)       read_atomic(ptep)
-
 /* Given a virtual address, get an entry offset into a linear page table. */
 #define l1_linear_offset(_a) (((_a) & VADDR_MASK) >> L1_PAGETABLE_SHIFT)
 #define l2_linear_offset(_a) (((_a) & VADDR_MASK) >> L2_PAGETABLE_SHIFT)
diff --git a/xen/arch/x86/mm.c b/xen/arch/x86/mm.c
index fab2de5fae27..6ffacab341ad 100644
--- a/xen/arch/x86/mm.c
+++ b/xen/arch/x86/mm.c
@@ -2147,7 +2147,7 @@ static int mod_l1_entry(l1_pgentry_t *pl1e, l1_pgentry_t nl1e,
                         struct vcpu *pt_vcpu, struct domain *pg_dom)
 {
     bool preserve_ad = (cmd == MMU_PT_UPDATE_PRESERVE_AD);
-    l1_pgentry_t ol1e = l1e_read_atomic(pl1e);
+    l1_pgentry_t ol1e = l1e_read(pl1e);
     struct domain *pt_dom = pt_vcpu->domain;
     int rc = 0;
 
@@ -2270,7 +2270,7 @@ static int mod_l2_entry(l2_pgentry_t *pl2e,
         return -EPERM;
     }
 
-    ol2e = l2e_read_atomic(pl2e);
+    ol2e = l2e_read(pl2e);
 
     if ( l2e_get_flags(nl2e) & _PAGE_PRESENT )
     {
@@ -2332,7 +2332,7 @@ static int mod_l3_entry(l3_pgentry_t *pl3e,
     if ( pgentry_ptr_to_slot(pl3e) >= 3 && is_pv_32bit_domain(d) )
         return -EINVAL;
 
-    ol3e = l3e_read_atomic(pl3e);
+    ol3e = l3e_read(pl3e);
 
     if ( l3e_get_flags(nl3e) & _PAGE_PRESENT )
     {
@@ -2394,7 +2394,7 @@ static int mod_l4_entry(l4_pgentry_t *pl4e,
         return -EINVAL;
     }
 
-    ol4e = l4e_read_atomic(pl4e);
+    ol4e = l4e_read(pl4e);
 
     if ( l4e_get_flags(nl4e) & _PAGE_PRESENT )
     {
@@ -5925,7 +5925,7 @@ void init_or_livepatch modify_xen_mappings_lite(
     while ( v < e )
     {
         l2_pgentry_t *pl2e = &l2_xenmap[l2_table_offset(v)];
-        l2_pgentry_t l2e = l2e_read_atomic(pl2e);
+        l2_pgentry_t l2e = l2e_read(pl2e);
         unsigned int l2f = l2e_get_flags(l2e);
 
         ASSERT(l2f & _PAGE_PRESENT);
@@ -5947,7 +5947,7 @@ void init_or_livepatch modify_xen_mappings_lite(
             while ( v < e )
             {
                 l1_pgentry_t *pl1e = &pl1t[l1_table_offset(v)];
-                l1_pgentry_t l1e = l1e_read_atomic(pl1e);
+                l1_pgentry_t l1e = l1e_read(pl1e);
                 unsigned int l1f = l1e_get_flags(l1e);
 
                 ASSERT(l1f & _PAGE_PRESENT);
diff --git a/xen/arch/x86/traps.c b/xen/arch/x86/traps.c
index ee91fc56b125..b4fb95917023 100644
--- a/xen/arch/x86/traps.c
+++ b/xen/arch/x86/traps.c
@@ -1450,7 +1450,7 @@ static enum pf_type __page_fault_type(unsigned long addr,
     mfn = cr3 >> PAGE_SHIFT;
 
     l4t = map_domain_page(_mfn(mfn));
-    l4e = l4e_read_atomic(&l4t[l4_table_offset(addr)]);
+    l4e = l4e_read(&l4t[l4_table_offset(addr)]);
     mfn = l4e_get_pfn(l4e);
     unmap_domain_page(l4t);
     if ( ((l4e_get_flags(l4e) & required_flags) != required_flags) ||
@@ -1459,7 +1459,7 @@ static enum pf_type __page_fault_type(unsigned long addr,
     page_user &= l4e_get_flags(l4e);
 
     l3t  = map_domain_page(_mfn(mfn));
-    l3e = l3e_read_atomic(&l3t[l3_table_offset(addr)]);
+    l3e = l3e_read(&l3t[l3_table_offset(addr)]);
     mfn = l3e_get_pfn(l3e);
     unmap_domain_page(l3t);
     if ( ((l3e_get_flags(l3e) & required_flags) != required_flags) ||
@@ -1470,7 +1470,7 @@ static enum pf_type __page_fault_type(unsigned long addr,
         goto leaf;
 
     l2t = map_domain_page(_mfn(mfn));
-    l2e = l2e_read_atomic(&l2t[l2_table_offset(addr)]);
+    l2e = l2e_read(&l2t[l2_table_offset(addr)]);
     mfn = l2e_get_pfn(l2e);
     unmap_domain_page(l2t);
     if ( ((l2e_get_flags(l2e) & required_flags) != required_flags) ||
@@ -1481,7 +1481,7 @@ static enum pf_type __page_fault_type(unsigned long addr,
         goto leaf;
 
     l1t = map_domain_page(_mfn(mfn));
-    l1e = l1e_read_atomic(&l1t[l1_table_offset(addr)]);
+    l1e = l1e_read(&l1t[l1_table_offset(addr)]);
     mfn = l1e_get_pfn(l1e);
     unmap_domain_page(l1t);
     if ( ((l1e_get_flags(l1e) & required_flags) != required_flags) ||
-- 
2.45.2



^ permalink raw reply related	[flat|nested] 64+ messages in thread

* [PATCH 03/22] x86/dom0: only disable SMAP for the PV dom0 build
  2024-07-26 15:21 [PATCH 00/22] x86: adventures in Address Space Isolation Roger Pau Monne
  2024-07-26 15:21 ` [PATCH 01/22] x86/mm: drop l{1,2,3,4}e_write_atomic() Roger Pau Monne
  2024-07-26 15:21 ` [PATCH 02/22] x86/mm: rename l{1,2,3,4}e_read_atomic() Roger Pau Monne
@ 2024-07-26 15:21 ` Roger Pau Monne
  2024-07-29  8:17   ` Roger Pau Monné
                     ` (2 more replies)
  2024-07-26 15:21 ` [PATCH 04/22] x86/mm: ensure L4 idle_pg_table is not modified past boot Roger Pau Monne
                   ` (18 subsequent siblings)
  21 siblings, 3 replies; 64+ messages in thread
From: Roger Pau Monne @ 2024-07-26 15:21 UTC (permalink / raw)
  To: xen-devel; +Cc: alejandro.vallejo, Roger Pau Monne, Jan Beulich, Andrew Cooper

The PVH dom0 builder doesn't switch page tables and has no need to run with
SMAP disabled.

Put the SMAP disabling close to the code region where it's necessary, as it
then becomes obvious why switch_cr3_cr4() is required instead of
write_ptbase().

Note removing SMAP from cr4_pv32_mask is not required, as we never jump into
guest context, and hence updating the value of cr4_pv32_mask is not relevant.

Signed-off-by: Roger Pau Monné <roger.pau@citrix.com>
---
 xen/arch/x86/pv/dom0_build.c | 13 ++++++++++---
 xen/arch/x86/setup.c         | 17 -----------------
 2 files changed, 10 insertions(+), 20 deletions(-)

diff --git a/xen/arch/x86/pv/dom0_build.c b/xen/arch/x86/pv/dom0_build.c
index d8043fa58a27..41772dbe80bf 100644
--- a/xen/arch/x86/pv/dom0_build.c
+++ b/xen/arch/x86/pv/dom0_build.c
@@ -370,6 +370,7 @@ int __init dom0_construct_pv(struct domain *d,
     unsigned long alloc_epfn;
     unsigned long initrd_pfn = -1, initrd_mfn = 0;
     unsigned long count;
+    unsigned long cr4;
     struct page_info *page = NULL;
     unsigned int flush_flags = 0;
     start_info_t *si;
@@ -814,8 +815,14 @@ int __init dom0_construct_pv(struct domain *d,
     /* Set up CR3 value for switch_cr3_cr4(). */
     update_cr3(v);
 
+    /*
+     * Temporarily clear SMAP in CR4 to allow user-accesses when running with
+     * the dom0 page-tables.  Cache the value of CR4 so it can be restored.
+     */
+    cr4 = read_cr4();
+
     /* We run on dom0's page tables for the final part of the build process. */
-    switch_cr3_cr4(cr3_pa(v->arch.cr3), read_cr4());
+    switch_cr3_cr4(cr3_pa(v->arch.cr3), cr4 & ~X86_CR4_SMAP);
     mapcache_override_current(v);
 
     /* Copy the OS image and free temporary buffer. */
@@ -836,7 +843,7 @@ int __init dom0_construct_pv(struct domain *d,
              (parms.virt_hypercall >= v_end) )
         {
             mapcache_override_current(NULL);
-            switch_cr3_cr4(current->arch.cr3, read_cr4());
+            switch_cr3_cr4(current->arch.cr3, cr4);
             printk("Invalid HYPERCALL_PAGE field in ELF notes.\n");
             return -EINVAL;
         }
@@ -978,7 +985,7 @@ int __init dom0_construct_pv(struct domain *d,
 
     /* Return to idle domain's page tables. */
     mapcache_override_current(NULL);
-    switch_cr3_cr4(current->arch.cr3, read_cr4());
+    switch_cr3_cr4(current->arch.cr3, cr4);
 
     update_domain_wallclock_time(d);
 
diff --git a/xen/arch/x86/setup.c b/xen/arch/x86/setup.c
index eee20bb1753c..bc387d96b519 100644
--- a/xen/arch/x86/setup.c
+++ b/xen/arch/x86/setup.c
@@ -955,26 +955,9 @@ static struct domain *__init create_dom0(const module_t *image,
         }
     }
 
-    /*
-     * Temporarily clear SMAP in CR4 to allow user-accesses in construct_dom0().
-     * This saves a large number of corner cases interactions with
-     * copy_from_user().
-     */
-    if ( cpu_has_smap )
-    {
-        cr4_pv32_mask &= ~X86_CR4_SMAP;
-        write_cr4(read_cr4() & ~X86_CR4_SMAP);
-    }
-
     if ( construct_dom0(d, image, headroom, initrd, cmdline) != 0 )
         panic("Could not construct domain 0\n");
 
-    if ( cpu_has_smap )
-    {
-        write_cr4(read_cr4() | X86_CR4_SMAP);
-        cr4_pv32_mask |= X86_CR4_SMAP;
-    }
-
     return d;
 }
 
-- 
2.45.2



^ permalink raw reply related	[flat|nested] 64+ messages in thread

* [PATCH 04/22] x86/mm: ensure L4 idle_pg_table is not modified past boot
  2024-07-26 15:21 [PATCH 00/22] x86: adventures in Address Space Isolation Roger Pau Monne
                   ` (2 preceding siblings ...)
  2024-07-26 15:21 ` [PATCH 03/22] x86/dom0: only disable SMAP for the PV dom0 build Roger Pau Monne
@ 2024-07-26 15:21 ` Roger Pau Monne
  2024-08-13 15:54   ` Jan Beulich
  2024-07-26 15:21 ` [PATCH 05/22] x86/mm: make virt_to_xen_l1e() static Roger Pau Monne
                   ` (17 subsequent siblings)
  21 siblings, 1 reply; 64+ messages in thread
From: Roger Pau Monne @ 2024-07-26 15:21 UTC (permalink / raw)
  To: xen-devel; +Cc: alejandro.vallejo, Roger Pau Monne, Jan Beulich, Andrew Cooper

The idle_pg_table L4 is cloned to create all the other L4 Xen uses, and hence
it shouldn't be modified once further L4 are created.

Signed-off-by: Roger Pau Monné <roger.pau@citrix.com>
---
 xen/arch/x86/mm.c | 6 ++++++
 1 file changed, 6 insertions(+)

diff --git a/xen/arch/x86/mm.c b/xen/arch/x86/mm.c
index 6ffacab341ad..01380fd82c9d 100644
--- a/xen/arch/x86/mm.c
+++ b/xen/arch/x86/mm.c
@@ -5023,6 +5023,12 @@ static l3_pgentry_t *virt_to_xen_l3e(unsigned long v)
         mfn_t l3mfn;
         l3_pgentry_t *l3t = alloc_mapped_pagetable(&l3mfn);
 
+        /*
+         * dom0 is build at smp_boot, at which point we already create new L4s
+         * based on idle_pg_table.
+         */
+        BUG_ON(system_state >= SYS_STATE_smp_boot);
+
         if ( !l3t )
             return NULL;
         UNMAP_DOMAIN_PAGE(l3t);
-- 
2.45.2



^ permalink raw reply related	[flat|nested] 64+ messages in thread

* [PATCH 05/22] x86/mm: make virt_to_xen_l1e() static
  2024-07-26 15:21 [PATCH 00/22] x86: adventures in Address Space Isolation Roger Pau Monne
                   ` (3 preceding siblings ...)
  2024-07-26 15:21 ` [PATCH 04/22] x86/mm: ensure L4 idle_pg_table is not modified past boot Roger Pau Monne
@ 2024-07-26 15:21 ` Roger Pau Monne
  2024-07-30 13:12   ` Andrew Cooper
  2024-07-26 15:21 ` [PATCH 06/22] x86/mm: introduce a local domain variable to write_ptbase() Roger Pau Monne
                   ` (16 subsequent siblings)
  21 siblings, 1 reply; 64+ messages in thread
From: Roger Pau Monne @ 2024-07-26 15:21 UTC (permalink / raw)
  To: xen-devel; +Cc: alejandro.vallejo, Roger Pau Monne, Jan Beulich, Andrew Cooper

There are no callers outside the translation unit where it's defined, so make
the function static.

No functional change intended.

Signed-off-by: Roger Pau Monné <roger.pau@citrix.com>
---
 xen/arch/x86/include/asm/mm.h | 2 --
 xen/arch/x86/mm.c             | 2 +-
 2 files changed, 1 insertion(+), 3 deletions(-)

diff --git a/xen/arch/x86/include/asm/mm.h b/xen/arch/x86/include/asm/mm.h
index 98b66edaca5e..b3853ae734fa 100644
--- a/xen/arch/x86/include/asm/mm.h
+++ b/xen/arch/x86/include/asm/mm.h
@@ -567,8 +567,6 @@ mfn_t alloc_xen_pagetable(void);
 void free_xen_pagetable(mfn_t mfn);
 void *alloc_mapped_pagetable(mfn_t *pmfn);
 
-l1_pgentry_t *virt_to_xen_l1e(unsigned long v);
-
 int __sync_local_execstate(void);
 
 /* Arch-specific portion of memory_op hypercall. */
diff --git a/xen/arch/x86/mm.c b/xen/arch/x86/mm.c
index 01380fd82c9d..ca3d116b0e05 100644
--- a/xen/arch/x86/mm.c
+++ b/xen/arch/x86/mm.c
@@ -5087,7 +5087,7 @@ static l2_pgentry_t *virt_to_xen_l2e(unsigned long v)
     return map_l2t_from_l3e(l3e) + l2_table_offset(v);
 }
 
-l1_pgentry_t *virt_to_xen_l1e(unsigned long v)
+static l1_pgentry_t *virt_to_xen_l1e(unsigned long v)
 {
     l2_pgentry_t *pl2e, l2e;
 
-- 
2.45.2



^ permalink raw reply related	[flat|nested] 64+ messages in thread

* [PATCH 06/22] x86/mm: introduce a local domain variable to write_ptbase()
  2024-07-26 15:21 [PATCH 00/22] x86: adventures in Address Space Isolation Roger Pau Monne
                   ` (4 preceding siblings ...)
  2024-07-26 15:21 ` [PATCH 05/22] x86/mm: make virt_to_xen_l1e() static Roger Pau Monne
@ 2024-07-26 15:21 ` Roger Pau Monne
  2024-07-30 13:19   ` Andrew Cooper
  2024-07-26 15:21 ` [PATCH 07/22] x86/spec-ctrl: initialize per-domain XPTI in spec_ctrl_init_domain() Roger Pau Monne
                   ` (15 subsequent siblings)
  21 siblings, 1 reply; 64+ messages in thread
From: Roger Pau Monne @ 2024-07-26 15:21 UTC (permalink / raw)
  To: xen-devel; +Cc: alejandro.vallejo, Roger Pau Monne, Jan Beulich, Andrew Cooper

This reduces the repeated accessing of v->domain.

No functional change intended.

Signed-off-by: Roger Pau Monné <roger.pau@citrix.com>
---
 xen/arch/x86/mm.c | 5 +++--
 1 file changed, 3 insertions(+), 2 deletions(-)

diff --git a/xen/arch/x86/mm.c b/xen/arch/x86/mm.c
index ca3d116b0e05..a792a300a866 100644
--- a/xen/arch/x86/mm.c
+++ b/xen/arch/x86/mm.c
@@ -517,13 +517,14 @@ void make_cr3(struct vcpu *v, mfn_t mfn)
 
 void write_ptbase(struct vcpu *v)
 {
+    const struct domain *d = v->domain;
     struct cpu_info *cpu_info = get_cpu_info();
     unsigned long new_cr4;
 
-    new_cr4 = (is_pv_vcpu(v) && !is_idle_vcpu(v))
+    new_cr4 = (is_pv_domain(d) && !is_idle_domain(d))
               ? pv_make_cr4(v) : mmu_cr4_features;
 
-    if ( is_pv_vcpu(v) && v->domain->arch.pv.xpti )
+    if ( is_pv_domain(d) && d->arch.pv.xpti )
     {
         cpu_info->root_pgt_changed = true;
         cpu_info->pv_cr3 = __pa(this_cpu(root_pgt));
-- 
2.45.2



^ permalink raw reply related	[flat|nested] 64+ messages in thread

* [PATCH 07/22] x86/spec-ctrl: initialize per-domain XPTI in spec_ctrl_init_domain()
  2024-07-26 15:21 [PATCH 00/22] x86: adventures in Address Space Isolation Roger Pau Monne
                   ` (5 preceding siblings ...)
  2024-07-26 15:21 ` [PATCH 06/22] x86/mm: introduce a local domain variable to write_ptbase() Roger Pau Monne
@ 2024-07-26 15:21 ` Roger Pau Monne
  2024-08-14  9:47   ` Jan Beulich
  2024-07-26 15:21 ` [PATCH 08/22] x86/mm: avoid passing a domain parameter to L4 init function Roger Pau Monne
                   ` (14 subsequent siblings)
  21 siblings, 1 reply; 64+ messages in thread
From: Roger Pau Monne @ 2024-07-26 15:21 UTC (permalink / raw)
  To: xen-devel; +Cc: alejandro.vallejo, Roger Pau Monne, Jan Beulich, Andrew Cooper

XPTI being a speculation mitigation feels better to be initialized in
spec_ctrl_init_domain().

No functional change intended, although the call to spec_ctrl_init_domain() in
arch_domain_create() needs to be moved ahead of pv_domain_initialise() for
d->->arch.pv.xpti to be correctly set.

Move it ahead of most of the initialization functions, since
spec_ctrl_init_domain() doesn't depend on any member in the struct domain being
set.

Signed-off-by: Roger Pau Monné <roger.pau@citrix.com>
---
 xen/arch/x86/domain.c    | 4 ++--
 xen/arch/x86/pv/domain.c | 2 --
 xen/arch/x86/spec_ctrl.c | 4 ++++
 3 files changed, 6 insertions(+), 4 deletions(-)

diff --git a/xen/arch/x86/domain.c b/xen/arch/x86/domain.c
index ccadfe0c9e70..3d3c14dbb5ae 100644
--- a/xen/arch/x86/domain.c
+++ b/xen/arch/x86/domain.c
@@ -842,6 +842,8 @@ int arch_domain_create(struct domain *d,
         is_pv_domain(d) ? __HYPERVISOR_COMPAT_VIRT_START : ~0u;
 #endif
 
+    spec_ctrl_init_domain(d);
+
     if ( (rc = paging_domain_init(d)) != 0 )
         goto fail;
     paging_initialised = true;
@@ -908,8 +910,6 @@ int arch_domain_create(struct domain *d,
 
     d->arch.msr_relaxed = config->arch.misc_flags & XEN_X86_MSR_RELAXED;
 
-    spec_ctrl_init_domain(d);
-
     return 0;
 
  fail:
diff --git a/xen/arch/x86/pv/domain.c b/xen/arch/x86/pv/domain.c
index 2a445bb17b99..86b74fb372d5 100644
--- a/xen/arch/x86/pv/domain.c
+++ b/xen/arch/x86/pv/domain.c
@@ -383,8 +383,6 @@ int pv_domain_initialise(struct domain *d)
 
     d->arch.ctxt_switch = &pv_csw;
 
-    d->arch.pv.xpti = is_hardware_domain(d) ? opt_xpti_hwdom : opt_xpti_domu;
-
     if ( !is_pv_32bit_domain(d) && use_invpcid && cpu_has_pcid )
         switch ( ACCESS_ONCE(opt_pcid) )
         {
diff --git a/xen/arch/x86/spec_ctrl.c b/xen/arch/x86/spec_ctrl.c
index 40f6ae017010..5dc7a17b9354 100644
--- a/xen/arch/x86/spec_ctrl.c
+++ b/xen/arch/x86/spec_ctrl.c
@@ -1769,6 +1769,10 @@ void spec_ctrl_init_domain(struct domain *d)
         (ibpb   ? SCF_entry_ibpb   : 0) |
         (bhb    ? SCF_entry_bhb    : 0) |
         0;
+
+    if ( pv )
+        d->arch.pv.xpti = is_hardware_domain(d) ? opt_xpti_hwdom
+                                                : opt_xpti_domu;
 }
 
 void __init init_speculation_mitigations(void)
-- 
2.45.2



^ permalink raw reply related	[flat|nested] 64+ messages in thread

* [PATCH 08/22] x86/mm: avoid passing a domain parameter to L4 init function
  2024-07-26 15:21 [PATCH 00/22] x86: adventures in Address Space Isolation Roger Pau Monne
                   ` (6 preceding siblings ...)
  2024-07-26 15:21 ` [PATCH 07/22] x86/spec-ctrl: initialize per-domain XPTI in spec_ctrl_init_domain() Roger Pau Monne
@ 2024-07-26 15:21 ` Roger Pau Monne
  2024-07-29 13:36   ` Alejandro Vallejo
  2024-08-14 10:24   ` Jan Beulich
  2024-07-26 15:21 ` [PATCH 09/22] x86/pv: untie issuing FLUSH_ROOT_PGTBL from XPTI Roger Pau Monne
                   ` (13 subsequent siblings)
  21 siblings, 2 replies; 64+ messages in thread
From: Roger Pau Monne @ 2024-07-26 15:21 UTC (permalink / raw)
  To: xen-devel
  Cc: alejandro.vallejo, Roger Pau Monne, Jan Beulich, Andrew Cooper,
	Tim Deegan

In preparation for the function being called from contexts where no domain is
present.

No functional change intended.

Signed-off-by: Roger Pau Monné <roger.pau@citrix.com>
---
 xen/arch/x86/include/asm/mm.h  |  4 +++-
 xen/arch/x86/mm.c              | 24 +++++++++++++-----------
 xen/arch/x86/mm/hap/hap.c      |  3 ++-
 xen/arch/x86/mm/shadow/hvm.c   |  3 ++-
 xen/arch/x86/mm/shadow/multi.c |  7 +++++--
 xen/arch/x86/pv/dom0_build.c   |  3 ++-
 xen/arch/x86/pv/domain.c       |  3 ++-
 7 files changed, 29 insertions(+), 18 deletions(-)

diff --git a/xen/arch/x86/include/asm/mm.h b/xen/arch/x86/include/asm/mm.h
index b3853ae734fa..076e7009dc99 100644
--- a/xen/arch/x86/include/asm/mm.h
+++ b/xen/arch/x86/include/asm/mm.h
@@ -375,7 +375,9 @@ int devalidate_page(struct page_info *page, unsigned long type,
 
 void init_xen_pae_l2_slots(l2_pgentry_t *l2t, const struct domain *d);
 void init_xen_l4_slots(l4_pgentry_t *l4t, mfn_t l4mfn,
-                       const struct domain *d, mfn_t sl4mfn, bool ro_mpt);
+                       mfn_t sl4mfn, const struct page_info *perdomain_l3,
+                       bool ro_mpt, bool maybe_compat, bool short_directmap);
+
 bool fill_ro_mpt(mfn_t mfn);
 void zap_ro_mpt(mfn_t mfn);
 
diff --git a/xen/arch/x86/mm.c b/xen/arch/x86/mm.c
index a792a300a866..c01b6712143e 100644
--- a/xen/arch/x86/mm.c
+++ b/xen/arch/x86/mm.c
@@ -1645,14 +1645,9 @@ static int promote_l3_table(struct page_info *page)
  * extended directmap.
  */
 void init_xen_l4_slots(l4_pgentry_t *l4t, mfn_t l4mfn,
-                       const struct domain *d, mfn_t sl4mfn, bool ro_mpt)
+                       mfn_t sl4mfn, const struct page_info *perdomain_l3,
+                       bool ro_mpt, bool maybe_compat, bool short_directmap)
 {
-    /*
-     * PV vcpus need a shortened directmap.  HVM and Idle vcpus get the full
-     * directmap.
-     */
-    bool short_directmap = !paging_mode_external(d);
-
     /* Slot 256: RO M2P (if applicable). */
     l4t[l4_table_offset(RO_MPT_VIRT_START)] =
         ro_mpt ? idle_pg_table[l4_table_offset(RO_MPT_VIRT_START)]
@@ -1673,13 +1668,14 @@ void init_xen_l4_slots(l4_pgentry_t *l4t, mfn_t l4mfn,
         l4e_from_mfn(sl4mfn, __PAGE_HYPERVISOR_RW);
 
     /* Slot 260: Per-domain mappings. */
-    l4t[l4_table_offset(PERDOMAIN_VIRT_START)] =
-        l4e_from_page(d->arch.perdomain_l3_pg, __PAGE_HYPERVISOR_RW);
+    if ( perdomain_l3 )
+        l4t[l4_table_offset(PERDOMAIN_VIRT_START)] =
+            l4e_from_page(perdomain_l3, __PAGE_HYPERVISOR_RW);
 
     /* Slot 4: Per-domain mappings mirror. */
     BUILD_BUG_ON(IS_ENABLED(CONFIG_PV32) &&
                  !l4_table_offset(PERDOMAIN_ALT_VIRT_START));
-    if ( !is_pv_64bit_domain(d) )
+    if ( perdomain_l3 && maybe_compat )
         l4t[l4_table_offset(PERDOMAIN_ALT_VIRT_START)] =
             l4t[l4_table_offset(PERDOMAIN_VIRT_START)];
 
@@ -1710,6 +1706,10 @@ void init_xen_l4_slots(l4_pgentry_t *l4t, mfn_t l4mfn,
     else
 #endif
     {
+        /*
+         * PV vcpus need a shortened directmap.  HVM and Idle vcpus get the full
+         * directmap.
+         */
         unsigned int slots = (short_directmap
                               ? ROOT_PAGETABLE_PV_XEN_SLOTS
                               : ROOT_PAGETABLE_XEN_SLOTS);
@@ -1830,7 +1830,9 @@ static int promote_l4_table(struct page_info *page)
     if ( !rc )
     {
         init_xen_l4_slots(pl4e, l4mfn,
-                          d, INVALID_MFN, VM_ASSIST(d, m2p_strict));
+                          INVALID_MFN, d->arch.perdomain_l3_pg,
+                          VM_ASSIST(d, m2p_strict), !is_pv_64bit_domain(d),
+                          true);
         atomic_inc(&d->arch.pv.nr_l4_pages);
     }
     unmap_domain_page(pl4e);
diff --git a/xen/arch/x86/mm/hap/hap.c b/xen/arch/x86/mm/hap/hap.c
index d2011fde2462..c8514ca0e917 100644
--- a/xen/arch/x86/mm/hap/hap.c
+++ b/xen/arch/x86/mm/hap/hap.c
@@ -402,7 +402,8 @@ static mfn_t hap_make_monitor_table(struct vcpu *v)
     m4mfn = page_to_mfn(pg);
     l4e = map_domain_page(m4mfn);
 
-    init_xen_l4_slots(l4e, m4mfn, d, INVALID_MFN, false);
+    init_xen_l4_slots(l4e, m4mfn, INVALID_MFN, d->arch.perdomain_l3_pg,
+                      false, true, false);
     unmap_domain_page(l4e);
 
     return m4mfn;
diff --git a/xen/arch/x86/mm/shadow/hvm.c b/xen/arch/x86/mm/shadow/hvm.c
index c16f3b3adf32..93922a71e511 100644
--- a/xen/arch/x86/mm/shadow/hvm.c
+++ b/xen/arch/x86/mm/shadow/hvm.c
@@ -758,7 +758,8 @@ mfn_t sh_make_monitor_table(const struct vcpu *v, unsigned int shadow_levels)
      * shadow-linear mapping will either be inserted below when creating
      * lower level monitor tables, or later in sh_update_cr3().
      */
-    init_xen_l4_slots(l4e, m4mfn, d, INVALID_MFN, false);
+    init_xen_l4_slots(l4e, m4mfn, INVALID_MFN, d->arch.perdomain_l3_pg,
+                      false, true, false);
 
     if ( shadow_levels < 4 )
     {
diff --git a/xen/arch/x86/mm/shadow/multi.c b/xen/arch/x86/mm/shadow/multi.c
index 376f6823cd44..0def0c073ca8 100644
--- a/xen/arch/x86/mm/shadow/multi.c
+++ b/xen/arch/x86/mm/shadow/multi.c
@@ -973,8 +973,11 @@ sh_make_shadow(struct vcpu *v, mfn_t gmfn, u32 shadow_type)
 
             BUILD_BUG_ON(sizeof(l4_pgentry_t) != sizeof(shadow_l4e_t));
 
-            init_xen_l4_slots(l4t, gmfn, d, smfn, (!is_pv_32bit_domain(d) &&
-                                                   VM_ASSIST(d, m2p_strict)));
+            init_xen_l4_slots(l4t, gmfn, smfn,
+                              d->arch.perdomain_l3_pg,
+                              (!is_pv_32bit_domain(d) &&
+                               VM_ASSIST(d, m2p_strict)),
+                              !is_pv_64bit_domain(d), true);
             unmap_domain_page(l4t);
         }
         break;
diff --git a/xen/arch/x86/pv/dom0_build.c b/xen/arch/x86/pv/dom0_build.c
index 41772dbe80bf..6a6689f402bb 100644
--- a/xen/arch/x86/pv/dom0_build.c
+++ b/xen/arch/x86/pv/dom0_build.c
@@ -711,7 +711,8 @@ int __init dom0_construct_pv(struct domain *d,
         l4start = l4tab = __va(mpt_alloc); mpt_alloc += PAGE_SIZE;
         clear_page(l4tab);
         init_xen_l4_slots(l4tab, _mfn(virt_to_mfn(l4start)),
-                          d, INVALID_MFN, true);
+                          INVALID_MFN, d->arch.perdomain_l3_pg,
+                          true, !is_pv_64bit_domain(d), true);
         v->arch.guest_table = pagetable_from_paddr(__pa(l4start));
     }
     else
diff --git a/xen/arch/x86/pv/domain.c b/xen/arch/x86/pv/domain.c
index 86b74fb372d5..6ff71f14a2f2 100644
--- a/xen/arch/x86/pv/domain.c
+++ b/xen/arch/x86/pv/domain.c
@@ -124,7 +124,8 @@ static int setup_compat_l4(struct vcpu *v)
     mfn = page_to_mfn(pg);
     l4tab = map_domain_page(mfn);
     clear_page(l4tab);
-    init_xen_l4_slots(l4tab, mfn, v->domain, INVALID_MFN, false);
+    init_xen_l4_slots(l4tab, mfn, INVALID_MFN, v->domain->arch.perdomain_l3_pg,
+                      false, true, true);
     unmap_domain_page(l4tab);
 
     /* This page needs to look like a pagetable so that it can be shadowed */
-- 
2.45.2



^ permalink raw reply related	[flat|nested] 64+ messages in thread

* [PATCH 09/22] x86/pv: untie issuing FLUSH_ROOT_PGTBL from XPTI
  2024-07-26 15:21 [PATCH 00/22] x86: adventures in Address Space Isolation Roger Pau Monne
                   ` (7 preceding siblings ...)
  2024-07-26 15:21 ` [PATCH 08/22] x86/mm: avoid passing a domain parameter to L4 init function Roger Pau Monne
@ 2024-07-26 15:21 ` Roger Pau Monne
  2024-07-26 15:21 ` [PATCH 10/22] x86/mm: move FLUSH_ROOT_PGTBL handling before TLB flush Roger Pau Monne
                   ` (12 subsequent siblings)
  21 siblings, 0 replies; 64+ messages in thread
From: Roger Pau Monne @ 2024-07-26 15:21 UTC (permalink / raw)
  To: xen-devel; +Cc: alejandro.vallejo, Roger Pau Monne, Jan Beulich, Andrew Cooper

The current logic gates issuing flush TLB requests with the FLUSH_ROOT_PGTBL
flag to XPTI being enabled.

In preparation for FLUSH_ROOT_PGTBL also being needed when not using XPTI,
untie it from the xpti domain boolean and instead introduce a new flush_root_pt
field.

No functional change intended, as flush_root_pt == xpti.

Signed-off-by: Roger Pau Monné <roger.pau@citrix.com>
---
 xen/arch/x86/include/asm/domain.h   | 2 ++
 xen/arch/x86/include/asm/flushtlb.h | 2 +-
 xen/arch/x86/mm.c                   | 2 +-
 xen/arch/x86/pv/domain.c            | 2 ++
 4 files changed, 6 insertions(+), 2 deletions(-)

diff --git a/xen/arch/x86/include/asm/domain.h b/xen/arch/x86/include/asm/domain.h
index f5daeb182baa..9dd2e047f4de 100644
--- a/xen/arch/x86/include/asm/domain.h
+++ b/xen/arch/x86/include/asm/domain.h
@@ -283,6 +283,8 @@ struct pv_domain
     bool pcid;
     /* Mitigate L1TF with shadow/crashing? */
     bool check_l1tf;
+    /* Issue FLUSH_ROOT_PGTBL for root page-table changes. */
+    bool flush_root_pt;
 
     /* map_domain_page() mapping cache. */
     struct mapcache_domain mapcache;
diff --git a/xen/arch/x86/include/asm/flushtlb.h b/xen/arch/x86/include/asm/flushtlb.h
index bb0ad58db49b..1b98d03decdc 100644
--- a/xen/arch/x86/include/asm/flushtlb.h
+++ b/xen/arch/x86/include/asm/flushtlb.h
@@ -177,7 +177,7 @@ void flush_area_mask(const cpumask_t *mask, const void *va,
 
 #define flush_root_pgtbl_domain(d)                                       \
 {                                                                        \
-    if ( is_pv_domain(d) && (d)->arch.pv.xpti )                          \
+    if ( is_pv_domain(d) && (d)->arch.pv.flush_root_pt )                 \
         flush_mask((d)->dirty_cpumask, FLUSH_ROOT_PGTBL);                \
 }
 
diff --git a/xen/arch/x86/mm.c b/xen/arch/x86/mm.c
index c01b6712143e..a1ac7bdc5b44 100644
--- a/xen/arch/x86/mm.c
+++ b/xen/arch/x86/mm.c
@@ -4167,7 +4167,7 @@ long do_mmu_update(
                                       cmd == MMU_PT_UPDATE_PRESERVE_AD, v);
                     if ( !rc )
                         flush_linear_pt = true;
-                    if ( !rc && pt_owner->arch.pv.xpti )
+                    if ( !rc && pt_owner->arch.pv.flush_root_pt )
                     {
                         bool local_in_use = false;
 
diff --git a/xen/arch/x86/pv/domain.c b/xen/arch/x86/pv/domain.c
index 6ff71f14a2f2..46ee10a8a4c2 100644
--- a/xen/arch/x86/pv/domain.c
+++ b/xen/arch/x86/pv/domain.c
@@ -384,6 +384,8 @@ int pv_domain_initialise(struct domain *d)
 
     d->arch.ctxt_switch = &pv_csw;
 
+    d->arch.pv.flush_root_pt = d->arch.pv.xpti;
+
     if ( !is_pv_32bit_domain(d) && use_invpcid && cpu_has_pcid )
         switch ( ACCESS_ONCE(opt_pcid) )
         {
-- 
2.45.2



^ permalink raw reply related	[flat|nested] 64+ messages in thread

* [PATCH 10/22] x86/mm: move FLUSH_ROOT_PGTBL handling before TLB flush
  2024-07-26 15:21 [PATCH 00/22] x86: adventures in Address Space Isolation Roger Pau Monne
                   ` (8 preceding siblings ...)
  2024-07-26 15:21 ` [PATCH 09/22] x86/pv: untie issuing FLUSH_ROOT_PGTBL from XPTI Roger Pau Monne
@ 2024-07-26 15:21 ` Roger Pau Monne
  2024-07-26 15:21 ` [PATCH 11/22] x86/mm: split setup of the per-domain slot on context switch Roger Pau Monne
                   ` (11 subsequent siblings)
  21 siblings, 0 replies; 64+ messages in thread
From: Roger Pau Monne @ 2024-07-26 15:21 UTC (permalink / raw)
  To: xen-devel; +Cc: alejandro.vallejo, Roger Pau Monne, Jan Beulich, Andrew Cooper

Move the handling of FLUSH_ROOT_PGTBL in flush_area_local() ahead of the logic
that does the TLB flushing, in preparation for further changes requiring the
TLB flush to be strictly done after having handled FLUSH_ROOT_PGTBL.

No functional change intended.

Signed-off-by: Roger Pau Monné <roger.pau@citrix.com>
---
 xen/arch/x86/flushtlb.c | 6 +++---
 1 file changed, 3 insertions(+), 3 deletions(-)

diff --git a/xen/arch/x86/flushtlb.c b/xen/arch/x86/flushtlb.c
index 18748b2bc805..fd5ed16ffb57 100644
--- a/xen/arch/x86/flushtlb.c
+++ b/xen/arch/x86/flushtlb.c
@@ -191,6 +191,9 @@ unsigned int flush_area_local(const void *va, unsigned int flags)
 {
     unsigned int order = (flags - 1) & FLUSH_ORDER_MASK;
 
+    if ( flags & FLUSH_ROOT_PGTBL )
+        get_cpu_info()->root_pgt_changed = true;
+
     if ( flags & (FLUSH_TLB|FLUSH_TLB_GLOBAL) )
     {
         if ( order == 0 )
@@ -254,9 +257,6 @@ unsigned int flush_area_local(const void *va, unsigned int flags)
         }
     }
 
-    if ( flags & FLUSH_ROOT_PGTBL )
-        get_cpu_info()->root_pgt_changed = true;
-
     return flags;
 }
 
-- 
2.45.2



^ permalink raw reply related	[flat|nested] 64+ messages in thread

* [PATCH 11/22] x86/mm: split setup of the per-domain slot on context switch
  2024-07-26 15:21 [PATCH 00/22] x86: adventures in Address Space Isolation Roger Pau Monne
                   ` (9 preceding siblings ...)
  2024-07-26 15:21 ` [PATCH 10/22] x86/mm: move FLUSH_ROOT_PGTBL handling before TLB flush Roger Pau Monne
@ 2024-07-26 15:21 ` Roger Pau Monne
  2024-07-26 15:21 ` [PATCH 12/22] x86/spec-ctrl: introduce Address Space Isolation command line option Roger Pau Monne
                   ` (10 subsequent siblings)
  21 siblings, 0 replies; 64+ messages in thread
From: Roger Pau Monne @ 2024-07-26 15:21 UTC (permalink / raw)
  To: xen-devel; +Cc: alejandro.vallejo, Roger Pau Monne, Jan Beulich, Andrew Cooper

It's currently only used for XPTI.  Move the code to a separate helper in
preparation for it gaining more logic.

While there switch to using l4e_write(): in the current context the L4 is
not active when modified, but that could change.

No functional change intended.

Signed-off-by: Roger Pau Monné <roger.pau@citrix.com>
---
 xen/arch/x86/domain.c         | 4 +---
 xen/arch/x86/include/asm/mm.h | 3 +++
 xen/arch/x86/mm.c             | 7 +++++++
 3 files changed, 11 insertions(+), 3 deletions(-)

diff --git a/xen/arch/x86/domain.c b/xen/arch/x86/domain.c
index 3d3c14dbb5ae..9cfcf0dc63f3 100644
--- a/xen/arch/x86/domain.c
+++ b/xen/arch/x86/domain.c
@@ -1918,9 +1918,7 @@ void cf_check paravirt_ctxt_switch_to(struct vcpu *v)
     root_pgentry_t *root_pgt = this_cpu(root_pgt);
 
     if ( root_pgt )
-        root_pgt[root_table_offset(PERDOMAIN_VIRT_START)] =
-            l4e_from_page(v->domain->arch.perdomain_l3_pg,
-                          __PAGE_HYPERVISOR_RW);
+        setup_perdomain_slot(v, root_pgt);
 
     if ( unlikely(v->arch.dr7 & DR7_ACTIVE_MASK) )
         activate_debugregs(v);
diff --git a/xen/arch/x86/include/asm/mm.h b/xen/arch/x86/include/asm/mm.h
index 076e7009dc99..2c309f7b1444 100644
--- a/xen/arch/x86/include/asm/mm.h
+++ b/xen/arch/x86/include/asm/mm.h
@@ -630,4 +630,7 @@ static inline bool arch_mfns_in_directmap(unsigned long mfn, unsigned long nr)
     return (mfn + nr) <= (virt_to_mfn(eva - 1) + 1);
 }
 
+/* Setup the per-domain slot in the root page table pointer. */
+void setup_perdomain_slot(const struct vcpu *v, root_pgentry_t *root_pgt);
+
 #endif /* __ASM_X86_MM_H__ */
diff --git a/xen/arch/x86/mm.c b/xen/arch/x86/mm.c
index a1ac7bdc5b44..35e929057d21 100644
--- a/xen/arch/x86/mm.c
+++ b/xen/arch/x86/mm.c
@@ -6362,6 +6362,13 @@ unsigned long get_upper_mfn_bound(void)
     return min(max_mfn, 1UL << (paddr_bits - PAGE_SHIFT)) - 1;
 }
 
+void setup_perdomain_slot(const struct vcpu *v, root_pgentry_t *root_pgt)
+{
+    l4e_write(&root_pgt[root_table_offset(PERDOMAIN_VIRT_START)],
+              l4e_from_page(v->domain->arch.perdomain_l3_pg,
+                            __PAGE_HYPERVISOR_RW));
+}
+
 static void __init __maybe_unused build_assertions(void)
 {
     /*
-- 
2.45.2



^ permalink raw reply related	[flat|nested] 64+ messages in thread

* [PATCH 12/22] x86/spec-ctrl: introduce Address Space Isolation command line option
  2024-07-26 15:21 [PATCH 00/22] x86: adventures in Address Space Isolation Roger Pau Monne
                   ` (10 preceding siblings ...)
  2024-07-26 15:21 ` [PATCH 11/22] x86/mm: split setup of the per-domain slot on context switch Roger Pau Monne
@ 2024-07-26 15:21 ` Roger Pau Monne
  2024-08-14 10:10   ` Jan Beulich
  2024-07-26 15:21 ` [PATCH 13/22] x86/hvm: use a per-pCPU monitor table in HAP mode Roger Pau Monne
                   ` (9 subsequent siblings)
  21 siblings, 1 reply; 64+ messages in thread
From: Roger Pau Monne @ 2024-07-26 15:21 UTC (permalink / raw)
  To: xen-devel
  Cc: alejandro.vallejo, Roger Pau Monne, Andrew Cooper, Jan Beulich,
	Julien Grall, Stefano Stabellini

No functional change, as the option is not used.

Introduced new so newly added functionality is keyed on the option being
enabled, even if the feature is non-functional.

Signed-off-by: Roger Pau Monné <roger.pau@citrix.com>
---
 docs/misc/xen-command-line.pandoc    | 15 ++++--
 xen/arch/x86/include/asm/domain.h    |  3 ++
 xen/arch/x86/include/asm/spec_ctrl.h |  2 +
 xen/arch/x86/spec_ctrl.c             | 74 +++++++++++++++++++++++++---
 4 files changed, 81 insertions(+), 13 deletions(-)

diff --git a/docs/misc/xen-command-line.pandoc b/docs/misc/xen-command-line.pandoc
index 98a45211556b..0ddc330428d9 100644
--- a/docs/misc/xen-command-line.pandoc
+++ b/docs/misc/xen-command-line.pandoc
@@ -2387,7 +2387,7 @@ By default SSBD will be mitigated at runtime (i.e `ssbd=runtime`).
 
 ### spec-ctrl (x86)
 > `= List of [ <bool>, xen=<bool>, {pv,hvm}=<bool>,
->              {msr-sc,rsb,verw,{ibpb,bhb}-entry}=<bool>|{pv,hvm}=<bool>,
+>              {msr-sc,rsb,verw,{ibpb,bhb}-entry,asi}=<bool>|{pv,hvm}=<bool>,
 >              bti-thunk=retpoline|lfence|jmp,bhb-seq=short|tsx|long,
 >              {ibrs,ibpb,ssbd,psfd,
 >              eager-fpu,l1d-flush,branch-harden,srb-lock,
@@ -2414,10 +2414,10 @@ in place for guests to use.
 
 Use of a positive boolean value for either of these options is invalid.
 
-The `pv=`, `hvm=`, `msr-sc=`, `rsb=`, `verw=`, `ibpb-entry=` and `bhb-entry=`
-options offer fine grained control over the primitives by Xen.  These impact
-Xen's ability to protect itself, and/or Xen's ability to virtualise support
-for guests to use.
+The `pv=`, `hvm=`, `msr-sc=`, `rsb=`, `verw=`, `ibpb-entry=`, `bhb-entry=` and
+`asi=` options offer fine grained control over the primitives by Xen.  These
+impact Xen's ability to protect itself, and/or Xen's ability to virtualise
+support for guests to use.
 
 * `pv=` and `hvm=` offer control over all suboptions for PV and HVM guests
   respectively.
@@ -2449,6 +2449,11 @@ for guests to use.
   is not available (see `bhi-dis-s`).  The choice of scrubbing sequence can be
   selected using the `bhb-seq=` option.  If it is necessary to protect dom0
   too, boot with `spec-ctrl=bhb-entry`.
+* `asi=` offers control over whether the hypervisor will engage in Address
+  Space Isolation, by not having sensitive information mapped in the VMM
+  page-tables.  Not having sensitive information on the page-tables avoids
+  having to perform some mitigations for speculative attacks when
+  context-switching to the hypervisor.
 
 If Xen was compiled with `CONFIG_INDIRECT_THUNK` support, `bti-thunk=` can be
 used to select which of the thunks gets patched into the
diff --git a/xen/arch/x86/include/asm/domain.h b/xen/arch/x86/include/asm/domain.h
index 9dd2e047f4de..8c366be8c75f 100644
--- a/xen/arch/x86/include/asm/domain.h
+++ b/xen/arch/x86/include/asm/domain.h
@@ -458,6 +458,9 @@ struct arch_domain
     /* Don't unconditionally inject #GP for unhandled MSRs. */
     bool msr_relaxed;
 
+    /* Run the guest without sensitive information in the VMM page-tables. */
+    bool asi;
+
     /* Emulated devices enabled bitmap. */
     uint32_t emulation_flags;
 } __cacheline_aligned;
diff --git a/xen/arch/x86/include/asm/spec_ctrl.h b/xen/arch/x86/include/asm/spec_ctrl.h
index 72347ef2b959..39963c004312 100644
--- a/xen/arch/x86/include/asm/spec_ctrl.h
+++ b/xen/arch/x86/include/asm/spec_ctrl.h
@@ -88,6 +88,8 @@ extern uint8_t default_scf;
 
 extern int8_t opt_xpti_hwdom, opt_xpti_domu;
 
+extern int8_t opt_asi_pv, opt_asi_hwdom, opt_asi_hvm;
+
 extern bool cpu_has_bug_l1tf;
 extern int8_t opt_pv_l1tf_hwdom, opt_pv_l1tf_domu;
 
diff --git a/xen/arch/x86/spec_ctrl.c b/xen/arch/x86/spec_ctrl.c
index 5dc7a17b9354..2e403aad791c 100644
--- a/xen/arch/x86/spec_ctrl.c
+++ b/xen/arch/x86/spec_ctrl.c
@@ -84,6 +84,11 @@ static bool __ro_after_init opt_verw_mmio;
 static int8_t __initdata opt_gds_mit = -1;
 static int8_t __initdata opt_div_scrub = -1;
 
+/* Address Space Isolation for PV/HVM. */
+int8_t __ro_after_init opt_asi_pv = -1;
+int8_t __ro_after_init opt_asi_hwdom = -1;
+int8_t __ro_after_init opt_asi_hvm = -1;
+
 static int __init cf_check parse_spec_ctrl(const char *s)
 {
     const char *ss;
@@ -143,6 +148,10 @@ static int __init cf_check parse_spec_ctrl(const char *s)
             opt_unpriv_mmio = false;
             opt_gds_mit = 0;
             opt_div_scrub = 0;
+
+            opt_asi_pv = 0;
+            opt_asi_hwdom = 0;
+            opt_asi_hvm = 0;
         }
         else if ( val > 0 )
             rc = -EINVAL;
@@ -162,6 +171,7 @@ static int __init cf_check parse_spec_ctrl(const char *s)
             opt_verw_pv = val;
             opt_ibpb_entry_pv = val;
             opt_bhb_entry_pv = val;
+            opt_asi_pv = val;
         }
         else if ( (val = parse_boolean("hvm", s, ss)) >= 0 )
         {
@@ -170,6 +180,7 @@ static int __init cf_check parse_spec_ctrl(const char *s)
             opt_verw_hvm = val;
             opt_ibpb_entry_hvm = val;
             opt_bhb_entry_hvm = val;
+            opt_asi_hvm = val;
         }
         else if ( (val = parse_boolean("msr-sc", s, ss)) != -1 )
         {
@@ -279,6 +290,27 @@ static int __init cf_check parse_spec_ctrl(const char *s)
                 break;
             }
         }
+        else if ( (val = parse_boolean("asi", s, ss)) != -1 )
+        {
+            switch ( val )
+            {
+            case 0:
+            case 1:
+                opt_asi_pv = opt_asi_hwdom = opt_asi_hvm = val;
+                break;
+
+            case -2:
+                s += strlen("asi=");
+                if ( (val = parse_boolean("pv", s, ss)) >= 0 )
+                    opt_asi_pv = val;
+                else if ( (val = parse_boolean("hvm", s, ss)) >= 0 )
+                    opt_asi_hvm = val;
+                else
+            default:
+                    rc = -EINVAL;
+                break;
+            }
+        }
 
         /* Xen's speculative sidechannel mitigation settings. */
         else if ( !strncmp(s, "bti-thunk=", 10) )
@@ -378,6 +410,13 @@ int8_t __ro_after_init opt_xpti_domu = -1;
 
 static __init void xpti_init_default(void)
 {
+    ASSERT(opt_asi_pv >= 0 && opt_asi_hwdom >= 0);
+    if ( (opt_xpti_hwdom == 1 || opt_xpti_domu == 1) && opt_asi_pv == 1 )
+    {
+        printk(XENLOG_ERR
+               "XPTI is incompatible with Address Space Isolation - disabling ASI\n");
+        opt_asi_pv = 0;
+    }
     if ( (boot_cpu_data.x86_vendor & (X86_VENDOR_AMD | X86_VENDOR_HYGON)) ||
          cpu_has_rdcl_no )
     {
@@ -389,9 +428,9 @@ static __init void xpti_init_default(void)
     else
     {
         if ( opt_xpti_hwdom < 0 )
-            opt_xpti_hwdom = 1;
+            opt_xpti_hwdom = !opt_asi_hwdom;
         if ( opt_xpti_domu < 0 )
-            opt_xpti_domu = 1;
+            opt_xpti_domu = !opt_asi_pv;
     }
 }
 
@@ -630,12 +669,13 @@ static void __init print_details(enum ind_thunk thunk)
      * mitigation support for guests.
      */
 #ifdef CONFIG_HVM
-    printk("  Support for HVM VMs:%s%s%s%s%s%s%s%s\n",
+    printk("  Support for HVM VMs:%s%s%s%s%s%s%s%s%s\n",
            (boot_cpu_has(X86_FEATURE_SC_MSR_HVM) ||
             boot_cpu_has(X86_FEATURE_SC_RSB_HVM) ||
             boot_cpu_has(X86_FEATURE_IBPB_ENTRY_HVM) ||
             opt_bhb_entry_hvm || amd_virt_spec_ctrl ||
-            opt_eager_fpu || opt_verw_hvm)           ? ""               : " None",
+            opt_eager_fpu || opt_verw_hvm ||
+            opt_asi_hvm)                             ? ""               : " None",
            boot_cpu_has(X86_FEATURE_SC_MSR_HVM)      ? " MSR_SPEC_CTRL" : "",
            (boot_cpu_has(X86_FEATURE_SC_MSR_HVM) ||
             amd_virt_spec_ctrl)                      ? " MSR_VIRT_SPEC_CTRL" : "",
@@ -643,22 +683,24 @@ static void __init print_details(enum ind_thunk thunk)
            opt_eager_fpu                             ? " EAGER_FPU"     : "",
            opt_verw_hvm                              ? " VERW"          : "",
            boot_cpu_has(X86_FEATURE_IBPB_ENTRY_HVM)  ? " IBPB-entry"    : "",
-           opt_bhb_entry_hvm                         ? " BHB-entry"     : "");
+           opt_bhb_entry_hvm                         ? " BHB-entry"     : "",
+           opt_asi_hvm                               ? " ASI"           : "");
 
 #endif
 #ifdef CONFIG_PV
-    printk("  Support for PV VMs:%s%s%s%s%s%s%s\n",
+    printk("  Support for PV VMs:%s%s%s%s%s%s%s%s\n",
            (boot_cpu_has(X86_FEATURE_SC_MSR_PV) ||
             boot_cpu_has(X86_FEATURE_SC_RSB_PV) ||
             boot_cpu_has(X86_FEATURE_IBPB_ENTRY_PV) ||
-            opt_bhb_entry_pv ||
+            opt_bhb_entry_pv || opt_asi_pv ||
             opt_eager_fpu || opt_verw_pv)            ? ""               : " None",
            boot_cpu_has(X86_FEATURE_SC_MSR_PV)       ? " MSR_SPEC_CTRL" : "",
            boot_cpu_has(X86_FEATURE_SC_RSB_PV)       ? " RSB"           : "",
            opt_eager_fpu                             ? " EAGER_FPU"     : "",
            opt_verw_pv                               ? " VERW"          : "",
            boot_cpu_has(X86_FEATURE_IBPB_ENTRY_PV)   ? " IBPB-entry"    : "",
-           opt_bhb_entry_pv                          ? " BHB-entry"     : "");
+           opt_bhb_entry_pv                          ? " BHB-entry"     : "",
+           opt_asi_pv                                ? " ASI"           : "");
 
     printk("  XPTI (64-bit PV only): Dom0 %s, DomU %s (with%s PCID)\n",
            opt_xpti_hwdom ? "enabled" : "disabled",
@@ -1773,6 +1815,9 @@ void spec_ctrl_init_domain(struct domain *d)
     if ( pv )
         d->arch.pv.xpti = is_hardware_domain(d) ? opt_xpti_hwdom
                                                 : opt_xpti_domu;
+
+    d->arch.asi = is_hardware_domain(d) ? opt_asi_hwdom
+                                        : pv ? opt_asi_pv : opt_asi_hvm;
 }
 
 void __init init_speculation_mitigations(void)
@@ -2069,6 +2114,19 @@ void __init init_speculation_mitigations(void)
          hw_smt_enabled && default_xen_spec_ctrl )
         setup_force_cpu_cap(X86_FEATURE_SC_MSR_IDLE);
 
+    /* Disable ASI by default until feature is finished. */
+    if ( opt_asi_pv == -1 )
+        opt_asi_pv = 0;
+    if ( opt_asi_hwdom == -1 )
+        opt_asi_hwdom = 0;
+    if ( opt_asi_hvm == -1 )
+        opt_asi_hvm = 0;
+
+    if ( opt_asi_pv || opt_asi_hvm )
+        warning_add(
+            "Address Space Isolation is not functional, this option is\n"
+            "intended to be used only for development purposes.\n");
+
     xpti_init_default();
 
     l1tf_calculations();
-- 
2.45.2



^ permalink raw reply related	[flat|nested] 64+ messages in thread

* [PATCH 13/22] x86/hvm: use a per-pCPU monitor table in HAP mode
  2024-07-26 15:21 [PATCH 00/22] x86: adventures in Address Space Isolation Roger Pau Monne
                   ` (11 preceding siblings ...)
  2024-07-26 15:21 ` [PATCH 12/22] x86/spec-ctrl: introduce Address Space Isolation command line option Roger Pau Monne
@ 2024-07-26 15:21 ` Roger Pau Monne
  2024-08-16 18:02   ` Alejandro Vallejo
  2024-07-26 15:21 ` [PATCH 14/22] x86/hvm: use a per-pCPU monitor table in shadow mode Roger Pau Monne
                   ` (8 subsequent siblings)
  21 siblings, 1 reply; 64+ messages in thread
From: Roger Pau Monne @ 2024-07-26 15:21 UTC (permalink / raw)
  To: xen-devel; +Cc: alejandro.vallejo, Roger Pau Monne, Jan Beulich, Andrew Cooper

Instead of allocating a monitor table for each vCPU when running in HVM HAP
mode, use a per-pCPU monitor table, which gets the per-domain slot updated on
guest context switch.

This limits the amount of memory used for HVM HAP monitor tables to the amount
of active pCPUs, rather than to the number of vCPUs.  It also simplifies vCPU
allocation and teardown, since the monitor table handling is removed from
there.

Note the switch to using a per-CPU monitor table is done regardless of whether
Address Space Isolation is enabled or not.  Partly for the memory usage
reduction, and also because it allows to simplify the VM tear down path by not
having to cleanup the per-vCPU monitor tables.

Signed-off-by: Roger Pau Monné <roger.pau@citrix.com>
---
Note the monitor table is not made static because uses outside of the file
where it's defined will be added by further patches.
---
 xen/arch/x86/hvm/hvm.c             | 60 ++++++++++++++++++++++++
 xen/arch/x86/hvm/svm/svm.c         |  5 ++
 xen/arch/x86/hvm/vmx/vmcs.c        |  1 +
 xen/arch/x86/hvm/vmx/vmx.c         |  4 ++
 xen/arch/x86/include/asm/hap.h     |  1 -
 xen/arch/x86/include/asm/hvm/hvm.h |  8 ++++
 xen/arch/x86/mm.c                  |  8 ++++
 xen/arch/x86/mm/hap/hap.c          | 75 ------------------------------
 xen/arch/x86/mm/paging.c           |  4 +-
 9 files changed, 87 insertions(+), 79 deletions(-)

diff --git a/xen/arch/x86/hvm/hvm.c b/xen/arch/x86/hvm/hvm.c
index 7f4b627b1f5f..3f771bc65677 100644
--- a/xen/arch/x86/hvm/hvm.c
+++ b/xen/arch/x86/hvm/hvm.c
@@ -104,6 +104,54 @@ static const char __initconst warning_hvm_fep[] =
 static bool __initdata opt_altp2m_enabled;
 boolean_param("altp2m", opt_altp2m_enabled);
 
+DEFINE_PER_CPU(root_pgentry_t *, monitor_pgt);
+
+static int allocate_cpu_monitor_table(unsigned int cpu)
+{
+    root_pgentry_t *pgt = alloc_xenheap_page();
+
+    if ( !pgt )
+        return -ENOMEM;
+
+    clear_page(pgt);
+
+    init_xen_l4_slots(pgt, _mfn(virt_to_mfn(pgt)), INVALID_MFN, NULL,
+                      false, true, false);
+
+    ASSERT(!per_cpu(monitor_pgt, cpu));
+    per_cpu(monitor_pgt, cpu) = pgt;
+
+    return 0;
+}
+
+static void free_cpu_monitor_table(unsigned int cpu)
+{
+    root_pgentry_t *pgt = per_cpu(monitor_pgt, cpu);
+
+    if ( !pgt )
+        return;
+
+    per_cpu(monitor_pgt, cpu) = NULL;
+    free_xenheap_page(pgt);
+}
+
+void hvm_set_cpu_monitor_table(struct vcpu *v)
+{
+    root_pgentry_t *pgt = this_cpu(monitor_pgt);
+
+    ASSERT(pgt);
+
+    setup_perdomain_slot(v, pgt);
+
+    make_cr3(v, _mfn(virt_to_mfn(pgt)));
+}
+
+void hvm_clear_cpu_monitor_table(struct vcpu *v)
+{
+    /* Poison %cr3, it will be updated when the vCPU is scheduled. */
+    make_cr3(v, INVALID_MFN);
+}
+
 static int cf_check cpu_callback(
     struct notifier_block *nfb, unsigned long action, void *hcpu)
 {
@@ -113,6 +161,9 @@ static int cf_check cpu_callback(
     switch ( action )
     {
     case CPU_UP_PREPARE:
+        rc = allocate_cpu_monitor_table(cpu);
+        if ( rc )
+            break;
         rc = alternative_call(hvm_funcs.cpu_up_prepare, cpu);
         break;
     case CPU_DYING:
@@ -121,6 +172,7 @@ static int cf_check cpu_callback(
     case CPU_UP_CANCELED:
     case CPU_DEAD:
         alternative_vcall(hvm_funcs.cpu_dead, cpu);
+        free_cpu_monitor_table(cpu);
         break;
     default:
         break;
@@ -154,6 +206,7 @@ static bool __init hap_supported(struct hvm_function_table *fns)
 static int __init cf_check hvm_enable(void)
 {
     const struct hvm_function_table *fns = NULL;
+    int rc;
 
     if ( cpu_has_vmx )
         fns = start_vmx();
@@ -205,6 +258,13 @@ static int __init cf_check hvm_enable(void)
 
     register_cpu_notifier(&cpu_nfb);
 
+    rc = allocate_cpu_monitor_table(0);
+    if ( rc )
+    {
+        printk(XENLOG_ERR "Error %d setting up HVM monitor page tables\n", rc);
+        return rc;
+    }
+
     return 0;
 }
 presmp_initcall(hvm_enable);
diff --git a/xen/arch/x86/hvm/svm/svm.c b/xen/arch/x86/hvm/svm/svm.c
index 988250dbc154..a3fc033c0100 100644
--- a/xen/arch/x86/hvm/svm/svm.c
+++ b/xen/arch/x86/hvm/svm/svm.c
@@ -902,6 +902,8 @@ static void cf_check svm_ctxt_switch_from(struct vcpu *v)
     if ( unlikely((read_efer() & EFER_SVME) == 0) )
         return;
 
+    hvm_clear_cpu_monitor_table(v);
+
     if ( !v->arch.fully_eager_fpu )
         svm_fpu_leave(v);
 
@@ -957,6 +959,8 @@ static void cf_check svm_ctxt_switch_to(struct vcpu *v)
         ASSERT(v->domain->arch.cpuid->extd.virt_ssbd);
         amd_set_legacy_ssbd(true);
     }
+
+    hvm_set_cpu_monitor_table(v);
 }
 
 static void noreturn cf_check svm_do_resume(void)
@@ -990,6 +994,7 @@ static void noreturn cf_check svm_do_resume(void)
         hvm_migrate_pirqs(v);
         /* Migrating to another ASID domain.  Request a new ASID. */
         hvm_asid_flush_vcpu(v);
+        hvm_update_host_cr3(v);
     }
 
     if ( !vcpu_guestmode && !vlapic_hw_disabled(vlapic) )
diff --git a/xen/arch/x86/hvm/vmx/vmcs.c b/xen/arch/x86/hvm/vmx/vmcs.c
index 9b6dc51f36ab..5d67c8157825 100644
--- a/xen/arch/x86/hvm/vmx/vmcs.c
+++ b/xen/arch/x86/hvm/vmx/vmcs.c
@@ -1957,6 +1957,7 @@ void cf_check vmx_do_resume(void)
         v->arch.hvm.vmx.hostenv_migrated = 1;
 
         hvm_asid_flush_vcpu(v);
+        hvm_update_host_cr3(v);
     }
 
     debug_state = v->domain->debugger_attached
diff --git a/xen/arch/x86/hvm/vmx/vmx.c b/xen/arch/x86/hvm/vmx/vmx.c
index cbe91c679807..5863c57b2d4a 100644
--- a/xen/arch/x86/hvm/vmx/vmx.c
+++ b/xen/arch/x86/hvm/vmx/vmx.c
@@ -1153,6 +1153,8 @@ static void cf_check vmx_ctxt_switch_from(struct vcpu *v)
     if ( unlikely(!this_cpu(vmxon)) )
         return;
 
+    hvm_clear_cpu_monitor_table(v);
+
     if ( !v->is_running )
     {
         /*
@@ -1182,6 +1184,8 @@ static void cf_check vmx_ctxt_switch_to(struct vcpu *v)
 
     if ( v->domain->arch.hvm.pi_ops.flags & PI_CSW_TO )
         vmx_pi_switch_to(v);
+
+    hvm_set_cpu_monitor_table(v);
 }
 
 
diff --git a/xen/arch/x86/include/asm/hap.h b/xen/arch/x86/include/asm/hap.h
index f01ce73fb4f3..ae6760bc2bf5 100644
--- a/xen/arch/x86/include/asm/hap.h
+++ b/xen/arch/x86/include/asm/hap.h
@@ -24,7 +24,6 @@ int   hap_domctl(struct domain *d, struct xen_domctl_shadow_op *sc,
                  XEN_GUEST_HANDLE_PARAM(xen_domctl_t) u_domctl);
 int   hap_enable(struct domain *d, u32 mode);
 void  hap_final_teardown(struct domain *d);
-void  hap_vcpu_teardown(struct vcpu *v);
 void  hap_teardown(struct domain *d, bool *preempted);
 void  hap_vcpu_init(struct vcpu *v);
 int   hap_track_dirty_vram(struct domain *d,
diff --git a/xen/arch/x86/include/asm/hvm/hvm.h b/xen/arch/x86/include/asm/hvm/hvm.h
index 1c01e22c8e62..6d9a1ae04feb 100644
--- a/xen/arch/x86/include/asm/hvm/hvm.h
+++ b/xen/arch/x86/include/asm/hvm/hvm.h
@@ -550,6 +550,14 @@ static inline void hvm_invlpg(struct vcpu *v, unsigned long linear)
                        (1U << X86_EXC_AC) | \
                        (1U << X86_EXC_MC))
 
+/*
+ * Setup the per-domain slots of the per-cpu monitor table and update the vCPU
+ * cr3 to use it.
+ */
+DECLARE_PER_CPU(root_pgentry_t *, monitor_pgt);
+void hvm_set_cpu_monitor_table(struct vcpu *v);
+void hvm_clear_cpu_monitor_table(struct vcpu *v);
+
 /* Called in boot/resume paths.  Must cope with no HVM support. */
 static inline int hvm_cpu_up(void)
 {
diff --git a/xen/arch/x86/mm.c b/xen/arch/x86/mm.c
index 35e929057d21..7f2666adaef4 100644
--- a/xen/arch/x86/mm.c
+++ b/xen/arch/x86/mm.c
@@ -6367,6 +6367,14 @@ void setup_perdomain_slot(const struct vcpu *v, root_pgentry_t *root_pgt)
     l4e_write(&root_pgt[root_table_offset(PERDOMAIN_VIRT_START)],
               l4e_from_page(v->domain->arch.perdomain_l3_pg,
                             __PAGE_HYPERVISOR_RW));
+
+    if ( !is_pv_64bit_vcpu(v) )
+        /*
+         * HVM guests always have the compatibility L4 per-domain area because
+         * bitness is not know, and can change at runtime.
+         */
+        l4e_write(&root_pgt[root_table_offset(PERDOMAIN_ALT_VIRT_START)],
+                  root_pgt[root_table_offset(PERDOMAIN_VIRT_START)]);
 }
 
 static void __init __maybe_unused build_assertions(void)
diff --git a/xen/arch/x86/mm/hap/hap.c b/xen/arch/x86/mm/hap/hap.c
index c8514ca0e917..3279aafcd7d8 100644
--- a/xen/arch/x86/mm/hap/hap.c
+++ b/xen/arch/x86/mm/hap/hap.c
@@ -387,46 +387,6 @@ int hap_set_allocation(struct domain *d, unsigned int pages, bool *preempted)
     return 0;
 }
 
-static mfn_t hap_make_monitor_table(struct vcpu *v)
-{
-    struct domain *d = v->domain;
-    struct page_info *pg;
-    l4_pgentry_t *l4e;
-    mfn_t m4mfn;
-
-    ASSERT(pagetable_get_pfn(v->arch.hvm.monitor_table) == 0);
-
-    if ( (pg = hap_alloc(d)) == NULL )
-        goto oom;
-
-    m4mfn = page_to_mfn(pg);
-    l4e = map_domain_page(m4mfn);
-
-    init_xen_l4_slots(l4e, m4mfn, INVALID_MFN, d->arch.perdomain_l3_pg,
-                      false, true, false);
-    unmap_domain_page(l4e);
-
-    return m4mfn;
-
- oom:
-    if ( !d->is_dying &&
-         (!d->is_shutting_down || d->shutdown_code != SHUTDOWN_crash) )
-    {
-        printk(XENLOG_G_ERR "%pd: out of memory building monitor pagetable\n",
-               d);
-        domain_crash(d);
-    }
-    return INVALID_MFN;
-}
-
-static void hap_destroy_monitor_table(struct vcpu* v, mfn_t mmfn)
-{
-    struct domain *d = v->domain;
-
-    /* Put the memory back in the pool */
-    hap_free(d, mmfn);
-}
-
 /************************************************/
 /*          HAP DOMAIN LEVEL FUNCTIONS          */
 /************************************************/
@@ -548,25 +508,6 @@ void hap_final_teardown(struct domain *d)
     }
 }
 
-void hap_vcpu_teardown(struct vcpu *v)
-{
-    struct domain *d = v->domain;
-    mfn_t mfn;
-
-    paging_lock(d);
-
-    if ( !paging_mode_hap(d) || !v->arch.paging.mode )
-        goto out;
-
-    mfn = pagetable_get_mfn(v->arch.hvm.monitor_table);
-    if ( mfn_x(mfn) )
-        hap_destroy_monitor_table(v, mfn);
-    v->arch.hvm.monitor_table = pagetable_null();
-
- out:
-    paging_unlock(d);
-}
-
 void hap_teardown(struct domain *d, bool *preempted)
 {
     struct vcpu *v;
@@ -575,10 +516,6 @@ void hap_teardown(struct domain *d, bool *preempted)
     ASSERT(d->is_dying);
     ASSERT(d != current->domain);
 
-    /* TODO - Remove when the teardown path is better structured. */
-    for_each_vcpu ( d, v )
-        hap_vcpu_teardown(v);
-
     /* Leave the root pt in case we get further attempts to modify the p2m. */
     if ( hvm_altp2m_supported() )
     {
@@ -782,21 +719,9 @@ static void cf_check hap_update_paging_modes(struct vcpu *v)
 
     v->arch.paging.mode = hap_paging_get_mode(v);
 
-    if ( pagetable_is_null(v->arch.hvm.monitor_table) )
-    {
-        mfn_t mmfn = hap_make_monitor_table(v);
-
-        if ( mfn_eq(mmfn, INVALID_MFN) )
-            goto unlock;
-        v->arch.hvm.monitor_table = pagetable_from_mfn(mmfn);
-        make_cr3(v, mmfn);
-        hvm_update_host_cr3(v);
-    }
-
     /* CR3 is effectively updated by a mode change. Flush ASIDs, etc. */
     hap_update_cr3(v, false);
 
- unlock:
     paging_unlock(d);
     put_gfn(d, cr3_gfn);
 }
diff --git a/xen/arch/x86/mm/paging.c b/xen/arch/x86/mm/paging.c
index bca320fffabf..8ba105b5cb0c 100644
--- a/xen/arch/x86/mm/paging.c
+++ b/xen/arch/x86/mm/paging.c
@@ -794,9 +794,7 @@ long do_paging_domctl_cont(
 
 void paging_vcpu_teardown(struct vcpu *v)
 {
-    if ( hap_enabled(v->domain) )
-        hap_vcpu_teardown(v);
-    else
+    if ( !hap_enabled(v->domain) )
         shadow_vcpu_teardown(v);
 }
 
-- 
2.45.2



^ permalink raw reply related	[flat|nested] 64+ messages in thread

* [PATCH 14/22] x86/hvm: use a per-pCPU monitor table in shadow mode
  2024-07-26 15:21 [PATCH 00/22] x86: adventures in Address Space Isolation Roger Pau Monne
                   ` (12 preceding siblings ...)
  2024-07-26 15:21 ` [PATCH 13/22] x86/hvm: use a per-pCPU monitor table in HAP mode Roger Pau Monne
@ 2024-07-26 15:21 ` Roger Pau Monne
  2024-07-26 15:21 ` [PATCH 15/22] x86/idle: allow using a per-pCPU L4 Roger Pau Monne
                   ` (7 subsequent siblings)
  21 siblings, 0 replies; 64+ messages in thread
From: Roger Pau Monne @ 2024-07-26 15:21 UTC (permalink / raw)
  To: xen-devel
  Cc: alejandro.vallejo, Roger Pau Monne, Jan Beulich, Andrew Cooper,
	Tim Deegan

Instead of allocating a monitor table for each vCPU when running in HVM shadow
mode, use a per-pCPU monitor table, which gets the per-domain slot updated on
guest context switch.

Signed-off-by: Roger Pau Monné <roger.pau@citrix.com>
---
I've tested this manually, but XenServer builds disable shadow support, so it
possibly hasn't been given the same level of testing as the rest of the
changes.
---
 xen/arch/x86/hvm/hvm.c              |  7 +++
 xen/arch/x86/include/asm/hvm/vcpu.h |  6 ++-
 xen/arch/x86/include/asm/paging.h   | 18 ++++++++
 xen/arch/x86/mm.c                   |  6 +++
 xen/arch/x86/mm/shadow/common.c     | 42 +++++++-----------
 xen/arch/x86/mm/shadow/hvm.c        | 65 ++++++++++++----------------
 xen/arch/x86/mm/shadow/multi.c      | 66 ++++++++++++++++++-----------
 xen/arch/x86/mm/shadow/private.h    |  4 +-
 8 files changed, 120 insertions(+), 94 deletions(-)

diff --git a/xen/arch/x86/hvm/hvm.c b/xen/arch/x86/hvm/hvm.c
index 3f771bc65677..419d78a79c51 100644
--- a/xen/arch/x86/hvm/hvm.c
+++ b/xen/arch/x86/hvm/hvm.c
@@ -141,6 +141,7 @@ void hvm_set_cpu_monitor_table(struct vcpu *v)
 
     ASSERT(pgt);
 
+    paging_set_cpu_monitor_table(v);
     setup_perdomain_slot(v, pgt);
 
     make_cr3(v, _mfn(virt_to_mfn(pgt)));
@@ -150,6 +151,8 @@ void hvm_clear_cpu_monitor_table(struct vcpu *v)
 {
     /* Poison %cr3, it will be updated when the vCPU is scheduled. */
     make_cr3(v, INVALID_MFN);
+
+    paging_clear_cpu_monitor_table(v);
 }
 
 static int cf_check cpu_callback(
@@ -1645,6 +1648,10 @@ int hvm_vcpu_initialise(struct vcpu *v)
     int rc;
     struct domain *d = v->domain;
 
+#ifdef CONFIG_SHADOW_PAGING
+    v->arch.hvm.shadow_linear_l3 = INVALID_MFN;
+#endif
+
     hvm_asid_flush_vcpu(v);
 
     spin_lock_init(&v->arch.hvm.tm_lock);
diff --git a/xen/arch/x86/include/asm/hvm/vcpu.h b/xen/arch/x86/include/asm/hvm/vcpu.h
index 64c7a6fedea9..f7faaaa21521 100644
--- a/xen/arch/x86/include/asm/hvm/vcpu.h
+++ b/xen/arch/x86/include/asm/hvm/vcpu.h
@@ -149,8 +149,10 @@ struct hvm_vcpu {
         uint16_t p2midx;
     } fast_single_step;
 
-    /* (MFN) hypervisor page table */
-    pagetable_t         monitor_table;
+#ifdef CONFIG_SHADOW_PAGING
+    /* Reference to the linear L3 page table. */
+    mfn_t shadow_linear_l3;
+#endif
 
     struct hvm_vcpu_asid n1asid;
 
diff --git a/xen/arch/x86/include/asm/paging.h b/xen/arch/x86/include/asm/paging.h
index 8a2a0af40874..c1e188bcd3c0 100644
--- a/xen/arch/x86/include/asm/paging.h
+++ b/xen/arch/x86/include/asm/paging.h
@@ -117,6 +117,8 @@ struct paging_mode {
                                             unsigned long cr3,
                                             paddr_t ga, uint32_t *pfec,
                                             unsigned int *page_order);
+    void          (*set_cpu_monitor_table  )(struct vcpu *v);
+    void          (*clear_cpu_monitor_table)(struct vcpu *v);
 #endif
     pagetable_t   (*update_cr3            )(struct vcpu *v, bool noflush);
 
@@ -288,6 +290,22 @@ static inline bool paging_flush_tlb(const unsigned long *vcpu_bitmap)
     return current->domain->arch.paging.flush_tlb(vcpu_bitmap);
 }
 
+static inline void paging_set_cpu_monitor_table(struct vcpu *v)
+{
+    const struct paging_mode *mode = paging_get_hostmode(v);
+
+    if ( mode->set_cpu_monitor_table )
+        mode->set_cpu_monitor_table(v);
+}
+
+static inline void paging_clear_cpu_monitor_table(struct vcpu *v)
+{
+    const struct paging_mode *mode = paging_get_hostmode(v);
+
+    if ( mode->clear_cpu_monitor_table )
+        mode->clear_cpu_monitor_table(v);
+}
+
 #endif /* CONFIG_HVM */
 
 /* Update all the things that are derived from the guest's CR3.
diff --git a/xen/arch/x86/mm.c b/xen/arch/x86/mm.c
index 7f2666adaef4..13aa15f4db22 100644
--- a/xen/arch/x86/mm.c
+++ b/xen/arch/x86/mm.c
@@ -534,6 +534,12 @@ void write_ptbase(struct vcpu *v)
     }
     else
     {
+        ASSERT(!is_hvm_domain(d) || !d->arch.asi
+#ifdef CONFIG_HVM
+               || mfn_eq(maddr_to_mfn(v->arch.cr3),
+                         virt_to_mfn(this_cpu(monitor_pgt)))
+#endif
+               );
         /* Make sure to clear use_pv_cr3 and xen_cr3 before pv_cr3. */
         cpu_info->use_pv_cr3 = false;
         cpu_info->xen_cr3 = 0;
diff --git a/xen/arch/x86/mm/shadow/common.c b/xen/arch/x86/mm/shadow/common.c
index 0176e33bc9c7..d31c1db8a1ab 100644
--- a/xen/arch/x86/mm/shadow/common.c
+++ b/xen/arch/x86/mm/shadow/common.c
@@ -2413,16 +2413,12 @@ static void sh_update_paging_modes(struct vcpu *v)
                 &SHADOW_INTERNAL_NAME(sh_paging_mode, 2);
         }
 
-        if ( pagetable_is_null(v->arch.hvm.monitor_table) )
+        if ( mfn_eq(v->arch.hvm.shadow_linear_l3, INVALID_MFN) )
         {
-            mfn_t mmfn = sh_make_monitor_table(
-                             v, v->arch.paging.mode->shadow.shadow_levels);
-
-            if ( mfn_eq(mmfn, INVALID_MFN) )
+            if ( sh_update_monitor_table(
+                     v, v->arch.paging.mode->shadow.shadow_levels) )
                 return;
 
-            v->arch.hvm.monitor_table = pagetable_from_mfn(mmfn);
-            make_cr3(v, mmfn);
             hvm_update_host_cr3(v);
         }
 
@@ -2440,8 +2436,8 @@ static void sh_update_paging_modes(struct vcpu *v)
                  (v->arch.paging.mode->shadow.shadow_levels !=
                   old_mode->shadow.shadow_levels) )
             {
-                /* Need to make a new monitor table for the new mode */
-                mfn_t new_mfn, old_mfn;
+                /* Might need to make a new L3 linear table for the new mode */
+                mfn_t old_mfn;
 
                 if ( v != current && vcpu_runnable(v) )
                 {
@@ -2455,24 +2451,21 @@ static void sh_update_paging_modes(struct vcpu *v)
                     return;
                 }
 
-                old_mfn = pagetable_get_mfn(v->arch.hvm.monitor_table);
-                v->arch.hvm.monitor_table = pagetable_null();
-                new_mfn = sh_make_monitor_table(
-                              v, v->arch.paging.mode->shadow.shadow_levels);
-                if ( mfn_eq(new_mfn, INVALID_MFN) )
+                old_mfn = v->arch.hvm.shadow_linear_l3;
+                v->arch.hvm.shadow_linear_l3 = INVALID_MFN;
+                if ( sh_update_monitor_table(
+                         v, v->arch.paging.mode->shadow.shadow_levels) )
                 {
                     sh_destroy_monitor_table(v, old_mfn,
                                              old_mode->shadow.shadow_levels);
                     return;
                 }
-                v->arch.hvm.monitor_table = pagetable_from_mfn(new_mfn);
-                SHADOW_PRINTK("new monitor table %"PRI_mfn "\n",
-                               mfn_x(new_mfn));
+                SHADOW_PRINTK("new L3 linear table %"PRI_mfn "\n",
+                               mfn_x(v->arch.hvm.shadow_linear_l3));
 
                 /* Don't be running on the old monitor table when we
                  * pull it down!  Switch CR3, and warn the HVM code that
                  * its host cr3 has changed. */
-                make_cr3(v, new_mfn);
                 if ( v == current )
                     write_ptbase(v);
                 hvm_update_host_cr3(v);
@@ -2781,16 +2774,13 @@ void shadow_vcpu_teardown(struct vcpu *v)
 
     sh_detach_old_tables(v);
 #ifdef CONFIG_HVM
-    if ( shadow_mode_external(d) )
+    if ( shadow_mode_external(d) &&
+         !mfn_eq(v->arch.hvm.shadow_linear_l3, INVALID_MFN) )
     {
-        mfn_t mfn = pagetable_get_mfn(v->arch.hvm.monitor_table);
-
-        if ( mfn_x(mfn) )
-            sh_destroy_monitor_table(
-                v, mfn,
+        sh_destroy_monitor_table(
+                v, v->arch.hvm.shadow_linear_l3,
                 v->arch.paging.mode->shadow.shadow_levels);
-
-        v->arch.hvm.monitor_table = pagetable_null();
+        v->arch.hvm.shadow_linear_l3 = INVALID_MFN;
     }
 #endif
 
diff --git a/xen/arch/x86/mm/shadow/hvm.c b/xen/arch/x86/mm/shadow/hvm.c
index 93922a71e511..15c75cf766bb 100644
--- a/xen/arch/x86/mm/shadow/hvm.c
+++ b/xen/arch/x86/mm/shadow/hvm.c
@@ -736,30 +736,15 @@ bool cf_check shadow_flush_tlb(const unsigned long *vcpu_bitmap)
     return true;
 }
 
-mfn_t sh_make_monitor_table(const struct vcpu *v, unsigned int shadow_levels)
+int sh_update_monitor_table(struct vcpu *v, unsigned int shadow_levels)
 {
     struct domain *d = v->domain;
-    mfn_t m4mfn;
-    l4_pgentry_t *l4e;
 
-    ASSERT(!pagetable_get_pfn(v->arch.hvm.monitor_table));
+    ASSERT(mfn_eq(v->arch.hvm.shadow_linear_l3, INVALID_MFN));
 
     /* Guarantee we can get the memory we need */
-    if ( !shadow_prealloc(d, SH_type_monitor_table, CONFIG_PAGING_LEVELS) )
-        return INVALID_MFN;
-
-    m4mfn = shadow_alloc(d, SH_type_monitor_table, 0);
-    mfn_to_page(m4mfn)->shadow_flags = 4;
-
-    l4e = map_domain_page(m4mfn);
-
-    /*
-     * Create a self-linear mapping, but no shadow-linear mapping.  A
-     * shadow-linear mapping will either be inserted below when creating
-     * lower level monitor tables, or later in sh_update_cr3().
-     */
-    init_xen_l4_slots(l4e, m4mfn, INVALID_MFN, d->arch.perdomain_l3_pg,
-                      false, true, false);
+    if ( !shadow_prealloc(d, SH_type_monitor_table, CONFIG_PAGING_LEVELS - 1) )
+        return -ENOMEM;
 
     if ( shadow_levels < 4 )
     {
@@ -773,52 +758,54 @@ mfn_t sh_make_monitor_table(const struct vcpu *v, unsigned int shadow_levels)
          */
         m3mfn = shadow_alloc(d, SH_type_monitor_table, 0);
         mfn_to_page(m3mfn)->shadow_flags = 3;
-        l4e[l4_table_offset(SH_LINEAR_PT_VIRT_START)]
-            = l4e_from_mfn(m3mfn, __PAGE_HYPERVISOR_RW);
 
         m2mfn = shadow_alloc(d, SH_type_monitor_table, 0);
         mfn_to_page(m2mfn)->shadow_flags = 2;
         l3e = map_domain_page(m3mfn);
         l3e[0] = l3e_from_mfn(m2mfn, __PAGE_HYPERVISOR_RW);
         unmap_domain_page(l3e);
-    }
 
-    unmap_domain_page(l4e);
+        v->arch.hvm.shadow_linear_l3 = m3mfn;
+
+        /*
+         * If the vCPU is not the current one the L4 entry will be updated on
+         * context switch.
+         */
+        if ( v == current )
+            this_cpu(monitor_pgt)[l4_table_offset(SH_LINEAR_PT_VIRT_START)]
+                = l4e_from_mfn(m3mfn, __PAGE_HYPERVISOR_RW);
+    }
+    else if ( v == current )
+        /* The shadow linear mapping will be inserted in sh_update_cr3(). */
+        this_cpu(monitor_pgt)[l4_table_offset(SH_LINEAR_PT_VIRT_START)]
+            = l4e_empty();
 
-    return m4mfn;
+    return 0;
 }
 
-void sh_destroy_monitor_table(const struct vcpu *v, mfn_t mmfn,
+void sh_destroy_monitor_table(const struct vcpu *v, mfn_t m3mfn,
                               unsigned int shadow_levels)
 {
     struct domain *d = v->domain;
 
-    ASSERT(mfn_to_page(mmfn)->u.sh.type == SH_type_monitor_table);
-
     if ( shadow_levels < 4 )
     {
-        mfn_t m3mfn;
-        l4_pgentry_t *l4e = map_domain_page(mmfn);
-        l3_pgentry_t *l3e;
-        unsigned int linear_slot = l4_table_offset(SH_LINEAR_PT_VIRT_START);
+        l3_pgentry_t *l3e = map_domain_page(m3mfn);
+
+        ASSERT(!mfn_eq(m3mfn, INVALID_MFN));
+        ASSERT(mfn_to_page(m3mfn)->u.sh.type == SH_type_monitor_table);
 
         /*
          * Need to destroy the l3 and l2 monitor pages used
          * for the linear map.
          */
-        ASSERT(l4e_get_flags(l4e[linear_slot]) & _PAGE_PRESENT);
-        m3mfn = l4e_get_mfn(l4e[linear_slot]);
-        l3e = map_domain_page(m3mfn);
         ASSERT(l3e_get_flags(l3e[0]) & _PAGE_PRESENT);
         shadow_free(d, l3e_get_mfn(l3e[0]));
         unmap_domain_page(l3e);
         shadow_free(d, m3mfn);
-
-        unmap_domain_page(l4e);
     }
-
-    /* Put the memory back in the pool */
-    shadow_free(d, mmfn);
+    else
+        ASSERT(mfn_eq(m3mfn, INVALID_MFN));
 }
 
 /**************************************************************************/
diff --git a/xen/arch/x86/mm/shadow/multi.c b/xen/arch/x86/mm/shadow/multi.c
index 0def0c073ca8..68c59233794f 100644
--- a/xen/arch/x86/mm/shadow/multi.c
+++ b/xen/arch/x86/mm/shadow/multi.c
@@ -3007,6 +3007,32 @@ static unsigned long cf_check sh_gva_to_gfn(
     return gfn_x(gfn);
 }
 
+static void cf_check set_cpu_monitor_table(struct vcpu *v)
+{
+    root_pgentry_t *pgt = this_cpu(monitor_pgt);
+
+    virt_to_page(pgt)->shadow_flags = 4;
+
+    /* Setup linear L3 entry. */
+    if ( !mfn_eq(v->arch.hvm.shadow_linear_l3, INVALID_MFN) )
+        pgt[l4_table_offset(SH_LINEAR_PT_VIRT_START)] =
+            l4e_from_mfn(v->arch.hvm.shadow_linear_l3, __PAGE_HYPERVISOR_RW);
+    else
+        pgt[l4_table_offset(SH_LINEAR_PT_VIRT_START)] =
+            l4e_from_pfn(
+                pagetable_get_pfn(v->arch.paging.shadow.shadow_table[0]),
+                __PAGE_HYPERVISOR_RW);
+}
+
+static void cf_check clear_cpu_monitor_table(struct vcpu *v)
+{
+    root_pgentry_t *pgt = this_cpu(monitor_pgt);
+
+    virt_to_page(pgt)->shadow_flags = 0;
+
+    pgt[l4_table_offset(SH_LINEAR_PT_VIRT_START)] = l4e_empty();
+}
+
 #endif /* CONFIG_HVM */
 
 static inline void
@@ -3033,8 +3059,11 @@ sh_update_linear_entries(struct vcpu *v)
      */
 
     /* Don't try to update the monitor table if it doesn't exist */
-    if ( !shadow_mode_external(d) ||
-         pagetable_get_pfn(v->arch.hvm.monitor_table) == 0 )
+    if ( !shadow_mode_external(d)
+#if SHADOW_PAGING_LEVELS == 3
+         || mfn_eq(v->arch.hvm.shadow_linear_l3, INVALID_MFN)
+#endif
+       )
         return;
 
 #if !defined(CONFIG_HVM)
@@ -3051,17 +3080,6 @@ sh_update_linear_entries(struct vcpu *v)
                 pagetable_get_pfn(v->arch.paging.shadow.shadow_table[0]),
                 __PAGE_HYPERVISOR_RW);
     }
-    else
-    {
-        l4_pgentry_t *ml4e;
-
-        ml4e = map_domain_page(pagetable_get_mfn(v->arch.hvm.monitor_table));
-        ml4e[l4_table_offset(SH_LINEAR_PT_VIRT_START)] =
-            l4e_from_pfn(
-                pagetable_get_pfn(v->arch.paging.shadow.shadow_table[0]),
-                __PAGE_HYPERVISOR_RW);
-        unmap_domain_page(ml4e);
-    }
 
 #elif SHADOW_PAGING_LEVELS == 3
 
@@ -3087,16 +3105,8 @@ sh_update_linear_entries(struct vcpu *v)
                 + l2_linear_offset(SH_LINEAR_PT_VIRT_START);
         else
         {
-            mfn_t l3mfn, l2mfn;
-            l4_pgentry_t *ml4e;
-            l3_pgentry_t *ml3e;
-            int linear_slot = shadow_l4_table_offset(SH_LINEAR_PT_VIRT_START);
-            ml4e = map_domain_page(pagetable_get_mfn(v->arch.hvm.monitor_table));
-
-            ASSERT(l4e_get_flags(ml4e[linear_slot]) & _PAGE_PRESENT);
-            l3mfn = l4e_get_mfn(ml4e[linear_slot]);
-            ml3e = map_domain_page(l3mfn);
-            unmap_domain_page(ml4e);
+            mfn_t l2mfn;
+            l3_pgentry_t *ml3e = map_domain_page(v->arch.hvm.shadow_linear_l3);
 
             ASSERT(l3e_get_flags(ml3e[0]) & _PAGE_PRESENT);
             l2mfn = l3e_get_mfn(ml3e[0]);
@@ -3341,9 +3351,13 @@ static pagetable_t cf_check sh_update_cr3(struct vcpu *v, bool noflush)
     ///
     /// v->arch.cr3
     ///
-    if ( shadow_mode_external(d) )
+    if ( shadow_mode_external(d) && v == current )
     {
-        make_cr3(v, pagetable_get_mfn(v->arch.hvm.monitor_table));
+#ifdef CONFIG_HVM
+        make_cr3(v, _mfn(virt_to_mfn(this_cpu(monitor_pgt))));
+#else
+        ASSERT_UNREACHABLE();
+#endif
     }
 #if SHADOW_PAGING_LEVELS == 4
     else // not shadow_mode_external...
@@ -4106,6 +4120,8 @@ const struct paging_mode sh_paging_mode = {
     .invlpg                        = sh_invlpg,
 #ifdef CONFIG_HVM
     .gva_to_gfn                    = sh_gva_to_gfn,
+    .set_cpu_monitor_table         = set_cpu_monitor_table,
+    .clear_cpu_monitor_table       = clear_cpu_monitor_table,
 #endif
     .update_cr3                    = sh_update_cr3,
     .guest_levels                  = GUEST_PAGING_LEVELS,
diff --git a/xen/arch/x86/mm/shadow/private.h b/xen/arch/x86/mm/shadow/private.h
index a5fc3a7676eb..6743aeefe12e 100644
--- a/xen/arch/x86/mm/shadow/private.h
+++ b/xen/arch/x86/mm/shadow/private.h
@@ -420,8 +420,8 @@ void shadow_unhook_mappings(struct domain *d, mfn_t smfn, int user_only);
  * sh_{make,destroy}_monitor_table() depend only on the number of shadow
  * levels.
  */
-mfn_t sh_make_monitor_table(const struct vcpu *v, unsigned int shadow_levels);
-void sh_destroy_monitor_table(const struct vcpu *v, mfn_t mmfn,
+int sh_update_monitor_table(struct vcpu *v, unsigned int shadow_levels);
+void sh_destroy_monitor_table(const struct vcpu *v, mfn_t m3mfn,
                               unsigned int shadow_levels);
 
 /* VRAM dirty tracking helpers. */
-- 
2.45.2



^ permalink raw reply related	[flat|nested] 64+ messages in thread

* [PATCH 15/22] x86/idle: allow using a per-pCPU L4
  2024-07-26 15:21 [PATCH 00/22] x86: adventures in Address Space Isolation Roger Pau Monne
                   ` (13 preceding siblings ...)
  2024-07-26 15:21 ` [PATCH 14/22] x86/hvm: use a per-pCPU monitor table in shadow mode Roger Pau Monne
@ 2024-07-26 15:21 ` Roger Pau Monne
  2024-08-21 16:42   ` Alejandro Vallejo
  2024-07-26 15:22 ` [PATCH 16/22] x86/mm: introduce a per-CPU L3 table for the per-domain slot Roger Pau Monne
                   ` (6 subsequent siblings)
  21 siblings, 1 reply; 64+ messages in thread
From: Roger Pau Monne @ 2024-07-26 15:21 UTC (permalink / raw)
  To: xen-devel; +Cc: alejandro.vallejo, Roger Pau Monne, Jan Beulich, Andrew Cooper

Introduce support for possibly using a different L4 across the idle vCPUs.

This change only introduces support for loading a per-pPCU idle L4, but even
with the per-CPU idle page-table enabled it should still be a clone of
idle_pg_table, hence no functional change expected.

Note the idle L4 is not changed after Xen has reached the SYS_STATE_smp_boot
state, hence there are no need to synchronize the contents of the L4 once the
CPUs are started.

Using a per-CPU idle page-table is not strictly required for the Address Space
Isolation work, as idle page tables are never used when running guests.
However it simplifies memory management of the per-CPU mappings, as creating
per-CPU mappings only require using the idle page-table of the CPU where the
mappings should be created.

Signed-off-by: Roger Pau Monné <roger.pau@citrix.com>
---
 xen/arch/x86/boot/x86_64.S       | 11 +++++++++++
 xen/arch/x86/domain.c            | 20 +++++++++++++++++++-
 xen/arch/x86/domain_page.c       |  2 +-
 xen/arch/x86/include/asm/setup.h |  1 +
 xen/arch/x86/setup.c             |  3 +++
 xen/arch/x86/smpboot.c           |  7 +++++++
 6 files changed, 42 insertions(+), 2 deletions(-)

diff --git a/xen/arch/x86/boot/x86_64.S b/xen/arch/x86/boot/x86_64.S
index 04bb62ae8680..af7854820185 100644
--- a/xen/arch/x86/boot/x86_64.S
+++ b/xen/arch/x86/boot/x86_64.S
@@ -15,6 +15,17 @@ ENTRY(__high_start)
         mov     $XEN_MINIMAL_CR4,%rcx
         mov     %rcx,%cr4
 
+        /*
+         * Possibly switch to the per-CPU idle page-tables. Note we cannot
+         * switch earlier as the per-CPU page-tables might be above 4G, and
+         * hence need to load them from 64bit code.
+         */
+        mov     ap_cr3(%rip), %rax
+        test    %rax, %rax
+        jz      .L_skip_cr3
+        mov     %rax, %cr3
+.L_skip_cr3:
+
         mov     stack_start(%rip),%rsp
 
         /* Reset EFLAGS (subsumes CLI and CLD). */
diff --git a/xen/arch/x86/domain.c b/xen/arch/x86/domain.c
index 9cfcf0dc63f3..b62c4311da6c 100644
--- a/xen/arch/x86/domain.c
+++ b/xen/arch/x86/domain.c
@@ -555,6 +555,7 @@ void arch_vcpu_regs_init(struct vcpu *v)
 int arch_vcpu_create(struct vcpu *v)
 {
     struct domain *d = v->domain;
+    root_pgentry_t *pgt = NULL;
     int rc;
 
     v->arch.flags = TF_kernel_mode;
@@ -589,7 +590,23 @@ int arch_vcpu_create(struct vcpu *v)
     else
     {
         /* Idle domain */
-        v->arch.cr3 = __pa(idle_pg_table);
+        if ( (opt_asi_pv || opt_asi_hvm) && v->vcpu_id )
+        {
+            pgt = alloc_xenheap_page();
+
+            /*
+             * For the idle vCPU 0 (the BSP idle vCPU) use idle_pg_table
+             * directly, there's no need to create yet another copy.
+             */
+            rc = -ENOMEM;
+            if ( !pgt )
+                goto fail;
+
+            copy_page(pgt, idle_pg_table);
+            v->arch.cr3 = __pa(pgt);
+        }
+        else
+            v->arch.cr3 = __pa(idle_pg_table);
         rc = 0;
         v->arch.msrs = ZERO_BLOCK_PTR; /* Catch stray misuses */
     }
@@ -611,6 +628,7 @@ int arch_vcpu_create(struct vcpu *v)
     vcpu_destroy_fpu(v);
     xfree(v->arch.msrs);
     v->arch.msrs = NULL;
+    free_xenheap_page(pgt);
 
     return rc;
 }
diff --git a/xen/arch/x86/domain_page.c b/xen/arch/x86/domain_page.c
index eac5e3304fb8..99b78af90fd3 100644
--- a/xen/arch/x86/domain_page.c
+++ b/xen/arch/x86/domain_page.c
@@ -51,7 +51,7 @@ static inline struct vcpu *mapcache_current_vcpu(void)
         if ( (v = idle_vcpu[smp_processor_id()]) == current )
             sync_local_execstate();
         /* We must now be running on the idle page table. */
-        ASSERT(cr3_pa(read_cr3()) == __pa(idle_pg_table));
+        ASSERT(cr3_pa(read_cr3()) == cr3_pa(v->arch.cr3));
     }
 
     return v;
diff --git a/xen/arch/x86/include/asm/setup.h b/xen/arch/x86/include/asm/setup.h
index d75589178b91..a8452fce8f05 100644
--- a/xen/arch/x86/include/asm/setup.h
+++ b/xen/arch/x86/include/asm/setup.h
@@ -14,6 +14,7 @@ extern unsigned long xenheap_initial_phys_start;
 extern uint64_t boot_tsc_stamp;
 
 extern void *stack_start;
+extern unsigned long ap_cr3;
 
 void early_cpu_init(bool verbose);
 void early_time_init(void);
diff --git a/xen/arch/x86/setup.c b/xen/arch/x86/setup.c
index bc387d96b519..c5a13b30daf4 100644
--- a/xen/arch/x86/setup.c
+++ b/xen/arch/x86/setup.c
@@ -158,6 +158,9 @@ char asmlinkage __section(".init.bss.stack_aligned") __aligned(STACK_SIZE)
 /* Used by the BSP/AP paths to find the higher half stack mapping to use. */
 void *stack_start = cpu0_stack + STACK_SIZE - sizeof(struct cpu_info);
 
+/* cr3 value for the AP to load on boot. */
+unsigned long ap_cr3;
+
 /* Used by the boot asm to stash the relocated multiboot info pointer. */
 unsigned int asmlinkage __initdata multiboot_ptr;
 
diff --git a/xen/arch/x86/smpboot.c b/xen/arch/x86/smpboot.c
index 8aa621533f3d..e07add36b1b6 100644
--- a/xen/arch/x86/smpboot.c
+++ b/xen/arch/x86/smpboot.c
@@ -581,6 +581,13 @@ static int do_boot_cpu(int apicid, int cpu)
 
     stack_start = stack_base[cpu] + STACK_SIZE - sizeof(struct cpu_info);
 
+    /*
+     * If per-CPU idle root page table has been allocated, switch to it as
+     * part of the AP bringup trampoline.
+     */
+    ap_cr3 = idle_vcpu[cpu]->arch.cr3 != __pa(idle_pg_table) ?
+             idle_vcpu[cpu]->arch.cr3 : 0;
+
     /* This grunge runs the startup process for the targeted processor. */
 
     set_cpu_state(CPU_STATE_INIT);
-- 
2.45.2



^ permalink raw reply related	[flat|nested] 64+ messages in thread

* [PATCH 16/22] x86/mm: introduce a per-CPU L3 table for the per-domain slot
  2024-07-26 15:21 [PATCH 00/22] x86: adventures in Address Space Isolation Roger Pau Monne
                   ` (14 preceding siblings ...)
  2024-07-26 15:21 ` [PATCH 15/22] x86/idle: allow using a per-pCPU L4 Roger Pau Monne
@ 2024-07-26 15:22 ` Roger Pau Monne
  2024-08-16 18:40   ` Alejandro Vallejo
  2024-07-26 15:22 ` [PATCH 17/22] x86/mm: introduce support to populate a per-CPU page-table region Roger Pau Monne
                   ` (5 subsequent siblings)
  21 siblings, 1 reply; 64+ messages in thread
From: Roger Pau Monne @ 2024-07-26 15:22 UTC (permalink / raw)
  To: xen-devel; +Cc: alejandro.vallejo, Roger Pau Monne, Jan Beulich, Andrew Cooper

So far L4 slot 260 has always been per-domain, in other words: all vCPUs of a
domain share the same L3 entry.  Currently only 3 slots are used in that L3
table, which leaves plenty of room.

Introduce a per-CPU L3 that's used the the domain has Address Space Isolation
enabled.  Such per-CPU L3 gets currently populated using the same L3 entries
present on the per-domain L3 (d->arch.perdomain_l3_pg).

No functional change expected, as the per-CPU L3 is always a copy of the
contents of d->arch.perdomain_l3_pg.

Note that all the per-domain L3 entries are populated at domain create, and
hence there's no need to sync the state of the per-CPU L3 as the domain won't
yet be running when the L3 is modified.

Signed-off-by: Roger Pau Monné <roger.pau@citrix.com>
---
 xen/arch/x86/include/asm/domain.h |  2 +
 xen/arch/x86/include/asm/mm.h     |  4 ++
 xen/arch/x86/mm.c                 | 80 +++++++++++++++++++++++++++++--
 xen/arch/x86/setup.c              |  8 ++++
 xen/arch/x86/smpboot.c            |  4 ++
 5 files changed, 95 insertions(+), 3 deletions(-)

diff --git a/xen/arch/x86/include/asm/domain.h b/xen/arch/x86/include/asm/domain.h
index 8c366be8c75f..7620a352b9e3 100644
--- a/xen/arch/x86/include/asm/domain.h
+++ b/xen/arch/x86/include/asm/domain.h
@@ -313,6 +313,8 @@ struct arch_domain
 {
     struct page_info *perdomain_l3_pg;
 
+    struct page_info *perdomain_l2_pgs[PERDOMAIN_SLOTS];
+
 #ifdef CONFIG_PV32
     unsigned int hv_compat_vstart;
 #endif
diff --git a/xen/arch/x86/include/asm/mm.h b/xen/arch/x86/include/asm/mm.h
index 2c309f7b1444..34407fb0af06 100644
--- a/xen/arch/x86/include/asm/mm.h
+++ b/xen/arch/x86/include/asm/mm.h
@@ -633,4 +633,8 @@ static inline bool arch_mfns_in_directmap(unsigned long mfn, unsigned long nr)
 /* Setup the per-domain slot in the root page table pointer. */
 void setup_perdomain_slot(const struct vcpu *v, root_pgentry_t *root_pgt);
 
+/* Allocate a per-CPU local L3 table to use in the per-domain slot. */
+int allocate_perdomain_local_l3(unsigned int cpu);
+void free_perdomain_local_l3(unsigned int cpu);
+
 #endif /* __ASM_X86_MM_H__ */
diff --git a/xen/arch/x86/mm.c b/xen/arch/x86/mm.c
index 13aa15f4db22..1367f3361ffe 100644
--- a/xen/arch/x86/mm.c
+++ b/xen/arch/x86/mm.c
@@ -6079,6 +6079,12 @@ int create_perdomain_mapping(struct domain *d, unsigned long va,
         l2tab = __map_domain_page(pg);
         clear_page(l2tab);
         l3tab[l3_table_offset(va)] = l3e_from_page(pg, __PAGE_HYPERVISOR_RW);
+        /*
+         * Keep a reference to the per-domain L3 entries in case a per-CPU L3
+         * is in use (as opposed to using perdomain_l3_pg).
+         */
+        ASSERT(!d->creation_finished);
+        d->arch.perdomain_l2_pgs[l3_table_offset(va)] = pg;
     }
     else
         l2tab = map_l2t_from_l3e(l3tab[l3_table_offset(va)]);
@@ -6368,11 +6374,79 @@ unsigned long get_upper_mfn_bound(void)
     return min(max_mfn, 1UL << (paddr_bits - PAGE_SHIFT)) - 1;
 }
 
+static DEFINE_PER_CPU(l3_pgentry_t *, local_l3);
+
+static void populate_perdomain(const struct domain *d, l4_pgentry_t *l4,
+                               l3_pgentry_t *l3)
+{
+    unsigned int i;
+
+    /* Populate the per-CPU L3 with the per-domain entries. */
+    for ( i = 0; i < ARRAY_SIZE(d->arch.perdomain_l2_pgs); i++ )
+    {
+        const struct page_info *pg = d->arch.perdomain_l2_pgs[i];
+
+        BUILD_BUG_ON(ARRAY_SIZE(d->arch.perdomain_l2_pgs) >
+                     L3_PAGETABLE_ENTRIES);
+        l3e_write(&l3[i], pg ? l3e_from_page(pg, __PAGE_HYPERVISOR_RW)
+                             : l3e_empty());
+    }
+
+    l4e_write(&l4[l4_table_offset(PERDOMAIN_VIRT_START)],
+              l4e_from_mfn(virt_to_mfn(l3), __PAGE_HYPERVISOR_RW));
+}
+
+int allocate_perdomain_local_l3(unsigned int cpu)
+{
+    const struct domain *d = idle_vcpu[cpu]->domain;
+    l3_pgentry_t *l3;
+    root_pgentry_t *root_pgt = maddr_to_virt(idle_vcpu[cpu]->arch.cr3);
+
+    ASSERT(!per_cpu(local_l3, cpu));
+
+    if ( !opt_asi_pv && !opt_asi_hvm )
+        return 0;
+
+    l3 = alloc_xenheap_page();
+    if ( !l3 )
+        return -ENOMEM;
+
+    clear_page(l3);
+
+    /* Setup the idle domain slots (current domain) in the L3. */
+    populate_perdomain(d, root_pgt, l3);
+
+    per_cpu(local_l3, cpu) = l3;
+
+    return 0;
+}
+
+void free_perdomain_local_l3(unsigned int cpu)
+{
+    l3_pgentry_t *l3 = per_cpu(local_l3, cpu);
+
+    if ( !l3 )
+        return;
+
+    per_cpu(local_l3, cpu) = NULL;
+    free_xenheap_page(l3);
+}
+
 void setup_perdomain_slot(const struct vcpu *v, root_pgentry_t *root_pgt)
 {
-    l4e_write(&root_pgt[root_table_offset(PERDOMAIN_VIRT_START)],
-              l4e_from_page(v->domain->arch.perdomain_l3_pg,
-                            __PAGE_HYPERVISOR_RW));
+    const struct domain *d = v->domain;
+
+    if ( d->arch.asi )
+    {
+        l3_pgentry_t *l3 = this_cpu(local_l3);
+
+        ASSERT(l3);
+        populate_perdomain(d, root_pgt, l3);
+    }
+    else if ( is_hvm_domain(d) || d->arch.pv.xpti )
+        l4e_write(&root_pgt[root_table_offset(PERDOMAIN_VIRT_START)],
+                  l4e_from_page(v->domain->arch.perdomain_l3_pg,
+                                __PAGE_HYPERVISOR_RW));
 
     if ( !is_pv_64bit_vcpu(v) )
         /*
diff --git a/xen/arch/x86/setup.c b/xen/arch/x86/setup.c
index c5a13b30daf4..5bf81b81b46f 100644
--- a/xen/arch/x86/setup.c
+++ b/xen/arch/x86/setup.c
@@ -1961,6 +1961,14 @@ void asmlinkage __init noreturn __start_xen(unsigned long mbi_p)
 
     alternative_branches();
 
+    /*
+     * Setup the local per-domain L3 for the BSP also, so it matches the state
+     * of the APs.
+     */
+    ret = allocate_perdomain_local_l3(0);
+    if ( ret )
+        panic("Error %d setting up local per-domain L3\n", ret);
+
     /*
      * NB: when running as a PV shim VCPUOP_up/down is wired to the shim
      * physical cpu_add/remove functions, so launch the guest with only
diff --git a/xen/arch/x86/smpboot.c b/xen/arch/x86/smpboot.c
index e07add36b1b6..40cc14799252 100644
--- a/xen/arch/x86/smpboot.c
+++ b/xen/arch/x86/smpboot.c
@@ -986,6 +986,7 @@ static void cpu_smpboot_free(unsigned int cpu, bool remove)
     }
 
     cleanup_cpu_root_pgt(cpu);
+    free_perdomain_local_l3(cpu);
 
     if ( per_cpu(stubs.addr, cpu) )
     {
@@ -1100,6 +1101,9 @@ static int cpu_smpboot_alloc(unsigned int cpu)
     per_cpu(stubs.addr, cpu) = stub_page + STUB_BUF_CPU_OFFS(cpu);
 
     rc = setup_cpu_root_pgt(cpu);
+    if ( rc )
+        goto out;
+    rc = allocate_perdomain_local_l3(cpu);
     if ( rc )
         goto out;
     rc = -ENOMEM;
-- 
2.45.2



^ permalink raw reply related	[flat|nested] 64+ messages in thread

* [PATCH 17/22] x86/mm: introduce support to populate a per-CPU page-table region
  2024-07-26 15:21 [PATCH 00/22] x86: adventures in Address Space Isolation Roger Pau Monne
                   ` (15 preceding siblings ...)
  2024-07-26 15:22 ` [PATCH 16/22] x86/mm: introduce a per-CPU L3 table for the per-domain slot Roger Pau Monne
@ 2024-07-26 15:22 ` Roger Pau Monne
  2024-07-26 15:22 ` [PATCH 18/22] x86/mm: allow modifying per-CPU entries of remote page-tables Roger Pau Monne
                   ` (4 subsequent siblings)
  21 siblings, 0 replies; 64+ messages in thread
From: Roger Pau Monne @ 2024-07-26 15:22 UTC (permalink / raw)
  To: xen-devel; +Cc: alejandro.vallejo, Roger Pau Monne, Jan Beulich, Andrew Cooper

Add logic in map_pages_to_xen() and modify_xen_mappings() so that TLB flushes
are only performed locally when dealing with entries in the per-CPU area of the
page-tables.

No functional change intended, as there are no callers added that create or
modify per-CPU mappings, nor is the per-CPU area still properly setup in
the page-tables yet.

Note that the removed flush_area() ended up calling flush_area_mask() through
the flush_area_all() alias.

Signed-off-by: Roger Pau Monné <roger.pau@citrix.com>
---
 xen/arch/x86/include/asm/config.h   |  4 ++
 xen/arch/x86/include/asm/flushtlb.h |  1 -
 xen/arch/x86/mm.c                   | 64 +++++++++++++++++++----------
 3 files changed, 47 insertions(+), 22 deletions(-)

diff --git a/xen/arch/x86/include/asm/config.h b/xen/arch/x86/include/asm/config.h
index 2a260a2581fd..c24d735a0cee 100644
--- a/xen/arch/x86/include/asm/config.h
+++ b/xen/arch/x86/include/asm/config.h
@@ -204,6 +204,10 @@ extern unsigned char boot_edid_info[128];
 #define PERDOMAIN_SLOTS         3
 #define PERDOMAIN_VIRT_SLOT(s)  (PERDOMAIN_VIRT_START + (s) * \
                                  (PERDOMAIN_SLOT_MBYTES << 20))
+#define PERCPU_VIRT_START       PERDOMAIN_VIRT_SLOT(PERDOMAIN_SLOTS)
+#define PERCPU_SLOTS            1
+#define PERCPU_VIRT_SLOT(s)     (PERCPU_VIRT_START + (s) * \
+                                 (PERDOMAIN_SLOT_MBYTES << 20))
 /* Slot 4: mirror of per-domain mappings (for compat xlat area accesses). */
 #define PERDOMAIN_ALT_VIRT_START PML4_ADDR(4)
 /* Slot 261: machine-to-phys conversion table (256GB). */
diff --git a/xen/arch/x86/include/asm/flushtlb.h b/xen/arch/x86/include/asm/flushtlb.h
index 1b98d03decdc..affe944d1a5b 100644
--- a/xen/arch/x86/include/asm/flushtlb.h
+++ b/xen/arch/x86/include/asm/flushtlb.h
@@ -146,7 +146,6 @@ void flush_area_mask(const cpumask_t *mask, const void *va,
 #define flush_mask(mask, flags) flush_area_mask(mask, NULL, flags)
 
 /* Flush all CPUs' TLBs/caches */
-#define flush_area_all(va, flags) flush_area_mask(&cpu_online_map, va, flags)
 #define flush_all(flags) flush_mask(&cpu_online_map, flags)
 
 /* Flush local TLBs */
diff --git a/xen/arch/x86/mm.c b/xen/arch/x86/mm.c
index 1367f3361ffe..c468b46a9d1b 100644
--- a/xen/arch/x86/mm.c
+++ b/xen/arch/x86/mm.c
@@ -5023,9 +5023,13 @@ static DEFINE_SPINLOCK(map_pgdir_lock);
  */
 static l3_pgentry_t *virt_to_xen_l3e(unsigned long v)
 {
+    unsigned int cpu = smp_processor_id();
+    /* Called before idle_vcpu is populated, fallback to idle_pg_table. */
+    root_pgentry_t *root_pgt = idle_vcpu[cpu] ?
+        maddr_to_virt(idle_vcpu[cpu]->arch.cr3) : idle_pg_table;
     l4_pgentry_t *pl4e;
 
-    pl4e = &idle_pg_table[l4_table_offset(v)];
+    pl4e = &root_pgt[l4_table_offset(v)];
     if ( !(l4e_get_flags(*pl4e) & _PAGE_PRESENT) )
     {
         bool locking = system_state > SYS_STATE_boot;
@@ -5138,8 +5142,8 @@ static l1_pgentry_t *virt_to_xen_l1e(unsigned long v)
 #define l1f_to_lNf(f) (((f) & _PAGE_PRESENT) ? ((f) |  _PAGE_PSE) : (f))
 #define lNf_to_l1f(f) (((f) & _PAGE_PRESENT) ? ((f) & ~_PAGE_PSE) : (f))
 
-/* flush_area_all() can be used prior to any other CPU being online.  */
-#define flush_area(v, f) flush_area_all((const void *)(v), f)
+/* flush_area_mask() can be used prior to any other CPU being online.  */
+#define flush_area_mask(m, v, f) flush_area_mask(m, (const void *)(v), f)
 
 #define L3T_INIT(page) (page) = ZERO_BLOCK_PTR
 
@@ -5222,7 +5226,11 @@ int map_pages_to_xen(
     unsigned long nr_mfns,
     unsigned int flags)
 {
-    bool locking = system_state > SYS_STATE_boot;
+    bool global = virt < PERCPU_VIRT_START ||
+                  virt >= PERCPU_VIRT_SLOT(PERCPU_SLOTS);
+    bool locking = system_state > SYS_STATE_boot && global;
+    const cpumask_t *flush_mask = global ? &cpu_online_map
+                                         : cpumask_of(smp_processor_id());
     l3_pgentry_t *pl3e = NULL, ol3e;
     l2_pgentry_t *pl2e = NULL, ol2e;
     l1_pgentry_t *pl1e, ol1e;
@@ -5244,6 +5252,11 @@ int map_pages_to_xen(
     }                                          \
 } while (0)
 
+    /* Ensure it's a global mapping or it's only modifying the per-CPU area. */
+    ASSERT(global ||
+           (virt + nr_mfns * PAGE_SIZE >= PERCPU_VIRT_START &&
+            virt + nr_mfns * PAGE_SIZE <  PERCPU_VIRT_SLOT(PERCPU_SLOTS)));
+
     L3T_INIT(current_l3page);
 
     while ( nr_mfns != 0 )
@@ -5278,7 +5291,7 @@ int map_pages_to_xen(
                 if ( l3e_get_flags(ol3e) & _PAGE_PSE )
                 {
                     flush_flags(lNf_to_l1f(l3e_get_flags(ol3e)));
-                    flush_area(virt, flush_flags);
+                    flush_area_mask(flush_mask, virt, flush_flags);
                 }
                 else
                 {
@@ -5301,7 +5314,7 @@ int map_pages_to_xen(
                             unmap_domain_page(l1t);
                         }
                     }
-                    flush_area(virt, flush_flags);
+                    flush_area_mask(flush_mask, virt, flush_flags);
                     for ( i = 0; i < L2_PAGETABLE_ENTRIES; i++ )
                     {
                         ol2e = l2t[i];
@@ -5373,7 +5386,7 @@ int map_pages_to_xen(
             }
             if ( locking )
                 spin_unlock(&map_pgdir_lock);
-            flush_area(virt, flush_flags);
+            flush_area_mask(flush_mask, virt, flush_flags);
 
             free_xen_pagetable(l2mfn);
         }
@@ -5399,7 +5412,7 @@ int map_pages_to_xen(
                 if ( l2e_get_flags(ol2e) & _PAGE_PSE )
                 {
                     flush_flags(lNf_to_l1f(l2e_get_flags(ol2e)));
-                    flush_area(virt, flush_flags);
+                    flush_area_mask(flush_mask, virt, flush_flags);
                 }
                 else
                 {
@@ -5407,7 +5420,7 @@ int map_pages_to_xen(
 
                     for ( i = 0; i < L1_PAGETABLE_ENTRIES; i++ )
                         flush_flags(l1e_get_flags(l1t[i]));
-                    flush_area(virt, flush_flags);
+                    flush_area_mask(flush_mask, virt, flush_flags);
                     unmap_domain_page(l1t);
                     free_xen_pagetable(l2e_get_mfn(ol2e));
                 }
@@ -5476,7 +5489,7 @@ int map_pages_to_xen(
                 }
                 if ( locking )
                     spin_unlock(&map_pgdir_lock);
-                flush_area(virt, flush_flags);
+                flush_area_mask(flush_mask, virt, flush_flags);
 
                 free_xen_pagetable(l1mfn);
             }
@@ -5491,7 +5504,7 @@ int map_pages_to_xen(
                 unsigned int flush_flags = FLUSH_TLB | FLUSH_ORDER(0);
 
                 flush_flags(l1e_get_flags(ol1e));
-                flush_area(virt, flush_flags);
+                flush_area_mask(flush_mask, virt, flush_flags);
             }
 
             virt    += 1UL << L1_PAGETABLE_SHIFT;
@@ -5540,9 +5553,9 @@ int map_pages_to_xen(
                     l2e_write(pl2e, l2e_from_pfn(base_mfn, l1f_to_lNf(flags)));
                     if ( locking )
                         spin_unlock(&map_pgdir_lock);
-                    flush_area(virt - PAGE_SIZE,
-                               FLUSH_TLB_GLOBAL |
-                               FLUSH_ORDER(PAGETABLE_ORDER));
+                    flush_area_mask(flush_mask, virt - PAGE_SIZE,
+                                    FLUSH_TLB_GLOBAL |
+                                    FLUSH_ORDER(PAGETABLE_ORDER));
                     free_xen_pagetable(l2e_get_mfn(ol2e));
                 }
                 else if ( locking )
@@ -5589,9 +5602,9 @@ int map_pages_to_xen(
                 l3e_write(pl3e, l3e_from_pfn(base_mfn, l1f_to_lNf(flags)));
                 if ( locking )
                     spin_unlock(&map_pgdir_lock);
-                flush_area(virt - PAGE_SIZE,
-                           FLUSH_TLB_GLOBAL |
-                           FLUSH_ORDER(2*PAGETABLE_ORDER));
+                flush_area_mask(flush_mask, virt - PAGE_SIZE,
+                                FLUSH_TLB_GLOBAL |
+                                FLUSH_ORDER(2*PAGETABLE_ORDER));
                 free_xen_pagetable(l3e_get_mfn(ol3e));
             }
             else if ( locking )
@@ -5629,7 +5642,11 @@ int __init populate_pt_range(unsigned long virt, unsigned long nr_mfns)
  */
 int modify_xen_mappings(unsigned long s, unsigned long e, unsigned int nf)
 {
-    bool locking = system_state > SYS_STATE_boot;
+    bool global = s < PERCPU_VIRT_START ||
+                  s >= PERCPU_VIRT_SLOT(PERCPU_SLOTS);
+    bool locking = system_state > SYS_STATE_boot && global;
+    const cpumask_t *flush_mask = global ? &cpu_online_map
+                                         : cpumask_of(smp_processor_id());
     l3_pgentry_t *pl3e = NULL;
     l2_pgentry_t *pl2e = NULL;
     l1_pgentry_t *pl1e;
@@ -5638,6 +5655,9 @@ int modify_xen_mappings(unsigned long s, unsigned long e, unsigned int nf)
     int rc = -ENOMEM;
     struct page_info *current_l3page;
 
+    ASSERT(global ||
+           (e >= PERCPU_VIRT_START && e < PERCPU_VIRT_SLOT(PERCPU_SLOTS)));
+
     /* Set of valid PTE bits which may be altered. */
 #define FLAGS_MASK (_PAGE_NX|_PAGE_DIRTY|_PAGE_ACCESSED|_PAGE_RW|_PAGE_PRESENT)
     nf &= FLAGS_MASK;
@@ -5836,7 +5856,8 @@ int modify_xen_mappings(unsigned long s, unsigned long e, unsigned int nf)
                 l2e_write(pl2e, l2e_empty());
                 if ( locking )
                     spin_unlock(&map_pgdir_lock);
-                flush_area(NULL, FLUSH_TLB_GLOBAL); /* flush before free */
+                /* flush before free */
+                flush_area_mask(flush_mask, NULL, FLUSH_TLB_GLOBAL);
                 free_xen_pagetable(l1mfn);
             }
             else if ( locking )
@@ -5880,7 +5901,8 @@ int modify_xen_mappings(unsigned long s, unsigned long e, unsigned int nf)
                 l3e_write(pl3e, l3e_empty());
                 if ( locking )
                     spin_unlock(&map_pgdir_lock);
-                flush_area(NULL, FLUSH_TLB_GLOBAL); /* flush before free */
+                /* flush before free */
+                flush_area_mask(flush_mask, NULL, FLUSH_TLB_GLOBAL);
                 free_xen_pagetable(l2mfn);
             }
             else if ( locking )
@@ -5888,7 +5910,7 @@ int modify_xen_mappings(unsigned long s, unsigned long e, unsigned int nf)
         }
     }
 
-    flush_area(NULL, FLUSH_TLB_GLOBAL);
+    flush_area_mask(flush_mask, NULL, FLUSH_TLB_GLOBAL);
 
 #undef FLAGS_MASK
     rc = 0;
-- 
2.45.2



^ permalink raw reply related	[flat|nested] 64+ messages in thread

* [PATCH 18/22] x86/mm: allow modifying per-CPU entries of remote page-tables
  2024-07-26 15:21 [PATCH 00/22] x86: adventures in Address Space Isolation Roger Pau Monne
                   ` (16 preceding siblings ...)
  2024-07-26 15:22 ` [PATCH 17/22] x86/mm: introduce support to populate a per-CPU page-table region Roger Pau Monne
@ 2024-07-26 15:22 ` Roger Pau Monne
  2024-07-26 15:22 ` [PATCH 19/22] x86/mm: introduce a per-CPU fixmap area Roger Pau Monne
                   ` (3 subsequent siblings)
  21 siblings, 0 replies; 64+ messages in thread
From: Roger Pau Monne @ 2024-07-26 15:22 UTC (permalink / raw)
  To: xen-devel; +Cc: alejandro.vallejo, Roger Pau Monne, Jan Beulich, Andrew Cooper

Add support for modifying the per-CPU page-tables entries of remote CPUs, this
will be required in order to setup the page-tables of CPUs before bringing them
up.  A restriction is added so that remote page-tables can only be modified as
long as the remote CPU is not yet online.

Non functional change, as there's no user introduced that modifies remote
page-tables.

Signed-off-by: Roger Pau Monné <roger.pau@citrix.com>
---
Can be merged with previous patch?
---
 xen/arch/x86/include/asm/mm.h | 15 ++++++++++
 xen/arch/x86/mm.c             | 55 ++++++++++++++++++++++++++---------
 2 files changed, 56 insertions(+), 14 deletions(-)

diff --git a/xen/arch/x86/include/asm/mm.h b/xen/arch/x86/include/asm/mm.h
index 34407fb0af06..f883468b1a7c 100644
--- a/xen/arch/x86/include/asm/mm.h
+++ b/xen/arch/x86/include/asm/mm.h
@@ -637,4 +637,19 @@ void setup_perdomain_slot(const struct vcpu *v, root_pgentry_t *root_pgt);
 int allocate_perdomain_local_l3(unsigned int cpu);
 void free_perdomain_local_l3(unsigned int cpu);
 
+/* Specify the CPU idle root page-table to use for modifications. */
+int map_pages_to_xen_cpu(
+    unsigned long virt,
+    mfn_t mfn,
+    unsigned long nr_mfns,
+    unsigned int flags,
+    unsigned int cpu);
+int modify_xen_mappings_cpu(unsigned long s, unsigned long e, unsigned int nf,
+                            unsigned int cpu);
+static inline int destroy_xen_mappings_cpu(unsigned long s, unsigned long e,
+                                           unsigned int cpu)
+{
+    return modify_xen_mappings_cpu(s, e, _PAGE_NONE, cpu);
+}
+
 #endif /* __ASM_X86_MM_H__ */
diff --git a/xen/arch/x86/mm.c b/xen/arch/x86/mm.c
index c468b46a9d1b..faf2d42745d1 100644
--- a/xen/arch/x86/mm.c
+++ b/xen/arch/x86/mm.c
@@ -5021,9 +5021,8 @@ static DEFINE_SPINLOCK(map_pgdir_lock);
  * For virt_to_xen_lXe() functions, they take a linear address and return a
  * pointer to Xen's LX entry. Caller needs to unmap the pointer.
  */
-static l3_pgentry_t *virt_to_xen_l3e(unsigned long v)
+static l3_pgentry_t *virt_to_xen_l3e_cpu(unsigned long v, unsigned int cpu)
 {
-    unsigned int cpu = smp_processor_id();
     /* Called before idle_vcpu is populated, fallback to idle_pg_table. */
     root_pgentry_t *root_pgt = idle_vcpu[cpu] ?
         maddr_to_virt(idle_vcpu[cpu]->arch.cr3) : idle_pg_table;
@@ -5062,11 +5061,16 @@ static l3_pgentry_t *virt_to_xen_l3e(unsigned long v)
     return map_l3t_from_l4e(*pl4e) + l3_table_offset(v);
 }
 
-static l2_pgentry_t *virt_to_xen_l2e(unsigned long v)
+static l3_pgentry_t *virt_to_xen_l3e(unsigned long v)
+{
+    return virt_to_xen_l3e_cpu(v, smp_processor_id());
+}
+
+static l2_pgentry_t *virt_to_xen_l2e_cpu(unsigned long v, unsigned int cpu)
 {
     l3_pgentry_t *pl3e, l3e;
 
-    pl3e = virt_to_xen_l3e(v);
+    pl3e = virt_to_xen_l3e_cpu(v, cpu);
     if ( !pl3e )
         return NULL;
 
@@ -5100,11 +5104,11 @@ static l2_pgentry_t *virt_to_xen_l2e(unsigned long v)
     return map_l2t_from_l3e(l3e) + l2_table_offset(v);
 }
 
-static l1_pgentry_t *virt_to_xen_l1e(unsigned long v)
+static l1_pgentry_t *virt_to_xen_l1e_cpu(unsigned long v, unsigned int cpu)
 {
     l2_pgentry_t *pl2e, l2e;
 
-    pl2e = virt_to_xen_l2e(v);
+    pl2e = virt_to_xen_l2e_cpu(v, cpu);
     if ( !pl2e )
         return NULL;
 
@@ -5220,17 +5224,18 @@ mfn_t xen_map_to_mfn(unsigned long va)
     return ret;
 }
 
-int map_pages_to_xen(
+int map_pages_to_xen_cpu(
     unsigned long virt,
     mfn_t mfn,
     unsigned long nr_mfns,
-    unsigned int flags)
+    unsigned int flags,
+    unsigned int cpu)
 {
     bool global = virt < PERCPU_VIRT_START ||
                   virt >= PERCPU_VIRT_SLOT(PERCPU_SLOTS);
     bool locking = system_state > SYS_STATE_boot && global;
     const cpumask_t *flush_mask = global ? &cpu_online_map
-                                         : cpumask_of(smp_processor_id());
+                                         : cpumask_of(cpu);
     l3_pgentry_t *pl3e = NULL, ol3e;
     l2_pgentry_t *pl2e = NULL, ol2e;
     l1_pgentry_t *pl1e, ol1e;
@@ -5257,6 +5262,9 @@ int map_pages_to_xen(
            (virt + nr_mfns * PAGE_SIZE >= PERCPU_VIRT_START &&
             virt + nr_mfns * PAGE_SIZE <  PERCPU_VIRT_SLOT(PERCPU_SLOTS)));
 
+    /* Only allow modifying remote page-tables if the CPU is not online. */
+    ASSERT(cpu == smp_processor_id() || !cpu_online(cpu));
+
     L3T_INIT(current_l3page);
 
     while ( nr_mfns != 0 )
@@ -5266,7 +5274,7 @@ int map_pages_to_xen(
         UNMAP_DOMAIN_PAGE(pl3e);
         UNMAP_DOMAIN_PAGE(pl2e);
 
-        pl3e = virt_to_xen_l3e(virt);
+        pl3e = virt_to_xen_l3e_cpu(virt, cpu);
         if ( !pl3e )
             goto out;
 
@@ -5391,7 +5399,7 @@ int map_pages_to_xen(
             free_xen_pagetable(l2mfn);
         }
 
-        pl2e = virt_to_xen_l2e(virt);
+        pl2e = virt_to_xen_l2e_cpu(virt, cpu);
         if ( !pl2e )
             goto out;
 
@@ -5437,7 +5445,7 @@ int map_pages_to_xen(
             /* Normal page mapping. */
             if ( !(l2e_get_flags(*pl2e) & _PAGE_PRESENT) )
             {
-                pl1e = virt_to_xen_l1e(virt);
+                pl1e = virt_to_xen_l1e_cpu(virt, cpu);
                 if ( pl1e == NULL )
                     goto out;
             }
@@ -5623,6 +5631,16 @@ int map_pages_to_xen(
     return rc;
 }
 
+int map_pages_to_xen(
+    unsigned long virt,
+    mfn_t mfn,
+    unsigned long nr_mfns,
+    unsigned int flags)
+{
+    return map_pages_to_xen_cpu(virt, mfn, nr_mfns, flags, smp_processor_id());
+}
+
+
 int __init populate_pt_range(unsigned long virt, unsigned long nr_mfns)
 {
     return map_pages_to_xen(virt, INVALID_MFN, nr_mfns, MAP_SMALL_PAGES);
@@ -5640,7 +5658,8 @@ int __init populate_pt_range(unsigned long virt, unsigned long nr_mfns)
  *
  * It is an error to call with present flags over an unpopulated range.
  */
-int modify_xen_mappings(unsigned long s, unsigned long e, unsigned int nf)
+int modify_xen_mappings_cpu(unsigned long s, unsigned long e, unsigned int nf,
+                            unsigned int cpu)
 {
     bool global = s < PERCPU_VIRT_START ||
                   s >= PERCPU_VIRT_SLOT(PERCPU_SLOTS);
@@ -5658,6 +5677,9 @@ int modify_xen_mappings(unsigned long s, unsigned long e, unsigned int nf)
     ASSERT(global ||
            (e >= PERCPU_VIRT_START && e < PERCPU_VIRT_SLOT(PERCPU_SLOTS)));
 
+    /* Only allow modifying remote page-tables if the CPU is not online. */
+    ASSERT(cpu == smp_processor_id() || !cpu_online(cpu));
+
     /* Set of valid PTE bits which may be altered. */
 #define FLAGS_MASK (_PAGE_NX|_PAGE_DIRTY|_PAGE_ACCESSED|_PAGE_RW|_PAGE_PRESENT)
     nf &= FLAGS_MASK;
@@ -5674,7 +5696,7 @@ int modify_xen_mappings(unsigned long s, unsigned long e, unsigned int nf)
         UNMAP_DOMAIN_PAGE(pl2e);
         UNMAP_DOMAIN_PAGE(pl3e);
 
-        pl3e = virt_to_xen_l3e(v);
+        pl3e = virt_to_xen_l3e_cpu(v, cpu);
         if ( !pl3e )
             goto out;
 
@@ -5927,6 +5949,11 @@ int modify_xen_mappings(unsigned long s, unsigned long e, unsigned int nf)
 
 #undef flush_area
 
+int modify_xen_mappings(unsigned long s, unsigned long e, unsigned int nf)
+{
+    return modify_xen_mappings_cpu(s, e, nf, smp_processor_id());
+}
+
 int destroy_xen_mappings(unsigned long s, unsigned long e)
 {
     return modify_xen_mappings(s, e, _PAGE_NONE);
-- 
2.45.2



^ permalink raw reply related	[flat|nested] 64+ messages in thread

* [PATCH 19/22] x86/mm: introduce a per-CPU fixmap area
  2024-07-26 15:21 [PATCH 00/22] x86: adventures in Address Space Isolation Roger Pau Monne
                   ` (17 preceding siblings ...)
  2024-07-26 15:22 ` [PATCH 18/22] x86/mm: allow modifying per-CPU entries of remote page-tables Roger Pau Monne
@ 2024-07-26 15:22 ` Roger Pau Monne
  2024-07-26 15:22 ` [PATCH 20/22] x86/pv: allow using a unique per-pCPU root page table (L4) Roger Pau Monne
                   ` (2 subsequent siblings)
  21 siblings, 0 replies; 64+ messages in thread
From: Roger Pau Monne @ 2024-07-26 15:22 UTC (permalink / raw)
  To: xen-devel; +Cc: alejandro.vallejo, Roger Pau Monne, Jan Beulich, Andrew Cooper

Introduce the logic to manage a per-CPU fixmap area.  This includes adding a
new set of headers that are capable of creating mappings in the per-CPU
page-table regions by making use of the map_pages_to_xen_cpu().

This per-CPU fixmap area is currently set to use one L3 slot: 1GiB of linear
address space.

Signed-off-by: Roger Pau Monné <roger.pau@citrix.com>
---
 xen/arch/x86/include/asm/fixmap.h | 44 +++++++++++++++++++++++++++++++
 xen/arch/x86/mm.c                 | 16 ++++++++++-
 2 files changed, 59 insertions(+), 1 deletion(-)

diff --git a/xen/arch/x86/include/asm/fixmap.h b/xen/arch/x86/include/asm/fixmap.h
index 516ec3fa6c95..a456c65072d8 100644
--- a/xen/arch/x86/include/asm/fixmap.h
+++ b/xen/arch/x86/include/asm/fixmap.h
@@ -118,6 +118,50 @@ extern void __set_fixmap_x(
 #define __fix_x_to_virt(x) (FIXADDR_X_TOP - ((x) << PAGE_SHIFT))
 #define fix_x_to_virt(x)   ((void *)__fix_x_to_virt(x))
 
+/* per-CPU fixmap area. */
+enum percpu_fixed_addresses {
+    __end_of_percpu_fixed_addresses
+};
+
+#define PERCPU_FIXADDR_SIZE (__end_of_percpu_fixed_addresses << PAGE_SHIFT)
+#define PERCPU_FIXADDR PERCPU_VIRT_SLOT(0)
+
+static inline void *percpu_fix_to_virt(enum percpu_fixed_addresses idx)
+{
+    BUG_ON(idx >=__end_of_percpu_fixed_addresses);
+    return (void *)PERCPU_FIXADDR + (idx << PAGE_SHIFT);
+}
+
+static inline void percpu_set_fixmap_remote(
+    unsigned int cpu, enum percpu_fixed_addresses idx, mfn_t mfn,
+    unsigned long flags)
+{
+    map_pages_to_xen_cpu((unsigned long)percpu_fix_to_virt(idx), mfn, 1, flags,
+                         cpu);
+}
+
+static inline void percpu_clear_fixmap_remote(
+    unsigned int cpu, enum percpu_fixed_addresses idx)
+{
+    /*
+     * Use map_pages_to_xen_cpu() instead of destroy_xen_mappings_cpu() to
+     * avoid tearing down the intermediate page-tables if empty.
+     */
+    map_pages_to_xen_cpu((unsigned long)percpu_fix_to_virt(idx), INVALID_MFN, 1,
+                         0, cpu);
+}
+
+static inline void percpu_set_fixmap(enum percpu_fixed_addresses idx, mfn_t mfn,
+                                     unsigned long flags)
+{
+    percpu_set_fixmap_remote(smp_processor_id(), idx, mfn, flags);
+}
+
+static inline void percpu_clear_fixmap(enum percpu_fixed_addresses idx)
+{
+    percpu_clear_fixmap_remote(smp_processor_id(), idx);
+}
+
 #endif /* __ASSEMBLY__ */
 
 #endif
diff --git a/xen/arch/x86/mm.c b/xen/arch/x86/mm.c
index faf2d42745d1..937089d203cc 100644
--- a/xen/arch/x86/mm.c
+++ b/xen/arch/x86/mm.c
@@ -6467,7 +6467,17 @@ int allocate_perdomain_local_l3(unsigned int cpu)
 
     per_cpu(local_l3, cpu) = l3;
 
-    return 0;
+    /*
+     * Pre-allocate the page-table structures for the per-cpu fixmap.  Some of
+     * the per-cpu fixmap calls might happen in contexts where memory
+     * allocation is not possible.
+     *
+     * Only one L3 slot is currently reserved for the per-CPU fixmap.
+     */
+    BUILD_BUG_ON(PERCPU_FIXADDR_SIZE > (1 << L3_PAGETABLE_SHIFT));
+    return map_pages_to_xen_cpu(PERCPU_VIRT_START, INVALID_MFN,
+                                PFN_DOWN(PERCPU_FIXADDR_SIZE), MAP_SMALL_PAGES,
+                                cpu);
 }
 
 void free_perdomain_local_l3(unsigned int cpu)
@@ -6478,6 +6488,10 @@ void free_perdomain_local_l3(unsigned int cpu)
         return;
 
     per_cpu(local_l3, cpu) = NULL;
+
+    destroy_xen_mappings_cpu(PERCPU_VIRT_START,
+                             PERCPU_VIRT_START + PERCPU_FIXADDR_SIZE, cpu);
+
     free_xenheap_page(l3);
 }
 
-- 
2.45.2



^ permalink raw reply related	[flat|nested] 64+ messages in thread

* [PATCH 20/22] x86/pv: allow using a unique per-pCPU root page table (L4)
  2024-07-26 15:21 [PATCH 00/22] x86: adventures in Address Space Isolation Roger Pau Monne
                   ` (18 preceding siblings ...)
  2024-07-26 15:22 ` [PATCH 19/22] x86/mm: introduce a per-CPU fixmap area Roger Pau Monne
@ 2024-07-26 15:22 ` Roger Pau Monne
  2024-07-26 15:22 ` [PATCH 21/22] x86/mm: switch to a per-CPU mapped stack when using ASI Roger Pau Monne
  2024-07-26 15:22 ` [PATCH 22/22] x86/mm: zero stack on stack switch or reset Roger Pau Monne
  21 siblings, 0 replies; 64+ messages in thread
From: Roger Pau Monne @ 2024-07-26 15:22 UTC (permalink / raw)
  To: xen-devel; +Cc: alejandro.vallejo, Roger Pau Monne, Jan Beulich, Andrew Cooper

When running PV guests it's possible for the guest to use the same root page
table (L4) for all vCPUs, which in turn will result in Xen also using the same
root page table on all pCPUs that are running any domain vCPU.

When using XPTI Xen switches to a per-CPU shadow L4 when running in guest
context, switching to the fully populated L4 when in Xen context.

Take advantage of this existing shadowing and force the usage of a per-CPU L4
that shadows the guest selected L4 when Address Space Isolation is requested
for PV guests.

The mapping of the guest L4 is done with a per-CPU fixmap entry, that however
requires that the currently loaded L4 has the per-CPU slot setup.  In order to
ensure this switch to the shadow per-CPU L4 with just the Xen slots populated,
and then map the guest L4 and copy the contents of the guest controlled
slots.

Signed-off-by: Roger Pau Monné <roger.pau@citrix.com>
---
 xen/arch/x86/domain.c              | 37 +++++++++++++++++++++
 xen/arch/x86/flushtlb.c            |  9 ++++++
 xen/arch/x86/include/asm/current.h | 15 ++++++---
 xen/arch/x86/include/asm/fixmap.h  |  1 +
 xen/arch/x86/include/asm/pv/mm.h   |  8 +++++
 xen/arch/x86/mm.c                  | 47 +++++++++++++++++++++++++++
 xen/arch/x86/pv/domain.c           | 25 ++++++++++++--
 xen/arch/x86/pv/mm.c               | 52 ++++++++++++++++++++++++++++++
 xen/arch/x86/smpboot.c             | 20 +++++++++++-
 9 files changed, 207 insertions(+), 7 deletions(-)

diff --git a/xen/arch/x86/domain.c b/xen/arch/x86/domain.c
index b62c4311da6c..94a42ef29cd1 100644
--- a/xen/arch/x86/domain.c
+++ b/xen/arch/x86/domain.c
@@ -45,6 +45,7 @@
 #include <asm/io.h>
 #include <asm/processor.h>
 #include <asm/desc.h>
+#include <asm/fixmap.h>
 #include <asm/i387.h>
 #include <asm/xstate.h>
 #include <asm/cpuidle.h>
@@ -2110,11 +2111,47 @@ void context_switch(struct vcpu *prev, struct vcpu *next)
 
     local_irq_disable();
 
+    if ( is_pv_domain(prevd) && prevd->arch.asi )
+    {
+        /*
+         * Don't leak the L4 shadow mapping in the per-CPU area.  Can't be done
+         * in paravirt_ctxt_switch_from() because the lazy idle vCPU context
+         * switch would otherwise enter an infinite loop in
+         * mapcache_current_vcpu() with sync_local_execstate().
+         *
+         * Note clearing the fixmpa must strictly be done ahead of changing the
+         * current vCPU and with interrupts disabled, so there's no window
+         * where current->domain->arch.asi == true and PCPU_FIX_PV_L4SHADOW is
+         * not mapped.
+         */
+        percpu_clear_fixmap(PCPU_FIX_PV_L4SHADOW);
+        get_cpu_info()->root_pgt_changed = false;
+    }
+
     set_current(next);
 
     if ( (per_cpu(curr_vcpu, cpu) == next) ||
          (is_idle_domain(nextd) && cpu_online(cpu)) )
     {
+        if ( is_pv_domain(nextd) && nextd->arch.asi )
+        {
+            /* Signal the fixmap entry must be mapped. */
+            get_cpu_info()->new_cr3 = true;
+            if ( get_cpu_info()->root_pgt_changed )
+            {
+                /*
+                 * Map and update the shadow L4 in case we received any
+                 * FLUSH_ROOT_PGTBL request while running on the idle vCPU.
+                 *
+                 * Do it before enabling interrupts so that no flush IPI can be
+                 * delivered without having PCPU_FIX_PV_L4SHADOW correctly
+                 * mapped.
+                 */
+                pv_update_shadow_l4(next, true);
+                get_cpu_info()->root_pgt_changed = false;
+            }
+        }
+
         local_irq_enable();
     }
     else
diff --git a/xen/arch/x86/flushtlb.c b/xen/arch/x86/flushtlb.c
index fd5ed16ffb57..b85ce232abbb 100644
--- a/xen/arch/x86/flushtlb.c
+++ b/xen/arch/x86/flushtlb.c
@@ -17,6 +17,7 @@
 #include <asm/nops.h>
 #include <asm/page.h>
 #include <asm/pv/domain.h>
+#include <asm/pv/mm.h>
 #include <asm/spec_ctrl.h>
 
 /* Debug builds: Wrap frequently to stress-test the wrap logic. */
@@ -192,7 +193,15 @@ unsigned int flush_area_local(const void *va, unsigned int flags)
     unsigned int order = (flags - 1) & FLUSH_ORDER_MASK;
 
     if ( flags & FLUSH_ROOT_PGTBL )
+    {
+        const struct vcpu *curr = current;
+        const struct domain *curr_d = curr->domain;
+
         get_cpu_info()->root_pgt_changed = true;
+        if ( is_pv_domain(curr_d) && curr_d->arch.asi )
+            /* Update the shadow root page-table ahead of doing TLB flush. */
+            pv_update_shadow_l4(curr, false);
+    }
 
     if ( flags & (FLUSH_TLB|FLUSH_TLB_GLOBAL) )
     {
diff --git a/xen/arch/x86/include/asm/current.h b/xen/arch/x86/include/asm/current.h
index bcec328c9875..6a021607a1a9 100644
--- a/xen/arch/x86/include/asm/current.h
+++ b/xen/arch/x86/include/asm/current.h
@@ -60,10 +60,14 @@ struct cpu_info {
     uint8_t      scf; /* SCF_* */
 
     /*
-     * The following field controls copying of the L4 page table of 64-bit
-     * PV guests to the per-cpu root page table on entering the guest context.
-     * If set the L4 page table is being copied to the root page table and
-     * the field will be reset.
+     * For XPTI the following field controls copying of the L4 page table of
+     * 64-bit PV guests to the per-cpu root page table on entering the guest
+     * context.  If set the L4 page table is being copied to the root page
+     * table and the field will be reset.
+     *
+     * For ASI the field is used to acknowledge whether a FLUSH_ROOT_PGTBL
+     * request has been received when running the idle vCPU on PV guest
+     * page-tables (a lazy context switch to the idle vCPU).
      */
     bool         root_pgt_changed;
 
@@ -74,6 +78,9 @@ struct cpu_info {
      */
     bool         use_pv_cr3;
 
+    /* For ASI: per-CPU fixmap of guest L4 is possibly out of sync. */
+    bool         new_cr3;
+
     /* get_stack_bottom() must be 16-byte aligned */
 };
 
diff --git a/xen/arch/x86/include/asm/fixmap.h b/xen/arch/x86/include/asm/fixmap.h
index a456c65072d8..bc68a98568ae 100644
--- a/xen/arch/x86/include/asm/fixmap.h
+++ b/xen/arch/x86/include/asm/fixmap.h
@@ -120,6 +120,7 @@ extern void __set_fixmap_x(
 
 /* per-CPU fixmap area. */
 enum percpu_fixed_addresses {
+    PCPU_FIX_PV_L4SHADOW,
     __end_of_percpu_fixed_addresses
 };
 
diff --git a/xen/arch/x86/include/asm/pv/mm.h b/xen/arch/x86/include/asm/pv/mm.h
index 182764542c1f..a7c74898fce0 100644
--- a/xen/arch/x86/include/asm/pv/mm.h
+++ b/xen/arch/x86/include/asm/pv/mm.h
@@ -23,6 +23,9 @@ bool pv_destroy_ldt(struct vcpu *v);
 
 int validate_segdesc_page(struct page_info *page);
 
+void pv_clear_l4_guest_entries(root_pgentry_t *root_pgt);
+void pv_update_shadow_l4(const struct vcpu *v, bool flush);
+
 #else
 
 #include <xen/errno.h>
@@ -44,6 +47,11 @@ static inline bool pv_map_ldt_shadow_page(unsigned int off) { return false; }
 static inline bool pv_destroy_ldt(struct vcpu *v)
 { ASSERT_UNREACHABLE(); return false; }
 
+static inline void pv_clear_l4_guest_entries(root_pgentry_t *root_pgt)
+{ ASSERT_UNREACHABLE(); }
+static inline void pv_update_shadow_l4(const struct vcpu *v, bool flush)
+{ ASSERT_UNREACHABLE(); }
+
 #endif
 
 #endif /* __X86_PV_MM_H__ */
diff --git a/xen/arch/x86/mm.c b/xen/arch/x86/mm.c
index 937089d203cc..8fea7465a9df 100644
--- a/xen/arch/x86/mm.c
+++ b/xen/arch/x86/mm.c
@@ -513,6 +513,8 @@ void make_cr3(struct vcpu *v, mfn_t mfn)
     v->arch.cr3 = mfn_x(mfn) << PAGE_SHIFT;
     if ( is_pv_domain(d) && d->arch.pv.pcid )
         v->arch.cr3 |= get_pcid_bits(v, false);
+    if ( is_pv_domain(d) && d->arch.asi )
+        get_cpu_info()->new_cr3 = true;
 }
 
 void write_ptbase(struct vcpu *v)
@@ -532,6 +534,40 @@ void write_ptbase(struct vcpu *v)
             cpu_info->pv_cr3 |= get_pcid_bits(v, true);
         switch_cr3_cr4(v->arch.cr3, new_cr4);
     }
+    else if ( is_pv_domain(d) && d->arch.asi )
+    {
+        root_pgentry_t *root_pgt = this_cpu(root_pgt);
+        unsigned long cr3 = __pa(root_pgt);
+
+        /*
+         * XPTI and ASI cannot be simultaneously used even by different
+         * domains at runtime.
+         */
+        ASSERT(!cpu_info->use_pv_cr3 && !cpu_info->xen_cr3 &&
+               !cpu_info->pv_cr3);
+
+        if ( new_cr4 & X86_CR4_PCIDE )
+            cr3 |= get_pcid_bits(v, false);
+
+        /*
+         * Zap guest L4 entries ahead of flushing the TLB, so that the CPU
+         * cannot speculatively populate the TLB with stale mappings.
+         */
+        pv_clear_l4_guest_entries(root_pgt);
+
+        /*
+         * Switch to the shadow L4 with just the Xen slots populated, the guest
+         * slots will be populated by pv_update_shadow_l4() once running on the
+         * shadow L4.
+         *
+         * The reason for switching to the per-CPU shadow L4 before updating
+         * the guest slots is that pv_update_shadow_l4() uses per-CPU mappings,
+         * and the in-use page-table previous to the switch_cr3_cr4() call
+         * might not support per-CPU mappings.
+         */
+        switch_cr3_cr4(cr3, new_cr4);
+        pv_update_shadow_l4(v, false);
+    }
     else
     {
         ASSERT(!is_hvm_domain(d) || !d->arch.asi
@@ -6505,6 +6541,17 @@ void setup_perdomain_slot(const struct vcpu *v, root_pgentry_t *root_pgt)
 
         ASSERT(l3);
         populate_perdomain(d, root_pgt, l3);
+
+        if ( is_pv_domain(d) )
+        {
+            /*
+             * Abuse the fact that this function is called on vCPU context
+             * switch and clean previous guest controlled slots from the shadow
+             * L4.
+             */
+            pv_clear_l4_guest_entries(root_pgt);
+            get_cpu_info()->new_cr3 = true;
+        }
     }
     else if ( is_hvm_domain(d) || d->arch.pv.xpti )
         l4e_write(&root_pgt[root_table_offset(PERDOMAIN_VIRT_START)],
diff --git a/xen/arch/x86/pv/domain.c b/xen/arch/x86/pv/domain.c
index 46ee10a8a4c2..80bf2bf934dd 100644
--- a/xen/arch/x86/pv/domain.c
+++ b/xen/arch/x86/pv/domain.c
@@ -15,6 +15,7 @@
 #include <asm/invpcid.h>
 #include <asm/spec_ctrl.h>
 #include <asm/pv/domain.h>
+#include <asm/pv/mm.h>
 #include <asm/shadow.h>
 
 #ifdef CONFIG_PV32
@@ -384,7 +385,7 @@ int pv_domain_initialise(struct domain *d)
 
     d->arch.ctxt_switch = &pv_csw;
 
-    d->arch.pv.flush_root_pt = d->arch.pv.xpti;
+    d->arch.pv.flush_root_pt = d->arch.pv.xpti || d->arch.asi;
 
     if ( !is_pv_32bit_domain(d) && use_invpcid && cpu_has_pcid )
         switch ( ACCESS_ONCE(opt_pcid) )
@@ -446,7 +447,27 @@ static void _toggle_guest_pt(struct vcpu *v)
      * to release). Switch to the idle page tables in such an event; the
      * guest will have been crashed already.
      */
-    cr3 = v->arch.cr3;
+    if ( v->domain->arch.asi )
+    {
+        /*
+         * _toggle_guest_pt() might switch between user and kernel page tables,
+         * but doesn't use write_ptbase(), and hence needs an explicit call to
+         * sync the shadow L4.
+         */
+        cr3 = __pa(this_cpu(root_pgt));
+        if ( v->domain->arch.pv.pcid )
+            cr3 |= get_pcid_bits(v, false);
+        /*
+         * Ensure the current root page table is already the shadow L4, as
+         * guest user/kernel switches can only happen once the guest is
+         * running.
+         */
+        ASSERT(read_cr3() == cr3);
+        pv_update_shadow_l4(v, false);
+    }
+    else
+        cr3 = v->arch.cr3;
+
     if ( shadow_mode_enabled(v->domain) )
     {
         cr3 &= ~X86_CR3_NOFLUSH;
diff --git a/xen/arch/x86/pv/mm.c b/xen/arch/x86/pv/mm.c
index 24f0d2e4ff7d..c20ce099ae27 100644
--- a/xen/arch/x86/pv/mm.c
+++ b/xen/arch/x86/pv/mm.c
@@ -11,6 +11,7 @@
 #include <xen/guest_access.h>
 
 #include <asm/current.h>
+#include <asm/fixmap.h>
 #include <asm/p2m.h>
 
 #include "mm.h"
@@ -103,6 +104,57 @@ void init_xen_pae_l2_slots(l2_pgentry_t *l2t, const struct domain *d)
 }
 #endif
 
+void pv_clear_l4_guest_entries(root_pgentry_t *root_pgt)
+{
+    unsigned int i;
+
+    for ( i = 0; i < ROOT_PAGETABLE_FIRST_XEN_SLOT; i++ )
+        l4e_write(&root_pgt[i], l4e_empty());
+    for ( i = ROOT_PAGETABLE_LAST_XEN_SLOT + 1; i < L4_PAGETABLE_ENTRIES; i++ )
+        l4e_write(&root_pgt[i], l4e_empty());
+}
+
+void pv_update_shadow_l4(const struct vcpu *v, bool flush)
+{
+    const root_pgentry_t *guest_pgt = percpu_fix_to_virt(PCPU_FIX_PV_L4SHADOW);
+    root_pgentry_t *shadow_pgt = this_cpu(root_pgt);
+
+    ASSERT(!v->domain->arch.pv.xpti);
+    ASSERT(is_pv_vcpu(v));
+    ASSERT(!is_idle_vcpu(v));
+
+    if ( get_cpu_info()->new_cr3 )
+    {
+        percpu_set_fixmap(PCPU_FIX_PV_L4SHADOW, maddr_to_mfn(v->arch.cr3),
+                          __PAGE_HYPERVISOR_RO);
+        get_cpu_info()->new_cr3 = false;
+    }
+
+    if ( is_pv_32bit_vcpu(v) )
+    {
+        l4e_write(&shadow_pgt[0], guest_pgt[0]);
+        l4e_write(&shadow_pgt[root_table_offset(PERDOMAIN_ALT_VIRT_START)],
+            shadow_pgt[root_table_offset(PERDOMAIN_VIRT_START)]);
+    }
+    else
+    {
+        unsigned int i;
+
+        for ( i = 0; i < ROOT_PAGETABLE_FIRST_XEN_SLOT; i++ )
+            l4e_write(&shadow_pgt[i], guest_pgt[i]);
+        for ( i = ROOT_PAGETABLE_LAST_XEN_SLOT + 1;
+              i < L4_PAGETABLE_ENTRIES; i++ )
+            l4e_write(&shadow_pgt[i], guest_pgt[i]);
+
+        /* The presence of this Xen slot is selected by the guest. */
+        l4e_write(&shadow_pgt[l4_table_offset(RO_MPT_VIRT_START)],
+            guest_pgt[l4_table_offset(RO_MPT_VIRT_START)]);
+    }
+
+    if ( flush )
+        flush_local(FLUSH_TLB_GLOBAL);
+}
+
 /*
  * Local variables:
  * mode: C
diff --git a/xen/arch/x86/smpboot.c b/xen/arch/x86/smpboot.c
index 40cc14799252..d9841ed3b663 100644
--- a/xen/arch/x86/smpboot.c
+++ b/xen/arch/x86/smpboot.c
@@ -829,7 +829,7 @@ int setup_cpu_root_pgt(unsigned int cpu)
     unsigned int off;
     int rc;
 
-    if ( !opt_xpti_hwdom && !opt_xpti_domu )
+    if ( !opt_xpti_hwdom && !opt_xpti_domu && !opt_asi_pv )
         return 0;
 
     rpt = alloc_xenheap_page();
@@ -839,6 +839,18 @@ int setup_cpu_root_pgt(unsigned int cpu)
     clear_page(rpt);
     per_cpu(root_pgt, cpu) = rpt;
 
+    if ( opt_asi_pv )
+    {
+        /*
+         * Populate the Xen slots, the guest ones will be copied from the guest
+         * root page-table.
+         */
+        init_xen_l4_slots(rpt, _mfn(virt_to_mfn(rpt)), INVALID_MFN, NULL,
+                          false, false, true);
+
+        return 0;
+    }
+
     rpt[root_table_offset(RO_MPT_VIRT_START)] =
         idle_pg_table[root_table_offset(RO_MPT_VIRT_START)];
     /* SH_LINEAR_PT inserted together with guest mappings. */
@@ -892,6 +904,12 @@ static void cleanup_cpu_root_pgt(unsigned int cpu)
 
     per_cpu(root_pgt, cpu) = NULL;
 
+    if ( opt_asi_pv )
+    {
+        free_xenheap_page(rpt);
+        return;
+    }
+
     for ( r = root_table_offset(DIRECTMAP_VIRT_START);
           r < root_table_offset(HYPERVISOR_VIRT_END); ++r )
     {
-- 
2.45.2



^ permalink raw reply related	[flat|nested] 64+ messages in thread

* [PATCH 21/22] x86/mm: switch to a per-CPU mapped stack when using ASI
  2024-07-26 15:21 [PATCH 00/22] x86: adventures in Address Space Isolation Roger Pau Monne
                   ` (19 preceding siblings ...)
  2024-07-26 15:22 ` [PATCH 20/22] x86/pv: allow using a unique per-pCPU root page table (L4) Roger Pau Monne
@ 2024-07-26 15:22 ` Roger Pau Monne
  2024-07-26 15:22 ` [PATCH 22/22] x86/mm: zero stack on stack switch or reset Roger Pau Monne
  21 siblings, 0 replies; 64+ messages in thread
From: Roger Pau Monne @ 2024-07-26 15:22 UTC (permalink / raw)
  To: xen-devel
  Cc: alejandro.vallejo, Roger Pau Monne, Jan Beulich, Andrew Cooper,
	Julien Grall, Stefano Stabellini, Daniel P. Smith,
	Marek Marczykowski-Górecki

When using ASI the CPU stack is mapped using a range of fixmap entries in the
per-CPU region.  This ensures the stack is only accessible by the current CPU.

Note however there's further work required in order to allocate the stack from
domheap instead of xenheap, and ensure the stack is not part of the direct
map.

For domains not running with ASI enabled all the CPU stacks are mapped in the
per-domain L3, so that the stack is always at the same linear address,
regardless of whether ASI is enabled or not for the domain.

When calling UEFI runtime methods the current per-domain slot needs to be added
to the EFI L4, so that the stack is available in UEFI.

Finally, some users of callfunc IPIs pass parameters from the stack, so when
handling a callfunc IPI the stack of the caller CPU is mapped into the address
space of the CPU handling the IPI.  This needs further work to use a bounce
buffer in order to avoid having to map remote CPU stacks.

Signed-off-by: Roger Pau Monné <roger.pau@citrix.com>
---
There's also further work required in order to avoid mapping remote stack when
handling callfunc IPIs.
---
 xen/arch/x86/domain.c              |  12 +++
 xen/arch/x86/include/asm/current.h |   5 ++
 xen/arch/x86/include/asm/fixmap.h  |   5 ++
 xen/arch/x86/include/asm/mm.h      |   6 +-
 xen/arch/x86/include/asm/smp.h     |  12 +++
 xen/arch/x86/mm.c                  | 125 +++++++++++++++++++++++++++--
 xen/arch/x86/setup.c               |  27 +++++--
 xen/arch/x86/smp.c                 |  29 +++++++
 xen/arch/x86/smpboot.c             |  47 ++++++++++-
 xen/arch/x86/traps.c               |   6 +-
 xen/common/efi/runtime.c           |  12 +++
 xen/common/smp.c                   |  10 +++
 xen/include/xen/smp.h              |   5 ++
 13 files changed, 281 insertions(+), 20 deletions(-)

diff --git a/xen/arch/x86/domain.c b/xen/arch/x86/domain.c
index 94a42ef29cd1..d00ba415877f 100644
--- a/xen/arch/x86/domain.c
+++ b/xen/arch/x86/domain.c
@@ -929,6 +929,18 @@ int arch_domain_create(struct domain *d,
 
     d->arch.msr_relaxed = config->arch.misc_flags & XEN_X86_MSR_RELAXED;
 
+    if ( !d->arch.asi && (opt_asi_hvm || opt_asi_pv ) )
+    {
+        /*
+         * This domain is not using ASI, but other domains on the system
+         * possibly are, hence the CPU stacks are on the per-CPU page-table
+         * region.  Add an L3 entry that has all the stacks mapped.
+         */
+        rc = map_all_stacks(d);
+        if ( rc )
+            goto fail;
+    }
+
     return 0;
 
  fail:
diff --git a/xen/arch/x86/include/asm/current.h b/xen/arch/x86/include/asm/current.h
index 6a021607a1a9..75b9a341f814 100644
--- a/xen/arch/x86/include/asm/current.h
+++ b/xen/arch/x86/include/asm/current.h
@@ -24,6 +24,11 @@
  * 0 - IST Shadow Stacks (4x 1k, read-only)
  */
 
+static inline bool is_shstk_slot(unsigned int i)
+{
+    return (i == 0 || i == PRIMARY_SHSTK_SLOT);
+}
+
 /*
  * Identify which stack page the stack pointer is on.  Returns an index
  * as per the comment above.
diff --git a/xen/arch/x86/include/asm/fixmap.h b/xen/arch/x86/include/asm/fixmap.h
index bc68a98568ae..d52c1886fcdd 100644
--- a/xen/arch/x86/include/asm/fixmap.h
+++ b/xen/arch/x86/include/asm/fixmap.h
@@ -120,6 +120,11 @@ extern void __set_fixmap_x(
 
 /* per-CPU fixmap area. */
 enum percpu_fixed_addresses {
+    /* For alignment reasons the per-CPU stacks must come first. */
+    PCPU_STACK_START,
+    PCPU_STACK_END = PCPU_STACK_START + NR_CPUS * (1U << STACK_ORDER) - 1,
+#define PERCPU_STACK_IDX(c) (PCPU_STACK_START + (c) * (1U << STACK_ORDER))
+#define PERCPU_STACK_ADDR(c) percpu_fix_to_virt(PERCPU_STACK_IDX(c))
     PCPU_FIX_PV_L4SHADOW,
     __end_of_percpu_fixed_addresses
 };
diff --git a/xen/arch/x86/include/asm/mm.h b/xen/arch/x86/include/asm/mm.h
index f883468b1a7c..b4f1e0399275 100644
--- a/xen/arch/x86/include/asm/mm.h
+++ b/xen/arch/x86/include/asm/mm.h
@@ -521,7 +521,7 @@ extern struct rangeset *mmio_ro_ranges;
 #define compat_pfn_to_cr3(pfn) (((unsigned)(pfn) << 12) | ((unsigned)(pfn) >> 20))
 #define compat_cr3_to_pfn(cr3) (((unsigned)(cr3) >> 12) | ((unsigned)(cr3) << 20))
 
-void memguard_guard_stack(void *p);
+void memguard_guard_stack(void *p, unsigned int cpu);
 void memguard_unguard_stack(void *p);
 
 struct mmio_ro_emulate_ctxt {
@@ -652,4 +652,8 @@ static inline int destroy_xen_mappings_cpu(unsigned long s, unsigned long e,
     return modify_xen_mappings_cpu(s, e, _PAGE_NONE, cpu);
 }
 
+/* Setup a per-domain slot that maps all pCPU stacks. */
+int map_all_stacks(struct domain *d);
+int add_stack(const void *stack, unsigned int cpu);
+
 #endif /* __ASM_X86_MM_H__ */
diff --git a/xen/arch/x86/include/asm/smp.h b/xen/arch/x86/include/asm/smp.h
index c8c79601343d..a17c609da4b6 100644
--- a/xen/arch/x86/include/asm/smp.h
+++ b/xen/arch/x86/include/asm/smp.h
@@ -79,6 +79,18 @@ extern bool unaccounted_cpus;
 
 void *cpu_alloc_stack(unsigned int cpu);
 
+/*
+ * Setup the per-CPU area stack mappings.
+ *
+ * @dest_cpu:  CPU where the mappings are to appear.
+ * @stack_cpu: CPU whose stacks should be mapped.
+ */
+void cpu_set_stack_mappings(unsigned int dest_cpu, unsigned int stack_cpu);
+
+#define HAS_ARCH_SMP_CALLFUNC
+void arch_smp_pre_callfunc(unsigned int cpu);
+void arch_smp_post_callfunc(unsigned int cpu);
+
 #endif /* !__ASSEMBLY__ */
 
 #endif
diff --git a/xen/arch/x86/mm.c b/xen/arch/x86/mm.c
index 8fea7465a9df..67ffdebb595e 100644
--- a/xen/arch/x86/mm.c
+++ b/xen/arch/x86/mm.c
@@ -87,6 +87,7 @@
  * doing the final put_page(), and remove it from the iommu if so.
  */
 
+#include <xen/cpu.h>
 #include <xen/init.h>
 #include <xen/ioreq.h>
 #include <xen/kernel.h>
@@ -6352,31 +6353,40 @@ void free_perdomain_mappings(struct domain *d)
     d->arch.perdomain_l3_pg = NULL;
 }
 
-static void write_sss_token(unsigned long *ptr)
+static void write_sss_token(unsigned long *ptr, unsigned long va)
 {
     /*
      * A supervisor shadow stack token is its own linear address, with the
      * busy bit (0) clear.
      */
-    *ptr = (unsigned long)ptr;
+    *ptr = va;
 }
 
-void memguard_guard_stack(void *p)
+void memguard_guard_stack(void *p, unsigned int cpu)
 {
+    unsigned long va =
+        (opt_asi_hvm || opt_asi_pv) ? (unsigned long)PERCPU_STACK_ADDR(cpu)
+                                    : (unsigned long)p;
+
     /* IST Shadow stacks.  4x 1k in stack page 0. */
     if ( IS_ENABLED(CONFIG_XEN_SHSTK) )
     {
-        write_sss_token(p + (IST_MCE * IST_SHSTK_SIZE) - 8);
-        write_sss_token(p + (IST_NMI * IST_SHSTK_SIZE) - 8);
-        write_sss_token(p + (IST_DB  * IST_SHSTK_SIZE) - 8);
-        write_sss_token(p + (IST_DF  * IST_SHSTK_SIZE) - 8);
+        write_sss_token(p + (IST_MCE * IST_SHSTK_SIZE) - 8,
+                        va + (IST_MCE * IST_SHSTK_SIZE) - 8);
+        write_sss_token(p + (IST_NMI * IST_SHSTK_SIZE) - 8,
+                        va + (IST_NMI * IST_SHSTK_SIZE) - 8);
+        write_sss_token(p + (IST_DB  * IST_SHSTK_SIZE) - 8,
+                        va + (IST_DB  * IST_SHSTK_SIZE) - 8);
+        write_sss_token(p + (IST_DF  * IST_SHSTK_SIZE) - 8,
+                        va + (IST_DF  * IST_SHSTK_SIZE) - 8);
     }
     map_pages_to_xen((unsigned long)p, virt_to_mfn(p), 1, PAGE_HYPERVISOR_SHSTK);
 
     /* Primary Shadow Stack.  1x 4k in stack page 5. */
     p += PRIMARY_SHSTK_SLOT * PAGE_SIZE;
+    va += PRIMARY_SHSTK_SLOT * PAGE_SIZE;
     if ( IS_ENABLED(CONFIG_XEN_SHSTK) )
-        write_sss_token(p + PAGE_SIZE - 8);
+        write_sss_token(p + PAGE_SIZE - 8, va + PAGE_SIZE - 8);
 
     map_pages_to_xen((unsigned long)p, virt_to_mfn(p), 1, PAGE_HYPERVISOR_SHSTK);
 }
@@ -6567,6 +6577,105 @@ void setup_perdomain_slot(const struct vcpu *v, root_pgentry_t *root_pgt)
                   root_pgt[root_table_offset(PERDOMAIN_VIRT_START)]);
 }
 
+static struct page_info *l2_all_stacks;
+
+int add_stack(const void *stack, unsigned int cpu)
+{
+    unsigned long va = (unsigned long)PERCPU_STACK_ADDR(cpu);
+    struct page_info *pg;
+    l2_pgentry_t *l2tab = NULL;
+    l1_pgentry_t *l1tab = NULL;
+    unsigned int nr;
+    int rc = 0;
+
+    /*
+     * Assume CPU stack allocation is always serialized, either because it's
+     * done on the BSP during boot, or in case of hotplug, in stop machine
+     * context.
+     */
+    ASSERT(system_state < SYS_STATE_active || cpu_in_hotplug_context());
+
+    if ( !opt_asi_hvm && !opt_asi_pv )
+        return 0;
+
+    if ( !l2_all_stacks )
+    {
+        l2_all_stacks = alloc_domheap_page(NULL, MEMF_no_owner);
+        if ( !l2_all_stacks )
+            return -ENOMEM;
+        l2tab = __map_domain_page(l2_all_stacks);
+        clear_page(l2tab);
+    }
+    else
+        l2tab = __map_domain_page(l2_all_stacks);
+
+    /* code assumes all the stacks can be mapped with a single l2. */
+    ASSERT(l3_table_offset((unsigned long)percpu_fix_to_virt(PCPU_STACK_END)) ==
+        l3_table_offset((unsigned long)percpu_fix_to_virt(PCPU_STACK_START)));
+    for ( nr = 0 ; nr < (1U << STACK_ORDER) ; nr++)
+    {
+        l2_pgentry_t *pl2e = l2tab + l2_table_offset(va);
+
+        if ( !(l2e_get_flags(*pl2e) & _PAGE_PRESENT) )
+        {
+            pg = alloc_domheap_page(NULL, MEMF_no_owner);
+            if ( !pg )
+            {
+                rc = -ENOMEM;
+                break;
+            }
+            l1tab = __map_domain_page(pg);
+            clear_page(l1tab);
+            l2e_write(pl2e, l2e_from_page(pg, __PAGE_HYPERVISOR_RW));
+        }
+        else if ( !l1tab )
+            l1tab = map_l1t_from_l2e(*pl2e);
+
+        l1e_write(&l1tab[l1_table_offset(va)],
+                  l1e_from_mfn(virt_to_mfn(stack),
+                               is_shstk_slot(nr) ? __PAGE_HYPERVISOR_SHSTK
+                                                 : __PAGE_HYPERVISOR_RW));
+
+        va += PAGE_SIZE;
+        stack += PAGE_SIZE;
+
+        if ( !l1_table_offset(va) )
+        {
+            unmap_domain_page(l1tab);
+            l1tab = NULL;
+        }
+    }
+
+    unmap_domain_page(l1tab);
+    unmap_domain_page(l2tab);
+    /*
+     * Don't care to free the intermediate page-tables on failure, can be used
+     * to map other stacks.
+     */
+
+    return rc;
+}
+
+int map_all_stacks(struct domain *d)
+{
+    /*
+     * Create the per-domain L3.  Pass a dummy PERDOMAIN_VIRT_START, but note
+     * only the per-domain L3 is allocated when nr == 0.
+     */
+    int rc = create_perdomain_mapping(d, PERDOMAIN_VIRT_START, 0, NULL, NULL);
+    l3_pgentry_t *l3tab;
+
+    if ( rc )
+        return rc;
+
+    l3tab = __map_domain_page(d->arch.perdomain_l3_pg);
+    l3tab[l3_table_offset((unsigned long)percpu_fix_to_virt(PCPU_STACK_START))]
+        = l3e_from_page(l2_all_stacks, __PAGE_HYPERVISOR_RW);
+    unmap_domain_page(l3tab);
+
+    return 0;
+}
+
 static void __init __maybe_unused build_assertions(void)
 {
     /*
diff --git a/xen/arch/x86/setup.c b/xen/arch/x86/setup.c
index 5bf81b81b46f..76f7d71b8c1c 100644
--- a/xen/arch/x86/setup.c
+++ b/xen/arch/x86/setup.c
@@ -808,8 +808,6 @@ static void __init noreturn reinit_bsp_stack(void)
     /* Update SYSCALL trampolines */
     percpu_traps_init();
 
-    stack_base[0] = stack;
-
     rc = setup_cpu_root_pgt(0);
     if ( rc )
         panic("Error %d setting up PV root page table\n", rc);
@@ -1771,10 +1769,6 @@ void asmlinkage __init noreturn __start_xen(unsigned long mbi_p)
 
     system_state = SYS_STATE_boot;
 
-    bsp_stack = cpu_alloc_stack(0);
-    if ( !bsp_stack )
-        panic("No memory for BSP stack\n");
-
     console_init_ring();
     vesa_init();
 
@@ -1961,6 +1955,16 @@ void asmlinkage __init noreturn __start_xen(unsigned long mbi_p)
 
     alternative_branches();
 
+    /*
+     * Alloc the BSP stack closer to the point where the AP ones also get
+     * allocated - and after the speculation mitigations have been initialized.
+     * In order to set up the shadow stack token correctly Xen needs to know
+     * whether per-CPU mapped stacks are being used.
+     */
+    bsp_stack = cpu_alloc_stack(0);
+    if ( !bsp_stack )
+        panic("No memory for BSP stack\n");
+
     /*
      * Setup the local per-domain L3 for the BSP also, so it matches the state
      * of the APs.
@@ -2065,8 +2069,17 @@ void asmlinkage __init noreturn __start_xen(unsigned long mbi_p)
         info->last_spec_ctrl = default_xen_spec_ctrl;
     }
 
+    stack_base[0] = bsp_stack;
+
     /* Copy the cpu info block, and move onto the BSP stack. */
-    bsp_info = get_cpu_info_from_stack((unsigned long)bsp_stack);
+    if ( opt_asi_hvm || opt_asi_pv )
+    {
+        cpu_set_stack_mappings(0, 0);
+        bsp_info = get_cpu_info_from_stack((unsigned long)PERCPU_STACK_ADDR(0));
+    }
+    else
+        bsp_info = get_cpu_info_from_stack((unsigned long)bsp_stack);
+
     *bsp_info = *info;
 
     asm volatile ("mov %[stk], %%rsp; jmp %c[fn]" ::
diff --git a/xen/arch/x86/smp.c b/xen/arch/x86/smp.c
index 04c6a0572319..18a7196195cf 100644
--- a/xen/arch/x86/smp.c
+++ b/xen/arch/x86/smp.c
@@ -22,6 +22,7 @@
 #include <asm/hardirq.h>
 #include <asm/hpet.h>
 #include <asm/setup.h>
+#include <asm/spec_ctrl.h>
 #include <irq_vectors.h>
 #include <mach_apic.h>
 
@@ -433,3 +434,31 @@ long cf_check cpu_down_helper(void *data)
         ret = cpu_down(cpu);
     return ret;
 }
+
+void arch_smp_pre_callfunc(unsigned int cpu)
+{
+    if ( (!opt_asi_pv && !opt_asi_hvm) || cpu == smp_processor_id() ||
+         (!current->domain->arch.asi && !is_idle_vcpu(current)) ||
+        /*
+         * CPU#0 still runs on the .init stack when the APs are started, don't
+         * attempt to map such stack.
+         */
+         (!cpu && system_state < SYS_STATE_active) )
+        return;
+
+    cpu_set_stack_mappings(smp_processor_id(), cpu);
+}
+
+void arch_smp_post_callfunc(unsigned int cpu)
+{
+    unsigned int i;
+
+    if ( (!opt_asi_pv && !opt_asi_hvm) || cpu == smp_processor_id() ||
+         (!current->domain->arch.asi && !is_idle_vcpu(current)) )
+        return;
+
+    for ( i = 0; i < (1U << STACK_ORDER); i++ )
+        percpu_clear_fixmap(PERCPU_STACK_IDX(cpu) + i);
+
+    flush_area_local(PERCPU_STACK_ADDR(cpu), FLUSH_ORDER(STACK_ORDER));
+}
diff --git a/xen/arch/x86/smpboot.c b/xen/arch/x86/smpboot.c
index d9841ed3b663..548e3102101c 100644
--- a/xen/arch/x86/smpboot.c
+++ b/xen/arch/x86/smpboot.c
@@ -579,7 +579,20 @@ static int do_boot_cpu(int apicid, int cpu)
         printk("Booting processor %d/%d eip %lx\n",
                cpu, apicid, start_eip);
 
-    stack_start = stack_base[cpu] + STACK_SIZE - sizeof(struct cpu_info);
+    if ( opt_asi_hvm || opt_asi_pv )
+    {
+        /*
+         * Uniformly run with the stack mapping of the per-CPU area (including
+         * the idle vCPU) if ASI is enabled for any domain type.
+         */
+        cpu_set_stack_mappings(cpu, cpu);
+
+        ASSERT(IS_ALIGNED((unsigned long)PERCPU_STACK_ADDR(cpu), STACK_SIZE));
+
+        stack_start = PERCPU_STACK_ADDR(cpu) + STACK_SIZE - sizeof(struct cpu_info);
+    }
+    else
+        stack_start = stack_base[cpu] + STACK_SIZE - sizeof(struct cpu_info);
 
     /*
      * If per-CPU idle root page table has been allocated, switch to it as
@@ -1053,11 +1066,41 @@ void *cpu_alloc_stack(unsigned int cpu)
     stack = alloc_xenheap_pages(STACK_ORDER, memflags);
 
     if ( stack )
-        memguard_guard_stack(stack);
+    {
+        int rc = add_stack(stack, cpu);
+
+        if ( rc )
+        {
+            printk(XENLOG_ERR "unable to map stack for CPU %u: %d\n", cpu, rc);
+            free_xenheap_pages(stack, STACK_ORDER);
+            return NULL;
+        }
+        memguard_guard_stack(stack, cpu);
+    }
 
     return stack;
 }
 
+void cpu_set_stack_mappings(unsigned int dest_cpu, unsigned int stack_cpu)
+{
+    unsigned int i;
+
+    for ( i = 0; i < (1U << STACK_ORDER); i++ )
+    {
+        unsigned int flags = (is_shstk_slot(i) ? __PAGE_HYPERVISOR_SHSTK
+                                               : __PAGE_HYPERVISOR_RW) |
+                             (dest_cpu == stack_cpu ? _PAGE_GLOBAL : 0);
+
+        if ( is_shstk_slot(i) && dest_cpu != stack_cpu )
+            continue;
+
+        percpu_set_fixmap_remote(dest_cpu, PERCPU_STACK_IDX(stack_cpu) + i,
+                                 _mfn(virt_to_mfn(stack_base[stack_cpu] +
+                                                  i * PAGE_SIZE)),
+                                 flags);
+    }
+}
+
 static int cpu_smpboot_alloc(unsigned int cpu)
 {
     struct cpu_info *info;
diff --git a/xen/arch/x86/traps.c b/xen/arch/x86/traps.c
index b4fb95917023..28513c0e3d6a 100644
--- a/xen/arch/x86/traps.c
+++ b/xen/arch/x86/traps.c
@@ -609,10 +609,12 @@ void show_stack_overflow(unsigned int cpu, const struct cpu_user_regs *regs)
     unsigned long esp = regs->rsp;
     unsigned long curr_stack_base = esp & ~(STACK_SIZE - 1);
     unsigned long esp_top, esp_bottom;
+    const void *stack = current->domain->arch.asi ? PERCPU_STACK_ADDR(cpu)
+                                                  : stack_base[cpu];
 
-    if ( _p(curr_stack_base) != stack_base[cpu] )
+    if ( _p(curr_stack_base) != stack )
         printk("Current stack base %p differs from expected %p\n",
-               _p(curr_stack_base), stack_base[cpu]);
+               _p(curr_stack_base), stack);
 
     esp_bottom = (esp | (STACK_SIZE - 1)) + 1;
     esp_top    = esp_bottom - PRIMARY_STACK_SIZE;
diff --git a/xen/common/efi/runtime.c b/xen/common/efi/runtime.c
index d952c3ba785e..3a8233ed62ac 100644
--- a/xen/common/efi/runtime.c
+++ b/xen/common/efi/runtime.c
@@ -32,6 +32,7 @@ void efi_rs_leave(struct efi_rs_state *state);
 
 #ifndef CONFIG_ARM
 # include <asm/i387.h>
+# include <asm/spec_ctrl.h>
 # include <asm/xstate.h>
 # include <public/platform.h>
 #endif
@@ -85,6 +86,7 @@ struct efi_rs_state efi_rs_enter(void)
     static const u16 fcw = FCW_DEFAULT;
     static const u32 mxcsr = MXCSR_DEFAULT;
     struct efi_rs_state state = { .cr3 = 0 };
+    root_pgentry_t *efi_pgt, *idle_pgt;
 
     if ( mfn_eq(efi_l4_mfn, INVALID_MFN) )
         return state;
@@ -98,6 +100,16 @@ struct efi_rs_state efi_rs_enter(void)
 
     efi_rs_on_cpu = smp_processor_id();
 
+    if ( opt_asi_pv || opt_asi_hvm )
+    {
+        /* Insert the idle per-domain slot for the stack mapping. */
+        efi_pgt = map_domain_page(efi_l4_mfn);
+        idle_pgt = maddr_to_virt(idle_vcpu[efi_rs_on_cpu]->arch.cr3);
+        efi_pgt[root_table_offset(PERDOMAIN_VIRT_START)].l4 =
+            idle_pgt[root_table_offset(PERDOMAIN_VIRT_START)].l4;
+        unmap_domain_page(efi_pgt);
+    }
+
     /* prevent fixup_page_fault() from doing anything */
     irq_enter();
 
diff --git a/xen/common/smp.c b/xen/common/smp.c
index a011f541f1ea..04f5aede0d3d 100644
--- a/xen/common/smp.c
+++ b/xen/common/smp.c
@@ -29,6 +29,7 @@ static struct call_data_struct {
     void (*func) (void *info);
     void *info;
     int wait;
+    unsigned int caller;
     cpumask_t selected;
 } call_data;
 
@@ -63,6 +64,7 @@ void on_selected_cpus(
     call_data.func = func;
     call_data.info = info;
     call_data.wait = wait;
+    call_data.caller = smp_processor_id();
 
     smp_send_call_function_mask(&call_data.selected);
 
@@ -82,6 +84,12 @@ void smp_call_function_interrupt(void)
     if ( !cpumask_test_cpu(cpu, &call_data.selected) )
         return;
 
+    /*
+     * TODO: use bounce buffers to pass callfunc data, so that when using ASI
+     * there's no need to map remote CPU stacks.
+     */
+    arch_smp_pre_callfunc(call_data.caller);
+
     irq_enter();
 
     if ( unlikely(!func) )
@@ -102,6 +110,8 @@ void smp_call_function_interrupt(void)
     }
 
     irq_exit();
+
+    arch_smp_post_callfunc(call_data.caller);
 }
 
 /*
diff --git a/xen/include/xen/smp.h b/xen/include/xen/smp.h
index 2ca9ff1bfcc1..610c279ca24c 100644
--- a/xen/include/xen/smp.h
+++ b/xen/include/xen/smp.h
@@ -76,4 +76,9 @@ extern void *stack_base[NR_CPUS];
 void initialize_cpu_data(unsigned int cpu);
 int setup_cpu_root_pgt(unsigned int cpu);
 
+#ifndef HAS_ARCH_SMP_CALLFUNC
+static inline void arch_smp_pre_callfunc(unsigned int cpu) {}
+static inline void arch_smp_post_callfunc(unsigned int cpu) {}
+#endif
+
 #endif /* __XEN_SMP_H__ */
-- 
2.45.2



^ permalink raw reply related	[flat|nested] 64+ messages in thread

* [PATCH 22/22] x86/mm: zero stack on stack switch or reset
  2024-07-26 15:21 [PATCH 00/22] x86: adventures in Address Space Isolation Roger Pau Monne
                   ` (20 preceding siblings ...)
  2024-07-26 15:22 ` [PATCH 21/22] x86/mm: switch to a per-CPU mapped stack when using ASI Roger Pau Monne
@ 2024-07-26 15:22 ` Roger Pau Monne
  2024-07-29 15:40   ` Andrew Cooper
  2024-08-13 13:16   ` Jan Beulich
  21 siblings, 2 replies; 64+ messages in thread
From: Roger Pau Monne @ 2024-07-26 15:22 UTC (permalink / raw)
  To: xen-devel; +Cc: alejandro.vallejo, Roger Pau Monne, Jan Beulich, Andrew Cooper

With the stack mapped on a per-CPU basis there's no risk of other CPUs being
able to read the stack contents, but vCPUs running on the current pCPU could
read stack rubble from operations of previous vCPUs.

The #DF stack is not zeroed because handling of #DF results in a panic.

Signed-off-by: Roger Pau Monné <roger.pau@citrix.com>
---
 xen/arch/x86/include/asm/current.h | 30 +++++++++++++++++++++++++++++-
 1 file changed, 29 insertions(+), 1 deletion(-)

diff --git a/xen/arch/x86/include/asm/current.h b/xen/arch/x86/include/asm/current.h
index 75b9a341f814..02b4118b03ef 100644
--- a/xen/arch/x86/include/asm/current.h
+++ b/xen/arch/x86/include/asm/current.h
@@ -177,6 +177,14 @@ unsigned long get_stack_dump_bottom (unsigned long sp);
 # define SHADOW_STACK_WORK ""
 #endif
 
+#define ZERO_STACK                                              \
+    "test %[stk_size], %[stk_size];"                            \
+    "jz .L_skip_zeroing.%=;"                                    \
+    "std;"                                                      \
+    "rep stosb;"                                                \
+    "cld;"                                                      \
+    ".L_skip_zeroing.%=:"
+
 #if __GNUC__ >= 9
 # define ssaj_has_attr_noreturn(fn) __builtin_has_attribute(fn, __noreturn__)
 #else
@@ -187,10 +195,24 @@ unsigned long get_stack_dump_bottom (unsigned long sp);
 #define switch_stack_and_jump(fn, instr, constr)                        \
     ({                                                                  \
         unsigned int tmp;                                               \
+        bool zero_stack = current->domain->arch.asi;                    \
         BUILD_BUG_ON(!ssaj_has_attr_noreturn(fn));                      \
+        ASSERT(IS_ALIGNED((unsigned long)guest_cpu_user_regs() -        \
+                          PRIMARY_STACK_SIZE +                          \
+                          sizeof(struct cpu_info), PAGE_SIZE));         \
+        if ( zero_stack )                                               \
+        {                                                               \
+            unsigned long stack_top = get_stack_bottom() &              \
+                                      ~(STACK_SIZE - 1);                \
+                                                                        \
+            clear_page((void *)stack_top + IST_MCE * PAGE_SIZE);        \
+            clear_page((void *)stack_top + IST_NMI * PAGE_SIZE);        \
+            clear_page((void *)stack_top + IST_DB  * PAGE_SIZE);        \
+        }                                                               \
         __asm__ __volatile__ (                                          \
             SHADOW_STACK_WORK                                           \
             "mov %[stk], %%rsp;"                                        \
+            ZERO_STACK                                                  \
             CHECK_FOR_LIVEPATCH_WORK                                    \
             instr "[fun]"                                               \
             : [val] "=&r" (tmp),                                        \
@@ -201,7 +223,13 @@ unsigned long get_stack_dump_bottom (unsigned long sp);
               ((PRIMARY_SHSTK_SLOT + 1) * PAGE_SIZE - 8),               \
               [stack_mask] "i" (STACK_SIZE - 1),                        \
               _ASM_BUGFRAME_INFO(BUGFRAME_bug, __LINE__,                \
-                                 __FILE__, NULL)                        \
+                                 __FILE__, NULL),                       \
+              /* For stack zeroing. */                                  \
+              "D" ((void *)guest_cpu_user_regs() - 1),                  \
+              [stk_size] "c"                                            \
+              (zero_stack ? PRIMARY_STACK_SIZE - sizeof(struct cpu_info)\
+                          : 0),                                         \
+              "a" (0)                                                   \
             : "memory" );                                               \
         unreachable();                                                  \
     })
-- 
2.45.2



^ permalink raw reply related	[flat|nested] 64+ messages in thread

* Re: [PATCH 01/22] x86/mm: drop l{1,2,3,4}e_write_atomic()
  2024-07-26 15:21 ` [PATCH 01/22] x86/mm: drop l{1,2,3,4}e_write_atomic() Roger Pau Monne
@ 2024-07-29  7:52   ` Jan Beulich
  2024-07-29 12:53     ` Roger Pau Monné
  0 siblings, 1 reply; 64+ messages in thread
From: Jan Beulich @ 2024-07-29  7:52 UTC (permalink / raw)
  To: Roger Pau Monne; +Cc: alejandro.vallejo, Andrew Cooper, xen-devel

On 26.07.2024 17:21, Roger Pau Monne wrote:
> The l{1,2,3,4}e_write_atomic() and non _atomic suffixed helpers share the same
> implementation, so it seems pointless and possibly confusing to have both.
> 
> Remove the l{1,2,3,4}e_write_atomic() helpers and switch it's user to
> l{1,2,3,4}e_write(), as that's also atomic.  While there also remove
> pte_write{,_atomic}() and just use write_atomic() in the wrappers.
> 
> No functional change intended.
> 
> Signed-off-by: Roger Pau Monné <roger.pau@citrix.com>

Reviewed-by: Jan Beulich <jbeulich@suse.com>

In the description, can we perhaps mention the historical aspect of why
these were there (and separate)? Happy to add a sentence when committing,
as long as you agree.

Jan


^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: [PATCH 02/22] x86/mm: rename l{1,2,3,4}e_read_atomic()
  2024-07-26 15:21 ` [PATCH 02/22] x86/mm: rename l{1,2,3,4}e_read_atomic() Roger Pau Monne
@ 2024-07-29  7:53   ` Jan Beulich
  0 siblings, 0 replies; 64+ messages in thread
From: Jan Beulich @ 2024-07-29  7:53 UTC (permalink / raw)
  To: Roger Pau Monne; +Cc: alejandro.vallejo, Andrew Cooper, xen-devel

On 26.07.2024 17:21, Roger Pau Monne wrote:
> There's no l{1,2,3,4}e_read() implementation, so drop the _atomic suffix from
> the read helpers.  This allows unifying the naming with the write helpers,
> which are also atomic but don't have the suffix already: l{1,2,3,4}e_write().
> 
> No functional change intended.
> 
> Signed-off-by: Roger Pau Monné <roger.pau@citrix.com>

Reviewed-by: Jan Beulich <jbeulich@suse.com>




^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: [PATCH 03/22] x86/dom0: only disable SMAP for the PV dom0 build
  2024-07-26 15:21 ` [PATCH 03/22] x86/dom0: only disable SMAP for the PV dom0 build Roger Pau Monne
@ 2024-07-29  8:17   ` Roger Pau Monné
  2024-07-29 11:53   ` Jan Beulich
  2024-07-29 15:59   ` Andrew Cooper
  2 siblings, 0 replies; 64+ messages in thread
From: Roger Pau Monné @ 2024-07-29  8:17 UTC (permalink / raw)
  To: xen-devel; +Cc: alejandro.vallejo, Jan Beulich, Andrew Cooper

On Fri, Jul 26, 2024 at 05:21:47PM +0200, Roger Pau Monne wrote:
> The PVH dom0 builder doesn't switch page tables and has no need to run with
> SMAP disabled.

This should be reworded as:

"The PVH dom0 builder doesn't build guest page-tables, because PVH is
started in 32bit protected mode, hence has no need to run with SMAP
disabled."

Thanks, Roger.


^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: [PATCH 03/22] x86/dom0: only disable SMAP for the PV dom0 build
  2024-07-26 15:21 ` [PATCH 03/22] x86/dom0: only disable SMAP for the PV dom0 build Roger Pau Monne
  2024-07-29  8:17   ` Roger Pau Monné
@ 2024-07-29 11:53   ` Jan Beulich
  2024-07-29 15:52     ` Andrew Cooper
  2024-07-29 15:59   ` Andrew Cooper
  2 siblings, 1 reply; 64+ messages in thread
From: Jan Beulich @ 2024-07-29 11:53 UTC (permalink / raw)
  To: Roger Pau Monne; +Cc: alejandro.vallejo, Andrew Cooper, xen-devel

On 26.07.2024 17:21, Roger Pau Monne wrote:
> The PVH dom0 builder doesn't switch page tables and has no need to run with
> SMAP disabled.
> 
> Put the SMAP disabling close to the code region where it's necessary, as it
> then becomes obvious why switch_cr3_cr4() is required instead of
> write_ptbase().
> 
> Note removing SMAP from cr4_pv32_mask is not required, as we never jump into
> guest context, and hence updating the value of cr4_pv32_mask is not relevant.

I'm okay-ish with that being dropped, but iirc the goal was to keep the
variable in sync with CPU state.

> Signed-off-by: Roger Pau Monné <roger.pau@citrix.com>

Reviewed-by: Jan Beulich <jbeulich@suse.com>




^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: [PATCH 01/22] x86/mm: drop l{1,2,3,4}e_write_atomic()
  2024-07-29  7:52   ` Jan Beulich
@ 2024-07-29 12:53     ` Roger Pau Monné
  0 siblings, 0 replies; 64+ messages in thread
From: Roger Pau Monné @ 2024-07-29 12:53 UTC (permalink / raw)
  To: Jan Beulich; +Cc: alejandro.vallejo, Andrew Cooper, xen-devel

On Mon, Jul 29, 2024 at 09:52:50AM +0200, Jan Beulich wrote:
> On 26.07.2024 17:21, Roger Pau Monne wrote:
> > The l{1,2,3,4}e_write_atomic() and non _atomic suffixed helpers share the same
> > implementation, so it seems pointless and possibly confusing to have both.
> > 
> > Remove the l{1,2,3,4}e_write_atomic() helpers and switch it's user to
> > l{1,2,3,4}e_write(), as that's also atomic.  While there also remove
> > pte_write{,_atomic}() and just use write_atomic() in the wrappers.
> > 
> > No functional change intended.
> > 
> > Signed-off-by: Roger Pau Monné <roger.pau@citrix.com>
> 
> Reviewed-by: Jan Beulich <jbeulich@suse.com>
> 
> In the description, can we perhaps mention the historical aspect of why
> these were there (and separate)? Happy to add a sentence when committing,
> as long as you agree.

Sure:

"x86 32bit mode used to have a non-atomic PTE write that would split
the write in two halves, but with Xen only supporting x86 64bit
that's no longer present."

Would be fine?  Possibly added after the first paragraph IMO.

Thanks, Roger.


^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: [PATCH 08/22] x86/mm: avoid passing a domain parameter to L4 init function
  2024-07-26 15:21 ` [PATCH 08/22] x86/mm: avoid passing a domain parameter to L4 init function Roger Pau Monne
@ 2024-07-29 13:36   ` Alejandro Vallejo
  2024-07-29 13:43     ` Jan Beulich
  2024-07-29 14:18     ` Roger Pau Monné
  2024-08-14 10:24   ` Jan Beulich
  1 sibling, 2 replies; 64+ messages in thread
From: Alejandro Vallejo @ 2024-07-29 13:36 UTC (permalink / raw)
  To: Roger Pau Monne, xen-devel; +Cc: Jan Beulich, Andrew Cooper, Tim Deegan

On Fri Jul 26, 2024 at 4:21 PM BST, Roger Pau Monne wrote:
> In preparation for the function being called from contexts where no domain is
> present.
>
> No functional change intended.
>
> Signed-off-by: Roger Pau Monné <roger.pau@citrix.com>
> ---
>  xen/arch/x86/include/asm/mm.h  |  4 +++-
>  xen/arch/x86/mm.c              | 24 +++++++++++++-----------
>  xen/arch/x86/mm/hap/hap.c      |  3 ++-
>  xen/arch/x86/mm/shadow/hvm.c   |  3 ++-
>  xen/arch/x86/mm/shadow/multi.c |  7 +++++--
>  xen/arch/x86/pv/dom0_build.c   |  3 ++-
>  xen/arch/x86/pv/domain.c       |  3 ++-
>  7 files changed, 29 insertions(+), 18 deletions(-)
>
> diff --git a/xen/arch/x86/include/asm/mm.h b/xen/arch/x86/include/asm/mm.h
> index b3853ae734fa..076e7009dc99 100644
> --- a/xen/arch/x86/include/asm/mm.h
> +++ b/xen/arch/x86/include/asm/mm.h
> @@ -375,7 +375,9 @@ int devalidate_page(struct page_info *page, unsigned long type,
>  
>  void init_xen_pae_l2_slots(l2_pgentry_t *l2t, const struct domain *d);
>  void init_xen_l4_slots(l4_pgentry_t *l4t, mfn_t l4mfn,
> -                       const struct domain *d, mfn_t sl4mfn, bool ro_mpt);
> +                       mfn_t sl4mfn, const struct page_info *perdomain_l3,
> +                       bool ro_mpt, bool maybe_compat, bool short_directmap);
> +

The comment currently in the .c file should probably be here instead, and
updated for the new arguments. That said, I'm skeptical 3 booleans is something
desirable. It induces a lot of complexity at the call sites (which of the 8
forms of init_xen_l4_slots() do I need here?) and a lot of cognitive overload.

I can't propose a solution because I'm still wrapping my head around how the
layout (esp. compat layout) fits together. Maybe the booleans can be mapped to
an enum? It would also help interpret the callsites as it'd no longer be a
sequence of contextless booleans, but a readable identifier.

>  bool fill_ro_mpt(mfn_t mfn);
>  void zap_ro_mpt(mfn_t mfn);
>  
> diff --git a/xen/arch/x86/mm.c b/xen/arch/x86/mm.c
> index a792a300a866..c01b6712143e 100644
> --- a/xen/arch/x86/mm.c
> +++ b/xen/arch/x86/mm.c
> @@ -1645,14 +1645,9 @@ static int promote_l3_table(struct page_info *page)
>   * extended directmap.
>   */
>  void init_xen_l4_slots(l4_pgentry_t *l4t, mfn_t l4mfn,
> -                       const struct domain *d, mfn_t sl4mfn, bool ro_mpt)
> +                       mfn_t sl4mfn, const struct page_info *perdomain_l3,
> +                       bool ro_mpt, bool maybe_compat, bool short_directmap)
>  {
> -    /*
> -     * PV vcpus need a shortened directmap.  HVM and Idle vcpus get the full
> -     * directmap.
> -     */
> -    bool short_directmap = !paging_mode_external(d);
> -
>      /* Slot 256: RO M2P (if applicable). */
>      l4t[l4_table_offset(RO_MPT_VIRT_START)] =
>          ro_mpt ? idle_pg_table[l4_table_offset(RO_MPT_VIRT_START)]
> @@ -1673,13 +1668,14 @@ void init_xen_l4_slots(l4_pgentry_t *l4t, mfn_t l4mfn,
>          l4e_from_mfn(sl4mfn, __PAGE_HYPERVISOR_RW);
>  
>      /* Slot 260: Per-domain mappings. */
> -    l4t[l4_table_offset(PERDOMAIN_VIRT_START)] =
> -        l4e_from_page(d->arch.perdomain_l3_pg, __PAGE_HYPERVISOR_RW);
> +    if ( perdomain_l3 )
> +        l4t[l4_table_offset(PERDOMAIN_VIRT_START)] =
> +            l4e_from_page(perdomain_l3, __PAGE_HYPERVISOR_RW);
>  
>      /* Slot 4: Per-domain mappings mirror. */
>      BUILD_BUG_ON(IS_ENABLED(CONFIG_PV32) &&
>                   !l4_table_offset(PERDOMAIN_ALT_VIRT_START));
> -    if ( !is_pv_64bit_domain(d) )
> +    if ( perdomain_l3 && maybe_compat )
>          l4t[l4_table_offset(PERDOMAIN_ALT_VIRT_START)] =
>              l4t[l4_table_offset(PERDOMAIN_VIRT_START)];
>  
> @@ -1710,6 +1706,10 @@ void init_xen_l4_slots(l4_pgentry_t *l4t, mfn_t l4mfn,
>      else
>  #endif
>      {
> +        /*
> +         * PV vcpus need a shortened directmap.  HVM and Idle vcpus get the full
> +         * directmap.
> +         */
>          unsigned int slots = (short_directmap
>                                ? ROOT_PAGETABLE_PV_XEN_SLOTS
>                                : ROOT_PAGETABLE_XEN_SLOTS);
> @@ -1830,7 +1830,9 @@ static int promote_l4_table(struct page_info *page)
>      if ( !rc )
>      {
>          init_xen_l4_slots(pl4e, l4mfn,
> -                          d, INVALID_MFN, VM_ASSIST(d, m2p_strict));
> +                          INVALID_MFN, d->arch.perdomain_l3_pg,
> +                          VM_ASSIST(d, m2p_strict), !is_pv_64bit_domain(d),
> +                          true);
>          atomic_inc(&d->arch.pv.nr_l4_pages);
>      }
>      unmap_domain_page(pl4e);
> diff --git a/xen/arch/x86/mm/hap/hap.c b/xen/arch/x86/mm/hap/hap.c
> index d2011fde2462..c8514ca0e917 100644
> --- a/xen/arch/x86/mm/hap/hap.c
> +++ b/xen/arch/x86/mm/hap/hap.c
> @@ -402,7 +402,8 @@ static mfn_t hap_make_monitor_table(struct vcpu *v)
>      m4mfn = page_to_mfn(pg);
>      l4e = map_domain_page(m4mfn);
>  
> -    init_xen_l4_slots(l4e, m4mfn, d, INVALID_MFN, false);
> +    init_xen_l4_slots(l4e, m4mfn, INVALID_MFN, d->arch.perdomain_l3_pg,
> +                      false, true, false);

Out of ignorance: why is the compat area mapped on HVM monitor PTs? I thought
those were used exclusively in hypervisor context, and would hence always have
the 512 slots available.

>      unmap_domain_page(l4e);
>  
>      return m4mfn;
> diff --git a/xen/arch/x86/mm/shadow/hvm.c b/xen/arch/x86/mm/shadow/hvm.c
> index c16f3b3adf32..93922a71e511 100644
> --- a/xen/arch/x86/mm/shadow/hvm.c
> +++ b/xen/arch/x86/mm/shadow/hvm.c
> @@ -758,7 +758,8 @@ mfn_t sh_make_monitor_table(const struct vcpu *v, unsigned int shadow_levels)
>       * shadow-linear mapping will either be inserted below when creating
>       * lower level monitor tables, or later in sh_update_cr3().
>       */
> -    init_xen_l4_slots(l4e, m4mfn, d, INVALID_MFN, false);
> +    init_xen_l4_slots(l4e, m4mfn, INVALID_MFN, d->arch.perdomain_l3_pg,
> +                      false, true, false);
>  
>      if ( shadow_levels < 4 )
>      {
> diff --git a/xen/arch/x86/mm/shadow/multi.c b/xen/arch/x86/mm/shadow/multi.c
> index 376f6823cd44..0def0c073ca8 100644
> --- a/xen/arch/x86/mm/shadow/multi.c
> +++ b/xen/arch/x86/mm/shadow/multi.c
> @@ -973,8 +973,11 @@ sh_make_shadow(struct vcpu *v, mfn_t gmfn, u32 shadow_type)
>  
>              BUILD_BUG_ON(sizeof(l4_pgentry_t) != sizeof(shadow_l4e_t));
>  
> -            init_xen_l4_slots(l4t, gmfn, d, smfn, (!is_pv_32bit_domain(d) &&
> -                                                   VM_ASSIST(d, m2p_strict)));
> +            init_xen_l4_slots(l4t, gmfn, smfn,
> +                              d->arch.perdomain_l3_pg,
> +                              (!is_pv_32bit_domain(d) &&
> +                               VM_ASSIST(d, m2p_strict)),
> +                              !is_pv_64bit_domain(d), true);
>              unmap_domain_page(l4t);
>          }
>          break;
> diff --git a/xen/arch/x86/pv/dom0_build.c b/xen/arch/x86/pv/dom0_build.c
> index 41772dbe80bf..6a6689f402bb 100644
> --- a/xen/arch/x86/pv/dom0_build.c
> +++ b/xen/arch/x86/pv/dom0_build.c
> @@ -711,7 +711,8 @@ int __init dom0_construct_pv(struct domain *d,
>          l4start = l4tab = __va(mpt_alloc); mpt_alloc += PAGE_SIZE;
>          clear_page(l4tab);
>          init_xen_l4_slots(l4tab, _mfn(virt_to_mfn(l4start)),
> -                          d, INVALID_MFN, true);
> +                          INVALID_MFN, d->arch.perdomain_l3_pg,
> +                          true, !is_pv_64bit_domain(d), true);
>          v->arch.guest_table = pagetable_from_paddr(__pa(l4start));
>      }
>      else
> diff --git a/xen/arch/x86/pv/domain.c b/xen/arch/x86/pv/domain.c
> index 86b74fb372d5..6ff71f14a2f2 100644
> --- a/xen/arch/x86/pv/domain.c
> +++ b/xen/arch/x86/pv/domain.c
> @@ -124,7 +124,8 @@ static int setup_compat_l4(struct vcpu *v)
>      mfn = page_to_mfn(pg);
>      l4tab = map_domain_page(mfn);
>      clear_page(l4tab);
> -    init_xen_l4_slots(l4tab, mfn, v->domain, INVALID_MFN, false);
> +    init_xen_l4_slots(l4tab, mfn, INVALID_MFN, v->domain->arch.perdomain_l3_pg,
> +                      false, true, true);
>      unmap_domain_page(l4tab);
>  
>      /* This page needs to look like a pagetable so that it can be shadowed */



^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: [PATCH 08/22] x86/mm: avoid passing a domain parameter to L4 init function
  2024-07-29 13:36   ` Alejandro Vallejo
@ 2024-07-29 13:43     ` Jan Beulich
  2024-07-29 14:18     ` Roger Pau Monné
  1 sibling, 0 replies; 64+ messages in thread
From: Jan Beulich @ 2024-07-29 13:43 UTC (permalink / raw)
  To: Alejandro Vallejo; +Cc: Andrew Cooper, Tim Deegan, Roger Pau Monne, xen-devel

On 29.07.2024 15:36, Alejandro Vallejo wrote:
> On Fri Jul 26, 2024 at 4:21 PM BST, Roger Pau Monne wrote:
>> --- a/xen/arch/x86/mm/hap/hap.c
>> +++ b/xen/arch/x86/mm/hap/hap.c
>> @@ -402,7 +402,8 @@ static mfn_t hap_make_monitor_table(struct vcpu *v)
>>      m4mfn = page_to_mfn(pg);
>>      l4e = map_domain_page(m4mfn);
>>  
>> -    init_xen_l4_slots(l4e, m4mfn, d, INVALID_MFN, false);
>> +    init_xen_l4_slots(l4e, m4mfn, INVALID_MFN, d->arch.perdomain_l3_pg,
>> +                      false, true, false);
> 
> Out of ignorance: why is the compat area mapped on HVM monitor PTs? I thought
> those were used exclusively in hypervisor context, and would hence always have
> the 512 slots available.

"compat area" is perhaps a misleading term. If you look at the function itself,
it's PERDOMAIN_ALT_VIRT_START that is mapped in that case. Which underlies the
compat-arg-xlat area, which both 32-bit PV and all HVM guests need, the latter
as they can, at any time, switch to 32-bit mode.

Jan


^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: [PATCH 08/22] x86/mm: avoid passing a domain parameter to L4 init function
  2024-07-29 13:36   ` Alejandro Vallejo
  2024-07-29 13:43     ` Jan Beulich
@ 2024-07-29 14:18     ` Roger Pau Monné
  1 sibling, 0 replies; 64+ messages in thread
From: Roger Pau Monné @ 2024-07-29 14:18 UTC (permalink / raw)
  To: Alejandro Vallejo; +Cc: xen-devel, Jan Beulich, Andrew Cooper, Tim Deegan

On Mon, Jul 29, 2024 at 02:36:39PM +0100, Alejandro Vallejo wrote:
> On Fri Jul 26, 2024 at 4:21 PM BST, Roger Pau Monne wrote:
> > In preparation for the function being called from contexts where no domain is
> > present.
> >
> > No functional change intended.
> >
> > Signed-off-by: Roger Pau Monné <roger.pau@citrix.com>
> > ---
> >  xen/arch/x86/include/asm/mm.h  |  4 +++-
> >  xen/arch/x86/mm.c              | 24 +++++++++++++-----------
> >  xen/arch/x86/mm/hap/hap.c      |  3 ++-
> >  xen/arch/x86/mm/shadow/hvm.c   |  3 ++-
> >  xen/arch/x86/mm/shadow/multi.c |  7 +++++--
> >  xen/arch/x86/pv/dom0_build.c   |  3 ++-
> >  xen/arch/x86/pv/domain.c       |  3 ++-
> >  7 files changed, 29 insertions(+), 18 deletions(-)
> >
> > diff --git a/xen/arch/x86/include/asm/mm.h b/xen/arch/x86/include/asm/mm.h
> > index b3853ae734fa..076e7009dc99 100644
> > --- a/xen/arch/x86/include/asm/mm.h
> > +++ b/xen/arch/x86/include/asm/mm.h
> > @@ -375,7 +375,9 @@ int devalidate_page(struct page_info *page, unsigned long type,
> >  
> >  void init_xen_pae_l2_slots(l2_pgentry_t *l2t, const struct domain *d);
> >  void init_xen_l4_slots(l4_pgentry_t *l4t, mfn_t l4mfn,
> > -                       const struct domain *d, mfn_t sl4mfn, bool ro_mpt);
> > +                       mfn_t sl4mfn, const struct page_info *perdomain_l3,
> > +                       bool ro_mpt, bool maybe_compat, bool short_directmap);
> > +
> 
> The comment currently in the .c file should probably be here instead, and
> updated for the new arguments. That said, I'm skeptical 3 booleans is something
> desirable. It induces a lot of complexity at the call sites (which of the 8
> forms of init_xen_l4_slots() do I need here?) and a lot of cognitive overload.
> 
> I can't propose a solution because I'm still wrapping my head around how the
> layout (esp. compat layout) fits together. Maybe the booleans can be mapped to
> an enum? It would also help interpret the callsites as it'd no longer be a
> sequence of contextless booleans, but a readable identifier.

We have the following possible combinations:

          RO MPT  COMPAT XLAT  SHORT DMAP
PV64        ?         N             Y
PV32        N         Y             Y
HVM         N         Y             N


So we would need:

enum l4_domain_type {
    PV64,
    PV64_RO_MPT,
    PV32,
    HVM
};

I can see about replacing those last 3 booleans with the proposed
enum.

Thanks, Roger.


^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: [PATCH 22/22] x86/mm: zero stack on stack switch or reset
  2024-07-26 15:22 ` [PATCH 22/22] x86/mm: zero stack on stack switch or reset Roger Pau Monne
@ 2024-07-29 15:40   ` Andrew Cooper
  2024-07-30 10:49     ` Roger Pau Monné
  2024-08-13 13:16   ` Jan Beulich
  1 sibling, 1 reply; 64+ messages in thread
From: Andrew Cooper @ 2024-07-29 15:40 UTC (permalink / raw)
  To: Roger Pau Monne, xen-devel; +Cc: alejandro.vallejo, Jan Beulich

On 26/07/2024 4:22 pm, Roger Pau Monne wrote:
> With the stack mapped on a per-CPU basis there's no risk of other CPUs being
> able to read the stack contents, but vCPUs running on the current pCPU could
> read stack rubble from operations of previous vCPUs.
>
> The #DF stack is not zeroed because handling of #DF results in a panic.
>
> Signed-off-by: Roger Pau Monné <roger.pau@citrix.com>
> ---
>  xen/arch/x86/include/asm/current.h | 30 +++++++++++++++++++++++++++++-
>  1 file changed, 29 insertions(+), 1 deletion(-)
>
> diff --git a/xen/arch/x86/include/asm/current.h b/xen/arch/x86/include/asm/current.h
> index 75b9a341f814..02b4118b03ef 100644
> --- a/xen/arch/x86/include/asm/current.h
> +++ b/xen/arch/x86/include/asm/current.h
> @@ -177,6 +177,14 @@ unsigned long get_stack_dump_bottom (unsigned long sp);
>  # define SHADOW_STACK_WORK ""
>  #endif
>  
> +#define ZERO_STACK                                              \
> +    "test %[stk_size], %[stk_size];"                            \
> +    "jz .L_skip_zeroing.%=;"                                    \
> +    "std;"                                                      \
> +    "rep stosb;"                                                \
> +    "cld;"                                                      \
> +    ".L_skip_zeroing.%=:"
> +
>  #if __GNUC__ >= 9
>  # define ssaj_has_attr_noreturn(fn) __builtin_has_attribute(fn, __noreturn__)
>  #else
> @@ -187,10 +195,24 @@ unsigned long get_stack_dump_bottom (unsigned long sp);
>  #define switch_stack_and_jump(fn, instr, constr)                        \
>      ({                                                                  \
>          unsigned int tmp;                                               \
> +        bool zero_stack = current->domain->arch.asi;                    \
>          BUILD_BUG_ON(!ssaj_has_attr_noreturn(fn));                      \
> +        ASSERT(IS_ALIGNED((unsigned long)guest_cpu_user_regs() -        \
> +                          PRIMARY_STACK_SIZE +                          \
> +                          sizeof(struct cpu_info), PAGE_SIZE));         \
> +        if ( zero_stack )                                               \
> +        {                                                               \
> +            unsigned long stack_top = get_stack_bottom() &              \
> +                                      ~(STACK_SIZE - 1);                \
> +                                                                        \
> +            clear_page((void *)stack_top + IST_MCE * PAGE_SIZE);        \
> +            clear_page((void *)stack_top + IST_NMI * PAGE_SIZE);        \
> +            clear_page((void *)stack_top + IST_DB  * PAGE_SIZE);        \
> +        }                                                               \
>          __asm__ __volatile__ (                                          \
>              SHADOW_STACK_WORK                                           \
>              "mov %[stk], %%rsp;"                                        \
> +            ZERO_STACK                                                  \
>              CHECK_FOR_LIVEPATCH_WORK                                    \
>              instr "[fun]"                                               \
>              : [val] "=&r" (tmp),                                        \
> @@ -201,7 +223,13 @@ unsigned long get_stack_dump_bottom (unsigned long sp);
>                ((PRIMARY_SHSTK_SLOT + 1) * PAGE_SIZE - 8),               \
>                [stack_mask] "i" (STACK_SIZE - 1),                        \
>                _ASM_BUGFRAME_INFO(BUGFRAME_bug, __LINE__,                \
> -                                 __FILE__, NULL)                        \
> +                                 __FILE__, NULL),                       \
> +              /* For stack zeroing. */                                  \
> +              "D" ((void *)guest_cpu_user_regs() - 1),                  \
> +              [stk_size] "c"                                            \
> +              (zero_stack ? PRIMARY_STACK_SIZE - sizeof(struct cpu_info)\
> +                          : 0),                                         \
> +              "a" (0)                                                   \
>              : "memory" );                                               \
>          unreachable();                                                  \
>      })

This looks very expensive.

For starters, switch_stack_and_jump() is used twice in a typical context
switch; once in the schedule tail, and again out of hvm_do_resume().

Furthermore, #MC happen never (to many many significant figures), #DB
happens never for HVM guests (but does happen for PV), and NMIs are
either ~never, or 2Hz which is far less often than the 30ms default
timeslice.

So, the overwhelming majority of the time, those 3 calls to clear_page()
will be re-zeroing blocks of zeroes.

This can probably be avoided by making use of ist_exit (held in %r12) to
only zero an IST stack when leaving it.  This leaves the IRET frame able
to be recovered, but with e.g. RFDS, you can do that irrespective, and
it's not terribly sensitive.


What about shadow stacks?  You're not zeroing those, and while they're
less sensitive than the data stack, there ought to be some reasoning
about them.

~Andrew


^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: [PATCH 03/22] x86/dom0: only disable SMAP for the PV dom0 build
  2024-07-29 11:53   ` Jan Beulich
@ 2024-07-29 15:52     ` Andrew Cooper
  2024-07-29 16:18       ` Roger Pau Monné
  0 siblings, 1 reply; 64+ messages in thread
From: Andrew Cooper @ 2024-07-29 15:52 UTC (permalink / raw)
  To: Jan Beulich, Roger Pau Monne; +Cc: alejandro.vallejo, xen-devel

On 29/07/2024 12:53 pm, Jan Beulich wrote:
> On 26.07.2024 17:21, Roger Pau Monne wrote:
>> The PVH dom0 builder doesn't switch page tables and has no need to run with
>> SMAP disabled.
>>
>> Put the SMAP disabling close to the code region where it's necessary, as it
>> then becomes obvious why switch_cr3_cr4() is required instead of
>> write_ptbase().
>>
>> Note removing SMAP from cr4_pv32_mask is not required, as we never jump into
>> guest context, and hence updating the value of cr4_pv32_mask is not relevant.
> I'm okay-ish with that being dropped, but iirc the goal was to keep the
> variable in sync with CPU state.

Removing SMAP from cr4_pv32_mask is necessary.

Otherwise IST vectors will reactive SMAP behind the back of the dombuilder.

This will probably only manifest in practice in a CONFIG_PV32=y build,
and with a poorly timed NMI.

~Andrew


^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: [PATCH 03/22] x86/dom0: only disable SMAP for the PV dom0 build
  2024-07-26 15:21 ` [PATCH 03/22] x86/dom0: only disable SMAP for the PV dom0 build Roger Pau Monne
  2024-07-29  8:17   ` Roger Pau Monné
  2024-07-29 11:53   ` Jan Beulich
@ 2024-07-29 15:59   ` Andrew Cooper
  2024-07-29 16:32     ` Roger Pau Monné
  2 siblings, 1 reply; 64+ messages in thread
From: Andrew Cooper @ 2024-07-29 15:59 UTC (permalink / raw)
  To: Roger Pau Monne, xen-devel; +Cc: alejandro.vallejo, Jan Beulich

On 26/07/2024 4:21 pm, Roger Pau Monne wrote:
> diff --git a/xen/arch/x86/setup.c b/xen/arch/x86/setup.c
> index eee20bb1753c..bc387d96b519 100644
> --- a/xen/arch/x86/setup.c
> +++ b/xen/arch/x86/setup.c
> @@ -955,26 +955,9 @@ static struct domain *__init create_dom0(const module_t *image,
>          }
>      }
>  
> -    /*
> -     * Temporarily clear SMAP in CR4 to allow user-accesses in construct_dom0().
> -     * This saves a large number of corner cases interactions with
> -     * copy_from_user().
> -     */
> -    if ( cpu_has_smap )
> -    {
> -        cr4_pv32_mask &= ~X86_CR4_SMAP;
> -        write_cr4(read_cr4() & ~X86_CR4_SMAP);
> -    }
> -
>      if ( construct_dom0(d, image, headroom, initrd, cmdline) != 0 )
>          panic("Could not construct domain 0\n");
>  
> -    if ( cpu_has_smap )
> -    {
> -        write_cr4(read_cr4() | X86_CR4_SMAP);
> -        cr4_pv32_mask |= X86_CR4_SMAP;
> -    }
> -

Hang on.  Isn't this (preexistingly) broken given the distinction
between cpu_has_smap and X86_FEATURE_XEN_SMAP ?

I'm very tempted to use this as a justification to remove opt_smap.

~Andrew


^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: [PATCH 03/22] x86/dom0: only disable SMAP for the PV dom0 build
  2024-07-29 15:52     ` Andrew Cooper
@ 2024-07-29 16:18       ` Roger Pau Monné
  2024-07-29 17:51         ` Andrew Cooper
  0 siblings, 1 reply; 64+ messages in thread
From: Roger Pau Monné @ 2024-07-29 16:18 UTC (permalink / raw)
  To: Andrew Cooper; +Cc: Jan Beulich, alejandro.vallejo, xen-devel

On Mon, Jul 29, 2024 at 04:52:22PM +0100, Andrew Cooper wrote:
> On 29/07/2024 12:53 pm, Jan Beulich wrote:
> > On 26.07.2024 17:21, Roger Pau Monne wrote:
> >> The PVH dom0 builder doesn't switch page tables and has no need to run with
> >> SMAP disabled.
> >>
> >> Put the SMAP disabling close to the code region where it's necessary, as it
> >> then becomes obvious why switch_cr3_cr4() is required instead of
> >> write_ptbase().
> >>
> >> Note removing SMAP from cr4_pv32_mask is not required, as we never jump into
> >> guest context, and hence updating the value of cr4_pv32_mask is not relevant.
> > I'm okay-ish with that being dropped, but iirc the goal was to keep the
> > variable in sync with CPU state.
> 
> Removing SMAP from cr4_pv32_mask is necessary.
> 
> Otherwise IST vectors will reactive SMAP behind the back of the dombuilder.
> 
> This will probably only manifest in practice in a CONFIG_PV32=y build,

Sorry, I'm possibly missing some context here.  When running the dom0
builder we switch to the guest page-tables, but not to the guest vCPU,
(iow: current == idle) and hence the context is always the Xen
context.

Why would the return path of the IST use cr4_pv32_mask when the
context in which the IST happened was the Xen one, and the current
vCPU is the idle one (a 64bit PV guest from Xen's PoV).

My understanding is that cr4_pv32_mask should only be used when the
current context is running a 32bit PV vCPU.

Thanks, Roger.


^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: [PATCH 03/22] x86/dom0: only disable SMAP for the PV dom0 build
  2024-07-29 15:59   ` Andrew Cooper
@ 2024-07-29 16:32     ` Roger Pau Monné
  0 siblings, 0 replies; 64+ messages in thread
From: Roger Pau Monné @ 2024-07-29 16:32 UTC (permalink / raw)
  To: Andrew Cooper; +Cc: xen-devel, alejandro.vallejo, Jan Beulich

On Mon, Jul 29, 2024 at 04:59:09PM +0100, Andrew Cooper wrote:
> On 26/07/2024 4:21 pm, Roger Pau Monne wrote:
> > diff --git a/xen/arch/x86/setup.c b/xen/arch/x86/setup.c
> > index eee20bb1753c..bc387d96b519 100644
> > --- a/xen/arch/x86/setup.c
> > +++ b/xen/arch/x86/setup.c
> > @@ -955,26 +955,9 @@ static struct domain *__init create_dom0(const module_t *image,
> >          }
> >      }
> >  
> > -    /*
> > -     * Temporarily clear SMAP in CR4 to allow user-accesses in construct_dom0().
> > -     * This saves a large number of corner cases interactions with
> > -     * copy_from_user().
> > -     */
> > -    if ( cpu_has_smap )
> > -    {
> > -        cr4_pv32_mask &= ~X86_CR4_SMAP;
> > -        write_cr4(read_cr4() & ~X86_CR4_SMAP);
> > -    }
> > -
> >      if ( construct_dom0(d, image, headroom, initrd, cmdline) != 0 )
> >          panic("Could not construct domain 0\n");
> >  
> > -    if ( cpu_has_smap )
> > -    {
> > -        write_cr4(read_cr4() | X86_CR4_SMAP);
> > -        cr4_pv32_mask |= X86_CR4_SMAP;
> > -    }
> > -
> 
> Hang on.  Isn't this (preexistingly) broken given the distinction
> between cpu_has_smap and X86_FEATURE_XEN_SMAP ?

I see, looks like Xen will unconditionally enable SMAP if the user has
requested SMP for HVM only.  Forcefully disabling SMAP for both PV and
HVM will result in the CPUID bit getting cleared, and hence
cpu_has_smap == false.

> I'm very tempted to use this as a justification to remove opt_smap.

Oh, so my change fixes that bug by caching the previous cr4 instead of
using cpu_has_smap.

It seems like opt_smap is useful for the PV shim, as it caused some
unnecessary performance degradation on AMD hardware due to AMD not
allowing to selectively trap accesses to CR4, so on pvshim mode
it gets disabled:

b05ec9263e56 x86/sm{e, a}p: do not enable SMEP/SMAP in PV shim by default on AMD

Thanks, Roger.


^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: [PATCH 03/22] x86/dom0: only disable SMAP for the PV dom0 build
  2024-07-29 16:18       ` Roger Pau Monné
@ 2024-07-29 17:51         ` Andrew Cooper
  2024-07-30 10:55           ` Roger Pau Monné
  0 siblings, 1 reply; 64+ messages in thread
From: Andrew Cooper @ 2024-07-29 17:51 UTC (permalink / raw)
  To: Roger Pau Monné; +Cc: Jan Beulich, alejandro.vallejo, xen-devel

On 29/07/2024 5:18 pm, Roger Pau Monné wrote:
> On Mon, Jul 29, 2024 at 04:52:22PM +0100, Andrew Cooper wrote:
>> On 29/07/2024 12:53 pm, Jan Beulich wrote:
>>> On 26.07.2024 17:21, Roger Pau Monne wrote:
>>>> The PVH dom0 builder doesn't switch page tables and has no need to run with
>>>> SMAP disabled.
>>>>
>>>> Put the SMAP disabling close to the code region where it's necessary, as it
>>>> then becomes obvious why switch_cr3_cr4() is required instead of
>>>> write_ptbase().
>>>>
>>>> Note removing SMAP from cr4_pv32_mask is not required, as we never jump into
>>>> guest context, and hence updating the value of cr4_pv32_mask is not relevant.
>>> I'm okay-ish with that being dropped, but iirc the goal was to keep the
>>> variable in sync with CPU state.
>> Removing SMAP from cr4_pv32_mask is necessary.
>>
>> Otherwise IST vectors will reactive SMAP behind the back of the dombuilder.
>>
>> This will probably only manifest in practice in a CONFIG_PV32=y build,
> Sorry, I'm possibly missing some context here.  When running the dom0
> builder we switch to the guest page-tables, but not to the guest vCPU,
> (iow: current == idle) and hence the context is always the Xen
> context.

Correct.

> Why would the return path of the IST use cr4_pv32_mask when the
> context in which the IST happened was the Xen one, and the current
> vCPU is the idle one (a 64bit PV guest from Xen's PoV).
>
> My understanding is that cr4_pv32_mask should only be used when the
> current context is running a 32bit PV vCPU.

This logic is evil to follow, because you need to look at both
cr4_pv32_mask and XEN_CR4_PV32_BITS to see both halves of it.

Notice how cr4_pv32_restore() only ever OR's cr4_pv32_mask into %cr4?

CR4_PV32_RESTORE is called from every entry path which *might* have come
from a 32bit PV guest, and it always results in Xen having SMEP/SMAP
active (as applicable).  This includes NMI.

The change is only undone in compat_restore_all_guest(), where
XEN_CR4_PV32_BITS is cleared from %cr4 iff returning to Ring1/2.  This
is logic cunningly disguised in the use of the Parity flag.


Because the NMI handler does reactive SMEP/SMAP (based on the value in
cr4_pv32_mask), and returning to Xen does not pass through
compat_restore_all_guest(), taking an NMI in the middle of of the
dombuilder will reactive SMAP behind your back.

~Andrew


^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: [PATCH 22/22] x86/mm: zero stack on stack switch or reset
  2024-07-29 15:40   ` Andrew Cooper
@ 2024-07-30 10:49     ` Roger Pau Monné
  0 siblings, 0 replies; 64+ messages in thread
From: Roger Pau Monné @ 2024-07-30 10:49 UTC (permalink / raw)
  To: Andrew Cooper; +Cc: xen-devel, alejandro.vallejo, Jan Beulich

On Mon, Jul 29, 2024 at 04:40:24PM +0100, Andrew Cooper wrote:
> On 26/07/2024 4:22 pm, Roger Pau Monne wrote:
> > With the stack mapped on a per-CPU basis there's no risk of other CPUs being
> > able to read the stack contents, but vCPUs running on the current pCPU could
> > read stack rubble from operations of previous vCPUs.
> >
> > The #DF stack is not zeroed because handling of #DF results in a panic.
> >
> > Signed-off-by: Roger Pau Monné <roger.pau@citrix.com>
> > ---
> >  xen/arch/x86/include/asm/current.h | 30 +++++++++++++++++++++++++++++-
> >  1 file changed, 29 insertions(+), 1 deletion(-)
> >
> > diff --git a/xen/arch/x86/include/asm/current.h b/xen/arch/x86/include/asm/current.h
> > index 75b9a341f814..02b4118b03ef 100644
> > --- a/xen/arch/x86/include/asm/current.h
> > +++ b/xen/arch/x86/include/asm/current.h
> > @@ -177,6 +177,14 @@ unsigned long get_stack_dump_bottom (unsigned long sp);
> >  # define SHADOW_STACK_WORK ""
> >  #endif
> >  
> > +#define ZERO_STACK                                              \
> > +    "test %[stk_size], %[stk_size];"                            \
> > +    "jz .L_skip_zeroing.%=;"                                    \
> > +    "std;"                                                      \
> > +    "rep stosb;"                                                \
> > +    "cld;"                                                      \
> > +    ".L_skip_zeroing.%=:"
> > +
> >  #if __GNUC__ >= 9
> >  # define ssaj_has_attr_noreturn(fn) __builtin_has_attribute(fn, __noreturn__)
> >  #else
> > @@ -187,10 +195,24 @@ unsigned long get_stack_dump_bottom (unsigned long sp);
> >  #define switch_stack_and_jump(fn, instr, constr)                        \
> >      ({                                                                  \
> >          unsigned int tmp;                                               \
> > +        bool zero_stack = current->domain->arch.asi;                    \
> >          BUILD_BUG_ON(!ssaj_has_attr_noreturn(fn));                      \
> > +        ASSERT(IS_ALIGNED((unsigned long)guest_cpu_user_regs() -        \
> > +                          PRIMARY_STACK_SIZE +                          \
> > +                          sizeof(struct cpu_info), PAGE_SIZE));         \
> > +        if ( zero_stack )                                               \
> > +        {                                                               \
> > +            unsigned long stack_top = get_stack_bottom() &              \
> > +                                      ~(STACK_SIZE - 1);                \
> > +                                                                        \
> > +            clear_page((void *)stack_top + IST_MCE * PAGE_SIZE);        \
> > +            clear_page((void *)stack_top + IST_NMI * PAGE_SIZE);        \
> > +            clear_page((void *)stack_top + IST_DB  * PAGE_SIZE);        \
> > +        }                                                               \
> >          __asm__ __volatile__ (                                          \
> >              SHADOW_STACK_WORK                                           \
> >              "mov %[stk], %%rsp;"                                        \
> > +            ZERO_STACK                                                  \
> >              CHECK_FOR_LIVEPATCH_WORK                                    \
> >              instr "[fun]"                                               \
> >              : [val] "=&r" (tmp),                                        \
> > @@ -201,7 +223,13 @@ unsigned long get_stack_dump_bottom (unsigned long sp);
> >                ((PRIMARY_SHSTK_SLOT + 1) * PAGE_SIZE - 8),               \
> >                [stack_mask] "i" (STACK_SIZE - 1),                        \
> >                _ASM_BUGFRAME_INFO(BUGFRAME_bug, __LINE__,                \
> > -                                 __FILE__, NULL)                        \
> > +                                 __FILE__, NULL),                       \
> > +              /* For stack zeroing. */                                  \
> > +              "D" ((void *)guest_cpu_user_regs() - 1),                  \
> > +              [stk_size] "c"                                            \
> > +              (zero_stack ? PRIMARY_STACK_SIZE - sizeof(struct cpu_info)\
> > +                          : 0),                                         \
> > +              "a" (0)                                                   \
> >              : "memory" );                                               \
> >          unreachable();                                                  \
> >      })
> 
> This looks very expensive.
> 
> For starters, switch_stack_and_jump() is used twice in a typical context
> switch; once in the schedule tail, and again out of hvm_do_resume().

Right, it's the reset_stack_and_call_ind() at the end of context
switch and then the reset_stack_and_jump() in the HVM tail context
switch handlers.

One option would be to only do the stack zeroing from the
reset_stack_and_call_ind() call in context_switch().

I've got no idea how expensive this is, I might try to run some
benchmarks to get some figures.  I was planning on running two VMs
with 1 vCPU each, both pinned to the same pCPU.

> 
> Furthermore, #MC happen never (to many many significant figures), #DB
> happens never for HVM guests (but does happen for PV), and NMIs are
> either ~never, or 2Hz which is far less often than the 30ms default
> timeslice.
> 
> So, the overwhelming majority of the time, those 3 calls to clear_page()
> will be re-zeroing blocks of zeroes.
> 
> This can probably be avoided by making use of ist_exit (held in %r12) to
> only zero an IST stack when leaving it.  This leaves the IRET frame able
> to be recovered, but with e.g. RFDS, you can do that irrespective, and
> it's not terribly sensitive.

I could look into that, TBH I was bordeline with clearing the IST
stacks, as I wasn't convinced there could be anything sensitive there,
but again couldn't convince myself there's nothing sensitive now,
nor can be in the future.

> What about shadow stacks?  You're not zeroing those, and while they're
> less sensitive than the data stack, there ought to be some reasoning
> about them.

I've assumed that shadow stacks only contained the expected return
addresses, and hence won't be considered sensitive information, but
maybe I was too lax.

An attacker could get execution traces of the previous vCPU, and that
might be useful for some exploits?

Thanks, Roger.


^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: [PATCH 03/22] x86/dom0: only disable SMAP for the PV dom0 build
  2024-07-29 17:51         ` Andrew Cooper
@ 2024-07-30 10:55           ` Roger Pau Monné
  2024-07-30 11:06             ` Andrew Cooper
  0 siblings, 1 reply; 64+ messages in thread
From: Roger Pau Monné @ 2024-07-30 10:55 UTC (permalink / raw)
  To: Andrew Cooper; +Cc: Jan Beulich, alejandro.vallejo, xen-devel

On Mon, Jul 29, 2024 at 06:51:31PM +0100, Andrew Cooper wrote:
> On 29/07/2024 5:18 pm, Roger Pau Monné wrote:
> > On Mon, Jul 29, 2024 at 04:52:22PM +0100, Andrew Cooper wrote:
> >> On 29/07/2024 12:53 pm, Jan Beulich wrote:
> >>> On 26.07.2024 17:21, Roger Pau Monne wrote:
> >>>> The PVH dom0 builder doesn't switch page tables and has no need to run with
> >>>> SMAP disabled.
> >>>>
> >>>> Put the SMAP disabling close to the code region where it's necessary, as it
> >>>> then becomes obvious why switch_cr3_cr4() is required instead of
> >>>> write_ptbase().
> >>>>
> >>>> Note removing SMAP from cr4_pv32_mask is not required, as we never jump into
> >>>> guest context, and hence updating the value of cr4_pv32_mask is not relevant.
> >>> I'm okay-ish with that being dropped, but iirc the goal was to keep the
> >>> variable in sync with CPU state.
> >> Removing SMAP from cr4_pv32_mask is necessary.
> >>
> >> Otherwise IST vectors will reactive SMAP behind the back of the dombuilder.
> >>
> >> This will probably only manifest in practice in a CONFIG_PV32=y build,
> > Sorry, I'm possibly missing some context here.  When running the dom0
> > builder we switch to the guest page-tables, but not to the guest vCPU,
> > (iow: current == idle) and hence the context is always the Xen
> > context.
> 
> Correct.
> 
> > Why would the return path of the IST use cr4_pv32_mask when the
> > context in which the IST happened was the Xen one, and the current
> > vCPU is the idle one (a 64bit PV guest from Xen's PoV).
> >
> > My understanding is that cr4_pv32_mask should only be used when the
> > current context is running a 32bit PV vCPU.
> 
> This logic is evil to follow, because you need to look at both
> cr4_pv32_mask and XEN_CR4_PV32_BITS to see both halves of it.
> 
> Notice how cr4_pv32_restore() only ever OR's cr4_pv32_mask into %cr4?
> 
> CR4_PV32_RESTORE is called from every entry path which *might* have come
> from a 32bit PV guest, and it always results in Xen having SMEP/SMAP
> active (as applicable).  This includes NMI.
> 
> The change is only undone in compat_restore_all_guest(), where
> XEN_CR4_PV32_BITS is cleared from %cr4 iff returning to Ring1/2.  This
> is logic cunningly disguised in the use of the Parity flag.
> 
> 
> Because the NMI handler does reactive SMEP/SMAP (based on the value in
> cr4_pv32_mask), and returning to Xen does not pass through
> compat_restore_all_guest(), taking an NMI in the middle of of the
> dombuilder will reactive SMAP behind your back.

After further conversations with Andrew we believe the current
disabling of X86_CR4_SMAP in %cr4 during dom0 build is not safe.

Regardless of whether cr4_pv32_mask is properly adjusted return to Xen
context from interrupt would be done with SMAP enabled if
X86_FEATURE_XEN_SMAP is set.

I will send a new patch that uses stac/clac in order to disable SMAP
(if required) around the dom0 builder code that switches to the guest
page-tables.

Thanks, Roger.


^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: [PATCH 03/22] x86/dom0: only disable SMAP for the PV dom0 build
  2024-07-30 10:55           ` Roger Pau Monné
@ 2024-07-30 11:06             ` Andrew Cooper
  2024-07-30 13:03               ` Roger Pau Monné
  0 siblings, 1 reply; 64+ messages in thread
From: Andrew Cooper @ 2024-07-30 11:06 UTC (permalink / raw)
  To: Roger Pau Monné; +Cc: Jan Beulich, alejandro.vallejo, xen-devel

On 30/07/2024 11:55 am, Roger Pau Monné wrote:
> On Mon, Jul 29, 2024 at 06:51:31PM +0100, Andrew Cooper wrote:
>> On 29/07/2024 5:18 pm, Roger Pau Monné wrote:
>>> On Mon, Jul 29, 2024 at 04:52:22PM +0100, Andrew Cooper wrote:
>>>> On 29/07/2024 12:53 pm, Jan Beulich wrote:
>>>>> On 26.07.2024 17:21, Roger Pau Monne wrote:
>>>>>> The PVH dom0 builder doesn't switch page tables and has no need to run with
>>>>>> SMAP disabled.
>>>>>>
>>>>>> Put the SMAP disabling close to the code region where it's necessary, as it
>>>>>> then becomes obvious why switch_cr3_cr4() is required instead of
>>>>>> write_ptbase().
>>>>>>
>>>>>> Note removing SMAP from cr4_pv32_mask is not required, as we never jump into
>>>>>> guest context, and hence updating the value of cr4_pv32_mask is not relevant.
>>>>> I'm okay-ish with that being dropped, but iirc the goal was to keep the
>>>>> variable in sync with CPU state.
>>>> Removing SMAP from cr4_pv32_mask is necessary.
>>>>
>>>> Otherwise IST vectors will reactive SMAP behind the back of the dombuilder.
>>>>
>>>> This will probably only manifest in practice in a CONFIG_PV32=y build,
>>> Sorry, I'm possibly missing some context here.  When running the dom0
>>> builder we switch to the guest page-tables, but not to the guest vCPU,
>>> (iow: current == idle) and hence the context is always the Xen
>>> context.
>> Correct.
>>
>>> Why would the return path of the IST use cr4_pv32_mask when the
>>> context in which the IST happened was the Xen one, and the current
>>> vCPU is the idle one (a 64bit PV guest from Xen's PoV).
>>>
>>> My understanding is that cr4_pv32_mask should only be used when the
>>> current context is running a 32bit PV vCPU.
>> This logic is evil to follow, because you need to look at both
>> cr4_pv32_mask and XEN_CR4_PV32_BITS to see both halves of it.
>>
>> Notice how cr4_pv32_restore() only ever OR's cr4_pv32_mask into %cr4?
>>
>> CR4_PV32_RESTORE is called from every entry path which *might* have come
>> from a 32bit PV guest, and it always results in Xen having SMEP/SMAP
>> active (as applicable).  This includes NMI.
>>
>> The change is only undone in compat_restore_all_guest(), where
>> XEN_CR4_PV32_BITS is cleared from %cr4 iff returning to Ring1/2.  This
>> is logic cunningly disguised in the use of the Parity flag.
>>
>>
>> Because the NMI handler does reactive SMEP/SMAP (based on the value in
>> cr4_pv32_mask), and returning to Xen does not pass through
>> compat_restore_all_guest(), taking an NMI in the middle of of the
>> dombuilder will reactive SMAP behind your back.
> After further conversations with Andrew we believe the current
> disabling of X86_CR4_SMAP in %cr4 during dom0 build is not safe.
>
> Regardless of whether cr4_pv32_mask is properly adjusted return to Xen
> context from interrupt would be done with SMAP enabled if
> X86_FEATURE_XEN_SMAP is set.

Sorry - that's not what I intended to convey.

The logic prior to this patch is safe.  SMAP is cleared from
cr4_pv32_mask before clearing CR4.SMAP, and reinstated in the opposite
order.  Therefore, an NMI hitting the region won't reactivate SMAP
because it's not (instantaniously) set in cr4_pv32_mask.

Arguably it wants some barrier()'s for clarity, and an explanation of
why this works.

The problem your patch has is that by not clearing SMAP from
cr4_pv32_mask, it becomes unsafe iff an NMI/#MC/#DB hits the region.

> I will send a new patch that uses stac/clac in order to disable SMAP
> (if required) around the dom0 builder code that switches to the guest
> page-tables.

Either way - this is a much cleaner solution.

~Andrew


^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: [PATCH 03/22] x86/dom0: only disable SMAP for the PV dom0 build
  2024-07-30 11:06             ` Andrew Cooper
@ 2024-07-30 13:03               ` Roger Pau Monné
  0 siblings, 0 replies; 64+ messages in thread
From: Roger Pau Monné @ 2024-07-30 13:03 UTC (permalink / raw)
  To: Andrew Cooper; +Cc: Jan Beulich, alejandro.vallejo, xen-devel

On Tue, Jul 30, 2024 at 12:06:45PM +0100, Andrew Cooper wrote:
> On 30/07/2024 11:55 am, Roger Pau Monné wrote:
> > On Mon, Jul 29, 2024 at 06:51:31PM +0100, Andrew Cooper wrote:
> >> On 29/07/2024 5:18 pm, Roger Pau Monné wrote:
> >>> On Mon, Jul 29, 2024 at 04:52:22PM +0100, Andrew Cooper wrote:
> >>>> On 29/07/2024 12:53 pm, Jan Beulich wrote:
> >>>>> On 26.07.2024 17:21, Roger Pau Monne wrote:
> >>>>>> The PVH dom0 builder doesn't switch page tables and has no need to run with
> >>>>>> SMAP disabled.
> >>>>>>
> >>>>>> Put the SMAP disabling close to the code region where it's necessary, as it
> >>>>>> then becomes obvious why switch_cr3_cr4() is required instead of
> >>>>>> write_ptbase().
> >>>>>>
> >>>>>> Note removing SMAP from cr4_pv32_mask is not required, as we never jump into
> >>>>>> guest context, and hence updating the value of cr4_pv32_mask is not relevant.
> >>>>> I'm okay-ish with that being dropped, but iirc the goal was to keep the
> >>>>> variable in sync with CPU state.
> >>>> Removing SMAP from cr4_pv32_mask is necessary.
> >>>>
> >>>> Otherwise IST vectors will reactive SMAP behind the back of the dombuilder.
> >>>>
> >>>> This will probably only manifest in practice in a CONFIG_PV32=y build,
> >>> Sorry, I'm possibly missing some context here.  When running the dom0
> >>> builder we switch to the guest page-tables, but not to the guest vCPU,
> >>> (iow: current == idle) and hence the context is always the Xen
> >>> context.
> >> Correct.
> >>
> >>> Why would the return path of the IST use cr4_pv32_mask when the
> >>> context in which the IST happened was the Xen one, and the current
> >>> vCPU is the idle one (a 64bit PV guest from Xen's PoV).
> >>>
> >>> My understanding is that cr4_pv32_mask should only be used when the
> >>> current context is running a 32bit PV vCPU.
> >> This logic is evil to follow, because you need to look at both
> >> cr4_pv32_mask and XEN_CR4_PV32_BITS to see both halves of it.
> >>
> >> Notice how cr4_pv32_restore() only ever OR's cr4_pv32_mask into %cr4?
> >>
> >> CR4_PV32_RESTORE is called from every entry path which *might* have come
> >> from a 32bit PV guest, and it always results in Xen having SMEP/SMAP
> >> active (as applicable).  This includes NMI.
> >>
> >> The change is only undone in compat_restore_all_guest(), where
> >> XEN_CR4_PV32_BITS is cleared from %cr4 iff returning to Ring1/2.  This
> >> is logic cunningly disguised in the use of the Parity flag.
> >>
> >>
> >> Because the NMI handler does reactive SMEP/SMAP (based on the value in
> >> cr4_pv32_mask), and returning to Xen does not pass through
> >> compat_restore_all_guest(), taking an NMI in the middle of of the
> >> dombuilder will reactive SMAP behind your back.
> > After further conversations with Andrew we believe the current
> > disabling of X86_CR4_SMAP in %cr4 during dom0 build is not safe.
> >
> > Regardless of whether cr4_pv32_mask is properly adjusted return to Xen
> > context from interrupt would be done with SMAP enabled if
> > X86_FEATURE_XEN_SMAP is set.
> 
> Sorry - that's not what I intended to convey.
> 
> The logic prior to this patch is safe.  SMAP is cleared from
> cr4_pv32_mask before clearing CR4.SMAP, and reinstated in the opposite
> order.  Therefore, an NMI hitting the region won't reactivate SMAP
> because it's not (instantaniously) set in cr4_pv32_mask.

My bad, I was getting confused with the `clac` instructions in the
event entry points.

I think with my proposed change we would also hit the BUG in
cr4_pv32_restore on debug builds if Xen got an interrupt in the middle
of the SMAP disabled dom0 build region if SMEP is enabled, as
cr4_pv32_mask would differ from the current %cr4 value (not all bits
intended would be actually set).

> Arguably it wants some barrier()'s for clarity, and an explanation of
> why this works.
> 
> The problem your patch has is that by not clearing SMAP from
> cr4_pv32_mask, it becomes unsafe iff an NMI/#MC/#DB hits the region.

Won't such issue also affect common_interrupt, and hence not be
limited to NMI/#MC/#DB only?

Regards, Roger.


^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: [PATCH 05/22] x86/mm: make virt_to_xen_l1e() static
  2024-07-26 15:21 ` [PATCH 05/22] x86/mm: make virt_to_xen_l1e() static Roger Pau Monne
@ 2024-07-30 13:12   ` Andrew Cooper
  0 siblings, 0 replies; 64+ messages in thread
From: Andrew Cooper @ 2024-07-30 13:12 UTC (permalink / raw)
  To: Roger Pau Monne, xen-devel; +Cc: alejandro.vallejo, Jan Beulich

On 26/07/2024 4:21 pm, Roger Pau Monne wrote:
> There are no callers outside the translation unit where it's defined, so make
> the function static.
>
> No functional change intended.
>
> Signed-off-by: Roger Pau Monné <roger.pau@citrix.com>

Acked-by: Andrew Cooper <andrew.cooper3@citrix.com>


^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: [PATCH 06/22] x86/mm: introduce a local domain variable to write_ptbase()
  2024-07-26 15:21 ` [PATCH 06/22] x86/mm: introduce a local domain variable to write_ptbase() Roger Pau Monne
@ 2024-07-30 13:19   ` Andrew Cooper
  0 siblings, 0 replies; 64+ messages in thread
From: Andrew Cooper @ 2024-07-30 13:19 UTC (permalink / raw)
  To: Roger Pau Monne, xen-devel; +Cc: alejandro.vallejo, Jan Beulich

On 26/07/2024 4:21 pm, Roger Pau Monne wrote:
> This reduces the repeated accessing of v->domain.
>
> No functional change intended.
>
> Signed-off-by: Roger Pau Monné <roger.pau@citrix.com>

Acked-by: Andrew Cooper <andrew.cooper3@citrix.com>


^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: [PATCH 22/22] x86/mm: zero stack on stack switch or reset
  2024-07-26 15:22 ` [PATCH 22/22] x86/mm: zero stack on stack switch or reset Roger Pau Monne
  2024-07-29 15:40   ` Andrew Cooper
@ 2024-08-13 13:16   ` Jan Beulich
  2024-09-27 10:22     ` Roger Pau Monné
  1 sibling, 1 reply; 64+ messages in thread
From: Jan Beulich @ 2024-08-13 13:16 UTC (permalink / raw)
  To: Roger Pau Monne; +Cc: alejandro.vallejo, Andrew Cooper, xen-devel

On 26.07.2024 17:22, Roger Pau Monne wrote:
> With the stack mapped on a per-CPU basis there's no risk of other CPUs being
> able to read the stack contents, but vCPUs running on the current pCPU could
> read stack rubble from operations of previous vCPUs.
> 
> The #DF stack is not zeroed because handling of #DF results in a panic.
> 
> Signed-off-by: Roger Pau Monné <roger.pau@citrix.com>
> ---
>  xen/arch/x86/include/asm/current.h | 30 +++++++++++++++++++++++++++++-
>  1 file changed, 29 insertions(+), 1 deletion(-)
> 
> diff --git a/xen/arch/x86/include/asm/current.h b/xen/arch/x86/include/asm/current.h
> index 75b9a341f814..02b4118b03ef 100644
> --- a/xen/arch/x86/include/asm/current.h
> +++ b/xen/arch/x86/include/asm/current.h
> @@ -177,6 +177,14 @@ unsigned long get_stack_dump_bottom (unsigned long sp);
>  # define SHADOW_STACK_WORK ""
>  #endif
>  
> +#define ZERO_STACK                                              \
> +    "test %[stk_size], %[stk_size];"                            \
> +    "jz .L_skip_zeroing.%=;"                                    \
> +    "std;"                                                      \
> +    "rep stosb;"                                                \
> +    "cld;"                                                      \

Is ERMS actually helping with backwards copies? I didn't think so, and hence
it may be that REP STOSQ might be more efficient here?

Jan


^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: [PATCH 04/22] x86/mm: ensure L4 idle_pg_table is not modified past boot
  2024-07-26 15:21 ` [PATCH 04/22] x86/mm: ensure L4 idle_pg_table is not modified past boot Roger Pau Monne
@ 2024-08-13 15:54   ` Jan Beulich
  2024-09-10  8:54     ` Roger Pau Monné
  0 siblings, 1 reply; 64+ messages in thread
From: Jan Beulich @ 2024-08-13 15:54 UTC (permalink / raw)
  To: Roger Pau Monne; +Cc: alejandro.vallejo, Andrew Cooper, xen-devel

On 26.07.2024 17:21, Roger Pau Monne wrote:
> The idle_pg_table L4 is cloned to create all the other L4 Xen uses, and hence
> it shouldn't be modified once further L4 are created.

Yes, but the window between moving into SYS_STATE_smp_boot and Dom0 having
its initial page tables created is quite large. If the justification was
relative to AP bringup, that may be all fine. But when related to cloning,
I think that would then truly want keying to there being any non-system
domain(s).

> --- a/xen/arch/x86/mm.c
> +++ b/xen/arch/x86/mm.c
> @@ -5023,6 +5023,12 @@ static l3_pgentry_t *virt_to_xen_l3e(unsigned long v)
>          mfn_t l3mfn;
>          l3_pgentry_t *l3t = alloc_mapped_pagetable(&l3mfn);
>  
> +        /*
> +         * dom0 is build at smp_boot, at which point we already create new L4s
> +         * based on idle_pg_table.
> +         */
> +        BUG_ON(system_state >= SYS_STATE_smp_boot);

Which effectively means most of this function could become __init (e.g. by
moving into a helper). We'd then hit the BUG_ON() prior to init_done()
destroying the .init.* mappings, and we'd simply #PF afterwards. That's
not so much for the space savings in .text, but to document the limited
lifetime of the (helper) function directly in its function head.

I further wonder whether in such a case the enclosing if() wouldn't want
to gain unlikely() at the same time.

Jan

^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: [PATCH 07/22] x86/spec-ctrl: initialize per-domain XPTI in spec_ctrl_init_domain()
  2024-07-26 15:21 ` [PATCH 07/22] x86/spec-ctrl: initialize per-domain XPTI in spec_ctrl_init_domain() Roger Pau Monne
@ 2024-08-14  9:47   ` Jan Beulich
  0 siblings, 0 replies; 64+ messages in thread
From: Jan Beulich @ 2024-08-14  9:47 UTC (permalink / raw)
  To: Roger Pau Monne; +Cc: alejandro.vallejo, Andrew Cooper, xen-devel

On 26.07.2024 17:21, Roger Pau Monne wrote:
> XPTI being a speculation mitigation feels better to be initialized in
> spec_ctrl_init_domain().
> 
> No functional change intended, although the call to spec_ctrl_init_domain() in
> arch_domain_create() needs to be moved ahead of pv_domain_initialise() for
> d->->arch.pv.xpti to be correctly set.
> 
> Move it ahead of most of the initialization functions, since
> spec_ctrl_init_domain() doesn't depend on any member in the struct domain being
> set.
> 
> Signed-off-by: Roger Pau Monné <roger.pau@citrix.com>

Reviewed-by: Jan Beulich <jbeulich@suse.com>




^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: [PATCH 12/22] x86/spec-ctrl: introduce Address Space Isolation command line option
  2024-07-26 15:21 ` [PATCH 12/22] x86/spec-ctrl: introduce Address Space Isolation command line option Roger Pau Monne
@ 2024-08-14 10:10   ` Jan Beulich
  2024-09-25 13:31     ` Roger Pau Monné
  0 siblings, 1 reply; 64+ messages in thread
From: Jan Beulich @ 2024-08-14 10:10 UTC (permalink / raw)
  To: Roger Pau Monne
  Cc: alejandro.vallejo, Andrew Cooper, Julien Grall,
	Stefano Stabellini, xen-devel

On 26.07.2024 17:21, Roger Pau Monne wrote:
> --- a/docs/misc/xen-command-line.pandoc
> +++ b/docs/misc/xen-command-line.pandoc
> @@ -2387,7 +2387,7 @@ By default SSBD will be mitigated at runtime (i.e `ssbd=runtime`).
>  
>  ### spec-ctrl (x86)
>  > `= List of [ <bool>, xen=<bool>, {pv,hvm}=<bool>,
> ->              {msr-sc,rsb,verw,{ibpb,bhb}-entry}=<bool>|{pv,hvm}=<bool>,
> +>              {msr-sc,rsb,verw,{ibpb,bhb}-entry,asi}=<bool>|{pv,hvm}=<bool>,

Is it really appropriate to hide this underneath an x86-only option? Even
of other architectures won't support it right away, they surely will want
to down the road? In which case making as much of this common right away
is probably the best we can do. This goes along with the question whether,
like e.g. "xpti", this should be a top-level option.

> @@ -2414,10 +2414,10 @@ in place for guests to use.
>  
>  Use of a positive boolean value for either of these options is invalid.
>  
> -The `pv=`, `hvm=`, `msr-sc=`, `rsb=`, `verw=`, `ibpb-entry=` and `bhb-entry=`
> -options offer fine grained control over the primitives by Xen.  These impact
> -Xen's ability to protect itself, and/or Xen's ability to virtualise support
> -for guests to use.
> +The `pv=`, `hvm=`, `msr-sc=`, `rsb=`, `verw=`, `ibpb-entry=`, `bhb-entry=` and
> +`asi=` options offer fine grained control over the primitives by Xen.  These

Here, ahead of "by Xen", it looks like "used" was missing. Maybe a good
opportunity to add it?

> @@ -2449,6 +2449,11 @@ for guests to use.
>    is not available (see `bhi-dis-s`).  The choice of scrubbing sequence can be
>    selected using the `bhb-seq=` option.  If it is necessary to protect dom0
>    too, boot with `spec-ctrl=bhb-entry`.
> +* `asi=` offers control over whether the hypervisor will engage in Address
> +  Space Isolation, by not having sensitive information mapped in the VMM
> +  page-tables.  Not having sensitive information on the page-tables avoids
> +  having to perform some mitigations for speculative attacks when
> +  context-switching to the hypervisor.

Is "not having" and ...

> --- a/xen/arch/x86/include/asm/domain.h
> +++ b/xen/arch/x86/include/asm/domain.h
> @@ -458,6 +458,9 @@ struct arch_domain
>      /* Don't unconditionally inject #GP for unhandled MSRs. */
>      bool msr_relaxed;
>  
> +    /* Run the guest without sensitive information in the VMM page-tables. */
> +    bool asi;

... "without" really going to be fully true? Wouldn't we better say "as little
as possible" or alike?

> @@ -143,6 +148,10 @@ static int __init cf_check parse_spec_ctrl(const char *s)
>              opt_unpriv_mmio = false;
>              opt_gds_mit = 0;
>              opt_div_scrub = 0;
> +
> +            opt_asi_pv = 0;
> +            opt_asi_hwdom = 0;
> +            opt_asi_hvm = 0;
>          }
>          else if ( val > 0 )
>              rc = -EINVAL;

I'm frequently in trouble when deciding where the split between "=no" and
"=xen" should be. opt_xpti_* are cleared ahead of the disable_common label;
considering the similarity I wonder whether the same should be true for ASI
(as this is also or even mainly about protecting guests from one another),
or whether the XPTI placement is actually wrong.

> @@ -378,6 +410,13 @@ int8_t __ro_after_init opt_xpti_domu = -1;
>  
>  static __init void xpti_init_default(void)
>  {
> +    ASSERT(opt_asi_pv >= 0 && opt_asi_hwdom >= 0);
> +    if ( (opt_xpti_hwdom == 1 || opt_xpti_domu == 1) && opt_asi_pv == 1 )

There is a separate opt_asi_hwdom which isn't used here, but only ...

> +    {
> +        printk(XENLOG_ERR
> +               "XPTI is incompatible with Address Space Isolation - disabling ASI\n");
> +        opt_asi_pv = 0;
> +    }
>      if ( (boot_cpu_data.x86_vendor & (X86_VENDOR_AMD | X86_VENDOR_HYGON)) ||
>           cpu_has_rdcl_no )
>      {
> @@ -389,9 +428,9 @@ static __init void xpti_init_default(void)
>      else
>      {
>          if ( opt_xpti_hwdom < 0 )
> -            opt_xpti_hwdom = 1;
> +            opt_xpti_hwdom = !opt_asi_hwdom;
>          if ( opt_xpti_domu < 0 )
> -            opt_xpti_domu = 1;
> +            opt_xpti_domu = !opt_asi_pv;
>      }

... here?

It would further seem desirable to me if opt_asi_hwdom had its default set
later, when we know the kind of Dom0, such that it could be defaulted to
what opt_asi_{hvm,pv} are set to. This, however, wouldn't be compatible
with the use here. Perhaps the invocation of xpti_init_default() would
need deferring, too.

> @@ -643,22 +683,24 @@ static void __init print_details(enum ind_thunk thunk)
>             opt_eager_fpu                             ? " EAGER_FPU"     : "",
>             opt_verw_hvm                              ? " VERW"          : "",
>             boot_cpu_has(X86_FEATURE_IBPB_ENTRY_HVM)  ? " IBPB-entry"    : "",
> -           opt_bhb_entry_hvm                         ? " BHB-entry"     : "");
> +           opt_bhb_entry_hvm                         ? " BHB-entry"     : "",
> +           opt_asi_hvm                               ? " ASI"           : "");
>  
>  #endif
>  #ifdef CONFIG_PV
> -    printk("  Support for PV VMs:%s%s%s%s%s%s%s\n",
> +    printk("  Support for PV VMs:%s%s%s%s%s%s%s%s\n",
>             (boot_cpu_has(X86_FEATURE_SC_MSR_PV) ||
>              boot_cpu_has(X86_FEATURE_SC_RSB_PV) ||
>              boot_cpu_has(X86_FEATURE_IBPB_ENTRY_PV) ||
> -            opt_bhb_entry_pv ||
> +            opt_bhb_entry_pv || opt_asi_pv ||
>              opt_eager_fpu || opt_verw_pv)            ? ""               : " None",
>             boot_cpu_has(X86_FEATURE_SC_MSR_PV)       ? " MSR_SPEC_CTRL" : "",
>             boot_cpu_has(X86_FEATURE_SC_RSB_PV)       ? " RSB"           : "",
>             opt_eager_fpu                             ? " EAGER_FPU"     : "",
>             opt_verw_pv                               ? " VERW"          : "",
>             boot_cpu_has(X86_FEATURE_IBPB_ENTRY_PV)   ? " IBPB-entry"    : "",
> -           opt_bhb_entry_pv                          ? " BHB-entry"     : "");
> +           opt_bhb_entry_pv                          ? " BHB-entry"     : "",
> +           opt_asi_pv                                ? " ASI"           : "");
>  
>      printk("  XPTI (64-bit PV only): Dom0 %s, DomU %s (with%s PCID)\n",
>             opt_xpti_hwdom ? "enabled" : "disabled",

Should this printk() perhaps be suppressed when ASI is in use?

Jan


^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: [PATCH 08/22] x86/mm: avoid passing a domain parameter to L4 init function
  2024-07-26 15:21 ` [PATCH 08/22] x86/mm: avoid passing a domain parameter to L4 init function Roger Pau Monne
  2024-07-29 13:36   ` Alejandro Vallejo
@ 2024-08-14 10:24   ` Jan Beulich
  1 sibling, 0 replies; 64+ messages in thread
From: Jan Beulich @ 2024-08-14 10:24 UTC (permalink / raw)
  To: Roger Pau Monne; +Cc: alejandro.vallejo, Andrew Cooper, Tim Deegan, xen-devel

On 26.07.2024 17:21, Roger Pau Monne wrote:
> @@ -1673,13 +1668,14 @@ void init_xen_l4_slots(l4_pgentry_t *l4t, mfn_t l4mfn,
>          l4e_from_mfn(sl4mfn, __PAGE_HYPERVISOR_RW);
>  
>      /* Slot 260: Per-domain mappings. */
> -    l4t[l4_table_offset(PERDOMAIN_VIRT_START)] =
> -        l4e_from_page(d->arch.perdomain_l3_pg, __PAGE_HYPERVISOR_RW);
> +    if ( perdomain_l3 )
> +        l4t[l4_table_offset(PERDOMAIN_VIRT_START)] =
> +            l4e_from_page(perdomain_l3, __PAGE_HYPERVISOR_RW);
>  
>      /* Slot 4: Per-domain mappings mirror. */
>      BUILD_BUG_ON(IS_ENABLED(CONFIG_PV32) &&
>                   !l4_table_offset(PERDOMAIN_ALT_VIRT_START));
> -    if ( !is_pv_64bit_domain(d) )
> +    if ( perdomain_l3 && maybe_compat )
>          l4t[l4_table_offset(PERDOMAIN_ALT_VIRT_START)] =
>              l4t[l4_table_offset(PERDOMAIN_VIRT_START)];

I think it would be nice if the description could clarify why we need checks
of perdomain_l3 twice here. I assume going forward you want to be able to
pass in NULL. Therefore, if the conditionals are required, I think it would
make sense to have just one, enclosing both (related) writes.

Jan


^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: [PATCH 13/22] x86/hvm: use a per-pCPU monitor table in HAP mode
  2024-07-26 15:21 ` [PATCH 13/22] x86/hvm: use a per-pCPU monitor table in HAP mode Roger Pau Monne
@ 2024-08-16 18:02   ` Alejandro Vallejo
  2024-08-19  8:29     ` Jan Beulich
                       ` (2 more replies)
  0 siblings, 3 replies; 64+ messages in thread
From: Alejandro Vallejo @ 2024-08-16 18:02 UTC (permalink / raw)
  To: Roger Pau Monne, xen-devel; +Cc: Jan Beulich, Andrew Cooper

On Fri Jul 26, 2024 at 4:21 PM BST, Roger Pau Monne wrote:
> Instead of allocating a monitor table for each vCPU when running in HVM HAP
> mode, use a per-pCPU monitor table, which gets the per-domain slot updated on
> guest context switch.
>
> This limits the amount of memory used for HVM HAP monitor tables to the amount
> of active pCPUs, rather than to the number of vCPUs.  It also simplifies vCPU
> allocation and teardown, since the monitor table handling is removed from
> there.
>
> Note the switch to using a per-CPU monitor table is done regardless of whether

s/per-CPU/per-pCPU/

> Address Space Isolation is enabled or not.  Partly for the memory usage
> reduction, and also because it allows to simplify the VM tear down path by not
> having to cleanup the per-vCPU monitor tables.
>
> Signed-off-by: Roger Pau Monné <roger.pau@citrix.com>
> ---
> Note the monitor table is not made static because uses outside of the file
> where it's defined will be added by further patches.
> ---
>  xen/arch/x86/hvm/hvm.c             | 60 ++++++++++++++++++++++++
>  xen/arch/x86/hvm/svm/svm.c         |  5 ++
>  xen/arch/x86/hvm/vmx/vmcs.c        |  1 +
>  xen/arch/x86/hvm/vmx/vmx.c         |  4 ++
>  xen/arch/x86/include/asm/hap.h     |  1 -
>  xen/arch/x86/include/asm/hvm/hvm.h |  8 ++++
>  xen/arch/x86/mm.c                  |  8 ++++
>  xen/arch/x86/mm/hap/hap.c          | 75 ------------------------------
>  xen/arch/x86/mm/paging.c           |  4 +-
>  9 files changed, 87 insertions(+), 79 deletions(-)
>
> diff --git a/xen/arch/x86/hvm/hvm.c b/xen/arch/x86/hvm/hvm.c
> index 7f4b627b1f5f..3f771bc65677 100644
> --- a/xen/arch/x86/hvm/hvm.c
> +++ b/xen/arch/x86/hvm/hvm.c
> @@ -104,6 +104,54 @@ static const char __initconst warning_hvm_fep[] =
>  static bool __initdata opt_altp2m_enabled;
>  boolean_param("altp2m", opt_altp2m_enabled);
>  
> +DEFINE_PER_CPU(root_pgentry_t *, monitor_pgt);
> +
> +static int allocate_cpu_monitor_table(unsigned int cpu)

To avoid ambiguity, could we call these *_pcpu_*() instead?

> +{
> +    root_pgentry_t *pgt = alloc_xenheap_page();
> +
> +    if ( !pgt )
> +        return -ENOMEM;
> +
> +    clear_page(pgt);
> +
> +    init_xen_l4_slots(pgt, _mfn(virt_to_mfn(pgt)), INVALID_MFN, NULL,
> +                      false, true, false);
> +
> +    ASSERT(!per_cpu(monitor_pgt, cpu));
> +    per_cpu(monitor_pgt, cpu) = pgt;
> +
> +    return 0;
> +}
> +
> +static void free_cpu_monitor_table(unsigned int cpu)
> +{
> +    root_pgentry_t *pgt = per_cpu(monitor_pgt, cpu);
> +
> +    if ( !pgt )
> +        return;
> +
> +    per_cpu(monitor_pgt, cpu) = NULL;
> +    free_xenheap_page(pgt);
> +}
> +
> +void hvm_set_cpu_monitor_table(struct vcpu *v)
> +{
> +    root_pgentry_t *pgt = this_cpu(monitor_pgt);
> +
> +    ASSERT(pgt);
> +
> +    setup_perdomain_slot(v, pgt);

Why not modify them as part of write_ptbase() instead? As it stands, it appears
to be modifying the PTEs of what may very well be our current PT, which makes
the perdomain slot be in a $DEITY-knows-what state until the next flush
(presumably the write to cr3 in write_ptbase()?; assuming no PCIDs).

Setting the slot up right before the cr3 change should reduce the potential for
misuse.

> +
> +    make_cr3(v, _mfn(virt_to_mfn(pgt)));
> +}
> +
> +void hvm_clear_cpu_monitor_table(struct vcpu *v)
> +{
> +    /* Poison %cr3, it will be updated when the vCPU is scheduled. */
> +    make_cr3(v, INVALID_MFN);

I think this would benefit from more exposition in the comment. If I'm getting
this right, after descheduling this vCPU we can't assume it'll be rescheduled
on the same pCPU, and if it's not it'll end up using a different monitor table.
This poison value is meant to highlight forgetting to set cr3 in the
"ctxt_switch_to()" path. 

All of that can be deduced from what you wrote and sufficient headscratching
but seeing how this is invoked from the context switch path it's not incredibly
clear wether you meant the perdomain slot would be updated by the next vCPU or
what I stated in the previous paragraph.

Assuming it is as I mentioned, maybe hvm_forget_cpu_monitor_table() would
convey what it does better? i.e: the vCPU forgets/unbinds the monitor table
from its internal state.

Cheers,
Alejandro


^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: [PATCH 16/22] x86/mm: introduce a per-CPU L3 table for the per-domain slot
  2024-07-26 15:22 ` [PATCH 16/22] x86/mm: introduce a per-CPU L3 table for the per-domain slot Roger Pau Monne
@ 2024-08-16 18:40   ` Alejandro Vallejo
  2024-09-27  9:46     ` Roger Pau Monné
  0 siblings, 1 reply; 64+ messages in thread
From: Alejandro Vallejo @ 2024-08-16 18:40 UTC (permalink / raw)
  To: Roger Pau Monne, xen-devel; +Cc: Jan Beulich, Andrew Cooper

On Fri Jul 26, 2024 at 4:22 PM BST, Roger Pau Monne wrote:
> So far L4 slot 260 has always been per-domain, in other words: all vCPUs of a
> domain share the same L3 entry.  Currently only 3 slots are used in that L3
> table, which leaves plenty of room.
>
> Introduce a per-CPU L3 that's used the the domain has Address Space Isolation
> enabled.  Such per-CPU L3 gets currently populated using the same L3 entries
> present on the per-domain L3 (d->arch.perdomain_l3_pg).
>
> No functional change expected, as the per-CPU L3 is always a copy of the
> contents of d->arch.perdomain_l3_pg.
>
> Note that all the per-domain L3 entries are populated at domain create, and
> hence there's no need to sync the state of the per-CPU L3 as the domain won't
> yet be running when the L3 is modified.
>
> Signed-off-by: Roger Pau Monné <roger.pau@citrix.com>

Still scratching my head with the details on this, but in general I'm utterly
confused whenever I read per-CPU in the series because it's not obvious which
CPU (p or v) I should be thinking about. A general change that would help a lot
is to replace every instance of per-CPU with per-vCPU or per-pCPU as needed.

Cheers,
Alejandro


^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: [PATCH 13/22] x86/hvm: use a per-pCPU monitor table in HAP mode
  2024-08-16 18:02   ` Alejandro Vallejo
@ 2024-08-19  8:29     ` Jan Beulich
  2024-08-19 18:22     ` Alejandro Vallejo
  2024-09-25 16:19     ` Roger Pau Monné
  2 siblings, 0 replies; 64+ messages in thread
From: Jan Beulich @ 2024-08-19  8:29 UTC (permalink / raw)
  To: Alejandro Vallejo, Roger Pau Monne; +Cc: Andrew Cooper, xen-devel

On 16.08.2024 20:02, Alejandro Vallejo wrote:
> On Fri Jul 26, 2024 at 4:21 PM BST, Roger Pau Monne wrote:
>> Instead of allocating a monitor table for each vCPU when running in HVM HAP
>> mode, use a per-pCPU monitor table, which gets the per-domain slot updated on
>> guest context switch.
>>
>> This limits the amount of memory used for HVM HAP monitor tables to the amount
>> of active pCPUs, rather than to the number of vCPUs.  It also simplifies vCPU
>> allocation and teardown, since the monitor table handling is removed from
>> there.
>>
>> Note the switch to using a per-CPU monitor table is done regardless of whether
> 
> s/per-CPU/per-pCPU/

While this adjustment is probably fine (albeit I wouldn't insist), ...

>> --- a/xen/arch/x86/hvm/hvm.c
>> +++ b/xen/arch/x86/hvm/hvm.c
>> @@ -104,6 +104,54 @@ static const char __initconst warning_hvm_fep[] =
>>  static bool __initdata opt_altp2m_enabled;
>>  boolean_param("altp2m", opt_altp2m_enabled);
>>  
>> +DEFINE_PER_CPU(root_pgentry_t *, monitor_pgt);
>> +
>> +static int allocate_cpu_monitor_table(unsigned int cpu)
> 
> To avoid ambiguity, could we call these *_pcpu_*() instead?

... I can spot only very few functions with "pcpu" in their names, and I
think we're also pretty clear in distinguishing vcpu from cpu. Therefore
I'd rather not see any p-s added to function names.

Jan


^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: [PATCH 13/22] x86/hvm: use a per-pCPU monitor table in HAP mode
  2024-08-16 18:02   ` Alejandro Vallejo
  2024-08-19  8:29     ` Jan Beulich
@ 2024-08-19 18:22     ` Alejandro Vallejo
  2024-09-25 16:19     ` Roger Pau Monné
  2 siblings, 0 replies; 64+ messages in thread
From: Alejandro Vallejo @ 2024-08-19 18:22 UTC (permalink / raw)
  To: Alejandro Vallejo, Roger Pau Monne, xen-devel; +Cc: Jan Beulich, Andrew Cooper

On Fri Aug 16, 2024 at 7:02 PM BST, Alejandro Vallejo wrote:
> On Fri Jul 26, 2024 at 4:21 PM BST, Roger Pau Monne wrote:
> > Instead of allocating a monitor table for each vCPU when running in HVM HAP
> > mode, use a per-pCPU monitor table, which gets the per-domain slot updated on
> > guest context switch.
> >
> > This limits the amount of memory used for HVM HAP monitor tables to the amount
> > of active pCPUs, rather than to the number of vCPUs.  It also simplifies vCPU
> > allocation and teardown, since the monitor table handling is removed from
> > there.
> >
> > Note the switch to using a per-CPU monitor table is done regardless of whether
>
> s/per-CPU/per-pCPU/
>
> > Address Space Isolation is enabled or not.  Partly for the memory usage
> > reduction, and also because it allows to simplify the VM tear down path by not
> > having to cleanup the per-vCPU monitor tables.
> >
> > Signed-off-by: Roger Pau Monné <roger.pau@citrix.com>
> > ---
> > Note the monitor table is not made static because uses outside of the file
> > where it's defined will be added by further patches.
> > ---
> >  xen/arch/x86/hvm/hvm.c             | 60 ++++++++++++++++++++++++
> >  xen/arch/x86/hvm/svm/svm.c         |  5 ++
> >  xen/arch/x86/hvm/vmx/vmcs.c        |  1 +
> >  xen/arch/x86/hvm/vmx/vmx.c         |  4 ++
> >  xen/arch/x86/include/asm/hap.h     |  1 -
> >  xen/arch/x86/include/asm/hvm/hvm.h |  8 ++++
> >  xen/arch/x86/mm.c                  |  8 ++++
> >  xen/arch/x86/mm/hap/hap.c          | 75 ------------------------------
> >  xen/arch/x86/mm/paging.c           |  4 +-
> >  9 files changed, 87 insertions(+), 79 deletions(-)
> >
> > diff --git a/xen/arch/x86/hvm/hvm.c b/xen/arch/x86/hvm/hvm.c
> > index 7f4b627b1f5f..3f771bc65677 100644
> > --- a/xen/arch/x86/hvm/hvm.c
> > +++ b/xen/arch/x86/hvm/hvm.c
> > @@ -104,6 +104,54 @@ static const char __initconst warning_hvm_fep[] =
> >  static bool __initdata opt_altp2m_enabled;
> >  boolean_param("altp2m", opt_altp2m_enabled);
> >  
> > +DEFINE_PER_CPU(root_pgentry_t *, monitor_pgt);
> > +
> > +static int allocate_cpu_monitor_table(unsigned int cpu)
>
> To avoid ambiguity, could we call these *_pcpu_*() instead?
>
> > +{
> > +    root_pgentry_t *pgt = alloc_xenheap_page();
> > +
> > +    if ( !pgt )
> > +        return -ENOMEM;
> > +
> > +    clear_page(pgt);
> > +
> > +    init_xen_l4_slots(pgt, _mfn(virt_to_mfn(pgt)), INVALID_MFN, NULL,
> > +                      false, true, false);
> > +
> > +    ASSERT(!per_cpu(monitor_pgt, cpu));
> > +    per_cpu(monitor_pgt, cpu) = pgt;
> > +
> > +    return 0;
> > +}
> > +
> > +static void free_cpu_monitor_table(unsigned int cpu)
> > +{
> > +    root_pgentry_t *pgt = per_cpu(monitor_pgt, cpu);
> > +
> > +    if ( !pgt )
> > +        return;
> > +
> > +    per_cpu(monitor_pgt, cpu) = NULL;
> > +    free_xenheap_page(pgt);
> > +}
> > +
> > +void hvm_set_cpu_monitor_table(struct vcpu *v)
> > +{
> > +    root_pgentry_t *pgt = this_cpu(monitor_pgt);
> > +
> > +    ASSERT(pgt);
> > +
> > +    setup_perdomain_slot(v, pgt);
>
> Why not modify them as part of write_ptbase() instead? As it stands, it appears
> to be modifying the PTEs of what may very well be our current PT, which makes
> the perdomain slot be in a $DEITY-knows-what state until the next flush
> (presumably the write to cr3 in write_ptbase()?; assuming no PCIDs).
>
> Setting the slot up right before the cr3 change should reduce the potential for
> misuse.
>
> > +
> > +    make_cr3(v, _mfn(virt_to_mfn(pgt)));
> > +}
> > +
> > +void hvm_clear_cpu_monitor_table(struct vcpu *v)
> > +{
> > +    /* Poison %cr3, it will be updated when the vCPU is scheduled. */
> > +    make_cr3(v, INVALID_MFN);
>
> I think this would benefit from more exposition in the comment. If I'm getting
> this right, after descheduling this vCPU we can't assume it'll be rescheduled
> on the same pCPU, and if it's not it'll end up using a different monitor table.
> This poison value is meant to highlight forgetting to set cr3 in the
> "ctxt_switch_to()" path. 
>
> All of that can be deduced from what you wrote and sufficient headscratching
> but seeing how this is invoked from the context switch path it's not incredibly
> clear wether you meant the perdomain slot would be updated by the next vCPU or
> what I stated in the previous paragraph.
>
> Assuming it is as I mentioned, maybe hvm_forget_cpu_monitor_table() would
> convey what it does better? i.e: the vCPU forgets/unbinds the monitor table
> from its internal state.
>
> Cheers,
> Alejandro

After playing with the code for a while I'm becoming increasingly convinced
that we don't want to tie hvm_clear_cpu_monitor_table() to the ctx_switch_to
handlers at all. In __context_switch() we would ideally like to delay restoring
the state until after said state is available in the page tables (i.e: after
write_ptbase()).

With that division we can do saves and restores with far less headaches as we
can assume that the pcpu fixmap always contains the relevant data.

Cheers,
Alejandro




^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: [PATCH 15/22] x86/idle: allow using a per-pCPU L4
  2024-07-26 15:21 ` [PATCH 15/22] x86/idle: allow using a per-pCPU L4 Roger Pau Monne
@ 2024-08-21 16:42   ` Alejandro Vallejo
  2024-09-27  9:29     ` Roger Pau Monné
  0 siblings, 1 reply; 64+ messages in thread
From: Alejandro Vallejo @ 2024-08-21 16:42 UTC (permalink / raw)
  To: Roger Pau Monne, xen-devel; +Cc: Jan Beulich, Andrew Cooper

On Fri Jul 26, 2024 at 4:21 PM BST, Roger Pau Monne wrote:
> diff --git a/xen/arch/x86/domain.c b/xen/arch/x86/domain.c
> index 9cfcf0dc63f3..b62c4311da6c 100644
> --- a/xen/arch/x86/domain.c
> +++ b/xen/arch/x86/domain.c
> @@ -555,6 +555,7 @@ void arch_vcpu_regs_init(struct vcpu *v)
>  int arch_vcpu_create(struct vcpu *v)
>  {
>      struct domain *d = v->domain;
> +    root_pgentry_t *pgt = NULL;
>      int rc;
>  
>      v->arch.flags = TF_kernel_mode;
> @@ -589,7 +590,23 @@ int arch_vcpu_create(struct vcpu *v)
>      else
>      {
>          /* Idle domain */
> -        v->arch.cr3 = __pa(idle_pg_table);
> +        if ( (opt_asi_pv || opt_asi_hvm) && v->vcpu_id )
> +        {
> +            pgt = alloc_xenheap_page();
> +
> +            /*
> +             * For the idle vCPU 0 (the BSP idle vCPU) use idle_pg_table
> +             * directly, there's no need to create yet another copy.
> +             */

Shouldn't this comment be in the else branch instead? Or reworded to refer to
non-0 vCPUs.

> +            rc = -ENOMEM;

While it's true rc is overriden later, I feel uneasy leaving it with -ENOMEM
after the check. Could we have it immediately before "goto fail"?

> +            if ( !pgt )
> +                goto fail;
> +
> +            copy_page(pgt, idle_pg_table);
> +            v->arch.cr3 = __pa(pgt);
> +        }
> +        else
> +            v->arch.cr3 = __pa(idle_pg_table);
>          rc = 0;
>          v->arch.msrs = ZERO_BLOCK_PTR; /* Catch stray misuses */
>      }
> @@ -611,6 +628,7 @@ int arch_vcpu_create(struct vcpu *v)
>      vcpu_destroy_fpu(v);
>      xfree(v->arch.msrs);
>      v->arch.msrs = NULL;
> +    free_xenheap_page(pgt);
>  
>      return rc;
>  }

I guess the idle domain has a forever lifetime and its vCPUs are kept around
forever too, right?; otherwise we'd need extra logic in the the vcpu_destroy()
to free the page table copies should they exist too.

Cheers,
Alejandro


^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: [PATCH 04/22] x86/mm: ensure L4 idle_pg_table is not modified past boot
  2024-08-13 15:54   ` Jan Beulich
@ 2024-09-10  8:54     ` Roger Pau Monné
  2024-09-10  9:00       ` Jan Beulich
  0 siblings, 1 reply; 64+ messages in thread
From: Roger Pau Monné @ 2024-09-10  8:54 UTC (permalink / raw)
  To: Jan Beulich; +Cc: alejandro.vallejo, Andrew Cooper, xen-devel

On Tue, Aug 13, 2024 at 05:54:54PM +0200, Jan Beulich wrote:
> On 26.07.2024 17:21, Roger Pau Monne wrote:
> > The idle_pg_table L4 is cloned to create all the other L4 Xen uses, and hence
> > it shouldn't be modified once further L4 are created.
> 
> Yes, but the window between moving into SYS_STATE_smp_boot and Dom0 having
> its initial page tables created is quite large. If the justification was
> relative to AP bringup, that may be all fine. But when related to cloning,
> I think that would then truly want keying to there being any non-system
> domain(s).

Further changes in this series will add a per-CPU idle page table, and
hence we need to ensure that by the time APs are started the BSP L4 idle
page directory is not changed, as otherwise the copies in the APs
would get out of sync.

The idle system domain is indeed tied to the idle page talbes, but the
idle vCPU0 (the BSP) directly uses idle_pg_table (no copying), and
hence it's fine to allow modifications of the L4 idle page table
directory up to when APs are started (those will indeed make copies of
the idle L4.

> > --- a/xen/arch/x86/mm.c
> > +++ b/xen/arch/x86/mm.c
> > @@ -5023,6 +5023,12 @@ static l3_pgentry_t *virt_to_xen_l3e(unsigned long v)
> >          mfn_t l3mfn;
> >          l3_pgentry_t *l3t = alloc_mapped_pagetable(&l3mfn);
> >  
> > +        /*
> > +         * dom0 is build at smp_boot, at which point we already create new L4s
> > +         * based on idle_pg_table.
> > +         */
> > +        BUG_ON(system_state >= SYS_STATE_smp_boot);
> 
> Which effectively means most of this function could become __init (e.g. by
> moving into a helper). We'd then hit the BUG_ON() prior to init_done()
> destroying the .init.* mappings, and we'd simply #PF afterwards. That's
> not so much for the space savings in .text, but to document the limited
> lifetime of the (helper) function directly in its function head.

IMO the BUG_ON() is clearer to debug, but I won't mind splitting the
logic inside the if body into a separate helper.

> I further wonder whether in such a case the enclosing if() wouldn't want
> to gain unlikely() at the same time.

Yes, I can certainly add that.

Thanks, Roger.


^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: [PATCH 04/22] x86/mm: ensure L4 idle_pg_table is not modified past boot
  2024-09-10  8:54     ` Roger Pau Monné
@ 2024-09-10  9:00       ` Jan Beulich
  2024-09-10  9:32         ` Roger Pau Monné
  0 siblings, 1 reply; 64+ messages in thread
From: Jan Beulich @ 2024-09-10  9:00 UTC (permalink / raw)
  To: Roger Pau Monné; +Cc: alejandro.vallejo, Andrew Cooper, xen-devel

On 10.09.2024 10:54, Roger Pau Monné wrote:
> On Tue, Aug 13, 2024 at 05:54:54PM +0200, Jan Beulich wrote:
>> On 26.07.2024 17:21, Roger Pau Monne wrote:
>>> The idle_pg_table L4 is cloned to create all the other L4 Xen uses, and hence
>>> it shouldn't be modified once further L4 are created.
>>
>> Yes, but the window between moving into SYS_STATE_smp_boot and Dom0 having
>> its initial page tables created is quite large. If the justification was
>> relative to AP bringup, that may be all fine. But when related to cloning,
>> I think that would then truly want keying to there being any non-system
>> domain(s).
> 
> Further changes in this series will add a per-CPU idle page table, and
> hence we need to ensure that by the time APs are started the BSP L4 idle
> page directory is not changed, as otherwise the copies in the APs
> would get out of sync.
> 
> The idle system domain is indeed tied to the idle page talbes, but the
> idle vCPU0 (the BSP) directly uses idle_pg_table (no copying), and
> hence it's fine to allow modifications of the L4 idle page table
> directory up to when APs are started (those will indeed make copies of
> the idle L4.

Which may want at least mentioning in the description then. I take it
that ...

>>> --- a/xen/arch/x86/mm.c
>>> +++ b/xen/arch/x86/mm.c
>>> @@ -5023,6 +5023,12 @@ static l3_pgentry_t *virt_to_xen_l3e(unsigned long v)
>>>          mfn_t l3mfn;
>>>          l3_pgentry_t *l3t = alloc_mapped_pagetable(&l3mfn);
>>>  
>>> +        /*
>>> +         * dom0 is build at smp_boot, at which point we already create new L4s
>>> +         * based on idle_pg_table.
>>> +         */

... this comment is then refined by the later patches you refer to?

>>> +        BUG_ON(system_state >= SYS_STATE_smp_boot);
>>
>> Which effectively means most of this function could become __init (e.g. by
>> moving into a helper). We'd then hit the BUG_ON() prior to init_done()
>> destroying the .init.* mappings, and we'd simply #PF afterwards. That's
>> not so much for the space savings in .text, but to document the limited
>> lifetime of the (helper) function directly in its function head.
> 
> IMO the BUG_ON() is clearer to debug,

Fair point - it's indeed a balance between two possible goals. I guess ...

> but I won't mind splitting the
> logic inside the if body into a separate helper.

... simply keep it as you have it.

Jan


^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: [PATCH 04/22] x86/mm: ensure L4 idle_pg_table is not modified past boot
  2024-09-10  9:00       ` Jan Beulich
@ 2024-09-10  9:32         ` Roger Pau Monné
  0 siblings, 0 replies; 64+ messages in thread
From: Roger Pau Monné @ 2024-09-10  9:32 UTC (permalink / raw)
  To: Jan Beulich; +Cc: alejandro.vallejo, Andrew Cooper, xen-devel

On Tue, Sep 10, 2024 at 11:00:27AM +0200, Jan Beulich wrote:
> On 10.09.2024 10:54, Roger Pau Monné wrote:
> > On Tue, Aug 13, 2024 at 05:54:54PM +0200, Jan Beulich wrote:
> >> On 26.07.2024 17:21, Roger Pau Monne wrote:
> >>> The idle_pg_table L4 is cloned to create all the other L4 Xen uses, and hence
> >>> it shouldn't be modified once further L4 are created.
> >>
> >> Yes, but the window between moving into SYS_STATE_smp_boot and Dom0 having
> >> its initial page tables created is quite large. If the justification was
> >> relative to AP bringup, that may be all fine. But when related to cloning,
> >> I think that would then truly want keying to there being any non-system
> >> domain(s).
> > 
> > Further changes in this series will add a per-CPU idle page table, and
> > hence we need to ensure that by the time APs are started the BSP L4 idle
> > page directory is not changed, as otherwise the copies in the APs
> > would get out of sync.
> > 
> > The idle system domain is indeed tied to the idle page talbes, but the
> > idle vCPU0 (the BSP) directly uses idle_pg_table (no copying), and
> > hence it's fine to allow modifications of the L4 idle page table
> > directory up to when APs are started (those will indeed make copies of
> > the idle L4.
> 
> Which may want at least mentioning in the description then. I take it
> that ...
> 
> >>> --- a/xen/arch/x86/mm.c
> >>> +++ b/xen/arch/x86/mm.c
> >>> @@ -5023,6 +5023,12 @@ static l3_pgentry_t *virt_to_xen_l3e(unsigned long v)
> >>>          mfn_t l3mfn;
> >>>          l3_pgentry_t *l3t = alloc_mapped_pagetable(&l3mfn);
> >>>  
> >>> +        /*
> >>> +         * dom0 is build at smp_boot, at which point we already create new L4s
> >>> +         * based on idle_pg_table.
> >>> +         */
> 
> ... this comment is then refined by the later patches you refer to?

Hm, I would have to double check, not sure I've updated it once the
idle_pg_table is cloned for AP bringup.

Will expand commit message and update the comment here if not done
already by later patches.

Thanks, Roger.


^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: [PATCH 12/22] x86/spec-ctrl: introduce Address Space Isolation command line option
  2024-08-14 10:10   ` Jan Beulich
@ 2024-09-25 13:31     ` Roger Pau Monné
  2024-09-25 14:03       ` Jan Beulich
  0 siblings, 1 reply; 64+ messages in thread
From: Roger Pau Monné @ 2024-09-25 13:31 UTC (permalink / raw)
  To: Jan Beulich
  Cc: alejandro.vallejo, Andrew Cooper, Julien Grall,
	Stefano Stabellini, xen-devel

On Wed, Aug 14, 2024 at 12:10:56PM +0200, Jan Beulich wrote:
> On 26.07.2024 17:21, Roger Pau Monne wrote:
> > --- a/docs/misc/xen-command-line.pandoc
> > +++ b/docs/misc/xen-command-line.pandoc
> > @@ -2387,7 +2387,7 @@ By default SSBD will be mitigated at runtime (i.e `ssbd=runtime`).
> >  
> >  ### spec-ctrl (x86)
> >  > `= List of [ <bool>, xen=<bool>, {pv,hvm}=<bool>,
> > ->              {msr-sc,rsb,verw,{ibpb,bhb}-entry}=<bool>|{pv,hvm}=<bool>,
> > +>              {msr-sc,rsb,verw,{ibpb,bhb}-entry,asi}=<bool>|{pv,hvm}=<bool>,
> 
> Is it really appropriate to hide this underneath an x86-only option? Even
> of other architectures won't support it right away, they surely will want
> to down the road? In which case making as much of this common right away
> is probably the best we can do. This goes along with the question whether,
> like e.g. "xpti", this should be a top-level option.

I think it's better placed in spec-ctrl as it's a speculation
mitigation.  I can see your point about sharing with other arches,
maybe when that's needed we can introduce a generic parser of
spec-ctrl options?

It might end up needing slightly different processing for arches
different than x86, as for x86 it should be possible to enable the
option only for PV or HVM domains, while for other arches this might
make no sense for not having PV support.

> > @@ -2414,10 +2414,10 @@ in place for guests to use.
> >  
> >  Use of a positive boolean value for either of these options is invalid.
> >  
> > -The `pv=`, `hvm=`, `msr-sc=`, `rsb=`, `verw=`, `ibpb-entry=` and `bhb-entry=`
> > -options offer fine grained control over the primitives by Xen.  These impact
> > -Xen's ability to protect itself, and/or Xen's ability to virtualise support
> > -for guests to use.
> > +The `pv=`, `hvm=`, `msr-sc=`, `rsb=`, `verw=`, `ibpb-entry=`, `bhb-entry=` and
> > +`asi=` options offer fine grained control over the primitives by Xen.  These
> 
> Here, ahead of "by Xen", it looks like "used" was missing. Maybe a good
> opportunity to add it?

Oh, yes.

> > @@ -2449,6 +2449,11 @@ for guests to use.
> >    is not available (see `bhi-dis-s`).  The choice of scrubbing sequence can be
> >    selected using the `bhb-seq=` option.  If it is necessary to protect dom0
> >    too, boot with `spec-ctrl=bhb-entry`.
> > +* `asi=` offers control over whether the hypervisor will engage in Address
> > +  Space Isolation, by not having sensitive information mapped in the VMM
> > +  page-tables.  Not having sensitive information on the page-tables avoids
> > +  having to perform some mitigations for speculative attacks when
> > +  context-switching to the hypervisor.
> 
> Is "not having" and ...
> 
> > --- a/xen/arch/x86/include/asm/domain.h
> > +++ b/xen/arch/x86/include/asm/domain.h
> > @@ -458,6 +458,9 @@ struct arch_domain
> >      /* Don't unconditionally inject #GP for unhandled MSRs. */
> >      bool msr_relaxed;
> >  
> > +    /* Run the guest without sensitive information in the VMM page-tables. */
> > +    bool asi;
> 
> ... "without" really going to be fully true? Wouldn't we better say "as little
> as possible" or alike?

Maybe better use:

"...by not having sensitive information permanently mapped..."

And a similar adjustment to the comment?

The key point is that we would only map sensitive information
transiently IMO.

> > @@ -143,6 +148,10 @@ static int __init cf_check parse_spec_ctrl(const char *s)
> >              opt_unpriv_mmio = false;
> >              opt_gds_mit = 0;
> >              opt_div_scrub = 0;
> > +
> > +            opt_asi_pv = 0;
> > +            opt_asi_hwdom = 0;
> > +            opt_asi_hvm = 0;
> >          }
> >          else if ( val > 0 )
> >              rc = -EINVAL;
> 
> I'm frequently in trouble when deciding where the split between "=no" and
> "=xen" should be. opt_xpti_* are cleared ahead of the disable_common label;
> considering the similarity I wonder whether the same should be true for ASI
> (as this is also or even mainly about protecting guests from one another),
> or whether the XPTI placement is actually wrong.

Hm, that's a difficult one.  ASI is a Xen implemented mitigation, so
it should be turned off when spec-ctrl=no-xen is used according to the
description of the option:

"spec-ctrl=no-xen can be used to turn off all of Xen’s mitigations"

OTOH, there's no "virtualisation support in place for guests to use"
when no-xen is used.

I have to admin the description for that option is not obviously clear
to me, so 

> > @@ -378,6 +410,13 @@ int8_t __ro_after_init opt_xpti_domu = -1;
> >  
> >  static __init void xpti_init_default(void)
> >  {
> > +    ASSERT(opt_asi_pv >= 0 && opt_asi_hwdom >= 0);
> > +    if ( (opt_xpti_hwdom == 1 || opt_xpti_domu == 1) && opt_asi_pv == 1 )
> 
> There is a separate opt_asi_hwdom which isn't used here, but only ...

opt_asi_pv (and opt_asi_hvm) must be set for opt_asi_hwdom to also be
set.  XPTI is sligtly different, in that XPTI could be set only for
the hwdom by using `xpti=dom0`.

> > +    {
> > +        printk(XENLOG_ERR
> > +               "XPTI is incompatible with Address Space Isolation - disabling ASI\n");
> > +        opt_asi_pv = 0;
> > +    }
> >      if ( (boot_cpu_data.x86_vendor & (X86_VENDOR_AMD | X86_VENDOR_HYGON)) ||
> >           cpu_has_rdcl_no )
> >      {
> > @@ -389,9 +428,9 @@ static __init void xpti_init_default(void)
> >      else
> >      {
> >          if ( opt_xpti_hwdom < 0 )
> > -            opt_xpti_hwdom = 1;
> > +            opt_xpti_hwdom = !opt_asi_hwdom;
> >          if ( opt_xpti_domu < 0 )
> > -            opt_xpti_domu = 1;
> > +            opt_xpti_domu = !opt_asi_pv;
> >      }
> 
> ... here?
> 
> It would further seem desirable to me if opt_asi_hwdom had its default set
> later, when we know the kind of Dom0, such that it could be defaulted to
> what opt_asi_{hvm,pv} are set to. This, however, wouldn't be compatible
> with the use here. Perhaps the invocation of xpti_init_default() would
> need deferring, too.

Given the current parsing logic, opt_asi_hwdom will only be set when
both opt_asi_{hvm,pv} are set.  Setting spec-ctrl=asi={pv,hvm} will
only enable ASI for the domUs of the selected mode.

Hence deferring won't make any practical difference, as having
opt_asi_hwdom enabled implies having ASI enabled for all domain
types.

I think the most common case is either having ASI enabled everywhere,
or having ASI enabled only for domUs and the speculation mitigations
also disabled for the hw domain.

> > @@ -643,22 +683,24 @@ static void __init print_details(enum ind_thunk thunk)
> >             opt_eager_fpu                             ? " EAGER_FPU"     : "",
> >             opt_verw_hvm                              ? " VERW"          : "",
> >             boot_cpu_has(X86_FEATURE_IBPB_ENTRY_HVM)  ? " IBPB-entry"    : "",
> > -           opt_bhb_entry_hvm                         ? " BHB-entry"     : "");
> > +           opt_bhb_entry_hvm                         ? " BHB-entry"     : "",
> > +           opt_asi_hvm                               ? " ASI"           : "");
> >  
> >  #endif
> >  #ifdef CONFIG_PV
> > -    printk("  Support for PV VMs:%s%s%s%s%s%s%s\n",
> > +    printk("  Support for PV VMs:%s%s%s%s%s%s%s%s\n",
> >             (boot_cpu_has(X86_FEATURE_SC_MSR_PV) ||
> >              boot_cpu_has(X86_FEATURE_SC_RSB_PV) ||
> >              boot_cpu_has(X86_FEATURE_IBPB_ENTRY_PV) ||
> > -            opt_bhb_entry_pv ||
> > +            opt_bhb_entry_pv || opt_asi_pv ||
> >              opt_eager_fpu || opt_verw_pv)            ? ""               : " None",
> >             boot_cpu_has(X86_FEATURE_SC_MSR_PV)       ? " MSR_SPEC_CTRL" : "",
> >             boot_cpu_has(X86_FEATURE_SC_RSB_PV)       ? " RSB"           : "",
> >             opt_eager_fpu                             ? " EAGER_FPU"     : "",
> >             opt_verw_pv                               ? " VERW"          : "",
> >             boot_cpu_has(X86_FEATURE_IBPB_ENTRY_PV)   ? " IBPB-entry"    : "",
> > -           opt_bhb_entry_pv                          ? " BHB-entry"     : "");
> > +           opt_bhb_entry_pv                          ? " BHB-entry"     : "",
> > +           opt_asi_pv                                ? " ASI"           : "");
> >  
> >      printk("  XPTI (64-bit PV only): Dom0 %s, DomU %s (with%s PCID)\n",
> >             opt_xpti_hwdom ? "enabled" : "disabled",
> 
> Should this printk() perhaps be suppressed when ASI is in use?

Maybe, I found it useful during development to ensure the logic was
correct, but I guess it's not of much use for plain users.  I will
make the printing conditional to ASI not being uniformly enabled.

Maybe it would be useful to unify XPTI printing with the rest of
mitigations listed in the "Support for PV VMs:" line?  Albeit that
would drop the signaling of opt_xpti_hwdom.

Thanks, Roger.


^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: [PATCH 12/22] x86/spec-ctrl: introduce Address Space Isolation command line option
  2024-09-25 13:31     ` Roger Pau Monné
@ 2024-09-25 14:03       ` Jan Beulich
  2024-09-25 15:27         ` Roger Pau Monné
  0 siblings, 1 reply; 64+ messages in thread
From: Jan Beulich @ 2024-09-25 14:03 UTC (permalink / raw)
  To: Roger Pau Monné
  Cc: alejandro.vallejo, Andrew Cooper, Julien Grall,
	Stefano Stabellini, xen-devel

On 25.09.2024 15:31, Roger Pau Monné wrote:
> On Wed, Aug 14, 2024 at 12:10:56PM +0200, Jan Beulich wrote:
>> On 26.07.2024 17:21, Roger Pau Monne wrote:
>>> --- a/docs/misc/xen-command-line.pandoc
>>> +++ b/docs/misc/xen-command-line.pandoc
>>> @@ -2387,7 +2387,7 @@ By default SSBD will be mitigated at runtime (i.e `ssbd=runtime`).
>>>  
>>>  ### spec-ctrl (x86)
>>>  > `= List of [ <bool>, xen=<bool>, {pv,hvm}=<bool>,
>>> ->              {msr-sc,rsb,verw,{ibpb,bhb}-entry}=<bool>|{pv,hvm}=<bool>,
>>> +>              {msr-sc,rsb,verw,{ibpb,bhb}-entry,asi}=<bool>|{pv,hvm}=<bool>,
>>
>> Is it really appropriate to hide this underneath an x86-only option? Even
>> of other architectures won't support it right away, they surely will want
>> to down the road? In which case making as much of this common right away
>> is probably the best we can do. This goes along with the question whether,
>> like e.g. "xpti", this should be a top-level option.
> 
> I think it's better placed in spec-ctrl as it's a speculation
> mitigation.

As is XPTI.

>  I can see your point about sharing with other arches,
> maybe when that's needed we can introduce a generic parser of
> spec-ctrl options?

Not sure how much could be generalized there.

>>> @@ -2449,6 +2449,11 @@ for guests to use.
>>>    is not available (see `bhi-dis-s`).  The choice of scrubbing sequence can be
>>>    selected using the `bhb-seq=` option.  If it is necessary to protect dom0
>>>    too, boot with `spec-ctrl=bhb-entry`.
>>> +* `asi=` offers control over whether the hypervisor will engage in Address
>>> +  Space Isolation, by not having sensitive information mapped in the VMM
>>> +  page-tables.  Not having sensitive information on the page-tables avoids
>>> +  having to perform some mitigations for speculative attacks when
>>> +  context-switching to the hypervisor.
>>
>> Is "not having" and ...
>>
>>> --- a/xen/arch/x86/include/asm/domain.h
>>> +++ b/xen/arch/x86/include/asm/domain.h
>>> @@ -458,6 +458,9 @@ struct arch_domain
>>>      /* Don't unconditionally inject #GP for unhandled MSRs. */
>>>      bool msr_relaxed;
>>>  
>>> +    /* Run the guest without sensitive information in the VMM page-tables. */
>>> +    bool asi;
>>
>> ... "without" really going to be fully true? Wouldn't we better say "as little
>> as possible" or alike?
> 
> Maybe better use:
> 
> "...by not having sensitive information permanently mapped..."
> 
> And a similar adjustment to the comment?

Yes, that's better.

>>> @@ -143,6 +148,10 @@ static int __init cf_check parse_spec_ctrl(const char *s)
>>>              opt_unpriv_mmio = false;
>>>              opt_gds_mit = 0;
>>>              opt_div_scrub = 0;
>>> +
>>> +            opt_asi_pv = 0;
>>> +            opt_asi_hwdom = 0;
>>> +            opt_asi_hvm = 0;
>>>          }
>>>          else if ( val > 0 )
>>>              rc = -EINVAL;
>>
>> I'm frequently in trouble when deciding where the split between "=no" and
>> "=xen" should be. opt_xpti_* are cleared ahead of the disable_common label;
>> considering the similarity I wonder whether the same should be true for ASI
>> (as this is also or even mainly about protecting guests from one another),
>> or whether the XPTI placement is actually wrong.
> 
> Hm, that's a difficult one.  ASI is a Xen implemented mitigation, so
> it should be turned off when spec-ctrl=no-xen is used according to the
> description of the option:
> 
> "spec-ctrl=no-xen can be used to turn off all of Xen’s mitigations"

Meaning (aiui) mitigations to protect Xen itself.

>>> @@ -378,6 +410,13 @@ int8_t __ro_after_init opt_xpti_domu = -1;
>>>  
>>>  static __init void xpti_init_default(void)
>>>  {
>>> +    ASSERT(opt_asi_pv >= 0 && opt_asi_hwdom >= 0);
>>> +    if ( (opt_xpti_hwdom == 1 || opt_xpti_domu == 1) && opt_asi_pv == 1 )
>>
>> There is a separate opt_asi_hwdom which isn't used here, but only ...
> 
> opt_asi_pv (and opt_asi_hvm) must be set for opt_asi_hwdom to also be
> set.  XPTI is sligtly different, in that XPTI could be set only for
> the hwdom by using `xpti=dom0`.

Hmm, I didn't even notice this oddity (as it feels to me) in parsing.
From the doc provided it wouldn't occur to me that e.g. "asi=pv" won't
affect a PV Dom0. That's (iirc) specifically why "xpti=" has a "hwdom"
sub-option.

>>> @@ -389,9 +428,9 @@ static __init void xpti_init_default(void)
>>>      else
>>>      {
>>>          if ( opt_xpti_hwdom < 0 )
>>> -            opt_xpti_hwdom = 1;
>>> +            opt_xpti_hwdom = !opt_asi_hwdom;
>>>          if ( opt_xpti_domu < 0 )
>>> -            opt_xpti_domu = 1;
>>> +            opt_xpti_domu = !opt_asi_pv;
>>>      }
>>
>> ... here?
>>
>> It would further seem desirable to me if opt_asi_hwdom had its default set
>> later, when we know the kind of Dom0, such that it could be defaulted to
>> what opt_asi_{hvm,pv} are set to. This, however, wouldn't be compatible
>> with the use here. Perhaps the invocation of xpti_init_default() would
>> need deferring, too.
> 
> Given the current parsing logic, opt_asi_hwdom will only be set when
> both opt_asi_{hvm,pv} are set.  Setting spec-ctrl=asi={pv,hvm} will
> only enable ASI for the domUs of the selected mode.
> 
> Hence deferring won't make any practical difference, as having
> opt_asi_hwdom enabled implies having ASI enabled for all domain
> types.

Right, another effect of me not having paid enough attention to that parsing
detail.

>>> @@ -643,22 +683,24 @@ static void __init print_details(enum ind_thunk thunk)
>>>             opt_eager_fpu                             ? " EAGER_FPU"     : "",
>>>             opt_verw_hvm                              ? " VERW"          : "",
>>>             boot_cpu_has(X86_FEATURE_IBPB_ENTRY_HVM)  ? " IBPB-entry"    : "",
>>> -           opt_bhb_entry_hvm                         ? " BHB-entry"     : "");
>>> +           opt_bhb_entry_hvm                         ? " BHB-entry"     : "",
>>> +           opt_asi_hvm                               ? " ASI"           : "");
>>>  
>>>  #endif
>>>  #ifdef CONFIG_PV
>>> -    printk("  Support for PV VMs:%s%s%s%s%s%s%s\n",
>>> +    printk("  Support for PV VMs:%s%s%s%s%s%s%s%s\n",
>>>             (boot_cpu_has(X86_FEATURE_SC_MSR_PV) ||
>>>              boot_cpu_has(X86_FEATURE_SC_RSB_PV) ||
>>>              boot_cpu_has(X86_FEATURE_IBPB_ENTRY_PV) ||
>>> -            opt_bhb_entry_pv ||
>>> +            opt_bhb_entry_pv || opt_asi_pv ||
>>>              opt_eager_fpu || opt_verw_pv)            ? ""               : " None",
>>>             boot_cpu_has(X86_FEATURE_SC_MSR_PV)       ? " MSR_SPEC_CTRL" : "",
>>>             boot_cpu_has(X86_FEATURE_SC_RSB_PV)       ? " RSB"           : "",
>>>             opt_eager_fpu                             ? " EAGER_FPU"     : "",
>>>             opt_verw_pv                               ? " VERW"          : "",
>>>             boot_cpu_has(X86_FEATURE_IBPB_ENTRY_PV)   ? " IBPB-entry"    : "",
>>> -           opt_bhb_entry_pv                          ? " BHB-entry"     : "");
>>> +           opt_bhb_entry_pv                          ? " BHB-entry"     : "",
>>> +           opt_asi_pv                                ? " ASI"           : "");
>>>  
>>>      printk("  XPTI (64-bit PV only): Dom0 %s, DomU %s (with%s PCID)\n",
>>>             opt_xpti_hwdom ? "enabled" : "disabled",
>>
>> Should this printk() perhaps be suppressed when ASI is in use?
> 
> Maybe, I found it useful during development to ensure the logic was
> correct, but I guess it's not of much use for plain users.  I will
> make the printing conditional to ASI not being uniformly enabled.
> 
> Maybe it would be useful to unify XPTI printing with the rest of
> mitigations listed in the "Support for PV VMs:" line?  Albeit that
> would drop the signaling of opt_xpti_hwdom.

Which is why I wouldn't want to "unify" it.

Jan


^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: [PATCH 12/22] x86/spec-ctrl: introduce Address Space Isolation command line option
  2024-09-25 14:03       ` Jan Beulich
@ 2024-09-25 15:27         ` Roger Pau Monné
  2024-09-25 15:47           ` Jan Beulich
  0 siblings, 1 reply; 64+ messages in thread
From: Roger Pau Monné @ 2024-09-25 15:27 UTC (permalink / raw)
  To: Jan Beulich
  Cc: alejandro.vallejo, Andrew Cooper, Julien Grall,
	Stefano Stabellini, xen-devel

On Wed, Sep 25, 2024 at 04:03:04PM +0200, Jan Beulich wrote:
> On 25.09.2024 15:31, Roger Pau Monné wrote:
> > On Wed, Aug 14, 2024 at 12:10:56PM +0200, Jan Beulich wrote:
> >> On 26.07.2024 17:21, Roger Pau Monne wrote:
> >>> --- a/docs/misc/xen-command-line.pandoc
> >>> +++ b/docs/misc/xen-command-line.pandoc
> >>> @@ -2387,7 +2387,7 @@ By default SSBD will be mitigated at runtime (i.e `ssbd=runtime`).
> >>>  
> >>>  ### spec-ctrl (x86)
> >>>  > `= List of [ <bool>, xen=<bool>, {pv,hvm}=<bool>,
> >>> ->              {msr-sc,rsb,verw,{ibpb,bhb}-entry}=<bool>|{pv,hvm}=<bool>,
> >>> +>              {msr-sc,rsb,verw,{ibpb,bhb}-entry,asi}=<bool>|{pv,hvm}=<bool>,
> >>
> >> Is it really appropriate to hide this underneath an x86-only option? Even
> >> of other architectures won't support it right away, they surely will want
> >> to down the road? In which case making as much of this common right away
> >> is probably the best we can do. This goes along with the question whether,
> >> like e.g. "xpti", this should be a top-level option.
> > 
> > I think it's better placed in spec-ctrl as it's a speculation
> > mitigation.
> 
> As is XPTI.

But XPTI predates the introduction of spec-ctrl option, I assumed
that's why xpti is not part of spec-ctrl.

> >  I can see your point about sharing with other arches,
> > maybe when that's needed we can introduce a generic parser of
> > spec-ctrl options?
> 
> Not sure how much could be generalized there.

Oh, so your point was not about sharing the parsing code, but sharing
the command line documentation about it, sorry, I missed that.

Along the lines of:

asi= boolean | { pv, hvm, hwdom }

Or similar?

Even then sub-options would likely be different between architectures.

> >>> @@ -143,6 +148,10 @@ static int __init cf_check parse_spec_ctrl(const char *s)
> >>>              opt_unpriv_mmio = false;
> >>>              opt_gds_mit = 0;
> >>>              opt_div_scrub = 0;
> >>> +
> >>> +            opt_asi_pv = 0;
> >>> +            opt_asi_hwdom = 0;
> >>> +            opt_asi_hvm = 0;
> >>>          }
> >>>          else if ( val > 0 )
> >>>              rc = -EINVAL;
> >>
> >> I'm frequently in trouble when deciding where the split between "=no" and
> >> "=xen" should be. opt_xpti_* are cleared ahead of the disable_common label;
> >> considering the similarity I wonder whether the same should be true for ASI
> >> (as this is also or even mainly about protecting guests from one another),
> >> or whether the XPTI placement is actually wrong.
> > 
> > Hm, that's a difficult one.  ASI is a Xen implemented mitigation, so
> > it should be turned off when spec-ctrl=no-xen is used according to the
> > description of the option:
> > 
> > "spec-ctrl=no-xen can be used to turn off all of Xen’s mitigations"
> 
> Meaning (aiui) mitigations to protect Xen itself.

So that would speculation attacks that take place in Xen context,
which is what ASI would protect against?

I don't have a strong opinion, but I also have a hard time seeing what
should `no-xen` disable.

> >>> @@ -378,6 +410,13 @@ int8_t __ro_after_init opt_xpti_domu = -1;
> >>>  
> >>>  static __init void xpti_init_default(void)
> >>>  {
> >>> +    ASSERT(opt_asi_pv >= 0 && opt_asi_hwdom >= 0);
> >>> +    if ( (opt_xpti_hwdom == 1 || opt_xpti_domu == 1) && opt_asi_pv == 1 )
> >>
> >> There is a separate opt_asi_hwdom which isn't used here, but only ...
> > 
> > opt_asi_pv (and opt_asi_hvm) must be set for opt_asi_hwdom to also be
> > set.  XPTI is sligtly different, in that XPTI could be set only for
> > the hwdom by using `xpti=dom0`.
> 
> Hmm, I didn't even notice this oddity (as it feels to me) in parsing.
> From the doc provided it wouldn't occur to me that e.g. "asi=pv" won't
> affect a PV Dom0. That's (iirc) specifically why "xpti=" has a "hwdom"
> sub-option.

It seems to be like that for all spec-ctrl options, see `bhb-entry`
for example.

> >>> @@ -643,22 +683,24 @@ static void __init print_details(enum ind_thunk thunk)
> >>>             opt_eager_fpu                             ? " EAGER_FPU"     : "",
> >>>             opt_verw_hvm                              ? " VERW"          : "",
> >>>             boot_cpu_has(X86_FEATURE_IBPB_ENTRY_HVM)  ? " IBPB-entry"    : "",
> >>> -           opt_bhb_entry_hvm                         ? " BHB-entry"     : "");
> >>> +           opt_bhb_entry_hvm                         ? " BHB-entry"     : "",
> >>> +           opt_asi_hvm                               ? " ASI"           : "");
> >>>  
> >>>  #endif
> >>>  #ifdef CONFIG_PV
> >>> -    printk("  Support for PV VMs:%s%s%s%s%s%s%s\n",
> >>> +    printk("  Support for PV VMs:%s%s%s%s%s%s%s%s\n",
> >>>             (boot_cpu_has(X86_FEATURE_SC_MSR_PV) ||
> >>>              boot_cpu_has(X86_FEATURE_SC_RSB_PV) ||
> >>>              boot_cpu_has(X86_FEATURE_IBPB_ENTRY_PV) ||
> >>> -            opt_bhb_entry_pv ||
> >>> +            opt_bhb_entry_pv || opt_asi_pv ||
> >>>              opt_eager_fpu || opt_verw_pv)            ? ""               : " None",
> >>>             boot_cpu_has(X86_FEATURE_SC_MSR_PV)       ? " MSR_SPEC_CTRL" : "",
> >>>             boot_cpu_has(X86_FEATURE_SC_RSB_PV)       ? " RSB"           : "",
> >>>             opt_eager_fpu                             ? " EAGER_FPU"     : "",
> >>>             opt_verw_pv                               ? " VERW"          : "",
> >>>             boot_cpu_has(X86_FEATURE_IBPB_ENTRY_PV)   ? " IBPB-entry"    : "",
> >>> -           opt_bhb_entry_pv                          ? " BHB-entry"     : "");
> >>> +           opt_bhb_entry_pv                          ? " BHB-entry"     : "",
> >>> +           opt_asi_pv                                ? " ASI"           : "");
> >>>  
> >>>      printk("  XPTI (64-bit PV only): Dom0 %s, DomU %s (with%s PCID)\n",
> >>>             opt_xpti_hwdom ? "enabled" : "disabled",
> >>
> >> Should this printk() perhaps be suppressed when ASI is in use?
> > 
> > Maybe, I found it useful during development to ensure the logic was
> > correct, but I guess it's not of much use for plain users.  I will
> > make the printing conditional to ASI not being uniformly enabled.
> > 
> > Maybe it would be useful to unify XPTI printing with the rest of
> > mitigations listed in the "Support for PV VMs:" line?  Albeit that
> > would drop the signaling of opt_xpti_hwdom.
> 
> Which is why I wouldn't want to "unify" it.

Right I will avoid printing the line if ASI is uniformly enabled.

Thanks, Roger.


^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: [PATCH 12/22] x86/spec-ctrl: introduce Address Space Isolation command line option
  2024-09-25 15:27         ` Roger Pau Monné
@ 2024-09-25 15:47           ` Jan Beulich
  0 siblings, 0 replies; 64+ messages in thread
From: Jan Beulich @ 2024-09-25 15:47 UTC (permalink / raw)
  To: Roger Pau Monné
  Cc: alejandro.vallejo, Andrew Cooper, Julien Grall,
	Stefano Stabellini, xen-devel

On 25.09.2024 17:27, Roger Pau Monné wrote:
> On Wed, Sep 25, 2024 at 04:03:04PM +0200, Jan Beulich wrote:
>> On 25.09.2024 15:31, Roger Pau Monné wrote:
>>> On Wed, Aug 14, 2024 at 12:10:56PM +0200, Jan Beulich wrote:
>>>> On 26.07.2024 17:21, Roger Pau Monne wrote:
>>>>> --- a/docs/misc/xen-command-line.pandoc
>>>>> +++ b/docs/misc/xen-command-line.pandoc
>>>>> @@ -2387,7 +2387,7 @@ By default SSBD will be mitigated at runtime (i.e `ssbd=runtime`).
>>>>>  
>>>>>  ### spec-ctrl (x86)
>>>>>  > `= List of [ <bool>, xen=<bool>, {pv,hvm}=<bool>,
>>>>> ->              {msr-sc,rsb,verw,{ibpb,bhb}-entry}=<bool>|{pv,hvm}=<bool>,
>>>>> +>              {msr-sc,rsb,verw,{ibpb,bhb}-entry,asi}=<bool>|{pv,hvm}=<bool>,
>>>>
>>>> Is it really appropriate to hide this underneath an x86-only option? Even
>>>> of other architectures won't support it right away, they surely will want
>>>> to down the road? In which case making as much of this common right away
>>>> is probably the best we can do. This goes along with the question whether,
>>>> like e.g. "xpti", this should be a top-level option.
>>>
>>> I think it's better placed in spec-ctrl as it's a speculation
>>> mitigation.
>>
>> As is XPTI.
> 
> But XPTI predates the introduction of spec-ctrl option, I assumed
> that's why xpti is not part of spec-ctrl.
> 
>>>  I can see your point about sharing with other arches,
>>> maybe when that's needed we can introduce a generic parser of
>>> spec-ctrl options?
>>
>> Not sure how much could be generalized there.
> 
> Oh, so your point was not about sharing the parsing code, but sharing
> the command line documentation about it, sorry, I missed that.

My point was really to share as much as possible, if this was a top-level
option. Of course ...

> Along the lines of:
> 
> asi= boolean | { pv, hvm, hwdom }
> 
> Or similar?
> 
> Even then sub-options would likely be different between architectures.

... the sub-options wouldn't all be generalizable.

>>>>> @@ -143,6 +148,10 @@ static int __init cf_check parse_spec_ctrl(const char *s)
>>>>>              opt_unpriv_mmio = false;
>>>>>              opt_gds_mit = 0;
>>>>>              opt_div_scrub = 0;
>>>>> +
>>>>> +            opt_asi_pv = 0;
>>>>> +            opt_asi_hwdom = 0;
>>>>> +            opt_asi_hvm = 0;
>>>>>          }
>>>>>          else if ( val > 0 )
>>>>>              rc = -EINVAL;
>>>>
>>>> I'm frequently in trouble when deciding where the split between "=no" and
>>>> "=xen" should be. opt_xpti_* are cleared ahead of the disable_common label;
>>>> considering the similarity I wonder whether the same should be true for ASI
>>>> (as this is also or even mainly about protecting guests from one another),
>>>> or whether the XPTI placement is actually wrong.
>>>
>>> Hm, that's a difficult one.  ASI is a Xen implemented mitigation, so
>>> it should be turned off when spec-ctrl=no-xen is used according to the
>>> description of the option:
>>>
>>> "spec-ctrl=no-xen can be used to turn off all of Xen’s mitigations"
>>
>> Meaning (aiui) mitigations to protect Xen itself.
> 
> So that would speculation attacks that take place in Xen context,
> which is what ASI would protect against?
> 
> I don't have a strong opinion, but I also have a hard time seeing what
> should `no-xen` disable.

I wonder whether Andrew knows of a clear way of expressing where that line
is intended to be drawn.

>>>>> @@ -378,6 +410,13 @@ int8_t __ro_after_init opt_xpti_domu = -1;
>>>>>  
>>>>>  static __init void xpti_init_default(void)
>>>>>  {
>>>>> +    ASSERT(opt_asi_pv >= 0 && opt_asi_hwdom >= 0);
>>>>> +    if ( (opt_xpti_hwdom == 1 || opt_xpti_domu == 1) && opt_asi_pv == 1 )
>>>>
>>>> There is a separate opt_asi_hwdom which isn't used here, but only ...
>>>
>>> opt_asi_pv (and opt_asi_hvm) must be set for opt_asi_hwdom to also be
>>> set.  XPTI is sligtly different, in that XPTI could be set only for
>>> the hwdom by using `xpti=dom0`.
>>
>> Hmm, I didn't even notice this oddity (as it feels to me) in parsing.
>> From the doc provided it wouldn't occur to me that e.g. "asi=pv" won't
>> affect a PV Dom0. That's (iirc) specifically why "xpti=" has a "hwdom"
>> sub-option.
> 
> It seems to be like that for all spec-ctrl options, see `bhb-entry`
> for example.

Hmm, indeed.

Jan


^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: [PATCH 13/22] x86/hvm: use a per-pCPU monitor table in HAP mode
  2024-08-16 18:02   ` Alejandro Vallejo
  2024-08-19  8:29     ` Jan Beulich
  2024-08-19 18:22     ` Alejandro Vallejo
@ 2024-09-25 16:19     ` Roger Pau Monné
  2 siblings, 0 replies; 64+ messages in thread
From: Roger Pau Monné @ 2024-09-25 16:19 UTC (permalink / raw)
  To: Alejandro Vallejo; +Cc: xen-devel, Jan Beulich, Andrew Cooper

On Fri, Aug 16, 2024 at 07:02:54PM +0100, Alejandro Vallejo wrote:
> On Fri Jul 26, 2024 at 4:21 PM BST, Roger Pau Monne wrote:
> > Instead of allocating a monitor table for each vCPU when running in HVM HAP
> > mode, use a per-pCPU monitor table, which gets the per-domain slot updated on
> > guest context switch.
> >
> > This limits the amount of memory used for HVM HAP monitor tables to the amount
> > of active pCPUs, rather than to the number of vCPUs.  It also simplifies vCPU
> > allocation and teardown, since the monitor table handling is removed from
> > there.
> >
> > Note the switch to using a per-CPU monitor table is done regardless of whether
> 
> s/per-CPU/per-pCPU/

Sorry, I might not has been as consistent as I wanted with using pCPU
everywhere.

> > Address Space Isolation is enabled or not.  Partly for the memory usage
> > reduction, and also because it allows to simplify the VM tear down path by not
> > having to cleanup the per-vCPU monitor tables.
> >
> > Signed-off-by: Roger Pau Monné <roger.pau@citrix.com>
> > ---
> > Note the monitor table is not made static because uses outside of the file
> > where it's defined will be added by further patches.
> > ---
> >  xen/arch/x86/hvm/hvm.c             | 60 ++++++++++++++++++++++++
> >  xen/arch/x86/hvm/svm/svm.c         |  5 ++
> >  xen/arch/x86/hvm/vmx/vmcs.c        |  1 +
> >  xen/arch/x86/hvm/vmx/vmx.c         |  4 ++
> >  xen/arch/x86/include/asm/hap.h     |  1 -
> >  xen/arch/x86/include/asm/hvm/hvm.h |  8 ++++
> >  xen/arch/x86/mm.c                  |  8 ++++
> >  xen/arch/x86/mm/hap/hap.c          | 75 ------------------------------
> >  xen/arch/x86/mm/paging.c           |  4 +-
> >  9 files changed, 87 insertions(+), 79 deletions(-)
> >
> > diff --git a/xen/arch/x86/hvm/hvm.c b/xen/arch/x86/hvm/hvm.c
> > index 7f4b627b1f5f..3f771bc65677 100644
> > --- a/xen/arch/x86/hvm/hvm.c
> > +++ b/xen/arch/x86/hvm/hvm.c
> > @@ -104,6 +104,54 @@ static const char __initconst warning_hvm_fep[] =
> >  static bool __initdata opt_altp2m_enabled;
> >  boolean_param("altp2m", opt_altp2m_enabled);
> >  
> > +DEFINE_PER_CPU(root_pgentry_t *, monitor_pgt);
> > +
> > +static int allocate_cpu_monitor_table(unsigned int cpu)
> 
> To avoid ambiguity, could we call these *_pcpu_*() instead?

As replied by Jan, plain 'cpu' is physical CPU on hypervisor code
function names usually.  '_pcpu_' here would IMO imply per-CPU, which
it also is, but likely doesn't need spelling in the function name.

> > +{
> > +    root_pgentry_t *pgt = alloc_xenheap_page();
> > +
> > +    if ( !pgt )
> > +        return -ENOMEM;
> > +
> > +    clear_page(pgt);
> > +
> > +    init_xen_l4_slots(pgt, _mfn(virt_to_mfn(pgt)), INVALID_MFN, NULL,
> > +                      false, true, false);
> > +
> > +    ASSERT(!per_cpu(monitor_pgt, cpu));
> > +    per_cpu(monitor_pgt, cpu) = pgt;
> > +
> > +    return 0;
> > +}
> > +
> > +static void free_cpu_monitor_table(unsigned int cpu)
> > +{
> > +    root_pgentry_t *pgt = per_cpu(monitor_pgt, cpu);
> > +
> > +    if ( !pgt )
> > +        return;
> > +
> > +    per_cpu(monitor_pgt, cpu) = NULL;
> > +    free_xenheap_page(pgt);
> > +}
> > +
> > +void hvm_set_cpu_monitor_table(struct vcpu *v)
> > +{
> > +    root_pgentry_t *pgt = this_cpu(monitor_pgt);
> > +
> > +    ASSERT(pgt);
> > +
> > +    setup_perdomain_slot(v, pgt);
> 
> Why not modify them as part of write_ptbase() instead? As it stands, it appears
> to be modifying the PTEs of what may very well be our current PT, which makes
> the perdomain slot be in a $DEITY-knows-what state until the next flush
> (presumably the write to cr3 in write_ptbase()?; assuming no PCIDs).
> 
> Setting the slot up right before the cr3 change should reduce the potential for
> misuse.

The reasoning for doing it here it that the per-domain slot only needs
setting on context switch.  In the PV case write_ptbase() will be
called each time the guest switches %cr3, but setting the per-domain
slot is not required for each call if the vCPU hasn't changed.

Let me see if I can arrange for the current contents of
setup_perdomain_slot() to be merged into write_ptbase(). Note
setup_perdomain_slot() started as a wrapper to extract XPTI specific
code from paravirt_ctxt_switch_to().

> > +
> > +    make_cr3(v, _mfn(virt_to_mfn(pgt)));
> > +}
> > +
> > +void hvm_clear_cpu_monitor_table(struct vcpu *v)
> > +{
> > +    /* Poison %cr3, it will be updated when the vCPU is scheduled. */
> > +    make_cr3(v, INVALID_MFN);
> 
> I think this would benefit from more exposition in the comment. If I'm getting
> this right, after descheduling this vCPU we can't assume it'll be rescheduled
> on the same pCPU, and if it's not it'll end up using a different monitor table.
> This poison value is meant to highlight forgetting to set cr3 in the
> "ctxt_switch_to()" path. 

Indeed, we would like to avoid running on a different pCPU while still
using the monitor page-tables from whatever pCPU the vCPU previously
had been running.

> All of that can be deduced from what you wrote and sufficient headscratching
> but seeing how this is invoked from the context switch path it's not incredibly
> clear wether you meant the perdomain slot would be updated by the next vCPU or
> what I stated in the previous paragraph.

No, it's just about not leaving stale values in the vcpu struct.

> Assuming it is as I mentioned, maybe hvm_forget_cpu_monitor_table() would
> convey what it does better? i.e: the vCPU forgets/unbinds the monitor table
> from its internal state.

Right, I assumed that 'clear' already conveyed the concept of
unbinding from a pCPU.  If I use unbind, then I guess I should also
use 'bind' for what I currently call 'set'.

Thanks, Roger.


^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: [PATCH 15/22] x86/idle: allow using a per-pCPU L4
  2024-08-21 16:42   ` Alejandro Vallejo
@ 2024-09-27  9:29     ` Roger Pau Monné
  0 siblings, 0 replies; 64+ messages in thread
From: Roger Pau Monné @ 2024-09-27  9:29 UTC (permalink / raw)
  To: Alejandro Vallejo; +Cc: xen-devel, Jan Beulich, Andrew Cooper

On Wed, Aug 21, 2024 at 05:42:26PM +0100, Alejandro Vallejo wrote:
> On Fri Jul 26, 2024 at 4:21 PM BST, Roger Pau Monne wrote:
> > diff --git a/xen/arch/x86/domain.c b/xen/arch/x86/domain.c
> > index 9cfcf0dc63f3..b62c4311da6c 100644
> > --- a/xen/arch/x86/domain.c
> > +++ b/xen/arch/x86/domain.c
> > @@ -555,6 +555,7 @@ void arch_vcpu_regs_init(struct vcpu *v)
> >  int arch_vcpu_create(struct vcpu *v)
> >  {
> >      struct domain *d = v->domain;
> > +    root_pgentry_t *pgt = NULL;
> >      int rc;
> >  
> >      v->arch.flags = TF_kernel_mode;
> > @@ -589,7 +590,23 @@ int arch_vcpu_create(struct vcpu *v)
> >      else
> >      {
> >          /* Idle domain */
> > -        v->arch.cr3 = __pa(idle_pg_table);
> > +        if ( (opt_asi_pv || opt_asi_hvm) && v->vcpu_id )
> > +        {
> > +            pgt = alloc_xenheap_page();
> > +
> > +            /*
> > +             * For the idle vCPU 0 (the BSP idle vCPU) use idle_pg_table
> > +             * directly, there's no need to create yet another copy.
> > +             */
> 
> Shouldn't this comment be in the else branch instead? Or reworded to refer to
> non-0 vCPUs.

Sure, moved to the else branch.

> > +            rc = -ENOMEM;
> 
> While it's true rc is overriden later, I feel uneasy leaving it with -ENOMEM
> after the check. Could we have it immediately before "goto fail"?

I have to admit I found this coding style weird at first, but it's
used all over Xen.  I don't mind setting rc ahead of the goto, AFAICT
the only benefit of the current style is that we can avoid the braces
around the if code block for it being a single statement.

> > +            if ( !pgt )
> > +                goto fail;
> > +
> > +            copy_page(pgt, idle_pg_table);
> > +            v->arch.cr3 = __pa(pgt);
> > +        }
> > +        else
> > +            v->arch.cr3 = __pa(idle_pg_table);
> >          rc = 0;
> >          v->arch.msrs = ZERO_BLOCK_PTR; /* Catch stray misuses */
> >      }
> > @@ -611,6 +628,7 @@ int arch_vcpu_create(struct vcpu *v)
> >      vcpu_destroy_fpu(v);
> >      xfree(v->arch.msrs);
> >      v->arch.msrs = NULL;
> > +    free_xenheap_page(pgt);
> >  
> >      return rc;
> >  }
> 
> I guess the idle domain has a forever lifetime and its vCPUs are kept around
> forever too, right?; otherwise we'd need extra logic in the the vcpu_destroy()
> to free the page table copies should they exist too.

Indeed, vcpus are only destroyed when destroying domains, and system
domains are never destroyed.

Thanks, Roger.


^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: [PATCH 16/22] x86/mm: introduce a per-CPU L3 table for the per-domain slot
  2024-08-16 18:40   ` Alejandro Vallejo
@ 2024-09-27  9:46     ` Roger Pau Monné
  0 siblings, 0 replies; 64+ messages in thread
From: Roger Pau Monné @ 2024-09-27  9:46 UTC (permalink / raw)
  To: Alejandro Vallejo; +Cc: xen-devel, Jan Beulich, Andrew Cooper

On Fri, Aug 16, 2024 at 07:40:40PM +0100, Alejandro Vallejo wrote:
> On Fri Jul 26, 2024 at 4:22 PM BST, Roger Pau Monne wrote:
> > So far L4 slot 260 has always been per-domain, in other words: all vCPUs of a
> > domain share the same L3 entry.  Currently only 3 slots are used in that L3
> > table, which leaves plenty of room.
> >
> > Introduce a per-CPU L3 that's used the the domain has Address Space Isolation
> > enabled.  Such per-CPU L3 gets currently populated using the same L3 entries
> > present on the per-domain L3 (d->arch.perdomain_l3_pg).
> >
> > No functional change expected, as the per-CPU L3 is always a copy of the
> > contents of d->arch.perdomain_l3_pg.
> >
> > Note that all the per-domain L3 entries are populated at domain create, and
> > hence there's no need to sync the state of the per-CPU L3 as the domain won't
> > yet be running when the L3 is modified.
> >
> > Signed-off-by: Roger Pau Monné <roger.pau@citrix.com>
> 
> Still scratching my head with the details on this, but in general I'm utterly
> confused whenever I read per-CPU in the series because it's not obvious which
> CPU (p or v) I should be thinking about. A general change that would help a lot
> is to replace every instance of per-CPU with per-vCPU or per-pCPU as needed.

per-CPU is always per-pCPU, as CPU without any prefix should always
refer to a physical CPU.  I think it's only recently that we have
started using pCPU vs vCPU, in the past it always was CPU vs vCPU.

I will attempt to be better at explicitly using pCPU instead of CPU in
the commit messages, sorry.

Thanks, Roger.


^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: [PATCH 22/22] x86/mm: zero stack on stack switch or reset
  2024-08-13 13:16   ` Jan Beulich
@ 2024-09-27 10:22     ` Roger Pau Monné
  0 siblings, 0 replies; 64+ messages in thread
From: Roger Pau Monné @ 2024-09-27 10:22 UTC (permalink / raw)
  To: Jan Beulich; +Cc: alejandro.vallejo, Andrew Cooper, xen-devel

On Tue, Aug 13, 2024 at 03:16:42PM +0200, Jan Beulich wrote:
> On 26.07.2024 17:22, Roger Pau Monne wrote:
> > With the stack mapped on a per-CPU basis there's no risk of other CPUs being
> > able to read the stack contents, but vCPUs running on the current pCPU could
> > read stack rubble from operations of previous vCPUs.
> > 
> > The #DF stack is not zeroed because handling of #DF results in a panic.
> > 
> > Signed-off-by: Roger Pau Monné <roger.pau@citrix.com>
> > ---
> >  xen/arch/x86/include/asm/current.h | 30 +++++++++++++++++++++++++++++-
> >  1 file changed, 29 insertions(+), 1 deletion(-)
> > 
> > diff --git a/xen/arch/x86/include/asm/current.h b/xen/arch/x86/include/asm/current.h
> > index 75b9a341f814..02b4118b03ef 100644
> > --- a/xen/arch/x86/include/asm/current.h
> > +++ b/xen/arch/x86/include/asm/current.h
> > @@ -177,6 +177,14 @@ unsigned long get_stack_dump_bottom (unsigned long sp);
> >  # define SHADOW_STACK_WORK ""
> >  #endif
> >  
> > +#define ZERO_STACK                                              \
> > +    "test %[stk_size], %[stk_size];"                            \
> > +    "jz .L_skip_zeroing.%=;"                                    \
> > +    "std;"                                                      \
> > +    "rep stosb;"                                                \
> > +    "cld;"                                                      \
> 
> Is ERMS actually helping with backwards copies? I didn't think so, and hence
> it may be that REP STOSQ might be more efficient here?

Possibly, Intel optimization guide says:

"However, setting the DF to force REP MOVSB to copy bytes from high
towards low addresses will experience significant performance
degradation."

I will see what I can do.

Thanks, Roger.


^ permalink raw reply	[flat|nested] 64+ messages in thread

end of thread, other threads:[~2024-09-27 10:23 UTC | newest]

Thread overview: 64+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2024-07-26 15:21 [PATCH 00/22] x86: adventures in Address Space Isolation Roger Pau Monne
2024-07-26 15:21 ` [PATCH 01/22] x86/mm: drop l{1,2,3,4}e_write_atomic() Roger Pau Monne
2024-07-29  7:52   ` Jan Beulich
2024-07-29 12:53     ` Roger Pau Monné
2024-07-26 15:21 ` [PATCH 02/22] x86/mm: rename l{1,2,3,4}e_read_atomic() Roger Pau Monne
2024-07-29  7:53   ` Jan Beulich
2024-07-26 15:21 ` [PATCH 03/22] x86/dom0: only disable SMAP for the PV dom0 build Roger Pau Monne
2024-07-29  8:17   ` Roger Pau Monné
2024-07-29 11:53   ` Jan Beulich
2024-07-29 15:52     ` Andrew Cooper
2024-07-29 16:18       ` Roger Pau Monné
2024-07-29 17:51         ` Andrew Cooper
2024-07-30 10:55           ` Roger Pau Monné
2024-07-30 11:06             ` Andrew Cooper
2024-07-30 13:03               ` Roger Pau Monné
2024-07-29 15:59   ` Andrew Cooper
2024-07-29 16:32     ` Roger Pau Monné
2024-07-26 15:21 ` [PATCH 04/22] x86/mm: ensure L4 idle_pg_table is not modified past boot Roger Pau Monne
2024-08-13 15:54   ` Jan Beulich
2024-09-10  8:54     ` Roger Pau Monné
2024-09-10  9:00       ` Jan Beulich
2024-09-10  9:32         ` Roger Pau Monné
2024-07-26 15:21 ` [PATCH 05/22] x86/mm: make virt_to_xen_l1e() static Roger Pau Monne
2024-07-30 13:12   ` Andrew Cooper
2024-07-26 15:21 ` [PATCH 06/22] x86/mm: introduce a local domain variable to write_ptbase() Roger Pau Monne
2024-07-30 13:19   ` Andrew Cooper
2024-07-26 15:21 ` [PATCH 07/22] x86/spec-ctrl: initialize per-domain XPTI in spec_ctrl_init_domain() Roger Pau Monne
2024-08-14  9:47   ` Jan Beulich
2024-07-26 15:21 ` [PATCH 08/22] x86/mm: avoid passing a domain parameter to L4 init function Roger Pau Monne
2024-07-29 13:36   ` Alejandro Vallejo
2024-07-29 13:43     ` Jan Beulich
2024-07-29 14:18     ` Roger Pau Monné
2024-08-14 10:24   ` Jan Beulich
2024-07-26 15:21 ` [PATCH 09/22] x86/pv: untie issuing FLUSH_ROOT_PGTBL from XPTI Roger Pau Monne
2024-07-26 15:21 ` [PATCH 10/22] x86/mm: move FLUSH_ROOT_PGTBL handling before TLB flush Roger Pau Monne
2024-07-26 15:21 ` [PATCH 11/22] x86/mm: split setup of the per-domain slot on context switch Roger Pau Monne
2024-07-26 15:21 ` [PATCH 12/22] x86/spec-ctrl: introduce Address Space Isolation command line option Roger Pau Monne
2024-08-14 10:10   ` Jan Beulich
2024-09-25 13:31     ` Roger Pau Monné
2024-09-25 14:03       ` Jan Beulich
2024-09-25 15:27         ` Roger Pau Monné
2024-09-25 15:47           ` Jan Beulich
2024-07-26 15:21 ` [PATCH 13/22] x86/hvm: use a per-pCPU monitor table in HAP mode Roger Pau Monne
2024-08-16 18:02   ` Alejandro Vallejo
2024-08-19  8:29     ` Jan Beulich
2024-08-19 18:22     ` Alejandro Vallejo
2024-09-25 16:19     ` Roger Pau Monné
2024-07-26 15:21 ` [PATCH 14/22] x86/hvm: use a per-pCPU monitor table in shadow mode Roger Pau Monne
2024-07-26 15:21 ` [PATCH 15/22] x86/idle: allow using a per-pCPU L4 Roger Pau Monne
2024-08-21 16:42   ` Alejandro Vallejo
2024-09-27  9:29     ` Roger Pau Monné
2024-07-26 15:22 ` [PATCH 16/22] x86/mm: introduce a per-CPU L3 table for the per-domain slot Roger Pau Monne
2024-08-16 18:40   ` Alejandro Vallejo
2024-09-27  9:46     ` Roger Pau Monné
2024-07-26 15:22 ` [PATCH 17/22] x86/mm: introduce support to populate a per-CPU page-table region Roger Pau Monne
2024-07-26 15:22 ` [PATCH 18/22] x86/mm: allow modifying per-CPU entries of remote page-tables Roger Pau Monne
2024-07-26 15:22 ` [PATCH 19/22] x86/mm: introduce a per-CPU fixmap area Roger Pau Monne
2024-07-26 15:22 ` [PATCH 20/22] x86/pv: allow using a unique per-pCPU root page table (L4) Roger Pau Monne
2024-07-26 15:22 ` [PATCH 21/22] x86/mm: switch to a per-CPU mapped stack when using ASI Roger Pau Monne
2024-07-26 15:22 ` [PATCH 22/22] x86/mm: zero stack on stack switch or reset Roger Pau Monne
2024-07-29 15:40   ` Andrew Cooper
2024-07-30 10:49     ` Roger Pau Monné
2024-08-13 13:16   ` Jan Beulich
2024-09-27 10:22     ` Roger Pau Monné

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.