[PATCH v2 00/18] x86: adventures in Address Space Isolation

All of lore.kernel.org
 help / color / mirror / Atom feed

* [PATCH v2 00/18] x86: adventures in Address Space Isolation
@ 2025-01-08 14:26 Roger Pau Monne
  2025-01-08 14:26 ` [PATCH v2 01/18] x86/mm: purge unneeded destroy_perdomain_mapping() Roger Pau Monne
                   ` (18 more replies)
  0 siblings, 19 replies; 55+ messages in thread
From: Roger Pau Monne @ 2025-01-08 14:26 UTC (permalink / raw)
  To: xen-devel
  Cc: Roger Pau Monne, Jan Beulich, Andrew Cooper, Anthony PERARD,
	Michal Orzel, Julien Grall, Stefano Stabellini, Tim Deegan

Hello,

The aim of this series is to introduce the functionality required to
create linear mappings visible to a single pCPU.

Doing so requires having a per-vCPU root page-table (L4), and hence
requires shadowing the guest selected L4 on PV guests.  As follow ups
(and partially to ensure the per-CPU mappings work fine) the CPU stacks
are switched to use per-CPU mappings, so that remote stack contents are
not by default mapped on all page-tables (note: for this to be true the
directmap entries for the stack pages would need to be removed also).

There's one known shortcoming with the presented code: migration of PV
guests using per-vCPU root page-tables is not working.  I need to
introduce extra logic to deal with PV shadow mode when using unique root
page-tables.  I don't think this should block the series however, such
missing functionality can always be added as follow up work.
paging_domctl() is adjusted to reflect this restriction.

The main differences compared to v1 are the usage of per-vCPU root page
tables (as opposed to per-pCPU), and the usage of the existing perdomain
family of functions to manage the mappings in the per-domain slot, that
now becomes per-vCPU.

All patches until 17 are mostly preparatory, I think there's a nice
cleanup and generalization of the creation and managing of per-domain
mappings, by no longer storing references to L1 page-tables in the vCPU
or domain struct.

Patch 13 introduces the command line option, and would need discussion
and integration with the sparse direct map series.  IMO we should get
consensus on how we want the command line to look ASAP, so that we can
basic parsing logic in place to be used by both the work here and the
direct map removal series.

As part of this series the map_domain_page() helpers are also switched
to create per-vCPU mappings (see patch 15), which converts an existing
interface into creating per-vCPU mappings.  Such interface can be used
to hide (map per-vCPU) further data that we don't want to be part of the
direct map, or even shared between vCPUs of the same domain.  Also all
existing users of the interface will already create per-vCPU mappings
without needing additional changes.

Note that none of the logic introduced in the series removes entries for
the directmap, so even when creating the per-CPU mappings the underlying
physical addresses are fully accessible when using it's direct map
entries.

I also haven't done any benchmarking.  Doesn't seem to cripple
performance up to the point that XenRT jobs would timeout before
finishing, that the only objective reference I can provide at the
moment.

The series has been extensively tested on XenRT, but that doesn't cover
all possible use-cases, so it's likely to still have some rough edges,
handle with care.

Thanks, Roger.

Roger Pau Monne (18):
  x86/mm: purge unneeded destroy_perdomain_mapping()
  x86/domain: limit window where curr_vcpu != current on context switch
  x86/mm: introduce helper to detect per-domain L1 entries that need
    freeing
  x86/pv: introduce function to populate perdomain area and use it to
    map Xen GDT
  x86/mm: switch destroy_perdomain_mapping() parameter from domain to
    vCPU
  x86/pv: set/clear guest GDT mappings using
    {populate,destroy}_perdomain_mapping()
  x86/pv: update guest LDT mappings using the linear entries
  x86/pv: remove stashing of GDT/LDT L1 page-tables
  x86/mm: simplify create_perdomain_mapping() interface
  x86/mm: switch {create,destroy}_perdomain_mapping() domain parameter
    to vCPU
  x86/pv: untie issuing FLUSH_ROOT_PGTBL from XPTI
  x86/mm: move FLUSH_ROOT_PGTBL handling before TLB flush
  x86/spec-ctrl: introduce Address Space Isolation command line option
  x86/mm: introduce per-vCPU L3 page-table
  x86/mm: introduce a per-vCPU mapcache when using ASI
  x86/pv: allow using a unique per-pCPU root page table (L4)
  x86/mm: switch to a per-CPU mapped stack when using ASI
  x86/mm: zero stack on context switch

 docs/misc/xen-command-line.pandoc    |  24 +++
 xen/arch/x86/cpu/mcheck/mce.c        |   4 +
 xen/arch/x86/domain.c                | 157 +++++++++++----
 xen/arch/x86/domain_page.c           | 105 ++++++----
 xen/arch/x86/flushtlb.c              |  28 ++-
 xen/arch/x86/hvm/hvm.c               |   6 -
 xen/arch/x86/include/asm/config.h    |  16 +-
 xen/arch/x86/include/asm/current.h   |  58 +++++-
 xen/arch/x86/include/asm/desc.h      |   6 +-
 xen/arch/x86/include/asm/domain.h    |  50 +++--
 xen/arch/x86/include/asm/flushtlb.h  |   2 +-
 xen/arch/x86/include/asm/mm.h        |  15 +-
 xen/arch/x86/include/asm/processor.h |   5 +
 xen/arch/x86/include/asm/pv/mm.h     |   5 +
 xen/arch/x86/include/asm/smp.h       |  12 ++
 xen/arch/x86/include/asm/spec_ctrl.h |   4 +
 xen/arch/x86/mm.c                    | 291 +++++++++++++++++++++------
 xen/arch/x86/mm/hap/hap.c            |   2 +-
 xen/arch/x86/mm/paging.c             |   6 +
 xen/arch/x86/mm/shadow/hvm.c         |   2 +-
 xen/arch/x86/mm/shadow/multi.c       |   2 +-
 xen/arch/x86/pv/descriptor-tables.c  |  47 ++---
 xen/arch/x86/pv/dom0_build.c         |  12 +-
 xen/arch/x86/pv/domain.c             |  57 ++++--
 xen/arch/x86/pv/mm.c                 |  43 +++-
 xen/arch/x86/setup.c                 |  32 ++-
 xen/arch/x86/smp.c                   |  39 ++++
 xen/arch/x86/smpboot.c               |  26 ++-
 xen/arch/x86/spec_ctrl.c             | 205 ++++++++++++++++++-
 xen/arch/x86/traps.c                 |  25 ++-
 xen/arch/x86/x86_64/mm.c             |   7 +-
 xen/common/smp.c                     |  10 +
 xen/common/stop_machine.c            |  10 +
 xen/include/xen/smp.h                |   8 +
 34 files changed, 1052 insertions(+), 269 deletions(-)

-- 
2.46.0

^ permalink raw reply	[flat|nested] 55+ messages in thread

* [PATCH v2 01/18] x86/mm: purge unneeded destroy_perdomain_mapping()
  2025-01-08 14:26 [PATCH v2 00/18] x86: adventures in Address Space Isolation Roger Pau Monne
@ 2025-01-08 14:26 ` Roger Pau Monne
  2025-01-08 15:59   ` Alejandro Vallejo
  2025-01-08 14:26 ` [PATCH v2 02/18] x86/domain: limit window where curr_vcpu != current on context switch Roger Pau Monne
                   ` (17 subsequent siblings)
  18 siblings, 1 reply; 55+ messages in thread
From: Roger Pau Monne @ 2025-01-08 14:26 UTC (permalink / raw)
  To: xen-devel; +Cc: Roger Pau Monne, Jan Beulich, Andrew Cooper

The destroy_perdomain_mapping() call in the hvm_domain_initialise() fail path
is useless.  destroy_perdomain_mapping() called with nr == 0 is effectively a
no op, as there are not entries torn down.  Remove the call, as
arch_domain_create() already calls free_perdomain_mappings() on failure.

There's also a call to destroy_perdomain_mapping() in pv_domain_destroy() which
is also not needed.  arch_domain_destroy() will already unconditionally call
free_perdomain_mappings(), which does the same as destroy_perdomain_mapping(),
plus additionally frees the page table structures.

Signed-off-by: Roger Pau Monné <roger.pau@citrix.com>
---
 xen/arch/x86/hvm/hvm.c   | 1 -
 xen/arch/x86/pv/domain.c | 3 ---
 2 files changed, 4 deletions(-)

diff --git a/xen/arch/x86/hvm/hvm.c b/xen/arch/x86/hvm/hvm.c
index 922c9b3af64d..70fdddae583d 100644
--- a/xen/arch/x86/hvm/hvm.c
+++ b/xen/arch/x86/hvm/hvm.c
@@ -708,7 +708,6 @@ int hvm_domain_initialise(struct domain *d,
     XFREE(d->arch.hvm.irq);
  fail0:
     hvm_destroy_cacheattr_region_list(d);
-    destroy_perdomain_mapping(d, PERDOMAIN_VIRT_START, 0);
  fail:
     hvm_domain_relinquish_resources(d);
     XFREE(d->arch.hvm.io_handler);
diff --git a/xen/arch/x86/pv/domain.c b/xen/arch/x86/pv/domain.c
index 7aef628f55be..bc7cd0c62f0e 100644
--- a/xen/arch/x86/pv/domain.c
+++ b/xen/arch/x86/pv/domain.c
@@ -345,9 +345,6 @@ void pv_domain_destroy(struct domain *d)
 {
     pv_l1tf_domain_destroy(d);
 
-    destroy_perdomain_mapping(d, GDT_LDT_VIRT_START,
-                              GDT_LDT_MBYTES << (20 - PAGE_SHIFT));
-
     XFREE(d->arch.pv.cpuidmasks);
 
     FREE_XENHEAP_PAGE(d->arch.pv.gdt_ldt_l1tab);
-- 
2.46.0



^ permalink raw reply related	[flat|nested] 55+ messages in thread

* [PATCH v2 02/18] x86/domain: limit window where curr_vcpu != current on context switch
  2025-01-08 14:26 [PATCH v2 00/18] x86: adventures in Address Space Isolation Roger Pau Monne
  2025-01-08 14:26 ` [PATCH v2 01/18] x86/mm: purge unneeded destroy_perdomain_mapping() Roger Pau Monne
@ 2025-01-08 14:26 ` Roger Pau Monne
  2025-01-08 16:26   ` Alejandro Vallejo
  2025-01-09  8:59   ` Jan Beulich
  2025-01-08 14:26 ` [PATCH v2 03/18] x86/mm: introduce helper to detect per-domain L1 entries that need freeing Roger Pau Monne
                   ` (16 subsequent siblings)
  18 siblings, 2 replies; 55+ messages in thread
From: Roger Pau Monne @ 2025-01-08 14:26 UTC (permalink / raw)
  To: xen-devel; +Cc: Roger Pau Monne, Jan Beulich, Andrew Cooper

On x86 Xen will perform lazy context switches to the idle vCPU, where the
previously running vCPU context is not overwritten, and only current is updated
to point to the idle vCPU.  The state is then disjunct between current and
curr_vcpu: current points to the idle vCPU, while curr_vcpu points to the vCPU
whose context is loaded on the pCPU.

While on that lazy context switched state, certain calls (like
map_domain_page()) will trigger a full synchronization of the pCPU state by
forcing a context switch.  Note however how calling any of such functions
inside the context switch code itself is very likely to trigger an infinite
recursion loop.

Attempt to limit the window where curr_vcpu != current in the context switch
code, as to prevent and infinite recursion loop around sync_local_execstate().

This is required for using map_domain_page() in the vCPU context switch code,
otherwise using map_domain_page() in that context ends up in a recursive
sync_local_execstate() loop:

map_domain_page() -> sync_local_execstate() -> map_domain_page() -> ...

Signed-off-by: Roger Pau Monné <roger.pau@citrix.com>
---
Changes since v1:
 - New in this version.
---
 xen/arch/x86/domain.c | 58 +++++++++++++++++++++++++++++++++++--------
 xen/arch/x86/traps.c  |  2 --
 2 files changed, 48 insertions(+), 12 deletions(-)

diff --git a/xen/arch/x86/domain.c b/xen/arch/x86/domain.c
index 78a13e6812c9..1f680bf176ee 100644
--- a/xen/arch/x86/domain.c
+++ b/xen/arch/x86/domain.c
@@ -1982,16 +1982,16 @@ static void load_default_gdt(unsigned int cpu)
     per_cpu(full_gdt_loaded, cpu) = false;
 }
 
-static void __context_switch(void)
+static void __context_switch(struct vcpu *n)
 {
     struct cpu_user_regs *stack_regs = guest_cpu_user_regs();
     unsigned int          cpu = smp_processor_id();
     struct vcpu          *p = per_cpu(curr_vcpu, cpu);
-    struct vcpu          *n = current;
     struct domain        *pd = p->domain, *nd = n->domain;
 
     ASSERT(p != n);
     ASSERT(!vcpu_cpu_dirty(n));
+    ASSERT(p == current);
 
     if ( !is_idle_domain(pd) )
     {
@@ -2036,6 +2036,18 @@ static void __context_switch(void)
 
     write_ptbase(n);
 
+    /*
+     * It's relevant to set both current and curr_vcpu back-to-back, to avoid a
+     * window where calls to mapcache_current_vcpu() during the context switch
+     * could trigger a recursive loop.
+     *
+     * Do the current switch immediately after switching to the new guest
+     * page-tables, so that current is (almost) always in sync with the
+     * currently loaded page-tables.
+     */
+    set_current(n);
+    per_cpu(curr_vcpu, cpu) = n;
+
 #ifdef CONFIG_PV
     /* Prefetch the VMCB if we expect to use it later in the context switch */
     if ( using_svm() && is_pv_64bit_domain(nd) && !is_idle_domain(nd) )
@@ -2048,8 +2060,6 @@ static void __context_switch(void)
     if ( pd != nd )
         cpumask_clear_cpu(cpu, pd->dirty_cpumask);
     write_atomic(&p->dirty_cpu, VCPU_CPU_CLEAN);
-
-    per_cpu(curr_vcpu, cpu) = n;
 }
 
 void context_switch(struct vcpu *prev, struct vcpu *next)
@@ -2081,16 +2091,36 @@ void context_switch(struct vcpu *prev, struct vcpu *next)
 
     local_irq_disable();
 
-    set_current(next);
-
     if ( (per_cpu(curr_vcpu, cpu) == next) ||
          (is_idle_domain(nextd) && cpu_online(cpu)) )
     {
+        /*
+         * Lazy context switch to the idle vCPU, set current == idle.  Full
+         * context switch happens if/when sync_local_execstate() is called.
+         */
+        set_current(next);
         local_irq_enable();
     }
     else
     {
-        __context_switch();
+        /*
+         * curr_vcpu will always point to the currently loaded vCPU context, as
+         * it's not updated when doing a lazy switch to the idle vCPU.
+         */
+        struct vcpu *prev_ctx = per_cpu(curr_vcpu, cpu);
+
+        if ( prev_ctx != current )
+        {
+            /*
+             * Doing a full context switch to a non-idle vCPU from a lazy
+             * context switched state.  Adjust current to point to the
+             * currently loaded vCPU context.
+             */
+            ASSERT(current == idle_vcpu[cpu]);
+            ASSERT(!is_idle_vcpu(next));
+            set_current(prev_ctx);
+        }
+        __context_switch(next);
 
         /* Re-enable interrupts before restoring state which may fault. */
         local_irq_enable();
@@ -2156,15 +2186,23 @@ int __sync_local_execstate(void)
 {
     unsigned long flags;
     int switch_required;
+    unsigned int cpu = smp_processor_id();
+    struct vcpu *p;
 
     local_irq_save(flags);
 
-    switch_required = (this_cpu(curr_vcpu) != current);
+    p = per_cpu(curr_vcpu, cpu);
+    switch_required = (p != current);
 
     if ( switch_required )
     {
-        ASSERT(current == idle_vcpu[smp_processor_id()]);
-        __context_switch();
+        ASSERT(current == idle_vcpu[cpu]);
+        /*
+         * Restore current to the previously running vCPU, __context_switch()
+         * will update current together with curr_vcpu.
+         */
+        set_current(p);
+        __context_switch(idle_vcpu[cpu]);
     }
 
     local_irq_restore(flags);
diff --git a/xen/arch/x86/traps.c b/xen/arch/x86/traps.c
index 87b30ce4df2a..487b8c5a78c5 100644
--- a/xen/arch/x86/traps.c
+++ b/xen/arch/x86/traps.c
@@ -2232,8 +2232,6 @@ void __init trap_init(void)
 
 void activate_debugregs(const struct vcpu *curr)
 {
-    ASSERT(curr == current);
-
     write_debugreg(0, curr->arch.dr[0]);
     write_debugreg(1, curr->arch.dr[1]);
     write_debugreg(2, curr->arch.dr[2]);
-- 
2.46.0



^ permalink raw reply related	[flat|nested] 55+ messages in thread

* [PATCH v2 03/18] x86/mm: introduce helper to detect per-domain L1 entries that need freeing
  2025-01-08 14:26 [PATCH v2 00/18] x86: adventures in Address Space Isolation Roger Pau Monne
  2025-01-08 14:26 ` [PATCH v2 01/18] x86/mm: purge unneeded destroy_perdomain_mapping() Roger Pau Monne
  2025-01-08 14:26 ` [PATCH v2 02/18] x86/domain: limit window where curr_vcpu != current on context switch Roger Pau Monne
@ 2025-01-08 14:26 ` Roger Pau Monne
  2025-01-09  9:03   ` Jan Beulich
  2025-01-08 14:26 ` [PATCH v2 04/18] x86/pv: introduce function to populate perdomain area and use it to map Xen GDT Roger Pau Monne
                   ` (15 subsequent siblings)
  18 siblings, 1 reply; 55+ messages in thread
From: Roger Pau Monne @ 2025-01-08 14:26 UTC (permalink / raw)
  To: xen-devel; +Cc: Roger Pau Monne, Jan Beulich, Andrew Cooper

L1 present entries that require the underlying page to be freed have the
_PAGE_AVAIL0 bit set, introduce a helper to unify the checking logic into a
single place.

No functional change intended.

Signed-off-by: Roger Pau Monné <roger.pau@citrix.com>
---
 xen/arch/x86/mm.c | 14 ++++++++------
 1 file changed, 8 insertions(+), 6 deletions(-)

diff --git a/xen/arch/x86/mm.c b/xen/arch/x86/mm.c
index fa21903eb25a..3d5dd22b6c36 100644
--- a/xen/arch/x86/mm.c
+++ b/xen/arch/x86/mm.c
@@ -6294,6 +6294,12 @@ void __iomem *__init ioremap_wc(paddr_t pa, size_t len)
     return (void __force __iomem *)(va + offs);
 }
 
+static bool perdomain_l1e_needs_freeing(l1_pgentry_t l1e)
+{
+    return (l1e_get_flags(l1e) & (_PAGE_PRESENT | _PAGE_AVAIL0)) ==
+           (_PAGE_PRESENT | _PAGE_AVAIL0);
+}
+
 int create_perdomain_mapping(struct domain *d, unsigned long va,
                              unsigned int nr, l1_pgentry_t **pl1tab,
                              struct page_info **ppg)
@@ -6446,9 +6452,7 @@ void destroy_perdomain_mapping(struct domain *d, unsigned long va,
 
                 for ( ; nr && i < L1_PAGETABLE_ENTRIES; --nr, ++i )
                 {
-                    if ( (l1e_get_flags(l1tab[i]) &
-                          (_PAGE_PRESENT | _PAGE_AVAIL0)) ==
-                         (_PAGE_PRESENT | _PAGE_AVAIL0) )
+                    if ( perdomain_l1e_needs_freeing(l1tab[i]) )
                         free_domheap_page(l1e_get_page(l1tab[i]));
                     l1tab[i] = l1e_empty();
                 }
@@ -6498,9 +6502,7 @@ void free_perdomain_mappings(struct domain *d)
                         unsigned int k;
 
                         for ( k = 0; k < L1_PAGETABLE_ENTRIES; ++k )
-                            if ( (l1e_get_flags(l1tab[k]) &
-                                  (_PAGE_PRESENT | _PAGE_AVAIL0)) ==
-                                 (_PAGE_PRESENT | _PAGE_AVAIL0) )
+                            if ( perdomain_l1e_needs_freeing(l1tab[k]) )
                                 free_domheap_page(l1e_get_page(l1tab[k]));
 
                         unmap_domain_page(l1tab);
-- 
2.46.0



^ permalink raw reply related	[flat|nested] 55+ messages in thread

* [PATCH v2 04/18] x86/pv: introduce function to populate perdomain area and use it to map Xen GDT
  2025-01-08 14:26 [PATCH v2 00/18] x86: adventures in Address Space Isolation Roger Pau Monne
                   ` (2 preceding siblings ...)
  2025-01-08 14:26 ` [PATCH v2 03/18] x86/mm: introduce helper to detect per-domain L1 entries that need freeing Roger Pau Monne
@ 2025-01-08 14:26 ` Roger Pau Monne
  2025-01-09  9:10   ` Jan Beulich
  2025-01-09  9:55   ` Alejandro Vallejo
  2025-01-08 14:26 ` [PATCH v2 05/18] x86/mm: switch destroy_perdomain_mapping() parameter from domain to vCPU Roger Pau Monne
                   ` (14 subsequent siblings)
  18 siblings, 2 replies; 55+ messages in thread
From: Roger Pau Monne @ 2025-01-08 14:26 UTC (permalink / raw)
  To: xen-devel; +Cc: Roger Pau Monne, Jan Beulich, Andrew Cooper

The current code to update the Xen part of the GDT when running a PV guest
relies on caching the direct map address of all the L1 tables used to map the
GDT and LDT, so that entries can be modified.

Introduce a new function that populates the per-domain region, either using the
recursive linear mappings when the target vCPU is the current one, or by
directly modifying the L1 table of the per-domain region.

Using such function to populate per-domain addresses drops the need to keep a
reference to per-domain L1 tables previously used to change the per-domain
mappings.

Signed-off-by: Roger Pau Monné <roger.pau@citrix.com>
---
 xen/arch/x86/domain.c                | 11 +++-
 xen/arch/x86/include/asm/desc.h      |  6 +-
 xen/arch/x86/include/asm/mm.h        |  2 +
 xen/arch/x86/include/asm/processor.h |  5 ++
 xen/arch/x86/mm.c                    | 88 ++++++++++++++++++++++++++++
 xen/arch/x86/smpboot.c               |  6 +-
 xen/arch/x86/traps.c                 | 10 ++--
 7 files changed, 113 insertions(+), 15 deletions(-)

diff --git a/xen/arch/x86/domain.c b/xen/arch/x86/domain.c
index 1f680bf176ee..0bd0ef7e40f4 100644
--- a/xen/arch/x86/domain.c
+++ b/xen/arch/x86/domain.c
@@ -1953,9 +1953,14 @@ static always_inline bool need_full_gdt(const struct domain *d)
 
 static void update_xen_slot_in_full_gdt(const struct vcpu *v, unsigned int cpu)
 {
-    l1e_write(pv_gdt_ptes(v) + FIRST_RESERVED_GDT_PAGE,
-              !is_pv_32bit_vcpu(v) ? per_cpu(gdt_l1e, cpu)
-                                   : per_cpu(compat_gdt_l1e, cpu));
+    ASSERT(v != current);
+
+    populate_perdomain_mapping(v,
+                               GDT_VIRT_START(v) +
+                               (FIRST_RESERVED_GDT_PAGE << PAGE_SHIFT),
+                               !is_pv_32bit_vcpu(v) ? &per_cpu(gdt_mfn, cpu)
+                                                    : &per_cpu(compat_gdt_mfn,
+                                                               cpu), 1);
 }
 
 static void load_full_gdt(const struct vcpu *v, unsigned int cpu)
diff --git a/xen/arch/x86/include/asm/desc.h b/xen/arch/x86/include/asm/desc.h
index a1e0807d97ed..33981bfca588 100644
--- a/xen/arch/x86/include/asm/desc.h
+++ b/xen/arch/x86/include/asm/desc.h
@@ -44,6 +44,8 @@
 
 #ifndef __ASSEMBLY__
 
+#include <xen/mm-frame.h>
+
 #define GUEST_KERNEL_RPL(d) (is_pv_32bit_domain(d) ? 1 : 3)
 
 /* Fix up the RPL of a guest segment selector. */
@@ -212,10 +214,10 @@ struct __packed desc_ptr {
 
 extern seg_desc_t boot_gdt[];
 DECLARE_PER_CPU(seg_desc_t *, gdt);
-DECLARE_PER_CPU(l1_pgentry_t, gdt_l1e);
+DECLARE_PER_CPU(mfn_t, gdt_mfn);
 extern seg_desc_t boot_compat_gdt[];
 DECLARE_PER_CPU(seg_desc_t *, compat_gdt);
-DECLARE_PER_CPU(l1_pgentry_t, compat_gdt_l1e);
+DECLARE_PER_CPU(mfn_t, compat_gdt_mfn);
 DECLARE_PER_CPU(bool, full_gdt_loaded);
 
 static inline void lgdt(const struct desc_ptr *gdtr)
diff --git a/xen/arch/x86/include/asm/mm.h b/xen/arch/x86/include/asm/mm.h
index 6c7e66ee21ab..b50a51327b2b 100644
--- a/xen/arch/x86/include/asm/mm.h
+++ b/xen/arch/x86/include/asm/mm.h
@@ -603,6 +603,8 @@ int compat_arch_memory_op(unsigned long cmd, XEN_GUEST_HANDLE_PARAM(void) arg);
 int create_perdomain_mapping(struct domain *d, unsigned long va,
                              unsigned int nr, l1_pgentry_t **pl1tab,
                              struct page_info **ppg);
+void populate_perdomain_mapping(const struct vcpu *v, unsigned long va,
+                                mfn_t *mfn, unsigned long nr);
 void destroy_perdomain_mapping(struct domain *d, unsigned long va,
                                unsigned int nr);
 void free_perdomain_mappings(struct domain *d);
diff --git a/xen/arch/x86/include/asm/processor.h b/xen/arch/x86/include/asm/processor.h
index d247ef8dd226..82ee89f736c2 100644
--- a/xen/arch/x86/include/asm/processor.h
+++ b/xen/arch/x86/include/asm/processor.h
@@ -243,6 +243,11 @@ static inline unsigned long cr3_pa(unsigned long cr3)
     return cr3 & X86_CR3_ADDR_MASK;
 }
 
+static inline mfn_t cr3_mfn(unsigned long cr3)
+{
+    return maddr_to_mfn(cr3_pa(cr3));
+}
+
 static inline unsigned int cr3_pcid(unsigned long cr3)
 {
     return IS_ENABLED(CONFIG_PV) ? cr3 & X86_CR3_PCID_MASK : 0;
diff --git a/xen/arch/x86/mm.c b/xen/arch/x86/mm.c
index 3d5dd22b6c36..0abea792486c 100644
--- a/xen/arch/x86/mm.c
+++ b/xen/arch/x86/mm.c
@@ -6423,6 +6423,94 @@ int create_perdomain_mapping(struct domain *d, unsigned long va,
     return rc;
 }
 
+void populate_perdomain_mapping(const struct vcpu *v, unsigned long va,
+                                mfn_t *mfn, unsigned long nr)
+{
+    l1_pgentry_t *l1tab = NULL, *pl1e;
+    const l3_pgentry_t *l3tab;
+    const l2_pgentry_t *l2tab;
+    struct domain *d = v->domain;
+
+    ASSERT(va >= PERDOMAIN_VIRT_START &&
+           va < PERDOMAIN_VIRT_SLOT(PERDOMAIN_SLOTS));
+    ASSERT(!nr || !l3_table_offset(va ^ (va + nr * PAGE_SIZE - 1)));
+
+    /* Use likely to force the optimization for the fast path. */
+    if ( likely(v == current) )
+    {
+        unsigned int i;
+
+        /* Ensure page-tables are from current (if current != curr_vcpu). */
+        sync_local_execstate();
+
+        /* Fast path: get L1 entries using the recursive linear mappings. */
+        pl1e = &__linear_l1_table[l1_linear_offset(va)];
+
+        for ( i = 0; i < nr; i++, pl1e++ )
+        {
+            if ( unlikely(perdomain_l1e_needs_freeing(*pl1e)) )
+            {
+                ASSERT_UNREACHABLE();
+                free_domheap_page(l1e_get_page(*pl1e));
+            }
+            l1e_write(pl1e, l1e_from_mfn(mfn[i], __PAGE_HYPERVISOR_RW));
+        }
+
+        return;
+    }
+
+    ASSERT(d->arch.perdomain_l3_pg);
+    l3tab = __map_domain_page(d->arch.perdomain_l3_pg);
+
+    if ( unlikely(!(l3e_get_flags(l3tab[l3_table_offset(va)]) &
+                    _PAGE_PRESENT)) )
+    {
+        unmap_domain_page(l3tab);
+        gprintk(XENLOG_ERR, "unable to map at VA %lx: L3e not present\n", va);
+        ASSERT_UNREACHABLE();
+        domain_crash(d);
+
+        return;
+    }
+
+    l2tab = map_l2t_from_l3e(l3tab[l3_table_offset(va)]);
+
+    for ( ; nr--; va += PAGE_SIZE, mfn++ )
+    {
+        if ( !l1tab || !l1_table_offset(va) )
+        {
+            const l2_pgentry_t *pl2e = l2tab + l2_table_offset(va);
+
+            if ( unlikely(!(l2e_get_flags(*pl2e) & _PAGE_PRESENT)) )
+            {
+                gprintk(XENLOG_ERR, "unable to map at VA %lx: L2e not present\n",
+                        va);
+                ASSERT_UNREACHABLE();
+                domain_crash(d);
+
+                break;
+            }
+
+            unmap_domain_page(l1tab);
+            l1tab = map_l1t_from_l2e(*pl2e);
+        }
+
+        pl1e = &l1tab[l1_table_offset(va)];
+
+        if ( unlikely(perdomain_l1e_needs_freeing(*pl1e)) )
+        {
+            ASSERT_UNREACHABLE();
+            free_domheap_page(l1e_get_page(*pl1e));
+        }
+
+        l1e_write(pl1e, l1e_from_mfn(*mfn, __PAGE_HYPERVISOR_RW));
+    }
+
+    unmap_domain_page(l1tab);
+    unmap_domain_page(l2tab);
+    unmap_domain_page(l3tab);
+}
+
 void destroy_perdomain_mapping(struct domain *d, unsigned long va,
                                unsigned int nr)
 {
diff --git a/xen/arch/x86/smpboot.c b/xen/arch/x86/smpboot.c
index 79a79c54c304..a740a6402272 100644
--- a/xen/arch/x86/smpboot.c
+++ b/xen/arch/x86/smpboot.c
@@ -1059,8 +1059,7 @@ static int cpu_smpboot_alloc(unsigned int cpu)
     if ( gdt == NULL )
         goto out;
     per_cpu(gdt, cpu) = gdt;
-    per_cpu(gdt_l1e, cpu) =
-        l1e_from_pfn(virt_to_mfn(gdt), __PAGE_HYPERVISOR_RW);
+    per_cpu(gdt_mfn, cpu) = _mfn(virt_to_mfn(gdt));
     memcpy(gdt, boot_gdt, NR_RESERVED_GDT_PAGES * PAGE_SIZE);
     BUILD_BUG_ON(NR_CPUS > 0x10000);
     gdt[PER_CPU_GDT_ENTRY - FIRST_RESERVED_GDT_ENTRY].a = cpu;
@@ -1069,8 +1068,7 @@ static int cpu_smpboot_alloc(unsigned int cpu)
     per_cpu(compat_gdt, cpu) = gdt = alloc_xenheap_pages(0, memflags);
     if ( gdt == NULL )
         goto out;
-    per_cpu(compat_gdt_l1e, cpu) =
-        l1e_from_pfn(virt_to_mfn(gdt), __PAGE_HYPERVISOR_RW);
+    per_cpu(compat_gdt_mfn, cpu) = _mfn(virt_to_mfn(gdt));
     memcpy(gdt, boot_compat_gdt, NR_RESERVED_GDT_PAGES * PAGE_SIZE);
     gdt[PER_CPU_GDT_ENTRY - FIRST_RESERVED_GDT_ENTRY].a = cpu;
 #endif
diff --git a/xen/arch/x86/traps.c b/xen/arch/x86/traps.c
index 487b8c5a78c5..a7f6fb611c34 100644
--- a/xen/arch/x86/traps.c
+++ b/xen/arch/x86/traps.c
@@ -92,10 +92,10 @@ DEFINE_PER_CPU(uint64_t, efer);
 static DEFINE_PER_CPU(unsigned long, last_extable_addr);
 
 DEFINE_PER_CPU_READ_MOSTLY(seg_desc_t *, gdt);
-DEFINE_PER_CPU_READ_MOSTLY(l1_pgentry_t, gdt_l1e);
+DEFINE_PER_CPU_READ_MOSTLY(mfn_t, gdt_mfn);
 #ifdef CONFIG_PV32
 DEFINE_PER_CPU_READ_MOSTLY(seg_desc_t *, compat_gdt);
-DEFINE_PER_CPU_READ_MOSTLY(l1_pgentry_t, compat_gdt_l1e);
+DEFINE_PER_CPU_READ_MOSTLY(mfn_t, compat_gdt_mfn);
 #endif
 
 /* Master table, used by CPU0. */
@@ -2219,11 +2219,9 @@ void __init trap_init(void)
     init_ler();
 
     /* Cache {,compat_}gdt_l1e now that physically relocation is done. */
-    this_cpu(gdt_l1e) =
-        l1e_from_pfn(virt_to_mfn(boot_gdt), __PAGE_HYPERVISOR_RW);
+    this_cpu(gdt_mfn) = _mfn(virt_to_mfn(boot_gdt));
     if ( IS_ENABLED(CONFIG_PV32) )
-        this_cpu(compat_gdt_l1e) =
-            l1e_from_pfn(virt_to_mfn(boot_compat_gdt), __PAGE_HYPERVISOR_RW);
+        this_cpu(compat_gdt_mfn) = _mfn(virt_to_mfn(boot_compat_gdt));
 
     percpu_traps_init();
 
-- 
2.46.0



^ permalink raw reply related	[flat|nested] 55+ messages in thread

* [PATCH v2 05/18] x86/mm: switch destroy_perdomain_mapping() parameter from domain to vCPU
  2025-01-08 14:26 [PATCH v2 00/18] x86: adventures in Address Space Isolation Roger Pau Monne
                   ` (3 preceding siblings ...)
  2025-01-08 14:26 ` [PATCH v2 04/18] x86/pv: introduce function to populate perdomain area and use it to map Xen GDT Roger Pau Monne
@ 2025-01-08 14:26 ` Roger Pau Monne
  2025-01-09 10:02   ` Alejandro Vallejo
  2025-01-08 14:26 ` [PATCH v2 06/18] x86/pv: set/clear guest GDT mappings using {populate,destroy}_perdomain_mapping() Roger Pau Monne
                   ` (13 subsequent siblings)
  18 siblings, 1 reply; 55+ messages in thread
From: Roger Pau Monne @ 2025-01-08 14:26 UTC (permalink / raw)
  To: xen-devel; +Cc: Roger Pau Monne, Jan Beulich, Andrew Cooper

In preparation for the per-domain area being populated with per-vCPU mappings
change the parameter of destroy_perdomain_mapping() to be a vCPU instead of a
domain, and also update the function logic to allow manipulation of per-domain
mappings using the linear page table mappings.

Signed-off-by: Roger Pau Monné <roger.pau@citrix.com>
---
 xen/arch/x86/include/asm/mm.h |  2 +-
 xen/arch/x86/mm.c             | 24 +++++++++++++++++++++++-
 xen/arch/x86/pv/domain.c      |  3 +--
 xen/arch/x86/x86_64/mm.c      |  2 +-
 4 files changed, 26 insertions(+), 5 deletions(-)

diff --git a/xen/arch/x86/include/asm/mm.h b/xen/arch/x86/include/asm/mm.h
index b50a51327b2b..65cd751087dc 100644
--- a/xen/arch/x86/include/asm/mm.h
+++ b/xen/arch/x86/include/asm/mm.h
@@ -605,7 +605,7 @@ int create_perdomain_mapping(struct domain *d, unsigned long va,
                              struct page_info **ppg);
 void populate_perdomain_mapping(const struct vcpu *v, unsigned long va,
                                 mfn_t *mfn, unsigned long nr);
-void destroy_perdomain_mapping(struct domain *d, unsigned long va,
+void destroy_perdomain_mapping(const struct vcpu *v, unsigned long va,
                                unsigned int nr);
 void free_perdomain_mappings(struct domain *d);
 
diff --git a/xen/arch/x86/mm.c b/xen/arch/x86/mm.c
index 0abea792486c..713ae8dd6fa3 100644
--- a/xen/arch/x86/mm.c
+++ b/xen/arch/x86/mm.c
@@ -6511,10 +6511,11 @@ void populate_perdomain_mapping(const struct vcpu *v, unsigned long va,
     unmap_domain_page(l3tab);
 }
 
-void destroy_perdomain_mapping(struct domain *d, unsigned long va,
+void destroy_perdomain_mapping(const struct vcpu *v, unsigned long va,
                                unsigned int nr)
 {
     const l3_pgentry_t *l3tab, *pl3e;
+    const struct domain *d = v->domain;
 
     ASSERT(va >= PERDOMAIN_VIRT_START &&
            va < PERDOMAIN_VIRT_SLOT(PERDOMAIN_SLOTS));
@@ -6523,6 +6524,27 @@ void destroy_perdomain_mapping(struct domain *d, unsigned long va,
     if ( !d->arch.perdomain_l3_pg )
         return;
 
+    /* Use likely to force the optimization for the fast path. */
+    if ( likely(v == current) )
+    {
+        l1_pgentry_t *pl1e;
+
+        /* Ensure page-tables are from current (if current != curr_vcpu). */
+        sync_local_execstate();
+
+        pl1e = &__linear_l1_table[l1_linear_offset(va)];
+
+        /* Fast path: zap L1 entries using the recursive linear mappings. */
+        for ( ; nr--; pl1e++ )
+        {
+            if ( perdomain_l1e_needs_freeing(*pl1e) )
+                free_domheap_page(l1e_get_page(*pl1e));
+            l1e_write(pl1e, l1e_empty());
+        }
+
+        return;
+    }
+
     l3tab = __map_domain_page(d->arch.perdomain_l3_pg);
     pl3e = l3tab + l3_table_offset(va);
 
diff --git a/xen/arch/x86/pv/domain.c b/xen/arch/x86/pv/domain.c
index bc7cd0c62f0e..7e8bffaae9a0 100644
--- a/xen/arch/x86/pv/domain.c
+++ b/xen/arch/x86/pv/domain.c
@@ -285,8 +285,7 @@ static int pv_create_gdt_ldt_l1tab(struct vcpu *v)
 
 static void pv_destroy_gdt_ldt_l1tab(struct vcpu *v)
 {
-    destroy_perdomain_mapping(v->domain, GDT_VIRT_START(v),
-                              1U << GDT_LDT_VCPU_SHIFT);
+    destroy_perdomain_mapping(v, GDT_VIRT_START(v), 1U << GDT_LDT_VCPU_SHIFT);
 }
 
 void pv_vcpu_destroy(struct vcpu *v)
diff --git a/xen/arch/x86/x86_64/mm.c b/xen/arch/x86/x86_64/mm.c
index 389d813ebe63..c08b28d9693b 100644
--- a/xen/arch/x86/x86_64/mm.c
+++ b/xen/arch/x86/x86_64/mm.c
@@ -737,7 +737,7 @@ int setup_compat_arg_xlat(struct vcpu *v)
 
 void free_compat_arg_xlat(struct vcpu *v)
 {
-    destroy_perdomain_mapping(v->domain, ARG_XLAT_START(v),
+    destroy_perdomain_mapping(v, ARG_XLAT_START(v),
                               PFN_UP(COMPAT_ARG_XLAT_SIZE));
 }
 
-- 
2.46.0



^ permalink raw reply related	[flat|nested] 55+ messages in thread

* [PATCH v2 06/18] x86/pv: set/clear guest GDT mappings using {populate,destroy}_perdomain_mapping()
  2025-01-08 14:26 [PATCH v2 00/18] x86: adventures in Address Space Isolation Roger Pau Monne
                   ` (4 preceding siblings ...)
  2025-01-08 14:26 ` [PATCH v2 05/18] x86/mm: switch destroy_perdomain_mapping() parameter from domain to vCPU Roger Pau Monne
@ 2025-01-08 14:26 ` Roger Pau Monne
  2025-01-08 15:11   ` [PATCH v2.1 " Roger Pau Monne
  2025-01-08 14:26 ` [PATCH v2 07/18] x86/pv: update guest LDT mappings using the linear entries Roger Pau Monne
                   ` (12 subsequent siblings)
  18 siblings, 1 reply; 55+ messages in thread
From: Roger Pau Monne @ 2025-01-08 14:26 UTC (permalink / raw)
  To: xen-devel; +Cc: Roger Pau Monne, Jan Beulich, Andrew Cooper

The pv_{set,destroy}_gdt() functions rely on the L1 table(s) that contain such
mappings being stashed in the domain structure, and thus such mappings being
modified by merely updating the L1 entries.

Switch both pv_{set,destroy}_gdt() to instead use
{populate,destory}_perdomain_mapping().

Note that this requires moving the pv_set_gdt() call in arch_set_info_guest()
strictly after update_cr3(), so v->arch.cr3 is valid when
populate_perdomain_mapping() is called.

Signed-off-by: Roger Pau Monné <roger.pau@citrix.com>
---
 xen/arch/x86/domain.c               | 33 ++++++++++++++---------------
 xen/arch/x86/pv/descriptor-tables.c | 28 +++++++++++-------------
 2 files changed, 28 insertions(+), 33 deletions(-)

diff --git a/xen/arch/x86/domain.c b/xen/arch/x86/domain.c
index 0bd0ef7e40f4..0481164f3727 100644
--- a/xen/arch/x86/domain.c
+++ b/xen/arch/x86/domain.c
@@ -1376,22 +1376,6 @@ int arch_set_info_guest(
     if ( rc )
         return rc;
 
-    if ( !compat )
-        rc = pv_set_gdt(v, c.nat->gdt_frames, c.nat->gdt_ents);
-#ifdef CONFIG_COMPAT
-    else
-    {
-        unsigned long gdt_frames[ARRAY_SIZE(v->arch.pv.gdt_frames)];
-
-        for ( i = 0; i < nr_gdt_frames; ++i )
-            gdt_frames[i] = c.cmp->gdt_frames[i];
-
-        rc = pv_set_gdt(v, gdt_frames, c.cmp->gdt_ents);
-    }
-#endif
-    if ( rc != 0 )
-        return rc;
-
     set_bit(_VPF_in_reset, &v->pause_flags);
 
 #ifdef CONFIG_COMPAT
@@ -1492,7 +1476,6 @@ int arch_set_info_guest(
     {
         if ( cr3_page )
             put_page(cr3_page);
-        pv_destroy_gdt(v);
         return rc;
     }
 
@@ -1508,6 +1491,22 @@ int arch_set_info_guest(
         paging_update_paging_modes(v);
     else
         update_cr3(v);
+
+    if ( !compat )
+        rc = pv_set_gdt(v, c.nat->gdt_frames, c.nat->gdt_ents);
+#ifdef CONFIG_COMPAT
+    else
+    {
+        unsigned long gdt_frames[ARRAY_SIZE(v->arch.pv.gdt_frames)];
+
+        for ( i = 0; i < nr_gdt_frames; ++i )
+            gdt_frames[i] = c.cmp->gdt_frames[i];
+
+        rc = pv_set_gdt(v, gdt_frames, c.cmp->gdt_ents);
+    }
+#endif
+    if ( rc != 0 )
+        return rc;
 #endif /* CONFIG_PV */
 
  out:
diff --git a/xen/arch/x86/pv/descriptor-tables.c b/xen/arch/x86/pv/descriptor-tables.c
index 02647a2c5047..5a79f022ce13 100644
--- a/xen/arch/x86/pv/descriptor-tables.c
+++ b/xen/arch/x86/pv/descriptor-tables.c
@@ -49,23 +49,20 @@ bool pv_destroy_ldt(struct vcpu *v)
 
 void pv_destroy_gdt(struct vcpu *v)
 {
-    l1_pgentry_t *pl1e = pv_gdt_ptes(v);
-    mfn_t zero_mfn = _mfn(virt_to_mfn(zero_page));
-    l1_pgentry_t zero_l1e = l1e_from_mfn(zero_mfn, __PAGE_HYPERVISOR_RO);
     unsigned int i;
 
     ASSERT(v == current || !vcpu_cpu_dirty(v));
 
-    v->arch.pv.gdt_ents = 0;
-    for ( i = 0; i < FIRST_RESERVED_GDT_PAGE; i++ )
-    {
-        mfn_t mfn = l1e_get_mfn(pl1e[i]);
+    if ( v->arch.cr3 )
+        destroy_perdomain_mapping(v, GDT_VIRT_START(v),
+                                  ARRAY_SIZE(v->arch.pv.gdt_frames));
 
-        if ( (l1e_get_flags(pl1e[i]) & _PAGE_PRESENT) &&
-             !mfn_eq(mfn, zero_mfn) )
-            put_page_and_type(mfn_to_page(mfn));
+    for ( i = 0; i < ARRAY_SIZE(v->arch.pv.gdt_frames); i++)
+    {
+        if ( !v->arch.pv.gdt_frames[i] )
+            break;
 
-        l1e_write(&pl1e[i], zero_l1e);
+        put_page_and_type(mfn_to_page(_mfn(v->arch.pv.gdt_frames[i])));
         v->arch.pv.gdt_frames[i] = 0;
     }
 }
@@ -74,8 +71,8 @@ int pv_set_gdt(struct vcpu *v, const unsigned long frames[],
                unsigned int entries)
 {
     struct domain *d = v->domain;
-    l1_pgentry_t *pl1e;
     unsigned int i, nr_frames = DIV_ROUND_UP(entries, 512);
+    mfn_t mfns[ARRAY_SIZE(v->arch.pv.gdt_frames)];
 
     ASSERT(v == current || !vcpu_cpu_dirty(v));
 
@@ -90,6 +87,8 @@ int pv_set_gdt(struct vcpu *v, const unsigned long frames[],
         if ( !mfn_valid(mfn) ||
              !get_page_and_type(mfn_to_page(mfn), d, PGT_seg_desc_page) )
             goto fail;
+
+        mfns[i] = mfn;
     }
 
     /* Tear down the old GDT. */
@@ -97,12 +96,9 @@ int pv_set_gdt(struct vcpu *v, const unsigned long frames[],
 
     /* Install the new GDT. */
     v->arch.pv.gdt_ents = entries;
-    pl1e = pv_gdt_ptes(v);
     for ( i = 0; i < nr_frames; i++ )
-    {
         v->arch.pv.gdt_frames[i] = frames[i];
-        l1e_write(&pl1e[i], l1e_from_pfn(frames[i], __PAGE_HYPERVISOR_RW));
-    }
+    populate_perdomain_mapping(v, GDT_VIRT_START(v), mfns, nr_frames);
 
     return 0;
 
-- 
2.46.0



^ permalink raw reply related	[flat|nested] 55+ messages in thread

* [PATCH v2 07/18] x86/pv: update guest LDT mappings using the linear entries
  2025-01-08 14:26 [PATCH v2 00/18] x86: adventures in Address Space Isolation Roger Pau Monne
                   ` (5 preceding siblings ...)
  2025-01-08 14:26 ` [PATCH v2 06/18] x86/pv: set/clear guest GDT mappings using {populate,destroy}_perdomain_mapping() Roger Pau Monne
@ 2025-01-08 14:26 ` Roger Pau Monne
  2025-01-09 14:34   ` Alejandro Vallejo
  2025-01-14 15:42   ` Jan Beulich
  2025-01-08 14:26 ` [PATCH v2 08/18] x86/pv: remove stashing of GDT/LDT L1 page-tables Roger Pau Monne
                   ` (11 subsequent siblings)
  18 siblings, 2 replies; 55+ messages in thread
From: Roger Pau Monne @ 2025-01-08 14:26 UTC (permalink / raw)
  To: xen-devel; +Cc: Roger Pau Monne, Jan Beulich, Andrew Cooper

The pv_map_ldt_shadow_page() and pv_destroy_ldt() functions rely on the L1
table(s) that contain such mappings being stashed in the domain structure, and
thus such mappings being modified by merely updating the require L1 entries.

Switch pv_map_ldt_shadow_page() to unconditionally use the linear recursive, as
that logic is always called while the vCPU is running on the current pCPU.

For pv_destroy_ldt() use the linear mappings if the vCPU is the one currently
running on the pCPU, otherwise use destroy_mappings().

Note this requires keeping an array with the pages currently mapped at the LDT
area, as that allows dropping the extra taken page reference when removing the
mappings.

Signed-off-by: Roger Pau Monné <roger.pau@citrix.com>
---
 xen/arch/x86/include/asm/domain.h   |  2 ++
 xen/arch/x86/pv/descriptor-tables.c | 19 ++++++++++---------
 xen/arch/x86/pv/domain.c            |  4 ++++
 xen/arch/x86/pv/mm.c                |  3 ++-
 4 files changed, 18 insertions(+), 10 deletions(-)

diff --git a/xen/arch/x86/include/asm/domain.h b/xen/arch/x86/include/asm/domain.h
index b79d6badd71c..b659cffc7f81 100644
--- a/xen/arch/x86/include/asm/domain.h
+++ b/xen/arch/x86/include/asm/domain.h
@@ -523,6 +523,8 @@ struct pv_vcpu
     struct trap_info *trap_ctxt;
 
     unsigned long gdt_frames[FIRST_RESERVED_GDT_PAGE];
+    /* Max LDT entries is 8192, so 8192 * 8 = 64KiB (16 pages). */
+    mfn_t ldt_frames[16];
     unsigned long ldt_base;
     unsigned int gdt_ents, ldt_ents;
 
diff --git a/xen/arch/x86/pv/descriptor-tables.c b/xen/arch/x86/pv/descriptor-tables.c
index 5a79f022ce13..95b598a4c0cf 100644
--- a/xen/arch/x86/pv/descriptor-tables.c
+++ b/xen/arch/x86/pv/descriptor-tables.c
@@ -20,28 +20,29 @@
  */
 bool pv_destroy_ldt(struct vcpu *v)
 {
-    l1_pgentry_t *pl1e;
+    const unsigned int nr_frames = ARRAY_SIZE(v->arch.pv.ldt_frames);
     unsigned int i, mappings_dropped = 0;
-    struct page_info *page;
 
     ASSERT(!in_irq());
 
     ASSERT(v == current || !vcpu_cpu_dirty(v));
 
-    pl1e = pv_ldt_ptes(v);
+    destroy_perdomain_mapping(v, LDT_VIRT_START(v), nr_frames);
 
-    for ( i = 0; i < 16; i++ )
+    for ( i = 0; i < nr_frames; i++ )
     {
-        if ( !(l1e_get_flags(pl1e[i]) & _PAGE_PRESENT) )
-            continue;
+        mfn_t mfn = v->arch.pv.ldt_frames[i];
+        struct page_info *page;
 
-        page = l1e_get_page(pl1e[i]);
-        l1e_write(&pl1e[i], l1e_empty());
-        mappings_dropped++;
+        if ( mfn_eq(mfn, INVALID_MFN) )
+            continue;
 
+        v->arch.pv.ldt_frames[i] = INVALID_MFN;
+        page = mfn_to_page(mfn);
         ASSERT_PAGE_IS_TYPE(page, PGT_seg_desc_page);
         ASSERT_PAGE_IS_DOMAIN(page, v->domain);
         put_page_and_type(page);
+        mappings_dropped++;
     }
 
     return mappings_dropped;
diff --git a/xen/arch/x86/pv/domain.c b/xen/arch/x86/pv/domain.c
index 7e8bffaae9a0..32d7488cc186 100644
--- a/xen/arch/x86/pv/domain.c
+++ b/xen/arch/x86/pv/domain.c
@@ -303,6 +303,7 @@ void pv_vcpu_destroy(struct vcpu *v)
 int pv_vcpu_initialise(struct vcpu *v)
 {
     struct domain *d = v->domain;
+    unsigned int i;
     int rc;
 
     ASSERT(!is_idle_domain(d));
@@ -311,6 +312,9 @@ int pv_vcpu_initialise(struct vcpu *v)
     if ( rc )
         return rc;
 
+    for ( i = 0; i < ARRAY_SIZE(v->arch.pv.ldt_frames); i++ )
+        v->arch.pv.ldt_frames[i] = INVALID_MFN;
+
     BUILD_BUG_ON(X86_NR_VECTORS * sizeof(*v->arch.pv.trap_ctxt) >
                  PAGE_SIZE);
     v->arch.pv.trap_ctxt = xzalloc_array(struct trap_info, X86_NR_VECTORS);
diff --git a/xen/arch/x86/pv/mm.c b/xen/arch/x86/pv/mm.c
index 187f5f6a3e8c..4853e619f2a7 100644
--- a/xen/arch/x86/pv/mm.c
+++ b/xen/arch/x86/pv/mm.c
@@ -86,7 +86,8 @@ bool pv_map_ldt_shadow_page(unsigned int offset)
         return false;
     }
 
-    pl1e = &pv_ldt_ptes(curr)[offset >> PAGE_SHIFT];
+    curr->arch.pv.ldt_frames[offset >> PAGE_SHIFT] = page_to_mfn(page);
+    pl1e = &__linear_l1_table[l1_linear_offset(LDT_VIRT_START(curr) + offset)];
     l1e_add_flags(gl1e, _PAGE_RW);
 
     l1e_write(pl1e, gl1e);
-- 
2.46.0



^ permalink raw reply related	[flat|nested] 55+ messages in thread

* [PATCH v2 08/18] x86/pv: remove stashing of GDT/LDT L1 page-tables
  2025-01-08 14:26 [PATCH v2 00/18] x86: adventures in Address Space Isolation Roger Pau Monne
                   ` (6 preceding siblings ...)
  2025-01-08 14:26 ` [PATCH v2 07/18] x86/pv: update guest LDT mappings using the linear entries Roger Pau Monne
@ 2025-01-08 14:26 ` Roger Pau Monne
  2025-01-08 14:26 ` [PATCH v2 09/18] x86/mm: simplify create_perdomain_mapping() interface Roger Pau Monne
                   ` (10 subsequent siblings)
  18 siblings, 0 replies; 55+ messages in thread
From: Roger Pau Monne @ 2025-01-08 14:26 UTC (permalink / raw)
  To: xen-devel; +Cc: Roger Pau Monne, Jan Beulich, Andrew Cooper

There are no remaining callers of pv_gdt_ptes() or pv_ldt_ptes() that use the
stashed L1 page-tables in the domain structure.  As such, the helpers and the
fields can now be removed.

No functional change intended, as the removed logic is not used.

Signed-off-by: Roger Pau Monné <roger.pau@citrix.com>
---
 xen/arch/x86/include/asm/domain.h |  9 ---------
 xen/arch/x86/pv/domain.c          | 10 +---------
 2 files changed, 1 insertion(+), 18 deletions(-)

diff --git a/xen/arch/x86/include/asm/domain.h b/xen/arch/x86/include/asm/domain.h
index b659cffc7f81..fbe59baa82ec 100644
--- a/xen/arch/x86/include/asm/domain.h
+++ b/xen/arch/x86/include/asm/domain.h
@@ -271,8 +271,6 @@ struct time_scale {
 
 struct pv_domain
 {
-    l1_pgentry_t **gdt_ldt_l1tab;
-
     atomic_t nr_l4_pages;
 
     /* Is a 32-bit PV guest? */
@@ -506,13 +504,6 @@ struct arch_domain
 #define has_pirq(d)        (!!((d)->arch.emulation_flags & X86_EMU_USE_PIRQ))
 #define has_vpci(d)        (!!((d)->arch.emulation_flags & X86_EMU_VPCI))
 
-#define gdt_ldt_pt_idx(v) \
-      ((v)->vcpu_id >> (PAGETABLE_ORDER - GDT_LDT_VCPU_SHIFT))
-#define pv_gdt_ptes(v) \
-    ((v)->domain->arch.pv.gdt_ldt_l1tab[gdt_ldt_pt_idx(v)] + \
-     (((v)->vcpu_id << GDT_LDT_VCPU_SHIFT) & (L1_PAGETABLE_ENTRIES - 1)))
-#define pv_ldt_ptes(v) (pv_gdt_ptes(v) + 16)
-
 struct pv_vcpu
 {
     /* map_domain_page() mapping cache. */
diff --git a/xen/arch/x86/pv/domain.c b/xen/arch/x86/pv/domain.c
index 32d7488cc186..dfaeeb2e2cc2 100644
--- a/xen/arch/x86/pv/domain.c
+++ b/xen/arch/x86/pv/domain.c
@@ -279,7 +279,7 @@ static int pv_create_gdt_ldt_l1tab(struct vcpu *v)
 {
     return create_perdomain_mapping(v->domain, GDT_VIRT_START(v),
                                     1U << GDT_LDT_VCPU_SHIFT,
-                                    v->domain->arch.pv.gdt_ldt_l1tab,
+                                    NIL(l1_pgentry_t *),
                                     NULL);
 }
 
@@ -349,8 +349,6 @@ void pv_domain_destroy(struct domain *d)
     pv_l1tf_domain_destroy(d);
 
     XFREE(d->arch.pv.cpuidmasks);
-
-    FREE_XENHEAP_PAGE(d->arch.pv.gdt_ldt_l1tab);
 }
 
 void noreturn cf_check continue_pv_domain(void);
@@ -366,12 +364,6 @@ int pv_domain_initialise(struct domain *d)
 
     pv_l1tf_domain_init(d);
 
-    d->arch.pv.gdt_ldt_l1tab =
-        alloc_xenheap_pages(0, MEMF_node(domain_to_node(d)));
-    if ( !d->arch.pv.gdt_ldt_l1tab )
-        goto fail;
-    clear_page(d->arch.pv.gdt_ldt_l1tab);
-
     if ( levelling_caps & ~LCAP_faulting &&
          (d->arch.pv.cpuidmasks = xmemdup(&cpuidmask_defaults)) == NULL )
         goto fail;
-- 
2.46.0



^ permalink raw reply related	[flat|nested] 55+ messages in thread

* [PATCH v2 09/18] x86/mm: simplify create_perdomain_mapping() interface
  2025-01-08 14:26 [PATCH v2 00/18] x86: adventures in Address Space Isolation Roger Pau Monne
                   ` (7 preceding siblings ...)
  2025-01-08 14:26 ` [PATCH v2 08/18] x86/pv: remove stashing of GDT/LDT L1 page-tables Roger Pau Monne
@ 2025-01-08 14:26 ` Roger Pau Monne
  2025-01-09 11:01   ` Alejandro Vallejo
  2025-01-08 14:26 ` [PATCH v2 10/18] x86/mm: switch {create,destroy}_perdomain_mapping() domain parameter to vCPU Roger Pau Monne
                   ` (9 subsequent siblings)
  18 siblings, 1 reply; 55+ messages in thread
From: Roger Pau Monne @ 2025-01-08 14:26 UTC (permalink / raw)
  To: xen-devel; +Cc: Roger Pau Monne, Jan Beulich, Andrew Cooper

There are no longer any callers of create_perdomain_mapping() that request a
reference to the used L1 tables, and hence the only difference between them is
whether the caller wants the region to be populated, or just the paging
structures to be allocated.

Simplify the arguments to create_perdomain_mapping() to reflect the current
usages: drop the last two arguments and instead introduce a boolean to signal
whether the caller wants the region populated.

Signed-off-by: Roger Pau Monné <roger.pau@citrix.com>
---
 xen/arch/x86/domain_page.c    | 10 ++++----
 xen/arch/x86/hvm/hvm.c        |  2 +-
 xen/arch/x86/include/asm/mm.h |  3 +--
 xen/arch/x86/mm.c             | 43 +++++++----------------------------
 xen/arch/x86/pv/domain.c      |  4 +---
 xen/arch/x86/x86_64/mm.c      |  3 +--
 6 files changed, 16 insertions(+), 49 deletions(-)

diff --git a/xen/arch/x86/domain_page.c b/xen/arch/x86/domain_page.c
index eac5e3304fb8..ad6d86be6918 100644
--- a/xen/arch/x86/domain_page.c
+++ b/xen/arch/x86/domain_page.c
@@ -254,8 +254,7 @@ int mapcache_domain_init(struct domain *d)
     spin_lock_init(&dcache->lock);
 
     return create_perdomain_mapping(d, (unsigned long)dcache->inuse,
-                                    2 * bitmap_pages + 1,
-                                    NIL(l1_pgentry_t *), NULL);
+                                    2 * bitmap_pages + 1, false);
 }
 
 int mapcache_vcpu_init(struct vcpu *v)
@@ -272,16 +271,15 @@ int mapcache_vcpu_init(struct vcpu *v)
     if ( ents > dcache->entries )
     {
         /* Populate page tables. */
-        int rc = create_perdomain_mapping(d, MAPCACHE_VIRT_START, ents,
-                                          NIL(l1_pgentry_t *), NULL);
+        int rc = create_perdomain_mapping(d, MAPCACHE_VIRT_START, ents, false);
 
         /* Populate bit maps. */
         if ( !rc )
             rc = create_perdomain_mapping(d, (unsigned long)dcache->inuse,
-                                          nr, NULL, NIL(struct page_info *));
+                                          nr, true);
         if ( !rc )
             rc = create_perdomain_mapping(d, (unsigned long)dcache->garbage,
-                                          nr, NULL, NIL(struct page_info *));
+                                          nr, true);
 
         if ( rc )
             return rc;
diff --git a/xen/arch/x86/hvm/hvm.c b/xen/arch/x86/hvm/hvm.c
index 70fdddae583d..e7817144059e 100644
--- a/xen/arch/x86/hvm/hvm.c
+++ b/xen/arch/x86/hvm/hvm.c
@@ -601,7 +601,7 @@ int hvm_domain_initialise(struct domain *d,
     INIT_LIST_HEAD(&d->arch.hvm.mmcfg_regions);
     INIT_LIST_HEAD(&d->arch.hvm.msix_tables);
 
-    rc = create_perdomain_mapping(d, PERDOMAIN_VIRT_START, 0, NULL, NULL);
+    rc = create_perdomain_mapping(d, PERDOMAIN_VIRT_START, 0, false);
     if ( rc )
         goto fail;
 
diff --git a/xen/arch/x86/include/asm/mm.h b/xen/arch/x86/include/asm/mm.h
index 65cd751087dc..0c57442c9593 100644
--- a/xen/arch/x86/include/asm/mm.h
+++ b/xen/arch/x86/include/asm/mm.h
@@ -601,8 +601,7 @@ int compat_arch_memory_op(unsigned long cmd, XEN_GUEST_HANDLE_PARAM(void) arg);
 #define IS_NIL(ptr) (!((uintptr_t)(ptr) + sizeof(*(ptr))))
 
 int create_perdomain_mapping(struct domain *d, unsigned long va,
-                             unsigned int nr, l1_pgentry_t **pl1tab,
-                             struct page_info **ppg);
+                             unsigned int nr, bool populate);
 void populate_perdomain_mapping(const struct vcpu *v, unsigned long va,
                                 mfn_t *mfn, unsigned long nr);
 void destroy_perdomain_mapping(const struct vcpu *v, unsigned long va,
diff --git a/xen/arch/x86/mm.c b/xen/arch/x86/mm.c
index 713ae8dd6fa3..45664c56cb8f 100644
--- a/xen/arch/x86/mm.c
+++ b/xen/arch/x86/mm.c
@@ -6301,8 +6301,7 @@ static bool perdomain_l1e_needs_freeing(l1_pgentry_t l1e)
 }
 
 int create_perdomain_mapping(struct domain *d, unsigned long va,
-                             unsigned int nr, l1_pgentry_t **pl1tab,
-                             struct page_info **ppg)
+                             unsigned int nr, bool populate)
 {
     struct page_info *pg;
     l3_pgentry_t *l3tab;
@@ -6351,55 +6350,32 @@ int create_perdomain_mapping(struct domain *d, unsigned long va,
 
     unmap_domain_page(l3tab);
 
-    if ( !pl1tab && !ppg )
-    {
-        unmap_domain_page(l2tab);
-        return 0;
-    }
-
     for ( l1tab = NULL; !rc && nr--; )
     {
         l2_pgentry_t *pl2e = l2tab + l2_table_offset(va);
 
         if ( !(l2e_get_flags(*pl2e) & _PAGE_PRESENT) )
         {
-            if ( pl1tab && !IS_NIL(pl1tab) )
-            {
-                l1tab = alloc_xenheap_pages(0, MEMF_node(domain_to_node(d)));
-                if ( !l1tab )
-                {
-                    rc = -ENOMEM;
-                    break;
-                }
-                ASSERT(!pl1tab[l2_table_offset(va)]);
-                pl1tab[l2_table_offset(va)] = l1tab;
-                pg = virt_to_page(l1tab);
-            }
-            else
+            pg = alloc_domheap_page(d, MEMF_no_owner);
+            if ( !pg )
             {
-                pg = alloc_domheap_page(d, MEMF_no_owner);
-                if ( !pg )
-                {
-                    rc = -ENOMEM;
-                    break;
-                }
-                l1tab = __map_domain_page(pg);
+                rc = -ENOMEM;
+                break;
             }
+            l1tab = __map_domain_page(pg);
             clear_page(l1tab);
             *pl2e = l2e_from_page(pg, __PAGE_HYPERVISOR_RW);
         }
         else if ( !l1tab )
             l1tab = map_l1t_from_l2e(*pl2e);
 
-        if ( ppg &&
+        if ( populate &&
              !(l1e_get_flags(l1tab[l1_table_offset(va)]) & _PAGE_PRESENT) )
         {
             pg = alloc_domheap_page(d, MEMF_no_owner);
             if ( pg )
             {
                 clear_domain_page(page_to_mfn(pg));
-                if ( !IS_NIL(ppg) )
-                    *ppg++ = pg;
                 l1tab[l1_table_offset(va)] =
                     l1e_from_page(pg, __PAGE_HYPERVISOR_RW | _PAGE_AVAIL0);
                 l2e_add_flags(*pl2e, _PAGE_AVAIL0);
@@ -6618,10 +6594,7 @@ void free_perdomain_mappings(struct domain *d)
                         unmap_domain_page(l1tab);
                     }
 
-                    if ( is_xen_heap_page(l1pg) )
-                        free_xenheap_page(page_to_virt(l1pg));
-                    else
-                        free_domheap_page(l1pg);
+                    free_domheap_page(l1pg);
                 }
 
             unmap_domain_page(l2tab);
diff --git a/xen/arch/x86/pv/domain.c b/xen/arch/x86/pv/domain.c
index dfaeeb2e2cc2..ca32e7b5d686 100644
--- a/xen/arch/x86/pv/domain.c
+++ b/xen/arch/x86/pv/domain.c
@@ -278,9 +278,7 @@ int switch_compat(struct domain *d)
 static int pv_create_gdt_ldt_l1tab(struct vcpu *v)
 {
     return create_perdomain_mapping(v->domain, GDT_VIRT_START(v),
-                                    1U << GDT_LDT_VCPU_SHIFT,
-                                    NIL(l1_pgentry_t *),
-                                    NULL);
+                                    1U << GDT_LDT_VCPU_SHIFT, false);
 }
 
 static void pv_destroy_gdt_ldt_l1tab(struct vcpu *v)
diff --git a/xen/arch/x86/x86_64/mm.c b/xen/arch/x86/x86_64/mm.c
index c08b28d9693b..55bba7e473ae 100644
--- a/xen/arch/x86/x86_64/mm.c
+++ b/xen/arch/x86/x86_64/mm.c
@@ -731,8 +731,7 @@ void __init zap_low_mappings(void)
 int setup_compat_arg_xlat(struct vcpu *v)
 {
     return create_perdomain_mapping(v->domain, ARG_XLAT_START(v),
-                                    PFN_UP(COMPAT_ARG_XLAT_SIZE),
-                                    NULL, NIL(struct page_info *));
+                                    PFN_UP(COMPAT_ARG_XLAT_SIZE), true);
 }
 
 void free_compat_arg_xlat(struct vcpu *v)
-- 
2.46.0



^ permalink raw reply related	[flat|nested] 55+ messages in thread

* [PATCH v2 10/18] x86/mm: switch {create,destroy}_perdomain_mapping() domain parameter to vCPU
  2025-01-08 14:26 [PATCH v2 00/18] x86: adventures in Address Space Isolation Roger Pau Monne
                   ` (8 preceding siblings ...)
  2025-01-08 14:26 ` [PATCH v2 09/18] x86/mm: simplify create_perdomain_mapping() interface Roger Pau Monne
@ 2025-01-08 14:26 ` Roger Pau Monne
  2025-01-14 16:27   ` Jan Beulich
  2025-01-08 14:26 ` [PATCH v2 11/18] x86/pv: untie issuing FLUSH_ROOT_PGTBL from XPTI Roger Pau Monne
                   ` (8 subsequent siblings)
  18 siblings, 1 reply; 55+ messages in thread
From: Roger Pau Monne @ 2025-01-08 14:26 UTC (permalink / raw)
  To: xen-devel; +Cc: Roger Pau Monne, Jan Beulich, Andrew Cooper

In preparation for the per-domain area being per-vCPU.  This requires moving
some of the {create,destroy}_perdomain_mapping() calls to the domain
initialization and tear down paths into vCPU initialization and tear down.

Signed-off-by: Roger Pau Monné <roger.pau@citrix.com>
---
 xen/arch/x86/domain.c             | 12 ++++++++----
 xen/arch/x86/domain_page.c        | 13 +++++--------
 xen/arch/x86/hvm/hvm.c            |  5 -----
 xen/arch/x86/include/asm/domain.h |  2 +-
 xen/arch/x86/include/asm/mm.h     |  4 ++--
 xen/arch/x86/mm.c                 |  6 ++++--
 xen/arch/x86/pv/domain.c          |  2 +-
 xen/arch/x86/x86_64/mm.c          |  2 +-
 8 files changed, 22 insertions(+), 24 deletions(-)

diff --git a/xen/arch/x86/domain.c b/xen/arch/x86/domain.c
index 0481164f3727..6e1f622f7385 100644
--- a/xen/arch/x86/domain.c
+++ b/xen/arch/x86/domain.c
@@ -559,6 +559,10 @@ int arch_vcpu_create(struct vcpu *v)
 
     v->arch.flags = TF_kernel_mode;
 
+    rc = create_perdomain_mapping(v, PERDOMAIN_VIRT_START, 0, false);
+    if ( rc )
+        return rc;
+
     rc = mapcache_vcpu_init(v);
     if ( rc )
         return rc;
@@ -607,6 +611,7 @@ int arch_vcpu_create(struct vcpu *v)
     return rc;
 
  fail:
+    free_perdomain_mappings(v);
     paging_vcpu_teardown(v);
     vcpu_destroy_fpu(v);
     xfree(v->arch.msrs);
@@ -629,6 +634,8 @@ void arch_vcpu_destroy(struct vcpu *v)
         hvm_vcpu_destroy(v);
     else
         pv_vcpu_destroy(v);
+
+    free_perdomain_mappings(v);
 }
 
 int arch_sanitise_domain_config(struct xen_domctl_createdomain *config)
@@ -870,8 +877,7 @@ int arch_domain_create(struct domain *d,
     }
     else if ( is_pv_domain(d) )
     {
-        if ( (rc = mapcache_domain_init(d)) != 0 )
-            goto fail;
+        mapcache_domain_init(d);
 
         if ( (rc = pv_domain_initialise(d)) != 0 )
             goto fail;
@@ -909,7 +915,6 @@ int arch_domain_create(struct domain *d,
     XFREE(d->arch.cpu_policy);
     if ( paging_initialised )
         paging_final_teardown(d);
-    free_perdomain_mappings(d);
 
     return rc;
 }
@@ -935,7 +940,6 @@ void arch_domain_destroy(struct domain *d)
 
     if ( is_pv_domain(d) )
         pv_domain_destroy(d);
-    free_perdomain_mappings(d);
 
     free_xenheap_page(d->shared_info);
     cleanup_domain_irq_mapping(d);
diff --git a/xen/arch/x86/domain_page.c b/xen/arch/x86/domain_page.c
index ad6d86be6918..1372be20224e 100644
--- a/xen/arch/x86/domain_page.c
+++ b/xen/arch/x86/domain_page.c
@@ -231,7 +231,7 @@ void unmap_domain_page(const void *ptr)
     local_irq_restore(flags);
 }
 
-int mapcache_domain_init(struct domain *d)
+void mapcache_domain_init(struct domain *d)
 {
     struct mapcache_domain *dcache = &d->arch.pv.mapcache;
     unsigned int bitmap_pages;
@@ -240,7 +240,7 @@ int mapcache_domain_init(struct domain *d)
 
 #ifdef NDEBUG
     if ( !mem_hotplug && max_page <= PFN_DOWN(__pa(HYPERVISOR_VIRT_END - 1)) )
-        return 0;
+        return;
 #endif
 
     BUILD_BUG_ON(MAPCACHE_VIRT_END + PAGE_SIZE * (3 +
@@ -252,9 +252,6 @@ int mapcache_domain_init(struct domain *d)
                       (bitmap_pages + 1) * PAGE_SIZE / sizeof(long);
 
     spin_lock_init(&dcache->lock);
-
-    return create_perdomain_mapping(d, (unsigned long)dcache->inuse,
-                                    2 * bitmap_pages + 1, false);
 }
 
 int mapcache_vcpu_init(struct vcpu *v)
@@ -271,14 +268,14 @@ int mapcache_vcpu_init(struct vcpu *v)
     if ( ents > dcache->entries )
     {
         /* Populate page tables. */
-        int rc = create_perdomain_mapping(d, MAPCACHE_VIRT_START, ents, false);
+        int rc = create_perdomain_mapping(v, MAPCACHE_VIRT_START, ents, false);
 
         /* Populate bit maps. */
         if ( !rc )
-            rc = create_perdomain_mapping(d, (unsigned long)dcache->inuse,
+            rc = create_perdomain_mapping(v, (unsigned long)dcache->inuse,
                                           nr, true);
         if ( !rc )
-            rc = create_perdomain_mapping(d, (unsigned long)dcache->garbage,
+            rc = create_perdomain_mapping(v, (unsigned long)dcache->garbage,
                                           nr, true);
 
         if ( rc )
diff --git a/xen/arch/x86/hvm/hvm.c b/xen/arch/x86/hvm/hvm.c
index e7817144059e..0dc693818349 100644
--- a/xen/arch/x86/hvm/hvm.c
+++ b/xen/arch/x86/hvm/hvm.c
@@ -601,10 +601,6 @@ int hvm_domain_initialise(struct domain *d,
     INIT_LIST_HEAD(&d->arch.hvm.mmcfg_regions);
     INIT_LIST_HEAD(&d->arch.hvm.msix_tables);
 
-    rc = create_perdomain_mapping(d, PERDOMAIN_VIRT_START, 0, false);
-    if ( rc )
-        goto fail;
-
     hvm_init_cacheattr_region_list(d);
 
     rc = paging_enable(d, PG_refcounts|PG_translate|PG_external);
@@ -708,7 +704,6 @@ int hvm_domain_initialise(struct domain *d,
     XFREE(d->arch.hvm.irq);
  fail0:
     hvm_destroy_cacheattr_region_list(d);
- fail:
     hvm_domain_relinquish_resources(d);
     XFREE(d->arch.hvm.io_handler);
     XFREE(d->arch.hvm.pl_time);
diff --git a/xen/arch/x86/include/asm/domain.h b/xen/arch/x86/include/asm/domain.h
index fbe59baa82ec..7c143d2a6c46 100644
--- a/xen/arch/x86/include/asm/domain.h
+++ b/xen/arch/x86/include/asm/domain.h
@@ -73,7 +73,7 @@ struct mapcache_domain {
     unsigned long *garbage;
 };
 
-int mapcache_domain_init(struct domain *d);
+void mapcache_domain_init(struct domain *d);
 int mapcache_vcpu_init(struct vcpu *v);
 void mapcache_override_current(struct vcpu *v);
 
diff --git a/xen/arch/x86/include/asm/mm.h b/xen/arch/x86/include/asm/mm.h
index 0c57442c9593..f501e5e115ff 100644
--- a/xen/arch/x86/include/asm/mm.h
+++ b/xen/arch/x86/include/asm/mm.h
@@ -600,13 +600,13 @@ int compat_arch_memory_op(unsigned long cmd, XEN_GUEST_HANDLE_PARAM(void) arg);
 #define NIL(type) ((type *)-sizeof(type))
 #define IS_NIL(ptr) (!((uintptr_t)(ptr) + sizeof(*(ptr))))
 
-int create_perdomain_mapping(struct domain *d, unsigned long va,
+int create_perdomain_mapping(struct vcpu *v, unsigned long va,
                              unsigned int nr, bool populate);
 void populate_perdomain_mapping(const struct vcpu *v, unsigned long va,
                                 mfn_t *mfn, unsigned long nr);
 void destroy_perdomain_mapping(const struct vcpu *v, unsigned long va,
                                unsigned int nr);
-void free_perdomain_mappings(struct domain *d);
+void free_perdomain_mappings(struct vcpu *v);
 
 void __iomem *ioremap_wc(paddr_t pa, size_t len);
 
diff --git a/xen/arch/x86/mm.c b/xen/arch/x86/mm.c
index 45664c56cb8f..c321f5723b04 100644
--- a/xen/arch/x86/mm.c
+++ b/xen/arch/x86/mm.c
@@ -6300,9 +6300,10 @@ static bool perdomain_l1e_needs_freeing(l1_pgentry_t l1e)
            (_PAGE_PRESENT | _PAGE_AVAIL0);
 }
 
-int create_perdomain_mapping(struct domain *d, unsigned long va,
+int create_perdomain_mapping(struct vcpu *v, unsigned long va,
                              unsigned int nr, bool populate)
 {
+    struct domain *d = v->domain;
     struct page_info *pg;
     l3_pgentry_t *l3tab;
     l2_pgentry_t *l2tab;
@@ -6560,8 +6561,9 @@ void destroy_perdomain_mapping(const struct vcpu *v, unsigned long va,
     unmap_domain_page(l3tab);
 }
 
-void free_perdomain_mappings(struct domain *d)
+void free_perdomain_mappings(struct vcpu *v)
 {
+    struct domain *d = v->domain;
     l3_pgentry_t *l3tab;
     unsigned int i;
 
diff --git a/xen/arch/x86/pv/domain.c b/xen/arch/x86/pv/domain.c
index ca32e7b5d686..534d2899100f 100644
--- a/xen/arch/x86/pv/domain.c
+++ b/xen/arch/x86/pv/domain.c
@@ -277,7 +277,7 @@ int switch_compat(struct domain *d)
 
 static int pv_create_gdt_ldt_l1tab(struct vcpu *v)
 {
-    return create_perdomain_mapping(v->domain, GDT_VIRT_START(v),
+    return create_perdomain_mapping(v, GDT_VIRT_START(v),
                                     1U << GDT_LDT_VCPU_SHIFT, false);
 }
 
diff --git a/xen/arch/x86/x86_64/mm.c b/xen/arch/x86/x86_64/mm.c
index 55bba7e473ae..3b421d218e0b 100644
--- a/xen/arch/x86/x86_64/mm.c
+++ b/xen/arch/x86/x86_64/mm.c
@@ -730,7 +730,7 @@ void __init zap_low_mappings(void)
 
 int setup_compat_arg_xlat(struct vcpu *v)
 {
-    return create_perdomain_mapping(v->domain, ARG_XLAT_START(v),
+    return create_perdomain_mapping(v, ARG_XLAT_START(v),
                                     PFN_UP(COMPAT_ARG_XLAT_SIZE), true);
 }
 
-- 
2.46.0



^ permalink raw reply related	[flat|nested] 55+ messages in thread

* [PATCH v2 11/18] x86/pv: untie issuing FLUSH_ROOT_PGTBL from XPTI
  2025-01-08 14:26 [PATCH v2 00/18] x86: adventures in Address Space Isolation Roger Pau Monne
                   ` (9 preceding siblings ...)
  2025-01-08 14:26 ` [PATCH v2 10/18] x86/mm: switch {create,destroy}_perdomain_mapping() domain parameter to vCPU Roger Pau Monne
@ 2025-01-08 14:26 ` Roger Pau Monne
  2025-01-08 14:26 ` [PATCH v2 12/18] x86/mm: move FLUSH_ROOT_PGTBL handling before TLB flush Roger Pau Monne
                   ` (7 subsequent siblings)
  18 siblings, 0 replies; 55+ messages in thread
From: Roger Pau Monne @ 2025-01-08 14:26 UTC (permalink / raw)
  To: xen-devel; +Cc: Roger Pau Monne, Jan Beulich, Andrew Cooper

The current logic gates issuing flush TLB requests with the FLUSH_ROOT_PGTBL
flag to XPTI being enabled.

In preparation for FLUSH_ROOT_PGTBL also being needed when not using XPTI,
untie it from the xpti domain boolean and instead introduce a new flush_root_pt
field.

No functional change intended, as flush_root_pt == xpti.

Signed-off-by: Roger Pau Monné <roger.pau@citrix.com>
---
 xen/arch/x86/include/asm/domain.h   | 2 ++
 xen/arch/x86/include/asm/flushtlb.h | 2 +-
 xen/arch/x86/mm.c                   | 2 +-
 xen/arch/x86/pv/domain.c            | 2 ++
 4 files changed, 6 insertions(+), 2 deletions(-)

diff --git a/xen/arch/x86/include/asm/domain.h b/xen/arch/x86/include/asm/domain.h
index 7c143d2a6c46..5af414fa64ac 100644
--- a/xen/arch/x86/include/asm/domain.h
+++ b/xen/arch/x86/include/asm/domain.h
@@ -281,6 +281,8 @@ struct pv_domain
     bool pcid;
     /* Mitigate L1TF with shadow/crashing? */
     bool check_l1tf;
+    /* Issue FLUSH_ROOT_PGTBL for root page-table changes. */
+    bool flush_root_pt;
 
     /* map_domain_page() mapping cache. */
     struct mapcache_domain mapcache;
diff --git a/xen/arch/x86/include/asm/flushtlb.h b/xen/arch/x86/include/asm/flushtlb.h
index bb0ad58db49b..1b98d03decdc 100644
--- a/xen/arch/x86/include/asm/flushtlb.h
+++ b/xen/arch/x86/include/asm/flushtlb.h
@@ -177,7 +177,7 @@ void flush_area_mask(const cpumask_t *mask, const void *va,
 
 #define flush_root_pgtbl_domain(d)                                       \
 {                                                                        \
-    if ( is_pv_domain(d) && (d)->arch.pv.xpti )                          \
+    if ( is_pv_domain(d) && (d)->arch.pv.flush_root_pt )                 \
         flush_mask((d)->dirty_cpumask, FLUSH_ROOT_PGTBL);                \
 }
 
diff --git a/xen/arch/x86/mm.c b/xen/arch/x86/mm.c
index c321f5723b04..49403196d56e 100644
--- a/xen/arch/x86/mm.c
+++ b/xen/arch/x86/mm.c
@@ -4178,7 +4178,7 @@ long do_mmu_update(
                                       cmd == MMU_PT_UPDATE_PRESERVE_AD, v);
                     if ( !rc )
                         flush_linear_pt = true;
-                    if ( !rc && pt_owner->arch.pv.xpti )
+                    if ( !rc && pt_owner->arch.pv.flush_root_pt )
                     {
                         bool local_in_use = false;
 
diff --git a/xen/arch/x86/pv/domain.c b/xen/arch/x86/pv/domain.c
index 534d2899100f..5bda168eadff 100644
--- a/xen/arch/x86/pv/domain.c
+++ b/xen/arch/x86/pv/domain.c
@@ -368,6 +368,8 @@ int pv_domain_initialise(struct domain *d)
 
     d->arch.ctxt_switch = &pv_csw;
 
+    d->arch.pv.flush_root_pt = d->arch.pv.xpti;
+
     if ( !is_pv_32bit_domain(d) && use_invpcid && cpu_has_pcid )
         switch ( ACCESS_ONCE(opt_pcid) )
         {
-- 
2.46.0



^ permalink raw reply related	[flat|nested] 55+ messages in thread

* [PATCH v2 12/18] x86/mm: move FLUSH_ROOT_PGTBL handling before TLB flush
  2025-01-08 14:26 [PATCH v2 00/18] x86: adventures in Address Space Isolation Roger Pau Monne
                   ` (10 preceding siblings ...)
  2025-01-08 14:26 ` [PATCH v2 11/18] x86/pv: untie issuing FLUSH_ROOT_PGTBL from XPTI Roger Pau Monne
@ 2025-01-08 14:26 ` Roger Pau Monne
  2025-01-08 14:26 ` [PATCH v2 13/18] x86/spec-ctrl: introduce Address Space Isolation command line option Roger Pau Monne
                   ` (6 subsequent siblings)
  18 siblings, 0 replies; 55+ messages in thread
From: Roger Pau Monne @ 2025-01-08 14:26 UTC (permalink / raw)
  To: xen-devel; +Cc: Roger Pau Monne, Jan Beulich, Andrew Cooper

Move the handling of FLUSH_ROOT_PGTBL in flush_area_local() ahead of the logic
that does the TLB flushing, in preparation for further changes requiring the
TLB flush to be strictly done after having handled FLUSH_ROOT_PGTBL.

No functional change intended.

Signed-off-by: Roger Pau Monné <roger.pau@citrix.com>
---
 xen/arch/x86/flushtlb.c | 6 +++---
 1 file changed, 3 insertions(+), 3 deletions(-)

diff --git a/xen/arch/x86/flushtlb.c b/xen/arch/x86/flushtlb.c
index 65be0474a8ea..a64c28f854ea 100644
--- a/xen/arch/x86/flushtlb.c
+++ b/xen/arch/x86/flushtlb.c
@@ -191,6 +191,9 @@ unsigned int flush_area_local(const void *va, unsigned int flags)
 {
     unsigned int order = (flags - 1) & FLUSH_ORDER_MASK;
 
+    if ( flags & FLUSH_ROOT_PGTBL )
+        get_cpu_info()->root_pgt_changed = true;
+
     if ( flags & (FLUSH_TLB|FLUSH_TLB_GLOBAL) )
     {
         if ( order == 0 )
@@ -254,9 +257,6 @@ unsigned int flush_area_local(const void *va, unsigned int flags)
         }
     }
 
-    if ( flags & FLUSH_ROOT_PGTBL )
-        get_cpu_info()->root_pgt_changed = true;
-
     return flags;
 }
 
-- 
2.46.0



^ permalink raw reply related	[flat|nested] 55+ messages in thread

* [PATCH v2 13/18] x86/spec-ctrl: introduce Address Space Isolation command line option
  2025-01-08 14:26 [PATCH v2 00/18] x86: adventures in Address Space Isolation Roger Pau Monne
                   ` (11 preceding siblings ...)
  2025-01-08 14:26 ` [PATCH v2 12/18] x86/mm: move FLUSH_ROOT_PGTBL handling before TLB flush Roger Pau Monne
@ 2025-01-08 14:26 ` Roger Pau Monne
  2025-01-09 14:58   ` Alejandro Vallejo
  2025-01-08 14:26 ` [PATCH v2 14/18] x86/mm: introduce per-vCPU L3 page-table Roger Pau Monne
                   ` (5 subsequent siblings)
  18 siblings, 1 reply; 55+ messages in thread
From: Roger Pau Monne @ 2025-01-08 14:26 UTC (permalink / raw)
  To: xen-devel
  Cc: Roger Pau Monne, Andrew Cooper, Anthony PERARD, Michal Orzel,
	Jan Beulich, Julien Grall, Stefano Stabellini

No functional change, as the option is not used.

Introduced new so newly added functionality is keyed on the option being
enabled, even if the feature is non-functional.

When ASI is enabled for PV domains, printing the usage of XPTI might be
omitted if it must be uniformly disabled given the usage of ASI.

Signed-off-by: Roger Pau Monné <roger.pau@citrix.com>
---
Changes since v1:
 - Improve comments and documentation about what ASI provides.
 - Do not print the XPTI information if ASI is used for pv domUs and dom0 is
   PVH, or if ASI is used for both domU and dom0.

FWIW, I would print the state of XPTI uniformly, as otherwise I find the output
might be confusing for user expecting to assert the state of XPTI.
---
 docs/misc/xen-command-line.pandoc    |  19 +++++
 xen/arch/x86/include/asm/domain.h    |   3 +
 xen/arch/x86/include/asm/spec_ctrl.h |   2 +
 xen/arch/x86/spec_ctrl.c             | 115 +++++++++++++++++++++++++--
 4 files changed, 133 insertions(+), 6 deletions(-)

diff --git a/docs/misc/xen-command-line.pandoc b/docs/misc/xen-command-line.pandoc
index 08b0053f9ced..3c1ad7b5fe7d 100644
--- a/docs/misc/xen-command-line.pandoc
+++ b/docs/misc/xen-command-line.pandoc
@@ -202,6 +202,25 @@ to appropriate auditing by Xen.  Argo is disabled by default.
     This option is disabled by default, to protect domains from a DoS by a
     buggy or malicious other domain spamming the ring.
 
+### asi (x86)
+> `= List of [ <bool>, {pv,hvm}=<bool>,
+               {vcpu-pt}=<bool>|{pv,hvm}=<bool> ]`
+
+Offers control over whether the hypervisor will engage in Address Space
+Isolation, by not having potentially sensitive information permanently mapped
+in the VMM page-tables.  Using this option might avoid the need to apply
+mitigations for certain speculative related attacks, at the cost of mapping
+sensitive information on-demand.
+
+* `pv=` and `hvm=` sub-options allow enabling for specific guest types.
+
+**WARNING: manual de-selection of enabled options will invalidate any
+protection offered by the feature.  The fine grained options provided below are
+meant to be used for debugging purposes only.**
+
+* `vcpu-pt` ensure each vCPU uses a unique top-level page-table and setup a
+  virtual address space region to map memory on a per-vCPU basis.
+
 ### asid (x86)
 > `= <boolean>`
 
diff --git a/xen/arch/x86/include/asm/domain.h b/xen/arch/x86/include/asm/domain.h
index 5af414fa64ac..fb92a10bf3b7 100644
--- a/xen/arch/x86/include/asm/domain.h
+++ b/xen/arch/x86/include/asm/domain.h
@@ -456,6 +456,9 @@ struct arch_domain
     /* Don't unconditionally inject #GP for unhandled MSRs. */
     bool msr_relaxed;
 
+    /* Use a per-vCPU root pt, and switch per-domain slot to per-vCPU. */
+    bool vcpu_pt;
+
     /* Emulated devices enabled bitmap. */
     uint32_t emulation_flags;
 } __cacheline_aligned;
diff --git a/xen/arch/x86/include/asm/spec_ctrl.h b/xen/arch/x86/include/asm/spec_ctrl.h
index 077225418956..c58afbaab671 100644
--- a/xen/arch/x86/include/asm/spec_ctrl.h
+++ b/xen/arch/x86/include/asm/spec_ctrl.h
@@ -88,6 +88,8 @@ extern uint8_t default_scf;
 
 extern int8_t opt_xpti_hwdom, opt_xpti_domu;
 
+extern int8_t opt_vcpu_pt_pv, opt_vcpu_pt_hwdom, opt_vcpu_pt_hvm;
+
 extern bool cpu_has_bug_l1tf;
 extern int8_t opt_pv_l1tf_hwdom, opt_pv_l1tf_domu;
 extern bool opt_bp_spec_reduce;
diff --git a/xen/arch/x86/spec_ctrl.c b/xen/arch/x86/spec_ctrl.c
index ced84750015c..9463a8624701 100644
--- a/xen/arch/x86/spec_ctrl.c
+++ b/xen/arch/x86/spec_ctrl.c
@@ -85,6 +85,11 @@ static int8_t __initdata opt_gds_mit = -1;
 static int8_t __initdata opt_div_scrub = -1;
 bool __ro_after_init opt_bp_spec_reduce = true;
 
+/* Use a per-vCPU root page-table and switch the per-domain slot to per-vCPU. */
+int8_t __ro_after_init opt_vcpu_pt_hvm = -1;
+int8_t __ro_after_init opt_vcpu_pt_hwdom = -1;
+int8_t __ro_after_init opt_vcpu_pt_pv = -1;
+
 static int __init cf_check parse_spec_ctrl(const char *s)
 {
     const char *ss;
@@ -384,6 +389,13 @@ int8_t __ro_after_init opt_xpti_domu = -1;
 
 static __init void xpti_init_default(void)
 {
+    ASSERT(opt_vcpu_pt_pv >= 0 && opt_vcpu_pt_hwdom >= 0);
+    if ( (opt_xpti_hwdom == 1 || opt_xpti_domu == 1) && opt_vcpu_pt_pv == 1 )
+    {
+        printk(XENLOG_ERR
+               "XPTI incompatible with per-vCPU page-tables, disabling ASI\n");
+        opt_vcpu_pt_pv = 0;
+    }
     if ( (boot_cpu_data.x86_vendor & (X86_VENDOR_AMD | X86_VENDOR_HYGON)) ||
          cpu_has_rdcl_no )
     {
@@ -395,9 +407,9 @@ static __init void xpti_init_default(void)
     else
     {
         if ( opt_xpti_hwdom < 0 )
-            opt_xpti_hwdom = 1;
+            opt_xpti_hwdom = !opt_vcpu_pt_hwdom;
         if ( opt_xpti_domu < 0 )
-            opt_xpti_domu = 1;
+            opt_xpti_domu = !opt_vcpu_pt_pv;
     }
 }
 
@@ -488,6 +500,66 @@ static int __init cf_check parse_pv_l1tf(const char *s)
 }
 custom_param("pv-l1tf", parse_pv_l1tf);
 
+static int __init cf_check parse_asi(const char *s)
+{
+    const char *ss;
+    int val, rc = 0;
+
+    /* Interpret 'asi' alone in its positive boolean form. */
+    if ( *s == '\0' )
+        opt_vcpu_pt_pv = opt_vcpu_pt_hwdom = opt_vcpu_pt_hvm = 1;
+
+    do {
+        ss = strchr(s, ',');
+        if ( !ss )
+            ss = strchr(s, '\0');
+
+        val = parse_bool(s, ss);
+        switch ( val )
+        {
+        case 0:
+        case 1:
+            opt_vcpu_pt_pv = opt_vcpu_pt_hwdom = opt_vcpu_pt_hvm = val;
+            break;
+
+        default:
+            if ( (val = parse_boolean("pv", s, ss)) >= 0 )
+                opt_vcpu_pt_pv = val;
+            else if ( (val = parse_boolean("hvm", s, ss)) >= 0 )
+                opt_vcpu_pt_hvm = val;
+            else if ( (val = parse_boolean("vcpu-pt", s, ss)) != -1 )
+            {
+                switch ( val )
+                {
+                case 1:
+                case 0:
+                    opt_vcpu_pt_pv = opt_vcpu_pt_hvm = opt_vcpu_pt_hwdom = val;
+                    break;
+
+                case -2:
+                    s += strlen("vcpu-pt=");
+                    if ( (val = parse_boolean("pv", s, ss)) >= 0 )
+                        opt_vcpu_pt_pv = val;
+                    else if ( (val = parse_boolean("hvm", s, ss)) >= 0 )
+                        opt_vcpu_pt_hvm = val;
+                    else
+                default:
+                        rc = -EINVAL;
+                    break;
+                }
+            }
+            else if ( *s )
+                rc = -EINVAL;
+            break;
+        }
+
+        s = ss + 1;
+    } while ( *ss );
+
+    return rc;
+}
+custom_param("asi", parse_asi);
+
 static void __init print_details(enum ind_thunk thunk)
 {
     unsigned int _7d0 = 0, _7d2 = 0, e8b = 0, e21a = 0, max = 0, tmp;
@@ -668,15 +740,29 @@ static void __init print_details(enum ind_thunk thunk)
            boot_cpu_has(X86_FEATURE_IBPB_ENTRY_PV)   ? " IBPB-entry"    : "",
            opt_bhb_entry_pv                          ? " BHB-entry"     : "");
 
-    printk("  XPTI (64-bit PV only): Dom0 %s, DomU %s (with%s PCID)\n",
-           opt_xpti_hwdom ? "enabled" : "disabled",
-           opt_xpti_domu  ? "enabled" : "disabled",
-           xpti_pcid_enabled() ? "" : "out");
+    if ( !opt_vcpu_pt_pv || (!opt_dom0_pvh && !opt_vcpu_pt_hwdom) )
+        printk("  XPTI (64-bit PV only): Dom0 %s, DomU %s (with%s PCID)\n",
+               opt_xpti_hwdom ? "enabled" : "disabled",
+               opt_xpti_domu  ? "enabled" : "disabled",
+               xpti_pcid_enabled() ? "" : "out");
 
     printk("  PV L1TF shadowing: Dom0 %s, DomU %s\n",
            opt_pv_l1tf_hwdom ? "enabled"  : "disabled",
            opt_pv_l1tf_domu  ? "enabled"  : "disabled");
 #endif
+
+#ifdef CONFIG_HVM
+    printk("  ASI features for HVM VMs:%s%s\n",
+           opt_vcpu_pt_hvm                           ? ""               : " None",
+           opt_vcpu_pt_hvm                           ? " vCPU-PT"       : "");
+
+#endif
+#ifdef CONFIG_PV
+    printk("  ASI features for PV VMs:%s%s\n",
+           opt_vcpu_pt_pv                            ? ""               : " None",
+           opt_vcpu_pt_pv                            ? " vCPU-PT"       : "");
+
+#endif
 }
 
 static bool __init check_smt_enabled(void)
@@ -1779,6 +1865,10 @@ void spec_ctrl_init_domain(struct domain *d)
     if ( pv )
         d->arch.pv.xpti = is_hardware_domain(d) ? opt_xpti_hwdom
                                                 : opt_xpti_domu;
+
+    d->arch.vcpu_pt = is_hardware_domain(d) ? opt_vcpu_pt_hwdom
+                                            : pv ? opt_vcpu_pt_pv
+                                                 : opt_vcpu_pt_hvm;
 }
 
 void __init init_speculation_mitigations(void)
@@ -2075,6 +2165,19 @@ void __init init_speculation_mitigations(void)
          hw_smt_enabled && default_xen_spec_ctrl )
         setup_force_cpu_cap(X86_FEATURE_SC_MSR_IDLE);
 
+    /* Disable all ASI options by default until feature is finished. */
+    if ( opt_vcpu_pt_pv == -1 )
+        opt_vcpu_pt_pv = 0;
+    if ( opt_vcpu_pt_hwdom == -1 )
+        opt_vcpu_pt_hwdom = 0;
+    if ( opt_vcpu_pt_hvm == -1 )
+        opt_vcpu_pt_hvm = 0;
+
+    if ( opt_vcpu_pt_pv || opt_vcpu_pt_hvm )
+        warning_add(
+            "Address Space Isolation is not functional, this option is\n"
+            "intended to be used only for development purposes.\n");
+
     xpti_init_default();
 
     l1tf_calculations();
-- 
2.46.0



^ permalink raw reply related	[flat|nested] 55+ messages in thread

* [PATCH v2 14/18] x86/mm: introduce per-vCPU L3 page-table
  2025-01-08 14:26 [PATCH v2 00/18] x86: adventures in Address Space Isolation Roger Pau Monne
                   ` (12 preceding siblings ...)
  2025-01-08 14:26 ` [PATCH v2 13/18] x86/spec-ctrl: introduce Address Space Isolation command line option Roger Pau Monne
@ 2025-01-08 14:26 ` Roger Pau Monne
  2025-01-08 14:26 ` [PATCH v2 15/18] x86/mm: introduce a per-vCPU mapcache when using ASI Roger Pau Monne
                   ` (4 subsequent siblings)
  18 siblings, 0 replies; 55+ messages in thread
From: Roger Pau Monne @ 2025-01-08 14:26 UTC (permalink / raw)
  To: xen-devel; +Cc: Roger Pau Monne, Jan Beulich, Andrew Cooper, Tim Deegan

Such table is to be used in the per-domain slot when running with Address Space
Isolation enabled for the domain.

Signed-off-by: Roger Pau Monné <roger.pau@citrix.com>
---
 xen/arch/x86/include/asm/domain.h |  3 +++
 xen/arch/x86/include/asm/mm.h     |  2 +-
 xen/arch/x86/mm.c                 | 45 ++++++++++++++++++++++---------
 xen/arch/x86/mm/hap/hap.c         |  2 +-
 xen/arch/x86/mm/shadow/hvm.c      |  2 +-
 xen/arch/x86/mm/shadow/multi.c    |  2 +-
 xen/arch/x86/pv/dom0_build.c      |  2 +-
 xen/arch/x86/pv/domain.c          |  2 +-
 8 files changed, 41 insertions(+), 19 deletions(-)

diff --git a/xen/arch/x86/include/asm/domain.h b/xen/arch/x86/include/asm/domain.h
index fb92a10bf3b7..5bf0ad3fdcf7 100644
--- a/xen/arch/x86/include/asm/domain.h
+++ b/xen/arch/x86/include/asm/domain.h
@@ -666,6 +666,9 @@ struct arch_vcpu
 
     struct vcpu_msrs *msrs;
 
+    /* ASI: per-vCPU L3 table to use in the L4 per-domain slot. */
+    struct page_info *pervcpu_l3_pg;
+
     struct {
         bool next_interrupt_enabled;
     } monitor;
diff --git a/xen/arch/x86/include/asm/mm.h b/xen/arch/x86/include/asm/mm.h
index f501e5e115ff..f79d1594fde4 100644
--- a/xen/arch/x86/include/asm/mm.h
+++ b/xen/arch/x86/include/asm/mm.h
@@ -375,7 +375,7 @@ int devalidate_page(struct page_info *page, unsigned long type,
 
 void init_xen_pae_l2_slots(l2_pgentry_t *l2t, const struct domain *d);
 void init_xen_l4_slots(l4_pgentry_t *l4t, mfn_t l4mfn,
-                       const struct domain *d, mfn_t sl4mfn, bool ro_mpt);
+                       const struct vcpu *v, mfn_t sl4mfn, bool ro_mpt);
 bool fill_ro_mpt(mfn_t mfn);
 void zap_ro_mpt(mfn_t mfn);
 
diff --git a/xen/arch/x86/mm.c b/xen/arch/x86/mm.c
index 49403196d56e..583bf4c58bf9 100644
--- a/xen/arch/x86/mm.c
+++ b/xen/arch/x86/mm.c
@@ -1658,8 +1658,9 @@ static int promote_l3_table(struct page_info *page)
  * extended directmap.
  */
 void init_xen_l4_slots(l4_pgentry_t *l4t, mfn_t l4mfn,
-                       const struct domain *d, mfn_t sl4mfn, bool ro_mpt)
+                       const struct vcpu *v, mfn_t sl4mfn, bool ro_mpt)
 {
+    const struct domain *d = v->domain;
     /*
      * PV vcpus need a shortened directmap.  HVM and Idle vcpus get the full
      * directmap.
@@ -1687,7 +1688,9 @@ void init_xen_l4_slots(l4_pgentry_t *l4t, mfn_t l4mfn,
 
     /* Slot 260: Per-domain mappings. */
     l4t[l4_table_offset(PERDOMAIN_VIRT_START)] =
-        l4e_from_page(d->arch.perdomain_l3_pg, __PAGE_HYPERVISOR_RW);
+        l4e_from_page(d->arch.vcpu_pt ? v->arch.pervcpu_l3_pg
+                                      : d->arch.perdomain_l3_pg,
+                      __PAGE_HYPERVISOR_RW);
 
     /* Slot 4: Per-domain mappings mirror. */
     BUILD_BUG_ON(IS_ENABLED(CONFIG_PV32) &&
@@ -1842,8 +1845,15 @@ static int promote_l4_table(struct page_info *page)
 
     if ( !rc )
     {
+        /*
+         * Use vCPU#0 unconditionally.  When not running with ASI enabled the
+         * per-domain table is shared between all vCPUs, so it doesn't matter
+         * which vCPU gets passed to init_xen_l4_slots().  When running with
+         * ASI enabled this L4 will not be used, as a shadow per-vCPU L4 is
+         * used instead.
+         */
         init_xen_l4_slots(pl4e, l4mfn,
-                          d, INVALID_MFN, VM_ASSIST(d, m2p_strict));
+                          d->vcpu[0], INVALID_MFN, VM_ASSIST(d, m2p_strict));
         atomic_inc(&d->arch.pv.nr_l4_pages);
     }
     unmap_domain_page(pl4e);
@@ -6313,14 +6323,17 @@ int create_perdomain_mapping(struct vcpu *v, unsigned long va,
     ASSERT(va >= PERDOMAIN_VIRT_START &&
            va < PERDOMAIN_VIRT_SLOT(PERDOMAIN_SLOTS));
 
-    if ( !d->arch.perdomain_l3_pg )
+    if ( !v->arch.pervcpu_l3_pg && !d->arch.perdomain_l3_pg )
     {
         pg = alloc_domheap_page(d, MEMF_no_owner);
         if ( !pg )
             return -ENOMEM;
         l3tab = __map_domain_page(pg);
         clear_page(l3tab);
-        d->arch.perdomain_l3_pg = pg;
+        if ( d->arch.vcpu_pt )
+            v->arch.pervcpu_l3_pg = pg;
+        else
+            d->arch.perdomain_l3_pg = pg;
         if ( !nr )
         {
             unmap_domain_page(l3tab);
@@ -6330,7 +6343,8 @@ int create_perdomain_mapping(struct vcpu *v, unsigned long va,
     else if ( !nr )
         return 0;
     else
-        l3tab = __map_domain_page(d->arch.perdomain_l3_pg);
+        l3tab = __map_domain_page(d->arch.vcpu_pt ? v->arch.pervcpu_l3_pg
+                                                  : d->arch.perdomain_l3_pg);
 
     ASSERT(!l3_table_offset(va ^ (va + nr * PAGE_SIZE - 1)));
 
@@ -6436,8 +6450,9 @@ void populate_perdomain_mapping(const struct vcpu *v, unsigned long va,
         return;
     }
 
-    ASSERT(d->arch.perdomain_l3_pg);
-    l3tab = __map_domain_page(d->arch.perdomain_l3_pg);
+    ASSERT(d->arch.perdomain_l3_pg || v->arch.pervcpu_l3_pg);
+    l3tab = __map_domain_page(d->arch.vcpu_pt ? v->arch.pervcpu_l3_pg
+                                              : d->arch.perdomain_l3_pg);
 
     if ( unlikely(!(l3e_get_flags(l3tab[l3_table_offset(va)]) &
                     _PAGE_PRESENT)) )
@@ -6498,7 +6513,7 @@ void destroy_perdomain_mapping(const struct vcpu *v, unsigned long va,
            va < PERDOMAIN_VIRT_SLOT(PERDOMAIN_SLOTS));
     ASSERT(!nr || !l3_table_offset(va ^ (va + nr * PAGE_SIZE - 1)));
 
-    if ( !d->arch.perdomain_l3_pg )
+    if ( !d->arch.perdomain_l3_pg && !v->arch.pervcpu_l3_pg )
         return;
 
     /* Use likely to force the optimization for the fast path. */
@@ -6522,7 +6537,8 @@ void destroy_perdomain_mapping(const struct vcpu *v, unsigned long va,
         return;
     }
 
-    l3tab = __map_domain_page(d->arch.perdomain_l3_pg);
+    l3tab = __map_domain_page(d->arch.vcpu_pt ? v->arch.pervcpu_l3_pg
+                                              : d->arch.perdomain_l3_pg);
     pl3e = l3tab + l3_table_offset(va);
 
     if ( l3e_get_flags(*pl3e) & _PAGE_PRESENT )
@@ -6567,10 +6583,11 @@ void free_perdomain_mappings(struct vcpu *v)
     l3_pgentry_t *l3tab;
     unsigned int i;
 
-    if ( !d->arch.perdomain_l3_pg )
+    if ( !v->arch.pervcpu_l3_pg && !d->arch.perdomain_l3_pg )
         return;
 
-    l3tab = __map_domain_page(d->arch.perdomain_l3_pg);
+    l3tab = __map_domain_page(d->arch.vcpu_pt ? v->arch.pervcpu_l3_pg
+                                              : d->arch.perdomain_l3_pg);
 
     for ( i = 0; i < PERDOMAIN_SLOTS; ++i)
         if ( l3e_get_flags(l3tab[i]) & _PAGE_PRESENT )
@@ -6604,8 +6621,10 @@ void free_perdomain_mappings(struct vcpu *v)
         }
 
     unmap_domain_page(l3tab);
-    free_domheap_page(d->arch.perdomain_l3_pg);
+    free_domheap_page(d->arch.vcpu_pt ? v->arch.pervcpu_l3_pg
+                                      : d->arch.perdomain_l3_pg);
     d->arch.perdomain_l3_pg = NULL;
+    v->arch.pervcpu_l3_pg = NULL;
 }
 
 static void write_sss_token(unsigned long *ptr)
diff --git a/xen/arch/x86/mm/hap/hap.c b/xen/arch/x86/mm/hap/hap.c
index ec5043a8aa9e..c7d9bf7c71bf 100644
--- a/xen/arch/x86/mm/hap/hap.c
+++ b/xen/arch/x86/mm/hap/hap.c
@@ -402,7 +402,7 @@ static mfn_t hap_make_monitor_table(struct vcpu *v)
     m4mfn = page_to_mfn(pg);
     l4e = map_domain_page(m4mfn);
 
-    init_xen_l4_slots(l4e, m4mfn, d, INVALID_MFN, false);
+    init_xen_l4_slots(l4e, m4mfn, v, INVALID_MFN, false);
     unmap_domain_page(l4e);
 
     return m4mfn;
diff --git a/xen/arch/x86/mm/shadow/hvm.c b/xen/arch/x86/mm/shadow/hvm.c
index 114957a3e1ec..d588dbbae003 100644
--- a/xen/arch/x86/mm/shadow/hvm.c
+++ b/xen/arch/x86/mm/shadow/hvm.c
@@ -776,7 +776,7 @@ mfn_t sh_make_monitor_table(const struct vcpu *v, unsigned int shadow_levels)
      * shadow-linear mapping will either be inserted below when creating
      * lower level monitor tables, or later in sh_update_cr3().
      */
-    init_xen_l4_slots(l4e, m4mfn, d, INVALID_MFN, false);
+    init_xen_l4_slots(l4e, m4mfn, v, INVALID_MFN, false);
 
     if ( shadow_levels < 4 )
     {
diff --git a/xen/arch/x86/mm/shadow/multi.c b/xen/arch/x86/mm/shadow/multi.c
index 10ddc408ff73..a1f8147e197a 100644
--- a/xen/arch/x86/mm/shadow/multi.c
+++ b/xen/arch/x86/mm/shadow/multi.c
@@ -973,7 +973,7 @@ sh_make_shadow(struct vcpu *v, mfn_t gmfn, u32 shadow_type)
 
             BUILD_BUG_ON(sizeof(l4_pgentry_t) != sizeof(shadow_l4e_t));
 
-            init_xen_l4_slots(l4t, gmfn, d, smfn, (!is_pv_32bit_domain(d) &&
+            init_xen_l4_slots(l4t, gmfn, v, smfn, (!is_pv_32bit_domain(d) &&
                                                    VM_ASSIST(d, m2p_strict)));
             unmap_domain_page(l4t);
         }
diff --git a/xen/arch/x86/pv/dom0_build.c b/xen/arch/x86/pv/dom0_build.c
index f54d1da5c6f4..5081c19b9a9a 100644
--- a/xen/arch/x86/pv/dom0_build.c
+++ b/xen/arch/x86/pv/dom0_build.c
@@ -737,7 +737,7 @@ static int __init dom0_construct(struct boot_info *bi, struct domain *d)
         l4start = l4tab = __va(mpt_alloc); mpt_alloc += PAGE_SIZE;
         clear_page(l4tab);
         init_xen_l4_slots(l4tab, _mfn(virt_to_mfn(l4start)),
-                          d, INVALID_MFN, true);
+                          d->vcpu[0], INVALID_MFN, true);
         v->arch.guest_table = pagetable_from_paddr(__pa(l4start));
     }
     else
diff --git a/xen/arch/x86/pv/domain.c b/xen/arch/x86/pv/domain.c
index 5bda168eadff..8d2428051607 100644
--- a/xen/arch/x86/pv/domain.c
+++ b/xen/arch/x86/pv/domain.c
@@ -125,7 +125,7 @@ static int setup_compat_l4(struct vcpu *v)
     mfn = page_to_mfn(pg);
     l4tab = map_domain_page(mfn);
     clear_page(l4tab);
-    init_xen_l4_slots(l4tab, mfn, v->domain, INVALID_MFN, false);
+    init_xen_l4_slots(l4tab, mfn, v, INVALID_MFN, false);
     unmap_domain_page(l4tab);
 
     /* This page needs to look like a pagetable so that it can be shadowed */
-- 
2.46.0



^ permalink raw reply related	[flat|nested] 55+ messages in thread

* [PATCH v2 15/18] x86/mm: introduce a per-vCPU mapcache when using ASI
  2025-01-08 14:26 [PATCH v2 00/18] x86: adventures in Address Space Isolation Roger Pau Monne
                   ` (13 preceding siblings ...)
  2025-01-08 14:26 ` [PATCH v2 14/18] x86/mm: introduce per-vCPU L3 page-table Roger Pau Monne
@ 2025-01-08 14:26 ` Roger Pau Monne
  2025-01-09 15:08   ` Alejandro Vallejo
  2025-01-08 14:26 ` [PATCH v2 16/18] x86/pv: allow using a unique per-pCPU root page table (L4) Roger Pau Monne
                   ` (3 subsequent siblings)
  18 siblings, 1 reply; 55+ messages in thread
From: Roger Pau Monne @ 2025-01-08 14:26 UTC (permalink / raw)
  To: xen-devel; +Cc: Roger Pau Monne, Jan Beulich, Andrew Cooper

When using a unique per-vCPU root page table the per-domain region becomes
per-vCPU, and hence the mapcache is no longer shared between all vCPUs of a
domain.  Introduce per-vCPU mapcache structures, and modify map_domain_page()
to create per-vCPU mappings when possible.  Note the lock is also not needed
with using per-vCPU map caches, as the structure is no longer shared.

This introduces some duplication in the domain and vcpu structures, as both
contain a mapcache field to support running with and without per-vCPU
page-tables.

Signed-off-by: Roger Pau Monné <roger.pau@citrix.com>
---
 xen/arch/x86/domain_page.c        | 90 ++++++++++++++++++++-----------
 xen/arch/x86/include/asm/domain.h | 20 ++++---
 2 files changed, 71 insertions(+), 39 deletions(-)

diff --git a/xen/arch/x86/domain_page.c b/xen/arch/x86/domain_page.c
index 1372be20224e..65900d6218f8 100644
--- a/xen/arch/x86/domain_page.c
+++ b/xen/arch/x86/domain_page.c
@@ -74,7 +74,9 @@ void *map_domain_page(mfn_t mfn)
     struct vcpu *v;
     struct mapcache_domain *dcache;
     struct mapcache_vcpu *vcache;
+    struct mapcache *cache;
     struct vcpu_maphash_entry *hashent;
+    struct domain *d;
 
 #ifdef NDEBUG
     if ( mfn_x(mfn) <= PFN_DOWN(__pa(HYPERVISOR_VIRT_END - 1)) )
@@ -85,9 +87,12 @@ void *map_domain_page(mfn_t mfn)
     if ( !v || !is_pv_vcpu(v) )
         return mfn_to_virt(mfn_x(mfn));
 
-    dcache = &v->domain->arch.pv.mapcache;
+    d = v->domain;
+    dcache = &d->arch.pv.mapcache;
     vcache = &v->arch.pv.mapcache;
-    if ( !dcache->inuse )
+    cache = d->arch.vcpu_pt ? &v->arch.pv.mapcache.cache
+                            : &d->arch.pv.mapcache.cache;
+    if ( !cache->inuse )
         return mfn_to_virt(mfn_x(mfn));
 
     perfc_incr(map_domain_page_count);
@@ -98,17 +103,18 @@ void *map_domain_page(mfn_t mfn)
     if ( hashent->mfn == mfn_x(mfn) )
     {
         idx = hashent->idx;
-        ASSERT(idx < dcache->entries);
+        ASSERT(idx < cache->entries);
         hashent->refcnt++;
         ASSERT(hashent->refcnt);
         ASSERT(mfn_eq(l1e_get_mfn(MAPCACHE_L1ENT(idx)), mfn));
         goto out;
     }
 
-    spin_lock(&dcache->lock);
+    if ( !d->arch.vcpu_pt )
+        spin_lock(&dcache->lock);
 
     /* Has some other CPU caused a wrap? We must flush if so. */
-    if ( unlikely(dcache->epoch != vcache->shadow_epoch) )
+    if ( unlikely(!d->arch.vcpu_pt && dcache->epoch != vcache->shadow_epoch) )
     {
         vcache->shadow_epoch = dcache->epoch;
         if ( NEED_FLUSH(this_cpu(tlbflush_time), dcache->tlbflush_timestamp) )
@@ -118,21 +124,21 @@ void *map_domain_page(mfn_t mfn)
         }
     }
 
-    idx = find_next_zero_bit(dcache->inuse, dcache->entries, dcache->cursor);
-    if ( unlikely(idx >= dcache->entries) )
+    idx = find_next_zero_bit(cache->inuse, cache->entries, cache->cursor);
+    if ( unlikely(idx >= cache->entries) )
     {
         unsigned long accum = 0, prev = 0;
 
         /* /First/, clean the garbage map and update the inuse list. */
-        for ( i = 0; i < BITS_TO_LONGS(dcache->entries); i++ )
+        for ( i = 0; i < BITS_TO_LONGS(cache->entries); i++ )
         {
             accum |= prev;
-            dcache->inuse[i] &= ~xchg(&dcache->garbage[i], 0);
-            prev = ~dcache->inuse[i];
+            cache->inuse[i] &= ~xchg(&cache->garbage[i], 0);
+            prev = ~cache->inuse[i];
         }
 
-        if ( accum | (prev & BITMAP_LAST_WORD_MASK(dcache->entries)) )
-            idx = find_first_zero_bit(dcache->inuse, dcache->entries);
+        if ( accum | (prev & BITMAP_LAST_WORD_MASK(cache->entries)) )
+            idx = find_first_zero_bit(cache->inuse, cache->entries);
         else
         {
             /* Replace a hash entry instead. */
@@ -152,19 +158,23 @@ void *map_domain_page(mfn_t mfn)
                     i = 0;
             } while ( i != MAPHASH_HASHFN(mfn_x(mfn)) );
         }
-        BUG_ON(idx >= dcache->entries);
+        BUG_ON(idx >= cache->entries);
 
         /* /Second/, flush TLBs. */
         perfc_incr(domain_page_tlb_flush);
         flush_tlb_local();
-        vcache->shadow_epoch = ++dcache->epoch;
-        dcache->tlbflush_timestamp = tlbflush_current_time();
+        if ( !d->arch.vcpu_pt )
+        {
+            vcache->shadow_epoch = ++dcache->epoch;
+            dcache->tlbflush_timestamp = tlbflush_current_time();
+        }
     }
 
-    set_bit(idx, dcache->inuse);
-    dcache->cursor = idx + 1;
+    set_bit(idx, cache->inuse);
+    cache->cursor = idx + 1;
 
-    spin_unlock(&dcache->lock);
+    if ( !d->arch.vcpu_pt )
+        spin_unlock(&dcache->lock);
 
     l1e_write(&MAPCACHE_L1ENT(idx), l1e_from_mfn(mfn, __PAGE_HYPERVISOR_RW));
 
@@ -178,6 +188,7 @@ void unmap_domain_page(const void *ptr)
     unsigned int idx;
     struct vcpu *v;
     struct mapcache_domain *dcache;
+    struct mapcache *cache;
     unsigned long va = (unsigned long)ptr, mfn, flags;
     struct vcpu_maphash_entry *hashent;
 
@@ -190,7 +201,9 @@ void unmap_domain_page(const void *ptr)
     ASSERT(v && is_pv_vcpu(v));
 
     dcache = &v->domain->arch.pv.mapcache;
-    ASSERT(dcache->inuse);
+    cache = v->domain->arch.vcpu_pt ? &v->arch.pv.mapcache.cache
+                                    : &v->domain->arch.pv.mapcache.cache;
+    ASSERT(cache->inuse);
 
     idx = PFN_DOWN(va - MAPCACHE_VIRT_START);
     mfn = l1e_get_pfn(MAPCACHE_L1ENT(idx));
@@ -213,7 +226,7 @@ void unmap_domain_page(const void *ptr)
                    hashent->mfn);
             l1e_write(&MAPCACHE_L1ENT(hashent->idx), l1e_empty());
             /* /Second/, mark as garbage. */
-            set_bit(hashent->idx, dcache->garbage);
+            set_bit(hashent->idx, cache->garbage);
         }
 
         /* Add newly-freed mapping to the maphash. */
@@ -225,7 +238,7 @@ void unmap_domain_page(const void *ptr)
         /* /First/, zap the PTE. */
         l1e_write(&MAPCACHE_L1ENT(idx), l1e_empty());
         /* /Second/, mark as garbage. */
-        set_bit(idx, dcache->garbage);
+        set_bit(idx, cache->garbage);
     }
 
     local_irq_restore(flags);
@@ -234,7 +247,6 @@ void unmap_domain_page(const void *ptr)
 void mapcache_domain_init(struct domain *d)
 {
     struct mapcache_domain *dcache = &d->arch.pv.mapcache;
-    unsigned int bitmap_pages;
 
     ASSERT(is_pv_domain(d));
 
@@ -243,13 +255,12 @@ void mapcache_domain_init(struct domain *d)
         return;
 #endif
 
+    if ( d->arch.vcpu_pt )
+        return;
+
     BUILD_BUG_ON(MAPCACHE_VIRT_END + PAGE_SIZE * (3 +
                  2 * PFN_UP(BITS_TO_LONGS(MAPCACHE_ENTRIES) * sizeof(long))) >
                  MAPCACHE_VIRT_START + (PERDOMAIN_SLOT_MBYTES << 20));
-    bitmap_pages = PFN_UP(BITS_TO_LONGS(MAPCACHE_ENTRIES) * sizeof(long));
-    dcache->inuse = (void *)MAPCACHE_VIRT_END + PAGE_SIZE;
-    dcache->garbage = dcache->inuse +
-                      (bitmap_pages + 1) * PAGE_SIZE / sizeof(long);
 
     spin_lock_init(&dcache->lock);
 }
@@ -258,30 +269,45 @@ int mapcache_vcpu_init(struct vcpu *v)
 {
     struct domain *d = v->domain;
     struct mapcache_domain *dcache = &d->arch.pv.mapcache;
+    struct mapcache *cache;
     unsigned long i;
-    unsigned int ents = d->max_vcpus * MAPCACHE_VCPU_ENTRIES;
+    unsigned int ents = (d->arch.vcpu_pt ? 1 : d->max_vcpus) *
+                        MAPCACHE_VCPU_ENTRIES;
     unsigned int nr = PFN_UP(BITS_TO_LONGS(ents) * sizeof(long));
 
-    if ( !is_pv_vcpu(v) || !dcache->inuse )
+    if ( !is_pv_vcpu(v) )
         return 0;
 
-    if ( ents > dcache->entries )
+    cache = d->arch.vcpu_pt ? &v->arch.pv.mapcache.cache
+                            : &dcache->cache;
+
+    if ( !cache->inuse )
+        return 0;
+
+    if ( ents > cache->entries )
     {
         /* Populate page tables. */
         int rc = create_perdomain_mapping(v, MAPCACHE_VIRT_START, ents, false);
+        const unsigned int bitmap_pages =
+            PFN_UP(BITS_TO_LONGS(MAPCACHE_ENTRIES) * sizeof(long));
+
+        cache->inuse = (void *)MAPCACHE_VIRT_END + PAGE_SIZE;
+        cache->garbage = cache->inuse +
+                         (bitmap_pages + 1) * PAGE_SIZE / sizeof(long);
+
 
         /* Populate bit maps. */
         if ( !rc )
-            rc = create_perdomain_mapping(v, (unsigned long)dcache->inuse,
+            rc = create_perdomain_mapping(v, (unsigned long)cache->inuse,
                                           nr, true);
         if ( !rc )
-            rc = create_perdomain_mapping(v, (unsigned long)dcache->garbage,
+            rc = create_perdomain_mapping(v, (unsigned long)cache->garbage,
                                           nr, true);
 
         if ( rc )
             return rc;
 
-        dcache->entries = ents;
+        cache->entries = ents;
     }
 
     /* Mark all maphash entries as not in use. */
diff --git a/xen/arch/x86/include/asm/domain.h b/xen/arch/x86/include/asm/domain.h
index 5bf0ad3fdcf7..ba5440099d90 100644
--- a/xen/arch/x86/include/asm/domain.h
+++ b/xen/arch/x86/include/asm/domain.h
@@ -41,6 +41,16 @@ struct trap_bounce {
     unsigned long eip;
 };
 
+struct mapcache {
+    /* The number of array entries, and a cursor into the array. */
+    unsigned int entries;
+    unsigned int cursor;
+
+    /* Which mappings are in use, and which are garbage to reap next epoch? */
+    unsigned long *inuse;
+    unsigned long *garbage;
+};
+
 #define MAPHASH_ENTRIES 8
 #define MAPHASH_HASHFN(pfn) ((pfn) & (MAPHASH_ENTRIES-1))
 #define MAPHASHENT_NOTINUSE ((u32)~0U)
@@ -54,13 +64,11 @@ struct mapcache_vcpu {
         uint32_t      idx;
         uint32_t      refcnt;
     } hash[MAPHASH_ENTRIES];
+
+    struct mapcache cache;
 };
 
 struct mapcache_domain {
-    /* The number of array entries, and a cursor into the array. */
-    unsigned int entries;
-    unsigned int cursor;
-
     /* Protects map_domain_page(). */
     spinlock_t lock;
 
@@ -68,9 +76,7 @@ struct mapcache_domain {
     unsigned int epoch;
     u32 tlbflush_timestamp;
 
-    /* Which mappings are in use, and which are garbage to reap next epoch? */
-    unsigned long *inuse;
-    unsigned long *garbage;
+    struct mapcache cache;
 };
 
 void mapcache_domain_init(struct domain *d);
-- 
2.46.0



^ permalink raw reply related	[flat|nested] 55+ messages in thread

* [PATCH v2 16/18] x86/pv: allow using a unique per-pCPU root page table (L4)
  2025-01-08 14:26 [PATCH v2 00/18] x86: adventures in Address Space Isolation Roger Pau Monne
                   ` (14 preceding siblings ...)
  2025-01-08 14:26 ` [PATCH v2 15/18] x86/mm: introduce a per-vCPU mapcache when using ASI Roger Pau Monne
@ 2025-01-08 14:26 ` Roger Pau Monne
  2025-01-08 14:26 ` [PATCH v2 17/18] x86/mm: switch to a per-CPU mapped stack when using ASI Roger Pau Monne
                   ` (2 subsequent siblings)
  18 siblings, 0 replies; 55+ messages in thread
From: Roger Pau Monne @ 2025-01-08 14:26 UTC (permalink / raw)
  To: xen-devel; +Cc: Roger Pau Monne, Jan Beulich, Andrew Cooper

When running PV guests it's possible for the guest to use the same root page
table (L4) for all vCPUs, which in turn will result in Xen also using the same
root page table on all pCPUs that are running any domain vCPU.

When using XPTI Xen switches to a per-CPU shadow L4 when running in guest
context, switching to the fully populated L4 when in Xen context.

Take advantage of this existing shadowing and force the usage of a per-CPU L4
that shadows the guest selected L4 when Address Space Isolation is requested
for PV guests.

The mapping of the guest L4 is done with a per-CPU fixmap entry, that however
requires that the currently loaded L4 has the per-CPU slot setup.  In order to
ensure this switch to the shadow per-CPU L4 with just the Xen slots populated,
and then map the guest L4 and copy the contents of the guest controlled
slots.

Signed-off-by: Roger Pau Monné <roger.pau@citrix.com>
---
 xen/arch/x86/flushtlb.c           | 22 +++++++++++++++++
 xen/arch/x86/include/asm/config.h |  6 +++++
 xen/arch/x86/include/asm/domain.h |  3 +++
 xen/arch/x86/include/asm/pv/mm.h  |  5 ++++
 xen/arch/x86/mm.c                 | 12 +++++++++-
 xen/arch/x86/mm/paging.c          |  6 +++++
 xen/arch/x86/pv/dom0_build.c      | 10 ++++++--
 xen/arch/x86/pv/domain.c          | 31 +++++++++++++++++++++++-
 xen/arch/x86/pv/mm.c              | 40 +++++++++++++++++++++++++++++++
 9 files changed, 131 insertions(+), 4 deletions(-)

diff --git a/xen/arch/x86/flushtlb.c b/xen/arch/x86/flushtlb.c
index a64c28f854ea..72692b504dd4 100644
--- a/xen/arch/x86/flushtlb.c
+++ b/xen/arch/x86/flushtlb.c
@@ -17,6 +17,7 @@
 #include <asm/nops.h>
 #include <asm/page.h>
 #include <asm/pv/domain.h>
+#include <asm/pv/mm.h>
 #include <asm/spec_ctrl.h>
 
 /* Debug builds: Wrap frequently to stress-test the wrap logic. */
@@ -192,7 +193,28 @@ unsigned int flush_area_local(const void *va, unsigned int flags)
     unsigned int order = (flags - 1) & FLUSH_ORDER_MASK;
 
     if ( flags & FLUSH_ROOT_PGTBL )
+    {
         get_cpu_info()->root_pgt_changed = true;
+        /*
+         * Use opt_vcpu_pt_pv instead of current->arch.vcpu_pt to avoid doing a
+         * sync_local_execstate() when per-vCPU page-tables are not enabled for
+         * PV.
+         */
+        if ( opt_vcpu_pt_pv )
+        {
+            const struct vcpu *curr;
+            const struct domain *curr_d;
+
+            sync_local_execstate();
+
+            curr = current;
+            curr_d = curr->domain;
+
+            if ( is_pv_domain(curr_d) && curr_d->arch.vcpu_pt )
+                /* Update shadow root page-table ahead of doing TLB flush. */
+                pv_asi_update_shadow_l4(curr);
+        }
+    }
 
     if ( flags & (FLUSH_TLB|FLUSH_TLB_GLOBAL) )
     {
diff --git a/xen/arch/x86/include/asm/config.h b/xen/arch/x86/include/asm/config.h
index 19746f956ec3..af3ff3cb8705 100644
--- a/xen/arch/x86/include/asm/config.h
+++ b/xen/arch/x86/include/asm/config.h
@@ -265,6 +265,12 @@ extern unsigned long xen_phys_start;
 /* The address of a particular VCPU's GDT or LDT. */
 #define GDT_VIRT_START(v)    \
     (PERDOMAIN_VIRT_START + ((v)->vcpu_id << GDT_LDT_VCPU_VA_SHIFT))
+/*
+ * There are 2 GDT pages reserved for Xen, but only one is used.  Use the
+ * remaining one to map the guest L4 when running with ASI enabled.
+ */
+#define L4_SHADOW(v) \
+    (GDT_VIRT_START(v) + ((FIRST_RESERVED_GDT_PAGE + 1) << PAGE_SHIFT))
 #define LDT_VIRT_START(v)    \
     (GDT_VIRT_START(v) + (64*1024))
 
diff --git a/xen/arch/x86/include/asm/domain.h b/xen/arch/x86/include/asm/domain.h
index ba5440099d90..a3c75e323cde 100644
--- a/xen/arch/x86/include/asm/domain.h
+++ b/xen/arch/x86/include/asm/domain.h
@@ -591,6 +591,9 @@ struct pv_vcpu
     /* Deferred VA-based update state. */
     bool need_update_runstate_area;
     struct vcpu_time_info pending_system_time;
+
+    /* For ASI: page to use as L4 shadow of the guest selected L4. */
+    root_pgentry_t *root_pgt;
 };
 
 struct arch_vcpu
diff --git a/xen/arch/x86/include/asm/pv/mm.h b/xen/arch/x86/include/asm/pv/mm.h
index 182764542c1f..540202f9712a 100644
--- a/xen/arch/x86/include/asm/pv/mm.h
+++ b/xen/arch/x86/include/asm/pv/mm.h
@@ -23,6 +23,8 @@ bool pv_destroy_ldt(struct vcpu *v);
 
 int validate_segdesc_page(struct page_info *page);
 
+void pv_asi_update_shadow_l4(const struct vcpu *v);
+
 #else
 
 #include <xen/errno.h>
@@ -44,6 +46,9 @@ static inline bool pv_map_ldt_shadow_page(unsigned int off) { return false; }
 static inline bool pv_destroy_ldt(struct vcpu *v)
 { ASSERT_UNREACHABLE(); return false; }
 
+static inline void pv_asi_update_shadow_l4(const struct vcpu *v)
+{ ASSERT_UNREACHABLE(); }
+
 #endif
 
 #endif /* __X86_PV_MM_H__ */
diff --git a/xen/arch/x86/mm.c b/xen/arch/x86/mm.c
index 583bf4c58bf9..3a637e508ff3 100644
--- a/xen/arch/x86/mm.c
+++ b/xen/arch/x86/mm.c
@@ -546,6 +546,8 @@ void write_ptbase(struct vcpu *v)
     }
     else
     {
+        if ( is_pv_domain(d) && d->arch.vcpu_pt )
+            pv_asi_update_shadow_l4(v);
         /* Make sure to clear use_pv_cr3 and xen_cr3 before pv_cr3. */
         cpu_info->use_pv_cr3 = false;
         cpu_info->xen_cr3 = 0;
@@ -565,6 +567,7 @@ void write_ptbase(struct vcpu *v)
  */
 pagetable_t update_cr3(struct vcpu *v)
 {
+    const struct domain *d = v->domain;
     mfn_t cr3_mfn;
 
     if ( paging_mode_enabled(v->domain) )
@@ -575,7 +578,14 @@ pagetable_t update_cr3(struct vcpu *v)
     else
         cr3_mfn = pagetable_get_mfn(v->arch.guest_table);
 
-    make_cr3(v, cr3_mfn);
+    make_cr3(v, d->arch.vcpu_pt ? virt_to_mfn(v->arch.pv.root_pgt) : cr3_mfn);
+
+    if ( d->arch.vcpu_pt )
+    {
+        populate_perdomain_mapping(v, L4_SHADOW(v), &cr3_mfn, 1);
+        if ( v == this_cpu(curr_vcpu) )
+            flush_tlb_one_local(L4_SHADOW(v));
+    }
 
     return pagetable_null();
 }
diff --git a/xen/arch/x86/mm/paging.c b/xen/arch/x86/mm/paging.c
index c77f4c1dac52..be30f21c1a7b 100644
--- a/xen/arch/x86/mm/paging.c
+++ b/xen/arch/x86/mm/paging.c
@@ -695,6 +695,12 @@ int paging_domctl(struct domain *d, struct xen_domctl_shadow_op *sc,
         return -EINVAL;
     }
 
+    if ( is_pv_domain(d) && d->arch.vcpu_pt )
+    {
+        gprintk(XENLOG_ERR, "Paging not supported on PV domains with ASI\n");
+        return -EOPNOTSUPP;
+    }
+
     if ( resuming
          ? (d->arch.paging.preempt.dom != current->domain ||
             d->arch.paging.preempt.op != sc->op)
diff --git a/xen/arch/x86/pv/dom0_build.c b/xen/arch/x86/pv/dom0_build.c
index 5081c19b9a9a..6c1d99a9bf0d 100644
--- a/xen/arch/x86/pv/dom0_build.c
+++ b/xen/arch/x86/pv/dom0_build.c
@@ -838,8 +838,11 @@ static int __init dom0_construct(struct boot_info *bi, struct domain *d)
 
     d->arch.paging.mode = 0;
 
-    /* Set up CR3 value for switch_cr3_cr4(). */
-    update_cr3(v);
+    /*
+     * Set up CR3 value for switch_cr3_cr4().  Use make_cr3() instead of
+     * update_cr3() to avoid using an ASI page-table for dom0 building.
+     */
+    make_cr3(v, pagetable_get_mfn(v->arch.guest_table));
 
     /* We run on dom0's page tables for the final part of the build process. */
     switch_cr3_cr4(cr3_pa(v->arch.cr3), read_cr4());
@@ -1068,6 +1071,9 @@ static int __init dom0_construct(struct boot_info *bi, struct domain *d)
     }
 #endif
 
+    /* Must be called in case ASI is enabled. */
+    update_cr3(v);
+
     v->is_initialised = 1;
     clear_bit(_VPF_down, &v->pause_flags);
 
diff --git a/xen/arch/x86/pv/domain.c b/xen/arch/x86/pv/domain.c
index 8d2428051607..583723c5d360 100644
--- a/xen/arch/x86/pv/domain.c
+++ b/xen/arch/x86/pv/domain.c
@@ -15,6 +15,7 @@
 #include <asm/invpcid.h>
 #include <asm/spec_ctrl.h>
 #include <asm/pv/domain.h>
+#include <asm/pv/mm.h>
 #include <asm/shadow.h>
 
 #ifdef CONFIG_PV32
@@ -296,6 +297,7 @@ void pv_vcpu_destroy(struct vcpu *v)
 
     pv_destroy_gdt_ldt_l1tab(v);
     XFREE(v->arch.pv.trap_ctxt);
+    FREE_XENHEAP_PAGE(v->arch.pv.root_pgt);
 }
 
 int pv_vcpu_initialise(struct vcpu *v)
@@ -336,6 +338,24 @@ int pv_vcpu_initialise(struct vcpu *v)
             goto done;
     }
 
+    if ( d->arch.vcpu_pt )
+    {
+        v->arch.pv.root_pgt = alloc_xenheap_page();
+        if ( !v->arch.pv.root_pgt )
+        {
+            rc = -ENOMEM;
+            goto done;
+        }
+
+        /*
+         * VM assists are not yet known, RO machine-to-phys slot will be copied
+         * from the guest L4.
+         */
+        init_xen_l4_slots(v->arch.pv.root_pgt,
+                          _mfn(virt_to_mfn(v->arch.pv.root_pgt)),
+                          v, INVALID_MFN, false);
+    }
+
  done:
     if ( rc )
         pv_vcpu_destroy(v);
@@ -368,7 +388,7 @@ int pv_domain_initialise(struct domain *d)
 
     d->arch.ctxt_switch = &pv_csw;
 
-    d->arch.pv.flush_root_pt = d->arch.pv.xpti;
+    d->arch.pv.flush_root_pt = d->arch.pv.xpti || d->arch.vcpu_pt;
 
     if ( !is_pv_32bit_domain(d) && use_invpcid && cpu_has_pcid )
         switch ( ACCESS_ONCE(opt_pcid) )
@@ -409,6 +429,7 @@ bool __init xpti_pcid_enabled(void)
 
 static void _toggle_guest_pt(struct vcpu *v)
 {
+    const struct domain *d = v->domain;
     bool guest_update;
     pagetable_t old_shadow;
     unsigned long cr3;
@@ -417,6 +438,14 @@ static void _toggle_guest_pt(struct vcpu *v)
     guest_update = v->arch.flags & TF_kernel_mode;
     old_shadow = update_cr3(v);
 
+    if ( d->arch.vcpu_pt )
+        /*
+         * _toggle_guest_pt() might switch between user and kernel page tables,
+         * but doesn't use write_ptbase(), and hence needs an explicit call to
+         * sync the shadow L4.
+         */
+        pv_asi_update_shadow_l4(v);
+
     /*
      * Don't flush user global mappings from the TLB. Don't tick TLB clock.
      *
diff --git a/xen/arch/x86/pv/mm.c b/xen/arch/x86/pv/mm.c
index 4853e619f2a7..46c437692bea 100644
--- a/xen/arch/x86/pv/mm.c
+++ b/xen/arch/x86/pv/mm.c
@@ -12,6 +12,7 @@
 
 #include <asm/current.h>
 #include <asm/p2m.h>
+#include <asm/pv/domain.h>
 
 #include "mm.h"
 
@@ -104,6 +105,45 @@ void init_xen_pae_l2_slots(l2_pgentry_t *l2t, const struct domain *d)
 }
 #endif
 
+void pv_asi_update_shadow_l4(const struct vcpu *v)
+{
+    const root_pgentry_t *guest_pgt;
+    root_pgentry_t *root_pgt = v->arch.pv.root_pgt;
+    const struct domain *d = v->domain;
+
+    ASSERT(!d->arch.pv.xpti);
+    ASSERT(is_pv_domain(d));
+    ASSERT(!is_idle_domain(d));
+    ASSERT(current == this_cpu(curr_vcpu));
+
+    if ( likely(v == current) )
+        guest_pgt = (void *)L4_SHADOW(v);
+    else if ( !(v->arch.flags & TF_kernel_mode) )
+        guest_pgt =
+            map_domain_page(pagetable_get_mfn(v->arch.guest_table_user));
+    else
+        guest_pgt = map_domain_page(pagetable_get_mfn(v->arch.guest_table));
+
+    if ( is_pv_64bit_domain(d) )
+    {
+        unsigned int i;
+
+        for ( i = 0; i < ROOT_PAGETABLE_FIRST_XEN_SLOT; i++ )
+            l4e_write(&root_pgt[i], guest_pgt[i]);
+        for ( i = ROOT_PAGETABLE_LAST_XEN_SLOT + 1;
+              i < L4_PAGETABLE_ENTRIES; i++ )
+            l4e_write(&root_pgt[i], guest_pgt[i]);
+
+        l4e_write(&root_pgt[l4_table_offset(RO_MPT_VIRT_START)],
+                  guest_pgt[l4_table_offset(RO_MPT_VIRT_START)]);
+    }
+    else
+        l4e_write(&root_pgt[0], guest_pgt[0]);
+
+    if ( v != this_cpu(curr_vcpu) )
+        unmap_domain_page(guest_pgt);
+}
+
 /*
  * Local variables:
  * mode: C
-- 
2.46.0



^ permalink raw reply related	[flat|nested] 55+ messages in thread

* [PATCH v2 17/18] x86/mm: switch to a per-CPU mapped stack when using ASI
  2025-01-08 14:26 [PATCH v2 00/18] x86: adventures in Address Space Isolation Roger Pau Monne
                   ` (15 preceding siblings ...)
  2025-01-08 14:26 ` [PATCH v2 16/18] x86/pv: allow using a unique per-pCPU root page table (L4) Roger Pau Monne
@ 2025-01-08 14:26 ` Roger Pau Monne
  2025-01-08 14:26 ` [PATCH v2 18/18] x86/mm: zero stack on context switch Roger Pau Monne
  2025-01-14 16:20 ` [PATCH v2 00/18] x86: adventures in Address Space Isolation Jan Beulich
  18 siblings, 0 replies; 55+ messages in thread
From: Roger Pau Monne @ 2025-01-08 14:26 UTC (permalink / raw)
  To: xen-devel
  Cc: Roger Pau Monne, Andrew Cooper, Anthony PERARD, Michal Orzel,
	Jan Beulich, Julien Grall, Stefano Stabellini

When using ASI the CPU stack is mapped using a range of fixmap entries in the
per-CPU region.  This ensures the stack is only accessible by the current CPU.

Note however there's further work required in order to allocate the stack from
domheap instead of xenheap, and ensure the stack is not part of the direct
map.

For domains not running with ASI enabled all the CPU stacks are mapped in the
per-domain L3, so that the stack is always at the same linear address,
regardless of whether ASI is enabled or not for the domain.

When calling UEFI runtime methods the current per-domain slot needs to be added
to the EFI L4, so that the stack is available in UEFI.

Finally, some users of callfunc IPIs pass parameters from the stack, so when
handling a callfunc IPI the stack of the caller CPU is mapped into the address
space of the CPU handling the IPI.  This needs further work to use a bounce
buffer in order to avoid having to map remote CPU stacks.

Signed-off-by: Roger Pau Monné <roger.pau@citrix.com>
---
There's also further work required in order to avoid mapping remote stack when
handling callfunc IPIs.
---
 docs/misc/xen-command-line.pandoc    |  5 +-
 xen/arch/x86/domain.c                | 30 ++++++++++++
 xen/arch/x86/include/asm/config.h    | 10 +++-
 xen/arch/x86/include/asm/current.h   |  5 ++
 xen/arch/x86/include/asm/domain.h    |  3 ++
 xen/arch/x86/include/asm/mm.h        |  2 +-
 xen/arch/x86/include/asm/smp.h       | 12 +++++
 xen/arch/x86/include/asm/spec_ctrl.h |  1 +
 xen/arch/x86/mm.c                    | 69 ++++++++++++++++++++++------
 xen/arch/x86/setup.c                 | 32 ++++++++++---
 xen/arch/x86/smp.c                   | 39 ++++++++++++++++
 xen/arch/x86/smpboot.c               | 20 +++++++-
 xen/arch/x86/spec_ctrl.c             | 67 +++++++++++++++++++++++----
 xen/arch/x86/traps.c                 |  8 +++-
 xen/common/smp.c                     | 10 ++++
 xen/common/stop_machine.c            | 10 ++++
 xen/include/xen/smp.h                |  8 ++++
 17 files changed, 295 insertions(+), 36 deletions(-)

diff --git a/docs/misc/xen-command-line.pandoc b/docs/misc/xen-command-line.pandoc
index 3c1ad7b5fe7d..e7828d092098 100644
--- a/docs/misc/xen-command-line.pandoc
+++ b/docs/misc/xen-command-line.pandoc
@@ -204,7 +204,7 @@ to appropriate auditing by Xen.  Argo is disabled by default.
 
 ### asi (x86)
 > `= List of [ <bool>, {pv,hvm}=<bool>,
-               {vcpu-pt}=<bool>|{pv,hvm}=<bool> ]`
+               {vcpu-pt,cpu-stack}=<bool>|{pv,hvm}=<bool> ]`
 
 Offers control over whether the hypervisor will engage in Address Space
 Isolation, by not having potentially sensitive information permanently mapped
@@ -221,6 +221,9 @@ meant to be used for debugging purposes only.**
 * `vcpu-pt` ensure each vCPU uses a unique top-level page-table and setup a
   virtual address space region to map memory on a per-vCPU basis.
 
+* `cpu-stack` prevent CPUs from having permanent mappings of stacks different
+  than their own.  Depends on the `vcpu-pt` option.
+
 ### asid (x86)
 > `= <boolean>`
 
diff --git a/xen/arch/x86/domain.c b/xen/arch/x86/domain.c
index 6e1f622f7385..ac6332266e95 100644
--- a/xen/arch/x86/domain.c
+++ b/xen/arch/x86/domain.c
@@ -563,6 +563,26 @@ int arch_vcpu_create(struct vcpu *v)
     if ( rc )
         return rc;
 
+    if ( opt_cpu_stack_hvm || opt_cpu_stack_pv )
+    {
+        if ( is_idle_vcpu(v) || d->arch.cpu_stack )
+            create_perdomain_mapping(v, PCPU_STACK_VIRT(0),
+                                     nr_cpu_ids << STACK_ORDER, false);
+        else if ( !v->vcpu_id )
+        {
+            l3_pgentry_t *idle_perdomain =
+                __map_domain_page(idle_vcpu[0]->domain->arch.perdomain_l3_pg);
+            l3_pgentry_t *guest_perdomain =
+                __map_domain_page(d->arch.perdomain_l3_pg);
+
+            l3e_write(&guest_perdomain[PCPU_STACK_SLOT],
+                      idle_perdomain[PCPU_STACK_SLOT]);
+
+            unmap_domain_page(guest_perdomain);
+            unmap_domain_page(idle_perdomain);
+        }
+    }
+
     rc = mapcache_vcpu_init(v);
     if ( rc )
         return rc;
@@ -2031,6 +2051,16 @@ static void __context_switch(struct vcpu *n)
         }
         vcpu_restore_fpu_nonlazy(n, false);
         nd->arch.ctxt_switch->to(n);
+        if ( nd->arch.cpu_stack )
+        {
+            /*
+             * Tear down previous stack mappings and map current pCPU stack.
+             * This is safe because not yet running on 'n' page-tables.
+             */
+            destroy_perdomain_mapping(n, PCPU_STACK_VIRT(0),
+                                      nr_cpu_ids << STACK_ORDER);
+            vcpu_set_stack_mappings(n, cpu, true);
+        }
     }
 
     psr_ctxt_switch_to(nd);
diff --git a/xen/arch/x86/include/asm/config.h b/xen/arch/x86/include/asm/config.h
index af3ff3cb8705..016d6c8b21a9 100644
--- a/xen/arch/x86/include/asm/config.h
+++ b/xen/arch/x86/include/asm/config.h
@@ -168,7 +168,7 @@
 /* Slot 260: per-domain mappings (including map cache). */
 #define PERDOMAIN_VIRT_START    (PML4_ADDR(260))
 #define PERDOMAIN_SLOT_MBYTES   (PML4_ENTRY_BYTES >> (20 + PAGETABLE_ORDER))
-#define PERDOMAIN_SLOTS         3
+#define PERDOMAIN_SLOTS         4
 #define PERDOMAIN_VIRT_SLOT(s)  (PERDOMAIN_VIRT_START + (s) * \
                                  (PERDOMAIN_SLOT_MBYTES << 20))
 /* Slot 4: mirror of per-domain mappings (for compat xlat area accesses). */
@@ -288,6 +288,14 @@ extern unsigned long xen_phys_start;
 #define ARG_XLAT_START(v)        \
     (ARG_XLAT_VIRT_START + ((v)->vcpu_id << ARG_XLAT_VA_SHIFT))
 
+/* Per-CPU stacks area when using ASI. */
+#define PCPU_STACK_SLOT         3
+#define PCPU_STACK_VIRT_START   PERDOMAIN_VIRT_SLOT(PCPU_STACK_SLOT)
+#define PCPU_STACK_VIRT_END     (PCPU_STACK_VIRT_START + \
+                                 (PERDOMAIN_SLOT_MBYTES << 20))
+#define PCPU_STACK_VIRT(cpu)    (PCPU_STACK_VIRT_START + \
+                                 (cpu << STACK_ORDER) * PAGE_SIZE)
+
 #define ELFSIZE 64
 
 #define ARCH_CRASH_SAVE_VMCOREINFO
diff --git a/xen/arch/x86/include/asm/current.h b/xen/arch/x86/include/asm/current.h
index bcec328c9875..4a9776f87a7a 100644
--- a/xen/arch/x86/include/asm/current.h
+++ b/xen/arch/x86/include/asm/current.h
@@ -24,6 +24,11 @@
  * 0 - IST Shadow Stacks (4x 1k, read-only)
  */
 
+static inline bool is_shstk_slot(unsigned int i)
+{
+    return (i == 0 || i == PRIMARY_SHSTK_SLOT);
+}
+
 /*
  * Identify which stack page the stack pointer is on.  Returns an index
  * as per the comment above.
diff --git a/xen/arch/x86/include/asm/domain.h b/xen/arch/x86/include/asm/domain.h
index a3c75e323cde..f83d2860c0b4 100644
--- a/xen/arch/x86/include/asm/domain.h
+++ b/xen/arch/x86/include/asm/domain.h
@@ -465,6 +465,9 @@ struct arch_domain
     /* Use a per-vCPU root pt, and switch per-domain slot to per-vCPU. */
     bool vcpu_pt;
 
+    /* Use per-CPU mapped stacks. */
+    bool cpu_stack;
+
     /* Emulated devices enabled bitmap. */
     uint32_t emulation_flags;
 } __cacheline_aligned;
diff --git a/xen/arch/x86/include/asm/mm.h b/xen/arch/x86/include/asm/mm.h
index f79d1594fde4..77f31685fd95 100644
--- a/xen/arch/x86/include/asm/mm.h
+++ b/xen/arch/x86/include/asm/mm.h
@@ -519,7 +519,7 @@ extern struct rangeset *mmio_ro_ranges;
 #define compat_pfn_to_cr3(pfn) (((unsigned)(pfn) << 12) | ((unsigned)(pfn) >> 20))
 #define compat_cr3_to_pfn(cr3) (((unsigned)(cr3) >> 12) | ((unsigned)(cr3) << 20))
 
-void memguard_guard_stack(void *p);
+void memguard_guard_stack(void *p, unsigned int cpu);
 void memguard_unguard_stack(void *p);
 
 /*
diff --git a/xen/arch/x86/include/asm/smp.h b/xen/arch/x86/include/asm/smp.h
index c8c79601343d..a356f0bf0a61 100644
--- a/xen/arch/x86/include/asm/smp.h
+++ b/xen/arch/x86/include/asm/smp.h
@@ -79,6 +79,18 @@ extern bool unaccounted_cpus;
 
 void *cpu_alloc_stack(unsigned int cpu);
 
+/*
+ * Setup the per-CPU area stack mappings.
+ *
+ * @v:         vCPU where the mappings are to appear.
+ * @stack_cpu: CPU whose stacks should be mapped.
+ * @map_shstk: create mappings for shadow stack regions.
+ */
+void vcpu_set_stack_mappings(const struct vcpu *v, unsigned int stack_cpu,
+                             bool map_shstk);
+
+#define HAS_ARCH_SMP_CALLFUNC_PREAMBLE
+
 #endif /* !__ASSEMBLY__ */
 
 #endif
diff --git a/xen/arch/x86/include/asm/spec_ctrl.h b/xen/arch/x86/include/asm/spec_ctrl.h
index c58afbaab671..c8943e81befa 100644
--- a/xen/arch/x86/include/asm/spec_ctrl.h
+++ b/xen/arch/x86/include/asm/spec_ctrl.h
@@ -89,6 +89,7 @@ extern uint8_t default_scf;
 extern int8_t opt_xpti_hwdom, opt_xpti_domu;
 
 extern int8_t opt_vcpu_pt_pv, opt_vcpu_pt_hwdom, opt_vcpu_pt_hvm;
+extern int8_t opt_cpu_stack_pv, opt_cpu_stack_hwdom, opt_cpu_stack_hvm;
 
 extern bool cpu_has_bug_l1tf;
 extern int8_t opt_pv_l1tf_hwdom, opt_pv_l1tf_domu;
diff --git a/xen/arch/x86/mm.c b/xen/arch/x86/mm.c
index 3a637e508ff3..22ee3170b86d 100644
--- a/xen/arch/x86/mm.c
+++ b/xen/arch/x86/mm.c
@@ -87,6 +87,7 @@
  * doing the final put_page(), and remove it from the iommu if so.
  */
 
+#include <xen/cpu.h>
 #include <xen/init.h>
 #include <xen/ioreq.h>
 #include <xen/kernel.h>
@@ -6424,8 +6425,10 @@ int create_perdomain_mapping(struct vcpu *v, unsigned long va,
     return rc;
 }
 
-void populate_perdomain_mapping(const struct vcpu *v, unsigned long va,
-                                mfn_t *mfn, unsigned long nr)
+static void populate_perdomain_mapping_flags(const struct vcpu *v,
+                                             unsigned long va, mfn_t *mfn,
+                                             unsigned long nr,
+                                             unsigned int flags)
 {
     l1_pgentry_t *l1tab = NULL, *pl1e;
     const l3_pgentry_t *l3tab;
@@ -6454,7 +6457,7 @@ void populate_perdomain_mapping(const struct vcpu *v, unsigned long va,
                 ASSERT_UNREACHABLE();
                 free_domheap_page(l1e_get_page(*pl1e));
             }
-            l1e_write(pl1e, l1e_from_mfn(mfn[i], __PAGE_HYPERVISOR_RW));
+            l1e_write(pl1e, l1e_from_mfn(mfn[i], flags));
         }
 
         return;
@@ -6505,7 +6508,7 @@ void populate_perdomain_mapping(const struct vcpu *v, unsigned long va,
             free_domheap_page(l1e_get_page(*pl1e));
         }
 
-        l1e_write(pl1e, l1e_from_mfn(*mfn, __PAGE_HYPERVISOR_RW));
+        l1e_write(pl1e, l1e_from_mfn(*mfn, flags));
     }
 
     unmap_domain_page(l1tab);
@@ -6513,6 +6516,31 @@ void populate_perdomain_mapping(const struct vcpu *v, unsigned long va,
     unmap_domain_page(l3tab);
 }
 
+void populate_perdomain_mapping(const struct vcpu *v, unsigned long va,
+                                mfn_t *mfn, unsigned long nr)
+{
+    populate_perdomain_mapping_flags(v, va, mfn, nr, __PAGE_HYPERVISOR_RW);
+}
+
+void vcpu_set_stack_mappings(const struct vcpu *v, unsigned int stack_cpu,
+                             bool map_shstk)
+{
+    unsigned int i;
+
+    for ( i = 0; i < (1U << STACK_ORDER); i++ )
+    {
+        unsigned int flags = is_shstk_slot(i) ? __PAGE_HYPERVISOR_SHSTK
+                                              : __PAGE_HYPERVISOR_RW;
+        mfn_t mfn = virt_to_mfn(stack_base[stack_cpu] + i * PAGE_SIZE);
+
+        if ( is_shstk_slot(i) && !map_shstk )
+            continue;
+
+        populate_perdomain_mapping_flags(v,
+            PCPU_STACK_VIRT(stack_cpu) + i * PAGE_SIZE, &mfn, 1, flags);
+    }
+}
+
 void destroy_perdomain_mapping(const struct vcpu *v, unsigned long va,
                                unsigned int nr)
 {
@@ -6599,7 +6627,12 @@ void free_perdomain_mappings(struct vcpu *v)
     l3tab = __map_domain_page(d->arch.vcpu_pt ? v->arch.pervcpu_l3_pg
                                               : d->arch.perdomain_l3_pg);
 
-    for ( i = 0; i < PERDOMAIN_SLOTS; ++i)
+    for ( i = 0; i < PERDOMAIN_SLOTS; ++i )
+    {
+        if ( i == PCPU_STACK_SLOT && !d->arch.cpu_stack )
+            /* Without ASI the stack L3e is shared with the idle page-tables. */
+            continue;
+
         if ( l3e_get_flags(l3tab[i]) & _PAGE_PRESENT )
         {
             struct page_info *l2pg = l3e_get_page(l3tab[i]);
@@ -6629,6 +6662,7 @@ void free_perdomain_mappings(struct vcpu *v)
             unmap_domain_page(l2tab);
             free_domheap_page(l2pg);
         }
+    }
 
     unmap_domain_page(l3tab);
     free_domheap_page(d->arch.vcpu_pt ? v->arch.pervcpu_l3_pg
@@ -6637,31 +6671,40 @@ void free_perdomain_mappings(struct vcpu *v)
     v->arch.pervcpu_l3_pg = NULL;
 }
 
-static void write_sss_token(unsigned long *ptr)
+static void write_sss_token(unsigned long *ptr, unsigned long va)
 {
     /*
      * A supervisor shadow stack token is its own linear address, with the
      * busy bit (0) clear.
      */
-    *ptr = (unsigned long)ptr;
+    *ptr = va;
 }
 
-void memguard_guard_stack(void *p)
+void memguard_guard_stack(void *p, unsigned int cpu)
 {
+    unsigned long va =
+        (opt_cpu_stack_hvm || opt_cpu_stack_pv) ? PCPU_STACK_VIRT(cpu)
+                                                : (unsigned long)p;
+
     /* IST Shadow stacks.  4x 1k in stack page 0. */
     if ( IS_ENABLED(CONFIG_XEN_SHSTK) )
     {
-        write_sss_token(p + (IST_MCE * IST_SHSTK_SIZE) - 8);
-        write_sss_token(p + (IST_NMI * IST_SHSTK_SIZE) - 8);
-        write_sss_token(p + (IST_DB  * IST_SHSTK_SIZE) - 8);
-        write_sss_token(p + (IST_DF  * IST_SHSTK_SIZE) - 8);
+        write_sss_token(p + (IST_MCE * IST_SHSTK_SIZE) - 8,
+                        va + (IST_MCE * IST_SHSTK_SIZE) - 8);
+        write_sss_token(p + (IST_NMI * IST_SHSTK_SIZE) - 8,
+                        va + (IST_NMI * IST_SHSTK_SIZE) - 8);
+        write_sss_token(p + (IST_DB  * IST_SHSTK_SIZE) - 8,
+                        va + (IST_DB  * IST_SHSTK_SIZE) - 8);
+        write_sss_token(p + (IST_DF  * IST_SHSTK_SIZE) - 8,
+                        va + (IST_DF  * IST_SHSTK_SIZE) - 8);
     }
     map_pages_to_xen((unsigned long)p, virt_to_mfn(p), 1, PAGE_HYPERVISOR_SHSTK);
 
     /* Primary Shadow Stack.  1x 4k in stack page 5. */
     p += PRIMARY_SHSTK_SLOT * PAGE_SIZE;
+    va += PRIMARY_SHSTK_SLOT * PAGE_SIZE;
     if ( IS_ENABLED(CONFIG_XEN_SHSTK) )
-        write_sss_token(p + PAGE_SIZE - 8);
+        write_sss_token(p + PAGE_SIZE - 8, va + PAGE_SIZE - 8);
 
     map_pages_to_xen((unsigned long)p, virt_to_mfn(p), 1, PAGE_HYPERVISOR_SHSTK);
 }
diff --git a/xen/arch/x86/setup.c b/xen/arch/x86/setup.c
index 8ebe5a9443f3..d0b2c986962a 100644
--- a/xen/arch/x86/setup.c
+++ b/xen/arch/x86/setup.c
@@ -402,6 +402,11 @@ static void __init init_idle_domain(void)
     scheduler_init();
     set_current(idle_vcpu[0]);
     this_cpu(curr_vcpu) = current;
+    if ( opt_cpu_stack_hvm || opt_cpu_stack_pv )
+        /* Set per-domain slot in the idle page-tables to access stack mappings. */
+        l4e_write(&idle_pg_table[l4_table_offset(PERDOMAIN_VIRT_START)],
+                  l4e_from_page(idle_vcpu[0]->domain->arch.perdomain_l3_pg,
+                                __PAGE_HYPERVISOR_RW));
 }
 
 void srat_detect_node(int cpu)
@@ -896,8 +901,6 @@ static void __init noreturn reinit_bsp_stack(void)
     /* Update SYSCALL trampolines */
     percpu_traps_init();
 
-    stack_base[0] = stack;
-
     rc = setup_cpu_root_pgt(0);
     if ( rc )
         panic("Error %d setting up PV root page table\n", rc);
@@ -1864,10 +1867,6 @@ void asmlinkage __init noreturn __start_xen(void)
 
     system_state = SYS_STATE_boot;
 
-    bsp_stack = cpu_alloc_stack(0);
-    if ( !bsp_stack )
-        panic("No memory for BSP stack\n");
-
     console_init_ring();
     vesa_init();
 
@@ -2050,6 +2049,16 @@ void asmlinkage __init noreturn __start_xen(void)
 
     alternative_branches();
 
+    /*
+     * Alloc the BSP stack closer to the point where the AP ones also get
+     * allocated - and after the speculation mitigations have been initialized.
+     * In order to set up the shadow stack token correctly Xen needs to know
+     * whether per-CPU mapped stacks are being used.
+     */
+    bsp_stack = cpu_alloc_stack(0);
+    if ( !bsp_stack )
+        panic("No memory for BSP stack\n");
+
     /*
      * NB: when running as a PV shim VCPUOP_up/down is wired to the shim
      * physical cpu_add/remove functions, so launch the guest with only
@@ -2155,8 +2164,17 @@ void asmlinkage __init noreturn __start_xen(void)
         info->last_spec_ctrl = default_xen_spec_ctrl;
     }
 
+    stack_base[0] = bsp_stack;
+
     /* Copy the cpu info block, and move onto the BSP stack. */
-    bsp_info = get_cpu_info_from_stack((unsigned long)bsp_stack);
+    if ( opt_cpu_stack_hvm || opt_cpu_stack_pv )
+    {
+        vcpu_set_stack_mappings(idle_vcpu[0], 0, true);
+        bsp_info = get_cpu_info_from_stack(PCPU_STACK_VIRT(0));
+    }
+    else
+        bsp_info = get_cpu_info_from_stack((unsigned long)bsp_stack);
+
     *bsp_info = *info;
 
     asm volatile ("mov %[stk], %%rsp; jmp %c[fn]" ::
diff --git a/xen/arch/x86/smp.c b/xen/arch/x86/smp.c
index 02a6ed7593f3..1b11017d5722 100644
--- a/xen/arch/x86/smp.c
+++ b/xen/arch/x86/smp.c
@@ -9,6 +9,7 @@
  */
 
 #include <xen/cpu.h>
+#include <xen/efi.h>
 #include <xen/irq.h>
 #include <xen/sched.h>
 #include <xen/delay.h>
@@ -27,6 +28,8 @@
 #include <asm/hpet.h>
 #include <asm/setup.h>
 
+#include <asm/spec_ctrl.h>
+
 /* Helper functions to prepare APIC register values. */
 static unsigned int prepare_ICR(unsigned int shortcut, int vector)
 {
@@ -435,3 +438,39 @@ long cf_check cpu_down_helper(void *data)
         ret = cpu_down(cpu);
     return ret;
 }
+
+void arch_smp_pre_callfunc(unsigned int cpu)
+{
+    if ( !opt_cpu_stack_hvm && !opt_cpu_stack_pv )
+        /*
+         * Avoid the unconditional sync_local_execstate() call below if ASI is
+         * not enabled for any domain.
+         */
+        return;
+
+    /*
+     * Sync execution state, so that the page-tables cannot change while
+     * creating or destroying the stack mappings.
+     */
+    sync_local_execstate();
+    if ( cpu == smp_processor_id() || !current->domain->arch.cpu_stack ||
+         /* EFI page-tables have all pCPU stacks mapped. */
+         efi_rs_using_pgtables() )
+        return;
+
+    vcpu_set_stack_mappings(current, cpu, false);
+}
+
+void arch_smp_post_callfunc(unsigned int cpu)
+{
+    if ( cpu == smp_processor_id() || !current->domain->arch.cpu_stack ||
+         /* EFI page-tables have all pCPU stacks mapped. */
+         efi_rs_using_pgtables() )
+        return;
+
+    ASSERT(current == this_cpu(curr_vcpu));
+    destroy_perdomain_mapping(current, PCPU_STACK_VIRT(cpu),
+                              (1U << STACK_ORDER));
+
+    flush_area_local((void *)PCPU_STACK_VIRT(cpu), FLUSH_ORDER(STACK_ORDER));
+}
diff --git a/xen/arch/x86/smpboot.c b/xen/arch/x86/smpboot.c
index a740a6402272..515ab3cb9c75 100644
--- a/xen/arch/x86/smpboot.c
+++ b/xen/arch/x86/smpboot.c
@@ -582,7 +582,21 @@ static int do_boot_cpu(int apicid, int cpu)
         printk("Booting processor %d/%d eip %lx\n",
                cpu, apicid, start_eip);
 
-    stack_start = stack_base[cpu] + STACK_SIZE - sizeof(struct cpu_info);
+    if ( opt_cpu_stack_hvm || opt_cpu_stack_pv )
+    {
+        /*
+         * Uniformly run with the stack mappings in the per-domain area if ASI
+         * is enabled for any domain type.
+         */
+        vcpu_set_stack_mappings(idle_vcpu[cpu], cpu, true);
+
+        ASSERT(IS_ALIGNED(PCPU_STACK_VIRT(cpu), STACK_SIZE));
+
+        stack_start = (void *)PCPU_STACK_VIRT(cpu) + STACK_SIZE -
+                      sizeof(struct cpu_info);
+    }
+    else
+        stack_start = stack_base[cpu] + STACK_SIZE - sizeof(struct cpu_info);
 
     /* This grunge runs the startup process for the targeted processor. */
 
@@ -1030,7 +1044,7 @@ void *cpu_alloc_stack(unsigned int cpu)
     stack = alloc_xenheap_pages(STACK_ORDER, memflags);
 
     if ( stack )
-        memguard_guard_stack(stack);
+        memguard_guard_stack(stack, cpu);
 
     return stack;
 }
@@ -1146,6 +1160,8 @@ static struct notifier_block cpu_smpboot_nfb = {
 
 void __init smp_prepare_cpus(void)
 {
+    BUILD_BUG_ON(PCPU_STACK_VIRT(CONFIG_NR_CPUS) > PCPU_STACK_VIRT_END);
+
     register_cpu_notifier(&cpu_smpboot_nfb);
 
     mtrr_aps_sync_begin();
diff --git a/xen/arch/x86/spec_ctrl.c b/xen/arch/x86/spec_ctrl.c
index 9463a8624701..4f1e912f8057 100644
--- a/xen/arch/x86/spec_ctrl.c
+++ b/xen/arch/x86/spec_ctrl.c
@@ -89,6 +89,10 @@ bool __ro_after_init opt_bp_spec_reduce = true;
 int8_t __ro_after_init opt_vcpu_pt_hvm = -1;
 int8_t __ro_after_init opt_vcpu_pt_hwdom = -1;
 int8_t __ro_after_init opt_vcpu_pt_pv = -1;
+/* Per-CPU stacks. */
+int8_t __ro_after_init opt_cpu_stack_hvm = -1;
+int8_t __ro_after_init opt_cpu_stack_hwdom = -1;
+int8_t __ro_after_init opt_cpu_stack_pv = -1;
 
 static int __init cf_check parse_spec_ctrl(const char *s)
 {
@@ -395,6 +399,7 @@ static __init void xpti_init_default(void)
         printk(XENLOG_ERR
                "XPTI incompatible with per-vCPU page-tables, disabling ASI\n");
         opt_vcpu_pt_pv = 0;
+        opt_cpu_stack_pv = 0;
     }
     if ( (boot_cpu_data.x86_vendor & (X86_VENDOR_AMD | X86_VENDOR_HYGON)) ||
          cpu_has_rdcl_no )
@@ -507,7 +512,10 @@ static int __init cf_check parse_asi(const char *s)
 
     /* Interpret 'asi' alone in its positive boolean form. */
     if ( *s == '\0' )
+    {
         opt_vcpu_pt_pv = opt_vcpu_pt_hwdom = opt_vcpu_pt_hvm = 1;
+        opt_cpu_stack_pv = opt_cpu_stack_hwdom = opt_cpu_stack_hvm = 1;
+    }
 
     do {
         ss = strchr(s, ',');
@@ -520,13 +528,14 @@ static int __init cf_check parse_asi(const char *s)
         case 0:
         case 1:
             opt_vcpu_pt_pv = opt_vcpu_pt_hwdom = opt_vcpu_pt_hvm = val;
+            opt_cpu_stack_pv = opt_cpu_stack_hvm = opt_cpu_stack_hwdom = val;
             break;
 
         default:
             if ( (val = parse_boolean("pv", s, ss)) >= 0 )
-                opt_vcpu_pt_pv = val;
+                opt_cpu_stack_pv = opt_vcpu_pt_pv = val;
             else if ( (val = parse_boolean("hvm", s, ss)) >= 0 )
-                opt_vcpu_pt_hvm = val;
+                opt_cpu_stack_hvm = opt_vcpu_pt_hvm = val;
             else if ( (val = parse_boolean("vcpu-pt", s, ss)) != -1 )
             {
                 switch ( val )
@@ -548,6 +557,28 @@ static int __init cf_check parse_asi(const char *s)
                     break;
                 }
             }
+            else if ( (val = parse_boolean("cpu-stack", s, ss)) != -1 )
+            {
+                switch ( val )
+                {
+                case 1:
+                case 0:
+                    opt_cpu_stack_pv = opt_cpu_stack_hvm =
+                        opt_cpu_stack_hwdom = val;
+                    break;
+
+                case -2:
+                    s += strlen("cpu-stack=");
+                    if ( (val = parse_boolean("pv", s, ss)) >= 0 )
+                        opt_cpu_stack_pv = val;
+                    else if ( (val = parse_boolean("hvm", s, ss)) >= 0 )
+                        opt_cpu_stack_hvm = val;
+                    else
+                default:
+                        rc = -EINVAL;
+                    break;
+                }
+            }
             else if ( *s )
                 rc = -EINVAL;
             break;
@@ -556,6 +587,14 @@ static int __init cf_check parse_asi(const char *s)
         s = ss + 1;
     } while ( *ss );
 
+    /* Per-CPU stacks depends on per-vCPU mappings. */
+    if ( opt_cpu_stack_pv == 1 )
+        opt_vcpu_pt_pv = 1;
+    if ( opt_cpu_stack_hvm == 1 )
+        opt_vcpu_pt_hvm = 1;
+    if ( opt_cpu_stack_hwdom == 1 )
+        opt_vcpu_pt_hwdom = 1;
+
     return rc;
 }
 custom_param("asi", parse_asi);
@@ -752,16 +791,17 @@ static void __init print_details(enum ind_thunk thunk)
 #endif
 
 #ifdef CONFIG_HVM
-    printk("  ASI features for HVM VMs:%s%s\n",
-           opt_vcpu_pt_hvm                           ? ""               : " None",
-           opt_vcpu_pt_hvm                           ? " vCPU-PT"       : "");
+    printk("  ASI features for HVM VMs:%s%s%s\n",
+           opt_vcpu_pt_hvm || opt_cpu_stack_hvm      ? ""               : " None",
+           opt_vcpu_pt_hvm                           ? " vCPU-PT"       : "",
+           opt_cpu_stack_hvm                         ? " CPU-STACK"     : "");
 
 #endif
 #ifdef CONFIG_PV
-    printk("  ASI features for PV VMs:%s%s\n",
-           opt_vcpu_pt_pv                            ? ""               : " None",
-           opt_vcpu_pt_pv                            ? " vCPU-PT"       : "");
-
+    printk("  ASI features for PV VMs:%s%s%s\n",
+           opt_vcpu_pt_pv || opt_cpu_stack_pv        ? ""               : " None",
+           opt_vcpu_pt_pv                            ? " vCPU-PT"       : "",
+           opt_cpu_stack_pv                          ? " CPU-STACK"     : "");
 #endif
 }
 
@@ -1869,6 +1909,9 @@ void spec_ctrl_init_domain(struct domain *d)
     d->arch.vcpu_pt = is_hardware_domain(d) ? opt_vcpu_pt_hwdom
                                             : pv ? opt_vcpu_pt_pv
                                                  : opt_vcpu_pt_hvm;
+    d->arch.cpu_stack = is_hardware_domain(d) ? opt_cpu_stack_hwdom
+                                              : pv ? opt_cpu_stack_pv
+                                                   : opt_cpu_stack_hvm;
 }
 
 void __init init_speculation_mitigations(void)
@@ -2172,6 +2215,12 @@ void __init init_speculation_mitigations(void)
         opt_vcpu_pt_hwdom = 0;
     if ( opt_vcpu_pt_hvm == -1 )
         opt_vcpu_pt_hvm = 0;
+    if ( opt_cpu_stack_pv == -1 )
+        opt_cpu_stack_pv = 0;
+    if ( opt_cpu_stack_hwdom == -1 )
+        opt_cpu_stack_hwdom = 0;
+    if ( opt_cpu_stack_hvm == -1 )
+        opt_cpu_stack_hvm = 0;
 
     if ( opt_vcpu_pt_pv || opt_vcpu_pt_hvm )
         warning_add(
diff --git a/xen/arch/x86/traps.c b/xen/arch/x86/traps.c
index a7f6fb611c34..c80ef2268e94 100644
--- a/xen/arch/x86/traps.c
+++ b/xen/arch/x86/traps.c
@@ -74,6 +74,7 @@
 #include <asm/pv/trace.h>
 #include <asm/pv/mm.h>
 #include <asm/shstk.h>
+#include <asm/spec_ctrl.h>
 
 /*
  * opt_nmi: one of 'ignore', 'dom0', or 'fatal'.
@@ -609,10 +610,13 @@ void show_stack_overflow(unsigned int cpu, const struct cpu_user_regs *regs)
     unsigned long esp = regs->rsp;
     unsigned long curr_stack_base = esp & ~(STACK_SIZE - 1);
     unsigned long esp_top, esp_bottom;
+    const void *stack =
+        (opt_cpu_stack_hvm || opt_cpu_stack_pv) ? (void *)PCPU_STACK_VIRT(cpu)
+                                                : stack_base[cpu];
 
-    if ( _p(curr_stack_base) != stack_base[cpu] )
+    if ( _p(curr_stack_base) != stack )
         printk("Current stack base %p differs from expected %p\n",
-               _p(curr_stack_base), stack_base[cpu]);
+               _p(curr_stack_base), stack);
 
     esp_bottom = (esp | (STACK_SIZE - 1)) + 1;
     esp_top    = esp_bottom - PRIMARY_STACK_SIZE;
diff --git a/xen/common/smp.c b/xen/common/smp.c
index a011f541f1ea..04f5aede0d3d 100644
--- a/xen/common/smp.c
+++ b/xen/common/smp.c
@@ -29,6 +29,7 @@ static struct call_data_struct {
     void (*func) (void *info);
     void *info;
     int wait;
+    unsigned int caller;
     cpumask_t selected;
 } call_data;
 
@@ -63,6 +64,7 @@ void on_selected_cpus(
     call_data.func = func;
     call_data.info = info;
     call_data.wait = wait;
+    call_data.caller = smp_processor_id();
 
     smp_send_call_function_mask(&call_data.selected);
 
@@ -82,6 +84,12 @@ void smp_call_function_interrupt(void)
     if ( !cpumask_test_cpu(cpu, &call_data.selected) )
         return;
 
+    /*
+     * TODO: use bounce buffers to pass callfunc data, so that when using ASI
+     * there's no need to map remote CPU stacks.
+     */
+    arch_smp_pre_callfunc(call_data.caller);
+
     irq_enter();
 
     if ( unlikely(!func) )
@@ -102,6 +110,8 @@ void smp_call_function_interrupt(void)
     }
 
     irq_exit();
+
+    arch_smp_post_callfunc(call_data.caller);
 }
 
 /*
diff --git a/xen/common/stop_machine.c b/xen/common/stop_machine.c
index 398cfd507c10..142059c36374 100644
--- a/xen/common/stop_machine.c
+++ b/xen/common/stop_machine.c
@@ -40,6 +40,7 @@ enum stopmachine_state {
 
 struct stopmachine_data {
     unsigned int nr_cpus;
+    unsigned int caller;
 
     enum stopmachine_state state;
     atomic_t done;
@@ -104,6 +105,7 @@ int stop_machine_run(int (*fn)(void *data), void *data, unsigned int cpu)
     stopmachine_data.fn_result = 0;
     atomic_set(&stopmachine_data.done, 0);
     stopmachine_data.state = STOPMACHINE_START;
+    stopmachine_data.caller = this;
 
     smp_wmb();
 
@@ -148,6 +150,12 @@ static void cf_check stopmachine_action(void *data)
 
     BUG_ON(cpu != smp_processor_id());
 
+    /*
+     * TODO: use bounce buffers to pass callfunc data, so that when using ASI
+     * there's no need to map remote CPU stacks.
+     */
+    arch_smp_pre_callfunc(stopmachine_data.caller);
+
     smp_mb();
 
     while ( state != STOPMACHINE_EXIT )
@@ -180,6 +188,8 @@ static void cf_check stopmachine_action(void *data)
     }
 
     local_irq_enable();
+
+    arch_smp_post_callfunc(stopmachine_data.caller);
 }
 
 static int cf_check cpu_callback(
diff --git a/xen/include/xen/smp.h b/xen/include/xen/smp.h
index 2ca9ff1bfcc1..a25d47e29dce 100644
--- a/xen/include/xen/smp.h
+++ b/xen/include/xen/smp.h
@@ -76,4 +76,12 @@ extern void *stack_base[NR_CPUS];
 void initialize_cpu_data(unsigned int cpu);
 int setup_cpu_root_pgt(unsigned int cpu);
 
+#ifdef HAS_ARCH_SMP_CALLFUNC_PREAMBLE
+void arch_smp_pre_callfunc(unsigned int cpu);
+void arch_smp_post_callfunc(unsigned int cpu);
+#else
+static inline void arch_smp_pre_callfunc(unsigned int cpu) {}
+static inline void arch_smp_post_callfunc(unsigned int cpu) {}
+#endif
+
 #endif /* __XEN_SMP_H__ */
-- 
2.46.0



^ permalink raw reply related	[flat|nested] 55+ messages in thread

* [PATCH v2 18/18] x86/mm: zero stack on context switch
  2025-01-08 14:26 [PATCH v2 00/18] x86: adventures in Address Space Isolation Roger Pau Monne
                   ` (16 preceding siblings ...)
  2025-01-08 14:26 ` [PATCH v2 17/18] x86/mm: switch to a per-CPU mapped stack when using ASI Roger Pau Monne
@ 2025-01-08 14:26 ` Roger Pau Monne
  2025-01-14 16:20 ` [PATCH v2 00/18] x86: adventures in Address Space Isolation Jan Beulich
  18 siblings, 0 replies; 55+ messages in thread
From: Roger Pau Monne @ 2025-01-08 14:26 UTC (permalink / raw)
  To: xen-devel
  Cc: Roger Pau Monne, Andrew Cooper, Anthony PERARD, Michal Orzel,
	Jan Beulich, Julien Grall, Stefano Stabellini

With the stack mapped on a per-CPU basis there's no risk of other CPUs being
able to read the stack contents, but vCPUs running on the current pCPU could
read stack rubble from operations of previous vCPUs.

The #DF stack is not zeroed because handling of #DF results in a panic.

The contents of the shadow stack are not cleared as part of this change.  It's
arguable that leaking internal Xen return addresses is not guest confidential
data.  At most those could be used by an attacker to figure out the paths
inside of Xen previous execution flows have used.

Signed-off-by: Roger Pau Monné <roger.pau@citrix.com>
---
Is it required to zero the stack when doing a non-lazy context switch from the
idle vPCU to the previously running vCPU?

d0v0 -> IDLE -> sync_execstate -> zero stack? -> d0v0

This is currently done in this proposal, as when running in the idle vCPU
context (iow: not lazy switched) stacks from remote pCPUs can be mapped or
tasklets executed.
---
Changes since v1:
 - Zero the stack forward to use ERMS.
 - Only zero the IST stacks if they have been used.
 - Only zero the primary stack for full context switches.
---
 docs/misc/xen-command-line.pandoc    |  4 +-
 xen/arch/x86/cpu/mcheck/mce.c        |  4 ++
 xen/arch/x86/domain.c                | 13 ++++++-
 xen/arch/x86/include/asm/current.h   | 53 +++++++++++++++++++++++---
 xen/arch/x86/include/asm/domain.h    |  3 ++
 xen/arch/x86/include/asm/spec_ctrl.h |  1 +
 xen/arch/x86/spec_ctrl.c             | 57 ++++++++++++++++++++++++----
 xen/arch/x86/traps.c                 |  5 +++
 8 files changed, 124 insertions(+), 16 deletions(-)

diff --git a/docs/misc/xen-command-line.pandoc b/docs/misc/xen-command-line.pandoc
index e7828d092098..9cde9e84aff2 100644
--- a/docs/misc/xen-command-line.pandoc
+++ b/docs/misc/xen-command-line.pandoc
@@ -204,7 +204,7 @@ to appropriate auditing by Xen.  Argo is disabled by default.
 
 ### asi (x86)
 > `= List of [ <bool>, {pv,hvm}=<bool>,
-               {vcpu-pt,cpu-stack}=<bool>|{pv,hvm}=<bool> ]`
+               {vcpu-pt,cpu-stack,zero-stack}=<bool>|{pv,hvm}=<bool> ]`
 
 Offers control over whether the hypervisor will engage in Address Space
 Isolation, by not having potentially sensitive information permanently mapped
@@ -224,6 +224,8 @@ meant to be used for debugging purposes only.**
 * `cpu-stack` prevent CPUs from having permanent mappings of stacks different
   than their own.  Depends on the `vcpu-pt` option.
 
+* `zero-stack` zero CPU stacks when context switching vCPUs.
+
 ### asid (x86)
 > `= <boolean>`
 
diff --git a/xen/arch/x86/cpu/mcheck/mce.c b/xen/arch/x86/cpu/mcheck/mce.c
index 9028ccde5477..eaaaefe7f8ba 100644
--- a/xen/arch/x86/cpu/mcheck/mce.c
+++ b/xen/arch/x86/cpu/mcheck/mce.c
@@ -92,10 +92,14 @@ struct mce_callbacks __ro_after_init mce_callbacks = {
 static const typeof(mce_callbacks.handler) __initconst_cf_clobber __used
     default_handler = unexpected_machine_check;
 
+DEFINE_PER_CPU(unsigned int, slice_mce_count);
+
 /* Call the installed machine check handler for this CPU setup. */
 
 void do_machine_check(const struct cpu_user_regs *regs)
 {
+    this_cpu(slice_mce_count)++;
+
     mce_enter();
     alternative_vcall(mce_callbacks.handler, regs);
     mce_exit();
diff --git a/xen/arch/x86/domain.c b/xen/arch/x86/domain.c
index ac6332266e95..1ff9200eb081 100644
--- a/xen/arch/x86/domain.c
+++ b/xen/arch/x86/domain.c
@@ -2106,6 +2106,7 @@ void context_switch(struct vcpu *prev, struct vcpu *next)
     struct cpu_info *info = get_cpu_info();
     const struct domain *prevd = prev->domain, *nextd = next->domain;
     unsigned int dirty_cpu = read_atomic(&next->dirty_cpu);
+    bool lazy = false;
 
     ASSERT(prev != next);
     ASSERT(local_irq_is_enabled());
@@ -2138,6 +2139,7 @@ void context_switch(struct vcpu *prev, struct vcpu *next)
          */
         set_current(next);
         local_irq_enable();
+        lazy = true;
     }
     else
     {
@@ -2212,12 +2214,19 @@ void context_switch(struct vcpu *prev, struct vcpu *next)
     /* Ensure that the vcpu has an up-to-date time base. */
     update_vcpu_system_time(next);
 
-    reset_stack_and_call_ind(nextd->arch.ctxt_switch->tail);
+    /*
+     * Context switches to the idle vCPU (either lazy or full) will never
+     * trigger zeroing of the stack, because the idle domain doesn't have ASI
+     * enabled.  Switching back to the previously running vCPU after a lazy
+     * switch shouldn't zero the stack either.
+     */
+    reset_stack_and_call_ind(nextd->arch.ctxt_switch->tail,
+                             !lazy && nextd->arch.zero_stack);
 }
 
 void continue_running(struct vcpu *same)
 {
-    reset_stack_and_call_ind(same->domain->arch.ctxt_switch->tail);
+    reset_stack_and_call_ind(same->domain->arch.ctxt_switch->tail, false);
 }
 
 int __sync_local_execstate(void)
diff --git a/xen/arch/x86/include/asm/current.h b/xen/arch/x86/include/asm/current.h
index 4a9776f87a7a..9abb4e55aeea 100644
--- a/xen/arch/x86/include/asm/current.h
+++ b/xen/arch/x86/include/asm/current.h
@@ -170,6 +170,12 @@ unsigned long get_stack_dump_bottom (unsigned long sp);
 # define SHADOW_STACK_WORK ""
 #endif
 
+#define ZERO_STACK                                              \
+    "test %[stk_size], %[stk_size];"                            \
+    "jz .L_skip_zeroing.%=;"                                    \
+    "rep stosb;"                                                \
+    ".L_skip_zeroing.%=:"
+
 #if __GNUC__ >= 9
 # define ssaj_has_attr_noreturn(fn) __builtin_has_attribute(fn, __noreturn__)
 #else
@@ -177,13 +183,43 @@ unsigned long get_stack_dump_bottom (unsigned long sp);
 # define ssaj_has_attr_noreturn(fn) true
 #endif
 
-#define switch_stack_and_jump(fn, instr, constr)                        \
+DECLARE_PER_CPU(unsigned int, slice_mce_count);
+DECLARE_PER_CPU(unsigned int, slice_nmi_count);
+DECLARE_PER_CPU(unsigned int, slice_db_count);
+
+#define switch_stack_and_jump(fn, instr, constr, zero_stk)              \
     ({                                                                  \
         unsigned int tmp;                                               \
+                                                                        \
         BUILD_BUG_ON(!ssaj_has_attr_noreturn(fn));                      \
+        ASSERT(IS_ALIGNED((unsigned long)guest_cpu_user_regs() -        \
+                          PRIMARY_STACK_SIZE +                          \
+                          sizeof(struct cpu_info), PAGE_SIZE));         \
+        if ( zero_stk )                                                 \
+        {                                                               \
+            unsigned long stack_top = get_stack_bottom() &              \
+                                      ~(STACK_SIZE - 1);                \
+                                                                        \
+            if ( this_cpu(slice_mce_count) )                            \
+            {                                                           \
+                this_cpu(slice_mce_count) = 0;                          \
+                clear_page((void *)stack_top + IST_MCE * PAGE_SIZE);    \
+            }                                                           \
+            if ( this_cpu(slice_nmi_count) )                            \
+            {                                                           \
+                this_cpu(slice_nmi_count) = 0;                          \
+                clear_page((void *)stack_top + IST_NMI * PAGE_SIZE);    \
+            }                                                           \
+            if ( this_cpu(slice_db_count) )                             \
+            {                                                           \
+                this_cpu(slice_db_count) = 0;                           \
+                clear_page((void *)stack_top + IST_DB  * PAGE_SIZE);    \
+            }                                                           \
+        }                                                               \
         __asm__ __volatile__ (                                          \
             SHADOW_STACK_WORK                                           \
             "mov %[stk], %%rsp;"                                        \
+            ZERO_STACK                                                  \
             CHECK_FOR_LIVEPATCH_WORK                                    \
             instr "[fun]"                                               \
             : [val] "=&r" (tmp),                                        \
@@ -194,19 +230,26 @@ unsigned long get_stack_dump_bottom (unsigned long sp);
               ((PRIMARY_SHSTK_SLOT + 1) * PAGE_SIZE - 8),               \
               [stack_mask] "i" (STACK_SIZE - 1),                        \
               _ASM_BUGFRAME_INFO(BUGFRAME_bug, __LINE__,                \
-                                 __FILE__, NULL)                        \
+                                 __FILE__, NULL),                       \
+              /* For stack zeroing. */                                  \
+              "D" ((void *)guest_cpu_user_regs() -                      \
+                   PRIMARY_STACK_SIZE + sizeof(struct cpu_info)),       \
+              [stk_size] "c"                                            \
+              ((zero_stk) ? PRIMARY_STACK_SIZE - sizeof(struct cpu_info)\
+                          : 0),                                         \
+              "a" (0)                                                   \
             : "memory" );                                               \
         unreachable();                                                  \
     })
 
 #define reset_stack_and_jump(fn)                                        \
-    switch_stack_and_jump(fn, "jmp %c", "i")
+    switch_stack_and_jump(fn, "jmp %c", "i", false)
 
 /* The constraint may only specify non-call-clobbered registers. */
-#define reset_stack_and_call_ind(fn)                                    \
+#define reset_stack_and_call_ind(fn, zero_stk)                          \
     ({                                                                  \
         (void)((fn) == (void (*)(void))NULL);                           \
-        switch_stack_and_jump(fn, "INDIRECT_CALL %", "b");              \
+        switch_stack_and_jump(fn, "INDIRECT_CALL %", "b", zero_stk);    \
     })
 
 /*
diff --git a/xen/arch/x86/include/asm/domain.h b/xen/arch/x86/include/asm/domain.h
index f83d2860c0b4..c2cbd73a42b4 100644
--- a/xen/arch/x86/include/asm/domain.h
+++ b/xen/arch/x86/include/asm/domain.h
@@ -468,6 +468,9 @@ struct arch_domain
     /* Use per-CPU mapped stacks. */
     bool cpu_stack;
 
+    /* Zero CPU stack on non lazy context switch. */
+    bool zero_stack;
+
     /* Emulated devices enabled bitmap. */
     uint32_t emulation_flags;
 } __cacheline_aligned;
diff --git a/xen/arch/x86/include/asm/spec_ctrl.h b/xen/arch/x86/include/asm/spec_ctrl.h
index c8943e81befa..c335c5eca35d 100644
--- a/xen/arch/x86/include/asm/spec_ctrl.h
+++ b/xen/arch/x86/include/asm/spec_ctrl.h
@@ -90,6 +90,7 @@ extern int8_t opt_xpti_hwdom, opt_xpti_domu;
 
 extern int8_t opt_vcpu_pt_pv, opt_vcpu_pt_hwdom, opt_vcpu_pt_hvm;
 extern int8_t opt_cpu_stack_pv, opt_cpu_stack_hwdom, opt_cpu_stack_hvm;
+extern int8_t opt_zero_stack_pv, opt_zero_stack_hwdom, opt_zero_stack_hvm;
 
 extern bool cpu_has_bug_l1tf;
 extern int8_t opt_pv_l1tf_hwdom, opt_pv_l1tf_domu;
diff --git a/xen/arch/x86/spec_ctrl.c b/xen/arch/x86/spec_ctrl.c
index 4f1e912f8057..edae4b802e67 100644
--- a/xen/arch/x86/spec_ctrl.c
+++ b/xen/arch/x86/spec_ctrl.c
@@ -93,6 +93,10 @@ int8_t __ro_after_init opt_vcpu_pt_pv = -1;
 int8_t __ro_after_init opt_cpu_stack_hvm = -1;
 int8_t __ro_after_init opt_cpu_stack_hwdom = -1;
 int8_t __ro_after_init opt_cpu_stack_pv = -1;
+/* Zero CPU stacks. */
+int8_t __ro_after_init opt_zero_stack_hvm = -1;
+int8_t __ro_after_init opt_zero_stack_hwdom = -1;
+int8_t __ro_after_init opt_zero_stack_pv = -1;
 
 static int __init cf_check parse_spec_ctrl(const char *s)
 {
@@ -515,6 +519,7 @@ static int __init cf_check parse_asi(const char *s)
     {
         opt_vcpu_pt_pv = opt_vcpu_pt_hwdom = opt_vcpu_pt_hvm = 1;
         opt_cpu_stack_pv = opt_cpu_stack_hwdom = opt_cpu_stack_hvm = 1;
+        opt_zero_stack_pv = opt_zero_stack_hvm = opt_zero_stack_hwdom = 1;
     }
 
     do {
@@ -529,13 +534,14 @@ static int __init cf_check parse_asi(const char *s)
         case 1:
             opt_vcpu_pt_pv = opt_vcpu_pt_hwdom = opt_vcpu_pt_hvm = val;
             opt_cpu_stack_pv = opt_cpu_stack_hvm = opt_cpu_stack_hwdom = val;
+            opt_zero_stack_pv = opt_zero_stack_hvm = opt_zero_stack_hwdom = val;
             break;
 
         default:
             if ( (val = parse_boolean("pv", s, ss)) >= 0 )
-                opt_cpu_stack_pv = opt_vcpu_pt_pv = val;
+                opt_zero_stack_pv = opt_cpu_stack_pv = opt_vcpu_pt_pv = val;
             else if ( (val = parse_boolean("hvm", s, ss)) >= 0 )
-                opt_cpu_stack_hvm = opt_vcpu_pt_hvm = val;
+                opt_zero_stack_hvm = opt_cpu_stack_hvm = opt_vcpu_pt_hvm = val;
             else if ( (val = parse_boolean("vcpu-pt", s, ss)) != -1 )
             {
                 switch ( val )
@@ -579,6 +585,28 @@ static int __init cf_check parse_asi(const char *s)
                     break;
                 }
             }
+            else if ( (val = parse_boolean("zero-stack", s, ss)) != -1 )
+            {
+                switch ( val )
+                {
+                case 1:
+                case 0:
+                    opt_zero_stack_pv = opt_zero_stack_hvm =
+                        opt_zero_stack_hwdom = val;
+                    break;
+
+                case -2:
+                    s += strlen("zero-stack=");
+                    if ( (val = parse_boolean("pv", s, ss)) >= 0 )
+                        opt_zero_stack_pv = val;
+                    else if ( (val = parse_boolean("hvm", s, ss)) >= 0 )
+                        opt_zero_stack_hvm = val;
+                    else
+                default:
+                        rc = -EINVAL;
+                    break;
+                }
+            }
             else if ( *s )
                 rc = -EINVAL;
             break;
@@ -791,17 +819,21 @@ static void __init print_details(enum ind_thunk thunk)
 #endif
 
 #ifdef CONFIG_HVM
-    printk("  ASI features for HVM VMs:%s%s%s\n",
-           opt_vcpu_pt_hvm || opt_cpu_stack_hvm      ? ""               : " None",
+    printk("  ASI features for HVM VMs:%s%s%s%s\n",
+           opt_vcpu_pt_hvm || opt_cpu_stack_hvm ||
+           opt_zero_stack_hvm                        ? ""               : " None",
            opt_vcpu_pt_hvm                           ? " vCPU-PT"       : "",
-           opt_cpu_stack_hvm                         ? " CPU-STACK"     : "");
+           opt_cpu_stack_hvm                         ? " CPU-STACK"     : "",
+           opt_zero_stack_hvm                        ? " ZERO-STACK"    : "");
 
 #endif
 #ifdef CONFIG_PV
-    printk("  ASI features for PV VMs:%s%s%s\n",
-           opt_vcpu_pt_pv || opt_cpu_stack_pv        ? ""               : " None",
+    printk("  ASI features for PV VMs:%s%s%s%s\n",
+           opt_vcpu_pt_pv || opt_cpu_stack_pv ||
+           opt_zero_stack_pv                         ? ""               : " None",
            opt_vcpu_pt_pv                            ? " vCPU-PT"       : "",
-           opt_cpu_stack_pv                          ? " CPU-STACK"     : "");
+           opt_cpu_stack_pv                          ? " CPU-STACK"     : "",
+           opt_zero_stack_pv                         ? " ZERO-STACK"    : "");
 #endif
 }
 
@@ -1912,6 +1944,9 @@ void spec_ctrl_init_domain(struct domain *d)
     d->arch.cpu_stack = is_hardware_domain(d) ? opt_cpu_stack_hwdom
                                               : pv ? opt_cpu_stack_pv
                                                    : opt_cpu_stack_hvm;
+    d->arch.zero_stack = is_hardware_domain(d) ? opt_zero_stack_hwdom
+                                               : pv ? opt_zero_stack_pv
+                                                    : opt_zero_stack_hvm;
 }
 
 void __init init_speculation_mitigations(void)
@@ -2221,6 +2256,12 @@ void __init init_speculation_mitigations(void)
         opt_cpu_stack_hwdom = 0;
     if ( opt_cpu_stack_hvm == -1 )
         opt_cpu_stack_hvm = 0;
+    if ( opt_zero_stack_pv == -1 )
+        opt_zero_stack_pv = 0;
+    if ( opt_zero_stack_hwdom == -1 )
+        opt_zero_stack_hwdom = 0;
+    if ( opt_zero_stack_hvm == -1 )
+        opt_zero_stack_hvm = 0;
 
     if ( opt_vcpu_pt_pv || opt_vcpu_pt_hvm )
         warning_add(
diff --git a/xen/arch/x86/traps.c b/xen/arch/x86/traps.c
index c80ef2268e94..2aa53550e8e6 100644
--- a/xen/arch/x86/traps.c
+++ b/xen/arch/x86/traps.c
@@ -1792,6 +1792,7 @@ static void unknown_nmi_error(const struct cpu_user_regs *regs,
 static nmi_callback_t *__read_mostly nmi_callback;
 
 DEFINE_PER_CPU(unsigned int, nmi_count);
+DEFINE_PER_CPU(unsigned int, slice_nmi_count);
 
 void do_nmi(const struct cpu_user_regs *regs)
 {
@@ -1801,6 +1802,7 @@ void do_nmi(const struct cpu_user_regs *regs)
     bool handle_unknown = false;
 
     this_cpu(nmi_count)++;
+    this_cpu(slice_nmi_count)++;
     nmi_enter();
 
     /*
@@ -1919,6 +1921,8 @@ void asmlinkage do_device_not_available(struct cpu_user_regs *regs)
 
 void nocall sysenter_eflags_saved(void);
 
+DEFINE_PER_CPU(unsigned int, slice_db_count);
+
 void asmlinkage do_debug(struct cpu_user_regs *regs)
 {
     unsigned long dr6;
@@ -1927,6 +1931,7 @@ void asmlinkage do_debug(struct cpu_user_regs *regs)
     /* Stash dr6 as early as possible. */
     dr6 = read_debugreg(6);
 
+    this_cpu(slice_db_count)++;
     /*
      * At the time of writing (March 2018), on the subject of %dr6:
      *
-- 
2.46.0



^ permalink raw reply related	[flat|nested] 55+ messages in thread

* [PATCH v2.1 06/18] x86/pv: set/clear guest GDT mappings using {populate,destroy}_perdomain_mapping()
  2025-01-08 14:26 ` [PATCH v2 06/18] x86/pv: set/clear guest GDT mappings using {populate,destroy}_perdomain_mapping() Roger Pau Monne
@ 2025-01-08 15:11   ` Roger Pau Monne
  2025-01-09 10:25     ` Alejandro Vallejo
  2025-01-14 15:30     ` Jan Beulich
  0 siblings, 2 replies; 55+ messages in thread
From: Roger Pau Monne @ 2025-01-08 15:11 UTC (permalink / raw)
  To: xen-devel; +Cc: Roger Pau Monne, Jan Beulich, Andrew Cooper

The pv_{set,destroy}_gdt() functions rely on the L1 table(s) that contain such
mappings being stashed in the domain structure, and thus such mappings being
modified by merely updating the L1 entries.

Switch both pv_{set,destroy}_gdt() to instead use
{populate,destory}_perdomain_mapping().

Signed-off-by: Roger Pau Monné <roger.pau@citrix.com>
---
Changes since v2:
 - Do not change ordering setup of arch_set_info_guest().
---
 xen/arch/x86/pv/descriptor-tables.c | 28 ++++++++++++----------------
 1 file changed, 12 insertions(+), 16 deletions(-)

diff --git a/xen/arch/x86/pv/descriptor-tables.c b/xen/arch/x86/pv/descriptor-tables.c
index 02647a2c5047..5a79f022ce13 100644
--- a/xen/arch/x86/pv/descriptor-tables.c
+++ b/xen/arch/x86/pv/descriptor-tables.c
@@ -49,23 +49,20 @@ bool pv_destroy_ldt(struct vcpu *v)
 
 void pv_destroy_gdt(struct vcpu *v)
 {
-    l1_pgentry_t *pl1e = pv_gdt_ptes(v);
-    mfn_t zero_mfn = _mfn(virt_to_mfn(zero_page));
-    l1_pgentry_t zero_l1e = l1e_from_mfn(zero_mfn, __PAGE_HYPERVISOR_RO);
     unsigned int i;
 
     ASSERT(v == current || !vcpu_cpu_dirty(v));
 
-    v->arch.pv.gdt_ents = 0;
-    for ( i = 0; i < FIRST_RESERVED_GDT_PAGE; i++ )
-    {
-        mfn_t mfn = l1e_get_mfn(pl1e[i]);
+    if ( v->arch.cr3 )
+        destroy_perdomain_mapping(v, GDT_VIRT_START(v),
+                                  ARRAY_SIZE(v->arch.pv.gdt_frames));
 
-        if ( (l1e_get_flags(pl1e[i]) & _PAGE_PRESENT) &&
-             !mfn_eq(mfn, zero_mfn) )
-            put_page_and_type(mfn_to_page(mfn));
+    for ( i = 0; i < ARRAY_SIZE(v->arch.pv.gdt_frames); i++)
+    {
+        if ( !v->arch.pv.gdt_frames[i] )
+            break;
 
-        l1e_write(&pl1e[i], zero_l1e);
+        put_page_and_type(mfn_to_page(_mfn(v->arch.pv.gdt_frames[i])));
         v->arch.pv.gdt_frames[i] = 0;
     }
 }
@@ -74,8 +71,8 @@ int pv_set_gdt(struct vcpu *v, const unsigned long frames[],
                unsigned int entries)
 {
     struct domain *d = v->domain;
-    l1_pgentry_t *pl1e;
     unsigned int i, nr_frames = DIV_ROUND_UP(entries, 512);
+    mfn_t mfns[ARRAY_SIZE(v->arch.pv.gdt_frames)];
 
     ASSERT(v == current || !vcpu_cpu_dirty(v));
 
@@ -90,6 +87,8 @@ int pv_set_gdt(struct vcpu *v, const unsigned long frames[],
         if ( !mfn_valid(mfn) ||
              !get_page_and_type(mfn_to_page(mfn), d, PGT_seg_desc_page) )
             goto fail;
+
+        mfns[i] = mfn;
     }
 
     /* Tear down the old GDT. */
@@ -97,12 +96,9 @@ int pv_set_gdt(struct vcpu *v, const unsigned long frames[],
 
     /* Install the new GDT. */
     v->arch.pv.gdt_ents = entries;
-    pl1e = pv_gdt_ptes(v);
     for ( i = 0; i < nr_frames; i++ )
-    {
         v->arch.pv.gdt_frames[i] = frames[i];
-        l1e_write(&pl1e[i], l1e_from_pfn(frames[i], __PAGE_HYPERVISOR_RW));
-    }
+    populate_perdomain_mapping(v, GDT_VIRT_START(v), mfns, nr_frames);
 
     return 0;
 
-- 
2.46.0



^ permalink raw reply related	[flat|nested] 55+ messages in thread

* Re: [PATCH v2 01/18] x86/mm: purge unneeded destroy_perdomain_mapping()
  2025-01-08 14:26 ` [PATCH v2 01/18] x86/mm: purge unneeded destroy_perdomain_mapping() Roger Pau Monne
@ 2025-01-08 15:59   ` Alejandro Vallejo
  0 siblings, 0 replies; 55+ messages in thread
From: Alejandro Vallejo @ 2025-01-08 15:59 UTC (permalink / raw)
  To: Roger Pau Monne, xen-devel; +Cc: Jan Beulich, Andrew Cooper

Hi,

I noticed the same duplication while moving mapcache initialization code, but
didn't want to touch it while doing that. Good to see these two lines gone.

On Wed Jan 8, 2025 at 2:26 PM GMT, Roger Pau Monne wrote:
> The destroy_perdomain_mapping() call in the hvm_domain_initialise() fail path
> is useless.  destroy_perdomain_mapping() called with nr == 0 is effectively a
> no op, as there are not entries torn down.  Remove the call, as
> arch_domain_create() already calls free_perdomain_mappings() on failure.
>
> There's also a call to destroy_perdomain_mapping() in pv_domain_destroy() which
> is also not needed.  arch_domain_destroy() will already unconditionally call
> free_perdomain_mappings(), which does the same as destroy_perdomain_mapping(),
> plus additionally frees the page table structures.
>
> Signed-off-by: Roger Pau Monné <roger.pau@citrix.com>
> ---
>  xen/arch/x86/hvm/hvm.c   | 1 -
>  xen/arch/x86/pv/domain.c | 3 ---
>  2 files changed, 4 deletions(-)
>
> diff --git a/xen/arch/x86/hvm/hvm.c b/xen/arch/x86/hvm/hvm.c
> index 922c9b3af64d..70fdddae583d 100644
> --- a/xen/arch/x86/hvm/hvm.c
> +++ b/xen/arch/x86/hvm/hvm.c
> @@ -708,7 +708,6 @@ int hvm_domain_initialise(struct domain *d,
>      XFREE(d->arch.hvm.irq);
>   fail0:
>      hvm_destroy_cacheattr_region_list(d);
> -    destroy_perdomain_mapping(d, PERDOMAIN_VIRT_START, 0);
>   fail:
>      hvm_domain_relinquish_resources(d);
>      XFREE(d->arch.hvm.io_handler);
> diff --git a/xen/arch/x86/pv/domain.c b/xen/arch/x86/pv/domain.c
> index 7aef628f55be..bc7cd0c62f0e 100644
> --- a/xen/arch/x86/pv/domain.c
> +++ b/xen/arch/x86/pv/domain.c
> @@ -345,9 +345,6 @@ void pv_domain_destroy(struct domain *d)
>  {
>      pv_l1tf_domain_destroy(d);
>  
> -    destroy_perdomain_mapping(d, GDT_LDT_VIRT_START,
> -                              GDT_LDT_MBYTES << (20 - PAGE_SHIFT));
> -
>      XFREE(d->arch.pv.cpuidmasks);
>  
>      FREE_XENHEAP_PAGE(d->arch.pv.gdt_ldt_l1tab);

  Reviewed-by: Alejandro Vallejo <alejandro.vallejo@cloud.com>

Cheers,
Alejandro


^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [PATCH v2 02/18] x86/domain: limit window where curr_vcpu != current on context switch
  2025-01-08 14:26 ` [PATCH v2 02/18] x86/domain: limit window where curr_vcpu != current on context switch Roger Pau Monne
@ 2025-01-08 16:26   ` Alejandro Vallejo
  2025-01-09 17:39     ` Roger Pau Monné
  2025-01-09  8:59   ` Jan Beulich
  1 sibling, 1 reply; 55+ messages in thread
From: Alejandro Vallejo @ 2025-01-08 16:26 UTC (permalink / raw)
  To: Roger Pau Monne, xen-devel; +Cc: Jan Beulich, Andrew Cooper

This is a net gain even without ASI. Having "current" hold the previous vCPU on
__context_switch() makes it _a lot_ easier to follow the lazy switch path.

On Wed Jan 8, 2025 at 2:26 PM GMT, Roger Pau Monne wrote:
> On x86 Xen will perform lazy context switches to the idle vCPU, where the
> previously running vCPU context is not overwritten, and only current is updated
> to point to the idle vCPU.  The state is then disjunct between current and
> curr_vcpu: current points to the idle vCPU, while curr_vcpu points to the vCPU
> whose context is loaded on the pCPU.
>
> While on that lazy context switched state, certain calls (like
> map_domain_page()) will trigger a full synchronization of the pCPU state by
> forcing a context switch.  Note however how calling any of such functions
> inside the context switch code itself is very likely to trigger an infinite
> recursion loop.
>
> Attempt to limit the window where curr_vcpu != current in the context switch
> code, as to prevent and infinite recursion loop around sync_local_execstate().
>
> This is required for using map_domain_page() in the vCPU context switch code,
> otherwise using map_domain_page() in that context ends up in a recursive
> sync_local_execstate() loop:
>
> map_domain_page() -> sync_local_execstate() -> map_domain_page() -> ...

More generally, it's worth mentioning that we want to establish an invariant
between a per-cpu variable (curr_vcpu) and the currently running page tables.
That way it can be used as discriminant to know which are the currently active
per-vCPU mappings.

That's essential for implementing FPU hiding as proposed here:

  https://lore.kernel.org/xen-devel/20241105143310.28301-1-alejandro.vallejo@cloud.com/

A shorter form of that should probably be mentioned also...

>
> Signed-off-by: Roger Pau Monné <roger.pau@citrix.com>
> ---
> Changes since v1:
>  - New in this version.
> ---
>  xen/arch/x86/domain.c | 58 +++++++++++++++++++++++++++++++++++--------
>  xen/arch/x86/traps.c  |  2 --
>  2 files changed, 48 insertions(+), 12 deletions(-)
>
> diff --git a/xen/arch/x86/domain.c b/xen/arch/x86/domain.c
> index 78a13e6812c9..1f680bf176ee 100644
> --- a/xen/arch/x86/domain.c
> +++ b/xen/arch/x86/domain.c
> @@ -1982,16 +1982,16 @@ static void load_default_gdt(unsigned int cpu)
>      per_cpu(full_gdt_loaded, cpu) = false;
>  }
>  
> -static void __context_switch(void)
> +static void __context_switch(struct vcpu *n)
>  {
>      struct cpu_user_regs *stack_regs = guest_cpu_user_regs();
>      unsigned int          cpu = smp_processor_id();
>      struct vcpu          *p = per_cpu(curr_vcpu, cpu);
> -    struct vcpu          *n = current;
>      struct domain        *pd = p->domain, *nd = n->domain;
>  
>      ASSERT(p != n);
>      ASSERT(!vcpu_cpu_dirty(n));
> +    ASSERT(p == current);
>  
>      if ( !is_idle_domain(pd) )
>      {
> @@ -2036,6 +2036,18 @@ static void __context_switch(void)
>  
>      write_ptbase(n);
>  
> +    /*
> +     * It's relevant to set both current and curr_vcpu back-to-back, to avoid a
> +     * window where calls to mapcache_current_vcpu() during the context switch
> +     * could trigger a recursive loop.
> +     *
> +     * Do the current switch immediately after switching to the new guest
> +     * page-tables, so that current is (almost) always in sync with the
> +     * currently loaded page-tables.
> +     */
> +    set_current(n);
> +    per_cpu(curr_vcpu, cpu) = n;

... here. So we're not tempted to move these 2 far off from write_ptbase().

> +
>  #ifdef CONFIG_PV
>      /* Prefetch the VMCB if we expect to use it later in the context switch */
>      if ( using_svm() && is_pv_64bit_domain(nd) && !is_idle_domain(nd) )
> @@ -2048,8 +2060,6 @@ static void __context_switch(void)
>      if ( pd != nd )
>          cpumask_clear_cpu(cpu, pd->dirty_cpumask);
>      write_atomic(&p->dirty_cpu, VCPU_CPU_CLEAN);
> -
> -    per_cpu(curr_vcpu, cpu) = n;
>  }
>  
>  void context_switch(struct vcpu *prev, struct vcpu *next)
> @@ -2081,16 +2091,36 @@ void context_switch(struct vcpu *prev, struct vcpu *next)
>  
>      local_irq_disable();
>  
> -    set_current(next);
> -
>      if ( (per_cpu(curr_vcpu, cpu) == next) ||
>           (is_idle_domain(nextd) && cpu_online(cpu)) )
>      {
> +        /*
> +         * Lazy context switch to the idle vCPU, set current == idle.  Full
> +         * context switch happens if/when sync_local_execstate() is called.
> +         */
> +        set_current(next);
>          local_irq_enable();
>      }
>      else
>      {
> -        __context_switch();
> +        /*
> +         * curr_vcpu will always point to the currently loaded vCPU context, as

nit: s/will always point/always points/ ? It's an inconditional invariant,
after all.

> +         * it's not updated when doing a lazy switch to the idle vCPU.
> +         */
> +        struct vcpu *prev_ctx = per_cpu(curr_vcpu, cpu);
> +
> +        if ( prev_ctx != current )
> +        {
> +            /*
> +             * Doing a full context switch to a non-idle vCPU from a lazy
> +             * context switched state.  Adjust current to point to the
> +             * currently loaded vCPU context.
> +             */
> +            ASSERT(current == idle_vcpu[cpu]);
> +            ASSERT(!is_idle_vcpu(next));
> +            set_current(prev_ctx);
> +        }
> +        __context_switch(next);
>  
>          /* Re-enable interrupts before restoring state which may fault. */
>          local_irq_enable();
> @@ -2156,15 +2186,23 @@ int __sync_local_execstate(void)
>  {
>      unsigned long flags;
>      int switch_required;
> +    unsigned int cpu = smp_processor_id();
> +    struct vcpu *p;
>  
>      local_irq_save(flags);
>  
> -    switch_required = (this_cpu(curr_vcpu) != current);
> +    p = per_cpu(curr_vcpu, cpu);
> +    switch_required = (p != current);
>  
>      if ( switch_required )
>      {
> -        ASSERT(current == idle_vcpu[smp_processor_id()]);
> -        __context_switch();
> +        ASSERT(current == idle_vcpu[cpu]);
> +        /*
> +         * Restore current to the previously running vCPU, __context_switch()
> +         * will update current together with curr_vcpu.
> +         */
> +        set_current(p);
> +        __context_switch(idle_vcpu[cpu]);
>      }
>  
>      local_irq_restore(flags);
> diff --git a/xen/arch/x86/traps.c b/xen/arch/x86/traps.c
> index 87b30ce4df2a..487b8c5a78c5 100644
> --- a/xen/arch/x86/traps.c
> +++ b/xen/arch/x86/traps.c
> @@ -2232,8 +2232,6 @@ void __init trap_init(void)
>  
>  void activate_debugregs(const struct vcpu *curr)
>  {
> -    ASSERT(curr == current);
> -
>      write_debugreg(0, curr->arch.dr[0]);
>      write_debugreg(1, curr->arch.dr[1]);
>      write_debugreg(2, curr->arch.dr[2]);

Cheers,
Alejandro


^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [PATCH v2 02/18] x86/domain: limit window where curr_vcpu != current on context switch
  2025-01-08 14:26 ` [PATCH v2 02/18] x86/domain: limit window where curr_vcpu != current on context switch Roger Pau Monne
  2025-01-08 16:26   ` Alejandro Vallejo
@ 2025-01-09  8:59   ` Jan Beulich
  2025-01-09 17:33     ` Roger Pau Monné
  1 sibling, 1 reply; 55+ messages in thread
From: Jan Beulich @ 2025-01-09  8:59 UTC (permalink / raw)
  To: Roger Pau Monne; +Cc: Andrew Cooper, xen-devel

On 08.01.2025 15:26, Roger Pau Monne wrote:
> On x86 Xen will perform lazy context switches to the idle vCPU, where the
> previously running vCPU context is not overwritten, and only current is updated
> to point to the idle vCPU.  The state is then disjunct between current and
> curr_vcpu: current points to the idle vCPU, while curr_vcpu points to the vCPU
> whose context is loaded on the pCPU.
> 
> While on that lazy context switched state, certain calls (like
> map_domain_page()) will trigger a full synchronization of the pCPU state by
> forcing a context switch.  Note however how calling any of such functions
> inside the context switch code itself is very likely to trigger an infinite
> recursion loop.
> 
> Attempt to limit the window where curr_vcpu != current in the context switch
> code, as to prevent and infinite recursion loop around sync_local_execstate().
> 
> This is required for using map_domain_page() in the vCPU context switch code,
> otherwise using map_domain_page() in that context ends up in a recursive
> sync_local_execstate() loop:

Question is whether it's a good idea in the first place to start using
map_domain_page() from the context switch path. Surely there are possible
alternatives.

> --- a/xen/arch/x86/domain.c
> +++ b/xen/arch/x86/domain.c
> @@ -1982,16 +1982,16 @@ static void load_default_gdt(unsigned int cpu)
>      per_cpu(full_gdt_loaded, cpu) = false;
>  }
>  
> -static void __context_switch(void)
> +static void __context_switch(struct vcpu *n)
>  {
>      struct cpu_user_regs *stack_regs = guest_cpu_user_regs();
>      unsigned int          cpu = smp_processor_id();
>      struct vcpu          *p = per_cpu(curr_vcpu, cpu);
> -    struct vcpu          *n = current;
>      struct domain        *pd = p->domain, *nd = n->domain;
>  
>      ASSERT(p != n);
>      ASSERT(!vcpu_cpu_dirty(n));
> +    ASSERT(p == current);
>  
>      if ( !is_idle_domain(pd) )
>      {
> @@ -2036,6 +2036,18 @@ static void __context_switch(void)
>  
>      write_ptbase(n);
>  
> +    /*
> +     * It's relevant to set both current and curr_vcpu back-to-back, to avoid a
> +     * window where calls to mapcache_current_vcpu() during the context switch
> +     * could trigger a recursive loop.
> +     *
> +     * Do the current switch immediately after switching to the new guest
> +     * page-tables, so that current is (almost) always in sync with the
> +     * currently loaded page-tables.
> +     */
> +    set_current(n);
> +    per_cpu(curr_vcpu, cpu) = n;

The latter paragraph of the comment states something that so far wasn't intended,
and imo also shouldn't be going forward. It's curr_vcpu which wants to be in sync
with the loaded page tables. (Whether pulling ahead its updating is okay is a
separate question. All of these actions used to be be very carefully placed they
way they are. Which isn't to say that I can exclude things having gone stale ...)
And yes, that has always meant that mapcache_current_vcpu()'s condition for
calling sync_local_execstate() was building upon the fact that it won't be called
from context switching contexts.

Did you consider updating that condition (evaluating curr_cpu) instead?

> @@ -2048,8 +2060,6 @@ static void __context_switch(void)
>      if ( pd != nd )
>          cpumask_clear_cpu(cpu, pd->dirty_cpumask);
>      write_atomic(&p->dirty_cpu, VCPU_CPU_CLEAN);
> -
> -    per_cpu(curr_vcpu, cpu) = n;
>  }
>  
>  void context_switch(struct vcpu *prev, struct vcpu *next)
> @@ -2081,16 +2091,36 @@ void context_switch(struct vcpu *prev, struct vcpu *next)
>  
>      local_irq_disable();
>  
> -    set_current(next);
> -
>      if ( (per_cpu(curr_vcpu, cpu) == next) ||
>           (is_idle_domain(nextd) && cpu_online(cpu)) )
>      {
> +        /*
> +         * Lazy context switch to the idle vCPU, set current == idle.  Full
> +         * context switch happens if/when sync_local_execstate() is called.
> +         */
> +        set_current(next);
>          local_irq_enable();

The comment is misleading as far as the first half of the if() condition goes:
No further switching is going to happen in that case, aiui.

>      }
>      else
>      {
> -        __context_switch();
> +        /*
> +         * curr_vcpu will always point to the currently loaded vCPU context, as
> +         * it's not updated when doing a lazy switch to the idle vCPU.
> +         */
> +        struct vcpu *prev_ctx = per_cpu(curr_vcpu, cpu);
> +
> +        if ( prev_ctx != current )
> +        {
> +            /*
> +             * Doing a full context switch to a non-idle vCPU from a lazy
> +             * context switched state.  Adjust current to point to the
> +             * currently loaded vCPU context.
> +             */
> +            ASSERT(current == idle_vcpu[cpu]);
> +            ASSERT(!is_idle_vcpu(next));
> +            set_current(prev_ctx);

This feels wrong, as in "current" then not representing what it should represent,
for a certain time window. I may be dense, but neither comment not description
clarify to me why this might be needed. I can see that it's needed to please the
ASSERT() you add to __context_switch(), yet then I might ask why that assertion
is put there.

> +        }
> +        __context_switch(next);
>  
>          /* Re-enable interrupts before restoring state which may fault. */
>          local_irq_enable();
> @@ -2156,15 +2186,23 @@ int __sync_local_execstate(void)
>  {
>      unsigned long flags;
>      int switch_required;
> +    unsigned int cpu = smp_processor_id();
> +    struct vcpu *p;
>  
>      local_irq_save(flags);
>  
> -    switch_required = (this_cpu(curr_vcpu) != current);
> +    p = per_cpu(curr_vcpu, cpu);
> +    switch_required = (p != current);
>  
>      if ( switch_required )
>      {
> -        ASSERT(current == idle_vcpu[smp_processor_id()]);
> -        __context_switch();
> +        ASSERT(current == idle_vcpu[cpu]);
> +        /*
> +         * Restore current to the previously running vCPU, __context_switch()
> +         * will update current together with curr_vcpu.
> +         */
> +        set_current(p);

Similarly here.

> +        __context_switch(idle_vcpu[cpu]);
>      }
>  
>      local_irq_restore(flags);
> --- a/xen/arch/x86/traps.c
> +++ b/xen/arch/x86/traps.c
> @@ -2232,8 +2232,6 @@ void __init trap_init(void)
>  
>  void activate_debugregs(const struct vcpu *curr)
>  {
> -    ASSERT(curr == current);
> -
>      write_debugreg(0, curr->arch.dr[0]);
>      write_debugreg(1, curr->arch.dr[1]);
>      write_debugreg(2, curr->arch.dr[2]);

Why would this assertion go away? If it suddenly triggers, the parameter name
would now end up being wrong.

Jan


^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [PATCH v2 03/18] x86/mm: introduce helper to detect per-domain L1 entries that need freeing
  2025-01-08 14:26 ` [PATCH v2 03/18] x86/mm: introduce helper to detect per-domain L1 entries that need freeing Roger Pau Monne
@ 2025-01-09  9:03   ` Jan Beulich
  0 siblings, 0 replies; 55+ messages in thread
From: Jan Beulich @ 2025-01-09  9:03 UTC (permalink / raw)
  To: Roger Pau Monne; +Cc: Andrew Cooper, xen-devel

On 08.01.2025 15:26, Roger Pau Monne wrote:
> L1 present entries that require the underlying page to be freed have the
> _PAGE_AVAIL0 bit set, introduce a helper to unify the checking logic into a
> single place.
> 
> No functional change intended.
> 
> Signed-off-by: Roger Pau Monné <roger.pau@citrix.com>

Reviewed-by: Jan Beulich <jbeulich@suse.com>

The name feels longish, yet perhaps that's acceptable here.

Jan


^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [PATCH v2 04/18] x86/pv: introduce function to populate perdomain area and use it to map Xen GDT
  2025-01-08 14:26 ` [PATCH v2 04/18] x86/pv: introduce function to populate perdomain area and use it to map Xen GDT Roger Pau Monne
@ 2025-01-09  9:10   ` Jan Beulich
  2025-01-10 14:15     ` Roger Pau Monné
  2025-01-09  9:55   ` Alejandro Vallejo
  1 sibling, 1 reply; 55+ messages in thread
From: Jan Beulich @ 2025-01-09  9:10 UTC (permalink / raw)
  To: Roger Pau Monne; +Cc: Andrew Cooper, xen-devel

On 08.01.2025 15:26, Roger Pau Monne wrote:
> The current code to update the Xen part of the GDT when running a PV guest
> relies on caching the direct map address of all the L1 tables used to map the
> GDT and LDT, so that entries can be modified.
> 
> Introduce a new function that populates the per-domain region, either using the
> recursive linear mappings when the target vCPU is the current one, or by
> directly modifying the L1 table of the per-domain region.
> 
> Using such function to populate per-domain addresses drops the need to keep a
> reference to per-domain L1 tables previously used to change the per-domain
> mappings.

Well, yes. You now record MFNs instead. And you do so at the expense of about
100 lines of new code. I'm afraid I'm lacking justification for this price to
be paid.

> @@ -2219,11 +2219,9 @@ void __init trap_init(void)
>      init_ler();
>  
>      /* Cache {,compat_}gdt_l1e now that physically relocation is done. */
> -    this_cpu(gdt_l1e) =
> -        l1e_from_pfn(virt_to_mfn(boot_gdt), __PAGE_HYPERVISOR_RW);
> +    this_cpu(gdt_mfn) = _mfn(virt_to_mfn(boot_gdt));
>      if ( IS_ENABLED(CONFIG_PV32) )
> -        this_cpu(compat_gdt_l1e) =
> -            l1e_from_pfn(virt_to_mfn(boot_compat_gdt), __PAGE_HYPERVISOR_RW);
> +        this_cpu(compat_gdt_mfn) = _mfn(virt_to_mfn(boot_compat_gdt));

The comment's going stale this way.

Jan


^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [PATCH v2 04/18] x86/pv: introduce function to populate perdomain area and use it to map Xen GDT
  2025-01-08 14:26 ` [PATCH v2 04/18] x86/pv: introduce function to populate perdomain area and use it to map Xen GDT Roger Pau Monne
  2025-01-09  9:10   ` Jan Beulich
@ 2025-01-09  9:55   ` Alejandro Vallejo
  2025-01-10 14:29     ` Roger Pau Monné
  1 sibling, 1 reply; 55+ messages in thread
From: Alejandro Vallejo @ 2025-01-09  9:55 UTC (permalink / raw)
  To: Roger Pau Monne, xen-devel; +Cc: Jan Beulich, Andrew Cooper

On Wed Jan 8, 2025 at 2:26 PM GMT, Roger Pau Monne wrote:
> The current code to update the Xen part of the GDT when running a PV guest
> relies on caching the direct map address of all the L1 tables used to map the
> GDT and LDT, so that entries can be modified.
>
> Introduce a new function that populates the per-domain region, either using the
> recursive linear mappings when the target vCPU is the current one, or by
> directly modifying the L1 table of the per-domain region.
>
> Using such function to populate per-domain addresses drops the need to keep a
> reference to per-domain L1 tables previously used to change the per-domain
> mappings.
>
> Signed-off-by: Roger Pau Monné <roger.pau@citrix.com>
> ---
>  xen/arch/x86/domain.c                | 11 +++-
>  xen/arch/x86/include/asm/desc.h      |  6 +-
>  xen/arch/x86/include/asm/mm.h        |  2 +
>  xen/arch/x86/include/asm/processor.h |  5 ++
>  xen/arch/x86/mm.c                    | 88 ++++++++++++++++++++++++++++
>  xen/arch/x86/smpboot.c               |  6 +-
>  xen/arch/x86/traps.c                 | 10 ++--
>  7 files changed, 113 insertions(+), 15 deletions(-)
>
> diff --git a/xen/arch/x86/domain.c b/xen/arch/x86/domain.c
> index 1f680bf176ee..0bd0ef7e40f4 100644
> --- a/xen/arch/x86/domain.c
> +++ b/xen/arch/x86/domain.c
> @@ -1953,9 +1953,14 @@ static always_inline bool need_full_gdt(const struct domain *d)
>  
>  static void update_xen_slot_in_full_gdt(const struct vcpu *v, unsigned int cpu)
>  {
> -    l1e_write(pv_gdt_ptes(v) + FIRST_RESERVED_GDT_PAGE,
> -              !is_pv_32bit_vcpu(v) ? per_cpu(gdt_l1e, cpu)
> -                                   : per_cpu(compat_gdt_l1e, cpu));
> +    ASSERT(v != current);

For this assert, and others below. IIUC, curr_vcpu == current when we're
properly switched. When we're idling current == idle and curr_vcpu == prev_ctx.

Granted, calling this in the middle of a lazy idle loop would be weird, but
would it make sense for PT consistency to use curr_vcpu here...

> +
> +    populate_perdomain_mapping(v,
> +                               GDT_VIRT_START(v) +
> +                               (FIRST_RESERVED_GDT_PAGE << PAGE_SHIFT),
> +                               !is_pv_32bit_vcpu(v) ? &per_cpu(gdt_mfn, cpu)
> +                                                    : &per_cpu(compat_gdt_mfn,
> +                                                               cpu), 1);
>  }
>  
>  static void load_full_gdt(const struct vcpu *v, unsigned int cpu)
> diff --git a/xen/arch/x86/include/asm/desc.h b/xen/arch/x86/include/asm/desc.h
> index a1e0807d97ed..33981bfca588 100644
> --- a/xen/arch/x86/include/asm/desc.h
> +++ b/xen/arch/x86/include/asm/desc.h
> @@ -44,6 +44,8 @@
>  
>  #ifndef __ASSEMBLY__
>  
> +#include <xen/mm-frame.h>
> +
>  #define GUEST_KERNEL_RPL(d) (is_pv_32bit_domain(d) ? 1 : 3)
>  
>  /* Fix up the RPL of a guest segment selector. */
> @@ -212,10 +214,10 @@ struct __packed desc_ptr {
>  
>  extern seg_desc_t boot_gdt[];
>  DECLARE_PER_CPU(seg_desc_t *, gdt);
> -DECLARE_PER_CPU(l1_pgentry_t, gdt_l1e);
> +DECLARE_PER_CPU(mfn_t, gdt_mfn);
>  extern seg_desc_t boot_compat_gdt[];
>  DECLARE_PER_CPU(seg_desc_t *, compat_gdt);
> -DECLARE_PER_CPU(l1_pgentry_t, compat_gdt_l1e);
> +DECLARE_PER_CPU(mfn_t, compat_gdt_mfn);
>  DECLARE_PER_CPU(bool, full_gdt_loaded);
>  
>  static inline void lgdt(const struct desc_ptr *gdtr)
> diff --git a/xen/arch/x86/include/asm/mm.h b/xen/arch/x86/include/asm/mm.h
> index 6c7e66ee21ab..b50a51327b2b 100644
> --- a/xen/arch/x86/include/asm/mm.h
> +++ b/xen/arch/x86/include/asm/mm.h
> @@ -603,6 +603,8 @@ int compat_arch_memory_op(unsigned long cmd, XEN_GUEST_HANDLE_PARAM(void) arg);
>  int create_perdomain_mapping(struct domain *d, unsigned long va,
>                               unsigned int nr, l1_pgentry_t **pl1tab,
>                               struct page_info **ppg);
> +void populate_perdomain_mapping(const struct vcpu *v, unsigned long va,
> +                                mfn_t *mfn, unsigned long nr);
>  void destroy_perdomain_mapping(struct domain *d, unsigned long va,
>                                 unsigned int nr);
>  void free_perdomain_mappings(struct domain *d);
> diff --git a/xen/arch/x86/include/asm/processor.h b/xen/arch/x86/include/asm/processor.h
> index d247ef8dd226..82ee89f736c2 100644
> --- a/xen/arch/x86/include/asm/processor.h
> +++ b/xen/arch/x86/include/asm/processor.h
> @@ -243,6 +243,11 @@ static inline unsigned long cr3_pa(unsigned long cr3)
>      return cr3 & X86_CR3_ADDR_MASK;
>  }
>  
> +static inline mfn_t cr3_mfn(unsigned long cr3)
> +{
> +    return maddr_to_mfn(cr3_pa(cr3));
> +}
> +
>  static inline unsigned int cr3_pcid(unsigned long cr3)
>  {
>      return IS_ENABLED(CONFIG_PV) ? cr3 & X86_CR3_PCID_MASK : 0;
> diff --git a/xen/arch/x86/mm.c b/xen/arch/x86/mm.c
> index 3d5dd22b6c36..0abea792486c 100644
> --- a/xen/arch/x86/mm.c
> +++ b/xen/arch/x86/mm.c
> @@ -6423,6 +6423,94 @@ int create_perdomain_mapping(struct domain *d, unsigned long va,
>      return rc;
>  }
>  
> +void populate_perdomain_mapping(const struct vcpu *v, unsigned long va,
> +                                mfn_t *mfn, unsigned long nr)
> +{
> +    l1_pgentry_t *l1tab = NULL, *pl1e;
> +    const l3_pgentry_t *l3tab;
> +    const l2_pgentry_t *l2tab;
> +    struct domain *d = v->domain;
> +
> +    ASSERT(va >= PERDOMAIN_VIRT_START &&
> +           va < PERDOMAIN_VIRT_SLOT(PERDOMAIN_SLOTS));
> +    ASSERT(!nr || !l3_table_offset(va ^ (va + nr * PAGE_SIZE - 1)));
> +
> +    /* Use likely to force the optimization for the fast path. */
> +    if ( likely(v == current) )

... and here? In particular I'd expect using curr_vcpu here means...

> +    {
> +        unsigned int i;
> +
> +        /* Ensure page-tables are from current (if current != curr_vcpu). */
> +        sync_local_execstate();

... this should not be needed.

> +
> +        /* Fast path: get L1 entries using the recursive linear mappings. */
> +        pl1e = &__linear_l1_table[l1_linear_offset(va)];
> +
> +        for ( i = 0; i < nr; i++, pl1e++ )
> +        {
> +            if ( unlikely(perdomain_l1e_needs_freeing(*pl1e)) )
> +            {
> +                ASSERT_UNREACHABLE();
> +                free_domheap_page(l1e_get_page(*pl1e));
> +            }
> +            l1e_write(pl1e, l1e_from_mfn(mfn[i], __PAGE_HYPERVISOR_RW));
> +        }
> +
> +        return;
> +    }
> +
> +    ASSERT(d->arch.perdomain_l3_pg);
> +    l3tab = __map_domain_page(d->arch.perdomain_l3_pg);
> +
> +    if ( unlikely(!(l3e_get_flags(l3tab[l3_table_offset(va)]) &
> +                    _PAGE_PRESENT)) )
> +    {
> +        unmap_domain_page(l3tab);
> +        gprintk(XENLOG_ERR, "unable to map at VA %lx: L3e not present\n", va);
> +        ASSERT_UNREACHABLE();
> +        domain_crash(d);
> +
> +        return;
> +    }
> +
> +    l2tab = map_l2t_from_l3e(l3tab[l3_table_offset(va)]);
> +
> +    for ( ; nr--; va += PAGE_SIZE, mfn++ )
> +    {
> +        if ( !l1tab || !l1_table_offset(va) )
> +        {
> +            const l2_pgentry_t *pl2e = l2tab + l2_table_offset(va);
> +
> +            if ( unlikely(!(l2e_get_flags(*pl2e) & _PAGE_PRESENT)) )
> +            {
> +                gprintk(XENLOG_ERR, "unable to map at VA %lx: L2e not present\n",
> +                        va);
> +                ASSERT_UNREACHABLE();
> +                domain_crash(d);
> +
> +                break;
> +            }
> +
> +            unmap_domain_page(l1tab);
> +            l1tab = map_l1t_from_l2e(*pl2e);
> +        }
> +
> +        pl1e = &l1tab[l1_table_offset(va)];
> +
> +        if ( unlikely(perdomain_l1e_needs_freeing(*pl1e)) )
> +        {
> +            ASSERT_UNREACHABLE();
> +            free_domheap_page(l1e_get_page(*pl1e));
> +        }
> +
> +        l1e_write(pl1e, l1e_from_mfn(*mfn, __PAGE_HYPERVISOR_RW));
> +    }
> +
> +    unmap_domain_page(l1tab);
> +    unmap_domain_page(l2tab);
> +    unmap_domain_page(l3tab);
> +}
> +
>  void destroy_perdomain_mapping(struct domain *d, unsigned long va,
>                                 unsigned int nr)
>  {
> diff --git a/xen/arch/x86/smpboot.c b/xen/arch/x86/smpboot.c
> index 79a79c54c304..a740a6402272 100644
> --- a/xen/arch/x86/smpboot.c
> +++ b/xen/arch/x86/smpboot.c
> @@ -1059,8 +1059,7 @@ static int cpu_smpboot_alloc(unsigned int cpu)
>      if ( gdt == NULL )
>          goto out;
>      per_cpu(gdt, cpu) = gdt;
> -    per_cpu(gdt_l1e, cpu) =
> -        l1e_from_pfn(virt_to_mfn(gdt), __PAGE_HYPERVISOR_RW);
> +    per_cpu(gdt_mfn, cpu) = _mfn(virt_to_mfn(gdt));
>      memcpy(gdt, boot_gdt, NR_RESERVED_GDT_PAGES * PAGE_SIZE);
>      BUILD_BUG_ON(NR_CPUS > 0x10000);
>      gdt[PER_CPU_GDT_ENTRY - FIRST_RESERVED_GDT_ENTRY].a = cpu;
> @@ -1069,8 +1068,7 @@ static int cpu_smpboot_alloc(unsigned int cpu)
>      per_cpu(compat_gdt, cpu) = gdt = alloc_xenheap_pages(0, memflags);
>      if ( gdt == NULL )
>          goto out;
> -    per_cpu(compat_gdt_l1e, cpu) =
> -        l1e_from_pfn(virt_to_mfn(gdt), __PAGE_HYPERVISOR_RW);
> +    per_cpu(compat_gdt_mfn, cpu) = _mfn(virt_to_mfn(gdt));
>      memcpy(gdt, boot_compat_gdt, NR_RESERVED_GDT_PAGES * PAGE_SIZE);
>      gdt[PER_CPU_GDT_ENTRY - FIRST_RESERVED_GDT_ENTRY].a = cpu;
>  #endif
> diff --git a/xen/arch/x86/traps.c b/xen/arch/x86/traps.c
> index 487b8c5a78c5..a7f6fb611c34 100644
> --- a/xen/arch/x86/traps.c
> +++ b/xen/arch/x86/traps.c
> @@ -92,10 +92,10 @@ DEFINE_PER_CPU(uint64_t, efer);
>  static DEFINE_PER_CPU(unsigned long, last_extable_addr);
>  
>  DEFINE_PER_CPU_READ_MOSTLY(seg_desc_t *, gdt);
> -DEFINE_PER_CPU_READ_MOSTLY(l1_pgentry_t, gdt_l1e);
> +DEFINE_PER_CPU_READ_MOSTLY(mfn_t, gdt_mfn);
>  #ifdef CONFIG_PV32
>  DEFINE_PER_CPU_READ_MOSTLY(seg_desc_t *, compat_gdt);
> -DEFINE_PER_CPU_READ_MOSTLY(l1_pgentry_t, compat_gdt_l1e);
> +DEFINE_PER_CPU_READ_MOSTLY(mfn_t, compat_gdt_mfn);
>  #endif
>  
>  /* Master table, used by CPU0. */
> @@ -2219,11 +2219,9 @@ void __init trap_init(void)
>      init_ler();
>  
>      /* Cache {,compat_}gdt_l1e now that physically relocation is done. */
> -    this_cpu(gdt_l1e) =
> -        l1e_from_pfn(virt_to_mfn(boot_gdt), __PAGE_HYPERVISOR_RW);
> +    this_cpu(gdt_mfn) = _mfn(virt_to_mfn(boot_gdt));
>      if ( IS_ENABLED(CONFIG_PV32) )
> -        this_cpu(compat_gdt_l1e) =
> -            l1e_from_pfn(virt_to_mfn(boot_compat_gdt), __PAGE_HYPERVISOR_RW);
> +        this_cpu(compat_gdt_mfn) = _mfn(virt_to_mfn(boot_compat_gdt));
>  
>      percpu_traps_init();
>  



^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [PATCH v2 05/18] x86/mm: switch destroy_perdomain_mapping() parameter from domain to vCPU
  2025-01-08 14:26 ` [PATCH v2 05/18] x86/mm: switch destroy_perdomain_mapping() parameter from domain to vCPU Roger Pau Monne
@ 2025-01-09 10:02   ` Alejandro Vallejo
  2025-01-10 14:30     ` Roger Pau Monné
  0 siblings, 1 reply; 55+ messages in thread
From: Alejandro Vallejo @ 2025-01-09 10:02 UTC (permalink / raw)
  To: Roger Pau Monne, xen-devel; +Cc: Jan Beulich, Andrew Cooper

On Wed Jan 8, 2025 at 2:26 PM GMT, Roger Pau Monne wrote:
> In preparation for the per-domain area being populated with per-vCPU mappings
> change the parameter of destroy_perdomain_mapping() to be a vCPU instead of a
> domain, and also update the function logic to allow manipulation of per-domain
> mappings using the linear page table mappings.
>
> Signed-off-by: Roger Pau Monné <roger.pau@citrix.com>
> ---
>  xen/arch/x86/include/asm/mm.h |  2 +-
>  xen/arch/x86/mm.c             | 24 +++++++++++++++++++++++-
>  xen/arch/x86/pv/domain.c      |  3 +--
>  xen/arch/x86/x86_64/mm.c      |  2 +-
>  4 files changed, 26 insertions(+), 5 deletions(-)
>
> diff --git a/xen/arch/x86/include/asm/mm.h b/xen/arch/x86/include/asm/mm.h
> index b50a51327b2b..65cd751087dc 100644
> --- a/xen/arch/x86/include/asm/mm.h
> +++ b/xen/arch/x86/include/asm/mm.h
> @@ -605,7 +605,7 @@ int create_perdomain_mapping(struct domain *d, unsigned long va,
>                               struct page_info **ppg);
>  void populate_perdomain_mapping(const struct vcpu *v, unsigned long va,
>                                  mfn_t *mfn, unsigned long nr);
> -void destroy_perdomain_mapping(struct domain *d, unsigned long va,
> +void destroy_perdomain_mapping(const struct vcpu *v, unsigned long va,
>                                 unsigned int nr);
>  void free_perdomain_mappings(struct domain *d);
>  
> diff --git a/xen/arch/x86/mm.c b/xen/arch/x86/mm.c
> index 0abea792486c..713ae8dd6fa3 100644
> --- a/xen/arch/x86/mm.c
> +++ b/xen/arch/x86/mm.c
> @@ -6511,10 +6511,11 @@ void populate_perdomain_mapping(const struct vcpu *v, unsigned long va,
>      unmap_domain_page(l3tab);
>  }
>  
> -void destroy_perdomain_mapping(struct domain *d, unsigned long va,
> +void destroy_perdomain_mapping(const struct vcpu *v, unsigned long va,
>                                 unsigned int nr)
>  {
>      const l3_pgentry_t *l3tab, *pl3e;
> +    const struct domain *d = v->domain;
>  
>      ASSERT(va >= PERDOMAIN_VIRT_START &&
>             va < PERDOMAIN_VIRT_SLOT(PERDOMAIN_SLOTS));
> @@ -6523,6 +6524,27 @@ void destroy_perdomain_mapping(struct domain *d, unsigned long va,
>      if ( !d->arch.perdomain_l3_pg )
>          return;
>  
> +    /* Use likely to force the optimization for the fast path. */
> +    if ( likely(v == current) )

As in the previous patch, doesn't using curr_vcpu here...

> +    {
> +        l1_pgentry_t *pl1e;
> +
> +        /* Ensure page-tables are from current (if current != curr_vcpu). */
> +        sync_local_execstate();

... avoid the need for this?

> +
> +        pl1e = &__linear_l1_table[l1_linear_offset(va)];
> +
> +        /* Fast path: zap L1 entries using the recursive linear mappings. */
> +        for ( ; nr--; pl1e++ )
> +        {
> +            if ( perdomain_l1e_needs_freeing(*pl1e) )
> +                free_domheap_page(l1e_get_page(*pl1e));
> +            l1e_write(pl1e, l1e_empty());
> +        }
> +
> +        return;
> +    }
> +
>      l3tab = __map_domain_page(d->arch.perdomain_l3_pg);
>      pl3e = l3tab + l3_table_offset(va);
>  
> diff --git a/xen/arch/x86/pv/domain.c b/xen/arch/x86/pv/domain.c
> index bc7cd0c62f0e..7e8bffaae9a0 100644
> --- a/xen/arch/x86/pv/domain.c
> +++ b/xen/arch/x86/pv/domain.c
> @@ -285,8 +285,7 @@ static int pv_create_gdt_ldt_l1tab(struct vcpu *v)
>  
>  static void pv_destroy_gdt_ldt_l1tab(struct vcpu *v)
>  {
> -    destroy_perdomain_mapping(v->domain, GDT_VIRT_START(v),
> -                              1U << GDT_LDT_VCPU_SHIFT);
> +    destroy_perdomain_mapping(v, GDT_VIRT_START(v), 1U << GDT_LDT_VCPU_SHIFT);
>  }
>  
>  void pv_vcpu_destroy(struct vcpu *v)
> diff --git a/xen/arch/x86/x86_64/mm.c b/xen/arch/x86/x86_64/mm.c
> index 389d813ebe63..c08b28d9693b 100644
> --- a/xen/arch/x86/x86_64/mm.c
> +++ b/xen/arch/x86/x86_64/mm.c
> @@ -737,7 +737,7 @@ int setup_compat_arg_xlat(struct vcpu *v)
>  
>  void free_compat_arg_xlat(struct vcpu *v)
>  {
> -    destroy_perdomain_mapping(v->domain, ARG_XLAT_START(v),
> +    destroy_perdomain_mapping(v, ARG_XLAT_START(v),
>                                PFN_UP(COMPAT_ARG_XLAT_SIZE));
>  }
>  

Cheers,
Alejandro


^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [PATCH v2.1 06/18] x86/pv: set/clear guest GDT mappings using {populate,destroy}_perdomain_mapping()
  2025-01-08 15:11   ` [PATCH v2.1 " Roger Pau Monne
@ 2025-01-09 10:25     ` Alejandro Vallejo
  2025-01-10 14:33       ` Roger Pau Monné
  2025-01-14 15:30     ` Jan Beulich
  1 sibling, 1 reply; 55+ messages in thread
From: Alejandro Vallejo @ 2025-01-09 10:25 UTC (permalink / raw)
  To: Roger Pau Monne, xen-devel; +Cc: Jan Beulich, Andrew Cooper

On Wed Jan 8, 2025 at 3:11 PM GMT, Roger Pau Monne wrote:
> The pv_{set,destroy}_gdt() functions rely on the L1 table(s) that contain such
> mappings being stashed in the domain structure, and thus such mappings being
> modified by merely updating the L1 entries.
>
> Switch both pv_{set,destroy}_gdt() to instead use
> {populate,destory}_perdomain_mapping().

nit: s/destory/destroy

How come pv_set_gdt() doesn't need to be reordered here (as opposed to v2)?

>
> Signed-off-by: Roger Pau Monné <roger.pau@citrix.com>
> ---
> Changes since v2:
>  - Do not change ordering setup of arch_set_info_guest().
> ---
>  xen/arch/x86/pv/descriptor-tables.c | 28 ++++++++++++----------------
>  1 file changed, 12 insertions(+), 16 deletions(-)
>
> diff --git a/xen/arch/x86/pv/descriptor-tables.c b/xen/arch/x86/pv/descriptor-tables.c
> index 02647a2c5047..5a79f022ce13 100644
> --- a/xen/arch/x86/pv/descriptor-tables.c
> +++ b/xen/arch/x86/pv/descriptor-tables.c
> @@ -49,23 +49,20 @@ bool pv_destroy_ldt(struct vcpu *v)
>  
>  void pv_destroy_gdt(struct vcpu *v)
>  {
> -    l1_pgentry_t *pl1e = pv_gdt_ptes(v);
> -    mfn_t zero_mfn = _mfn(virt_to_mfn(zero_page));
> -    l1_pgentry_t zero_l1e = l1e_from_mfn(zero_mfn, __PAGE_HYPERVISOR_RO);
>      unsigned int i;
>  
>      ASSERT(v == current || !vcpu_cpu_dirty(v));
>  
> -    v->arch.pv.gdt_ents = 0;
> -    for ( i = 0; i < FIRST_RESERVED_GDT_PAGE; i++ )
> -    {
> -        mfn_t mfn = l1e_get_mfn(pl1e[i]);
> +    if ( v->arch.cr3 )
> +        destroy_perdomain_mapping(v, GDT_VIRT_START(v),
> +                                  ARRAY_SIZE(v->arch.pv.gdt_frames));
>  
> -        if ( (l1e_get_flags(pl1e[i]) & _PAGE_PRESENT) &&
> -             !mfn_eq(mfn, zero_mfn) )
> -            put_page_and_type(mfn_to_page(mfn));
> +    for ( i = 0; i < ARRAY_SIZE(v->arch.pv.gdt_frames); i++)
> +    {
> +        if ( !v->arch.pv.gdt_frames[i] )
> +            break;
>  
> -        l1e_write(&pl1e[i], zero_l1e);
> +        put_page_and_type(mfn_to_page(_mfn(v->arch.pv.gdt_frames[i])));
>          v->arch.pv.gdt_frames[i] = 0;
>      }
>  }
> @@ -74,8 +71,8 @@ int pv_set_gdt(struct vcpu *v, const unsigned long frames[],
>                 unsigned int entries)
>  {
>      struct domain *d = v->domain;
> -    l1_pgentry_t *pl1e;
>      unsigned int i, nr_frames = DIV_ROUND_UP(entries, 512);
> +    mfn_t mfns[ARRAY_SIZE(v->arch.pv.gdt_frames)];
>  
>      ASSERT(v == current || !vcpu_cpu_dirty(v));
>  
> @@ -90,6 +87,8 @@ int pv_set_gdt(struct vcpu *v, const unsigned long frames[],
>          if ( !mfn_valid(mfn) ||
>               !get_page_and_type(mfn_to_page(mfn), d, PGT_seg_desc_page) )
>              goto fail;
> +
> +        mfns[i] = mfn;
>      }
>  
>      /* Tear down the old GDT. */
> @@ -97,12 +96,9 @@ int pv_set_gdt(struct vcpu *v, const unsigned long frames[],
>  
>      /* Install the new GDT. */
>      v->arch.pv.gdt_ents = entries;
> -    pl1e = pv_gdt_ptes(v);
>      for ( i = 0; i < nr_frames; i++ )
> -    {
>          v->arch.pv.gdt_frames[i] = frames[i];
> -        l1e_write(&pl1e[i], l1e_from_pfn(frames[i], __PAGE_HYPERVISOR_RW));
> -    }
> +    populate_perdomain_mapping(v, GDT_VIRT_START(v), mfns, nr_frames);
>  
>      return 0;
>  

Cheers,
Alejandro


^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [PATCH v2 09/18] x86/mm: simplify create_perdomain_mapping() interface
  2025-01-08 14:26 ` [PATCH v2 09/18] x86/mm: simplify create_perdomain_mapping() interface Roger Pau Monne
@ 2025-01-09 11:01   ` Alejandro Vallejo
  2025-01-10 14:45     ` Roger Pau Monné
  0 siblings, 1 reply; 55+ messages in thread
From: Alejandro Vallejo @ 2025-01-09 11:01 UTC (permalink / raw)
  To: Roger Pau Monne, xen-devel; +Cc: Jan Beulich, Andrew Cooper

On Wed Jan 8, 2025 at 2:26 PM GMT, Roger Pau Monne wrote:
> diff --git a/xen/arch/x86/include/asm/mm.h b/xen/arch/x86/include/asm/mm.h
> index 65cd751087dc..0c57442c9593 100644
> --- a/xen/arch/x86/include/asm/mm.h
> +++ b/xen/arch/x86/include/asm/mm.h
> @@ -601,8 +601,7 @@ int compat_arch_memory_op(unsigned long cmd, XEN_GUEST_HANDLE_PARAM(void) arg);
>  #define IS_NIL(ptr) (!((uintptr_t)(ptr) + sizeof(*(ptr))))

Shouldn't IS_NIL() and NIL (out of context) be removed too?

Cheers,
Alejandro


^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [PATCH v2 07/18] x86/pv: update guest LDT mappings using the linear entries
  2025-01-08 14:26 ` [PATCH v2 07/18] x86/pv: update guest LDT mappings using the linear entries Roger Pau Monne
@ 2025-01-09 14:34   ` Alejandro Vallejo
  2025-01-10 14:44     ` Roger Pau Monné
  2025-01-14 15:42   ` Jan Beulich
  1 sibling, 1 reply; 55+ messages in thread
From: Alejandro Vallejo @ 2025-01-09 14:34 UTC (permalink / raw)
  To: Roger Pau Monne, xen-devel; +Cc: Jan Beulich, Andrew Cooper

On Wed Jan 8, 2025 at 2:26 PM GMT, Roger Pau Monne wrote:
> The pv_map_ldt_shadow_page() and pv_destroy_ldt() functions rely on the L1
> table(s) that contain such mappings being stashed in the domain structure, and
> thus such mappings being modified by merely updating the require L1 entries.
>
> Switch pv_map_ldt_shadow_page() to unconditionally use the linear recursive, as
> that logic is always called while the vCPU is running on the current pCPU.
>
> For pv_destroy_ldt() use the linear mappings if the vCPU is the one currently
> running on the pCPU, otherwise use destroy_mappings().
>
> Note this requires keeping an array with the pages currently mapped at the LDT
> area, as that allows dropping the extra taken page reference when removing the
> mappings.
>
> Signed-off-by: Roger Pau Monné <roger.pau@citrix.com>
> ---
>  xen/arch/x86/include/asm/domain.h   |  2 ++
>  xen/arch/x86/pv/descriptor-tables.c | 19 ++++++++++---------
>  xen/arch/x86/pv/domain.c            |  4 ++++
>  xen/arch/x86/pv/mm.c                |  3 ++-
>  4 files changed, 18 insertions(+), 10 deletions(-)
>
> diff --git a/xen/arch/x86/include/asm/domain.h b/xen/arch/x86/include/asm/domain.h
> index b79d6badd71c..b659cffc7f81 100644
> --- a/xen/arch/x86/include/asm/domain.h
> +++ b/xen/arch/x86/include/asm/domain.h
> @@ -523,6 +523,8 @@ struct pv_vcpu
>      struct trap_info *trap_ctxt;
>  
>      unsigned long gdt_frames[FIRST_RESERVED_GDT_PAGE];
> +    /* Max LDT entries is 8192, so 8192 * 8 = 64KiB (16 pages). */
> +    mfn_t ldt_frames[16];
>      unsigned long ldt_base;
>      unsigned int gdt_ents, ldt_ents;
>  
> diff --git a/xen/arch/x86/pv/descriptor-tables.c b/xen/arch/x86/pv/descriptor-tables.c
> index 5a79f022ce13..95b598a4c0cf 100644
> --- a/xen/arch/x86/pv/descriptor-tables.c
> +++ b/xen/arch/x86/pv/descriptor-tables.c
> @@ -20,28 +20,29 @@
>   */
>  bool pv_destroy_ldt(struct vcpu *v)
>  {
> -    l1_pgentry_t *pl1e;
> +    const unsigned int nr_frames = ARRAY_SIZE(v->arch.pv.ldt_frames);
>      unsigned int i, mappings_dropped = 0;
> -    struct page_info *page;
>  
>      ASSERT(!in_irq());
>  
>      ASSERT(v == current || !vcpu_cpu_dirty(v));
>  
> -    pl1e = pv_ldt_ptes(v);
> +    destroy_perdomain_mapping(v, LDT_VIRT_START(v), nr_frames);
>  
> -    for ( i = 0; i < 16; i++ )
> +    for ( i = 0; i < nr_frames; i++ )

nit: While at this, can the "unsigned int" be moved here too?

>      {
> -        if ( !(l1e_get_flags(pl1e[i]) & _PAGE_PRESENT) )
> -            continue;
> +        mfn_t mfn = v->arch.pv.ldt_frames[i];
> +        struct page_info *page;
>  
> -        page = l1e_get_page(pl1e[i]);
> -        l1e_write(&pl1e[i], l1e_empty());
> -        mappings_dropped++;
> +        if ( mfn_eq(mfn, INVALID_MFN) )
> +            continue;

Can it really be disjoint? As in, why "continue" and not "break"?. Not that it
matters in the slightest, and I prefer this form; but I'm curious.

>  
> +        v->arch.pv.ldt_frames[i] = INVALID_MFN;
> +        page = mfn_to_page(mfn);
>          ASSERT_PAGE_IS_TYPE(page, PGT_seg_desc_page);
>          ASSERT_PAGE_IS_DOMAIN(page, v->domain);
>          put_page_and_type(page);
> +        mappings_dropped++;
>      }
>  
>      return mappings_dropped;
> diff --git a/xen/arch/x86/pv/domain.c b/xen/arch/x86/pv/domain.c
> index 7e8bffaae9a0..32d7488cc186 100644
> --- a/xen/arch/x86/pv/domain.c
> +++ b/xen/arch/x86/pv/domain.c
> @@ -303,6 +303,7 @@ void pv_vcpu_destroy(struct vcpu *v)
>  int pv_vcpu_initialise(struct vcpu *v)
>  {
>      struct domain *d = v->domain;
> +    unsigned int i;
>      int rc;
>  
>      ASSERT(!is_idle_domain(d));
> @@ -311,6 +312,9 @@ int pv_vcpu_initialise(struct vcpu *v)
>      if ( rc )
>          return rc;
>  
> +    for ( i = 0; i < ARRAY_SIZE(v->arch.pv.ldt_frames); i++ )
> +        v->arch.pv.ldt_frames[i] = INVALID_MFN;
> +

I think it makes more sense to move this earlier so ldt_frames[] is initialised
even if pv_vcpu_initialise() fails. It may be benign, but it looks like an
accident abount to happen.

Also, nit: "unsigned int i"'s scope can be restricted to the loop itself.

  As in, "for ( unsigned int i =..."

>      BUILD_BUG_ON(X86_NR_VECTORS * sizeof(*v->arch.pv.trap_ctxt) >
>                   PAGE_SIZE);
>      v->arch.pv.trap_ctxt = xzalloc_array(struct trap_info, X86_NR_VECTORS);
> diff --git a/xen/arch/x86/pv/mm.c b/xen/arch/x86/pv/mm.c
> index 187f5f6a3e8c..4853e619f2a7 100644
> --- a/xen/arch/x86/pv/mm.c
> +++ b/xen/arch/x86/pv/mm.c
> @@ -86,7 +86,8 @@ bool pv_map_ldt_shadow_page(unsigned int offset)
>          return false;
>      }
>  
> -    pl1e = &pv_ldt_ptes(curr)[offset >> PAGE_SHIFT];
> +    curr->arch.pv.ldt_frames[offset >> PAGE_SHIFT] = page_to_mfn(page);
> +    pl1e = &__linear_l1_table[l1_linear_offset(LDT_VIRT_START(curr) + offset)];
>      l1e_add_flags(gl1e, _PAGE_RW);
>  
>      l1e_write(pl1e, gl1e);

Cheers,
Alejandro


^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [PATCH v2 13/18] x86/spec-ctrl: introduce Address Space Isolation command line option
  2025-01-08 14:26 ` [PATCH v2 13/18] x86/spec-ctrl: introduce Address Space Isolation command line option Roger Pau Monne
@ 2025-01-09 14:58   ` Alejandro Vallejo
  2025-01-10 14:55     ` Roger Pau Monné
  0 siblings, 1 reply; 55+ messages in thread
From: Alejandro Vallejo @ 2025-01-09 14:58 UTC (permalink / raw)
  To: Roger Pau Monne, xen-devel
  Cc: Andrew Cooper, Anthony PERARD, Michal Orzel, Jan Beulich,
	Julien Grall, Stefano Stabellini

On Wed Jan 8, 2025 at 2:26 PM GMT, Roger Pau Monne wrote:
> No functional change, as the option is not used.
>
> Introduced new so newly added functionality is keyed on the option being
> enabled, even if the feature is non-functional.
>
> When ASI is enabled for PV domains, printing the usage of XPTI might be
> omitted if it must be uniformly disabled given the usage of ASI.
>
> Signed-off-by: Roger Pau Monné <roger.pau@citrix.com>
> ---
> Changes since v1:
>  - Improve comments and documentation about what ASI provides.
>  - Do not print the XPTI information if ASI is used for pv domUs and dom0 is
>    PVH, or if ASI is used for both domU and dom0.
>
> FWIW, I would print the state of XPTI uniformly, as otherwise I find the output
> might be confusing for user expecting to assert the state of XPTI.
> ---
>  docs/misc/xen-command-line.pandoc    |  19 +++++
>  xen/arch/x86/include/asm/domain.h    |   3 +
>  xen/arch/x86/include/asm/spec_ctrl.h |   2 +
>  xen/arch/x86/spec_ctrl.c             | 115 +++++++++++++++++++++++++--
>  4 files changed, 133 insertions(+), 6 deletions(-)
>
> diff --git a/docs/misc/xen-command-line.pandoc b/docs/misc/xen-command-line.pandoc
> index 08b0053f9ced..3c1ad7b5fe7d 100644
> --- a/docs/misc/xen-command-line.pandoc
> +++ b/docs/misc/xen-command-line.pandoc
> @@ -202,6 +202,25 @@ to appropriate auditing by Xen.  Argo is disabled by default.
>      This option is disabled by default, to protect domains from a DoS by a
>      buggy or malicious other domain spamming the ring.
>  
> +### asi (x86)
> +> `= List of [ <bool>, {pv,hvm}=<bool>,
> +               {vcpu-pt}=<bool>|{pv,hvm}=<bool> ]`

nit: While this grows later, the braces around vcpu-pt aren't strictly needed here.

> +
> +Offers control over whether the hypervisor will engage in Address Space
> +Isolation, by not having potentially sensitive information permanently mapped
> +in the VMM page-tables.  Using this option might avoid the need to apply
> +mitigations for certain speculative related attacks, at the cost of mapping
> +sensitive information on-demand.

Might be worth mentioning that this provides some defense in depth against
unmitigated attacks too.

> +
> +* `pv=` and `hvm=` sub-options allow enabling for specific guest types.
> +
> +**WARNING: manual de-selection of enabled options will invalidate any
> +protection offered by the feature.  The fine grained options provided below are
> +meant to be used for debugging purposes only.**
> +
> +* `vcpu-pt` ensure each vCPU uses a unique top-level page-table and setup a
> +  virtual address space region to map memory on a per-vCPU basis.
> +
>  ### asid (x86)
>  > `= <boolean>`
>  
> diff --git a/xen/arch/x86/spec_ctrl.c b/xen/arch/x86/spec_ctrl.c
> index ced84750015c..9463a8624701 100644
> --- a/xen/arch/x86/spec_ctrl.c
> +++ b/xen/arch/x86/spec_ctrl.c
> @@ -2075,6 +2165,19 @@ void __init init_speculation_mitigations(void)
>           hw_smt_enabled && default_xen_spec_ctrl )
>          setup_force_cpu_cap(X86_FEATURE_SC_MSR_IDLE);
>  
> +    /* Disable all ASI options by default until feature is finished. */
> +    if ( opt_vcpu_pt_pv == -1 )
> +        opt_vcpu_pt_pv = 0;
> +    if ( opt_vcpu_pt_hwdom == -1 )
> +        opt_vcpu_pt_hwdom = 0;
> +    if ( opt_vcpu_pt_hvm == -1 )
> +        opt_vcpu_pt_hvm = 0;

Why not preinitialise them to zero instead in the static declarations?

> +
> +    if ( opt_vcpu_pt_pv || opt_vcpu_pt_hvm )
> +        warning_add(
> +            "Address Space Isolation is not functional, this option is\n"
> +            "intended to be used only for development purposes.\n");
> +
>      xpti_init_default();
>  
>      l1tf_calculations();

Cheers,
Alejandro


^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [PATCH v2 15/18] x86/mm: introduce a per-vCPU mapcache when using ASI
  2025-01-08 14:26 ` [PATCH v2 15/18] x86/mm: introduce a per-vCPU mapcache when using ASI Roger Pau Monne
@ 2025-01-09 15:08   ` Alejandro Vallejo
  2025-01-10 15:02     ` Roger Pau Monné
  0 siblings, 1 reply; 55+ messages in thread
From: Alejandro Vallejo @ 2025-01-09 15:08 UTC (permalink / raw)
  To: Roger Pau Monne, xen-devel; +Cc: Jan Beulich, Andrew Cooper

On Wed Jan 8, 2025 at 2:26 PM GMT, Roger Pau Monne wrote:
> When using a unique per-vCPU root page table the per-domain region becomes
> per-vCPU, and hence the mapcache is no longer shared between all vCPUs of a
> domain.  Introduce per-vCPU mapcache structures, and modify map_domain_page()
> to create per-vCPU mappings when possible.  Note the lock is also not needed
> with using per-vCPU map caches, as the structure is no longer shared.
>
> This introduces some duplication in the domain and vcpu structures, as both
> contain a mapcache field to support running with and without per-vCPU
> page-tables.
>
> Signed-off-by: Roger Pau Monné <roger.pau@citrix.com>
> ---
>  xen/arch/x86/domain_page.c        | 90 ++++++++++++++++++++-----------
>  xen/arch/x86/include/asm/domain.h | 20 ++++---
>  2 files changed, 71 insertions(+), 39 deletions(-)
>
> diff --git a/xen/arch/x86/domain_page.c b/xen/arch/x86/domain_page.c
> index 1372be20224e..65900d6218f8 100644
> --- a/xen/arch/x86/domain_page.c
> +++ b/xen/arch/x86/domain_page.c
> @@ -74,7 +74,9 @@ void *map_domain_page(mfn_t mfn)
>      struct vcpu *v;
>      struct mapcache_domain *dcache;
>      struct mapcache_vcpu *vcache;
> +    struct mapcache *cache;
>      struct vcpu_maphash_entry *hashent;
> +    struct domain *d;
>  
>  #ifdef NDEBUG
>      if ( mfn_x(mfn) <= PFN_DOWN(__pa(HYPERVISOR_VIRT_END - 1)) )
> @@ -85,9 +87,12 @@ void *map_domain_page(mfn_t mfn)
>      if ( !v || !is_pv_vcpu(v) )
>          return mfn_to_virt(mfn_x(mfn));
>  
> -    dcache = &v->domain->arch.pv.mapcache;
> +    d = v->domain;
> +    dcache = &d->arch.pv.mapcache;
>      vcache = &v->arch.pv.mapcache;
> -    if ( !dcache->inuse )
> +    cache = d->arch.vcpu_pt ? &v->arch.pv.mapcache.cache
> +                            : &d->arch.pv.mapcache.cache;
> +    if ( !cache->inuse )
>          return mfn_to_virt(mfn_x(mfn));
>  
>      perfc_incr(map_domain_page_count);
> @@ -98,17 +103,18 @@ void *map_domain_page(mfn_t mfn)
>      if ( hashent->mfn == mfn_x(mfn) )
>      {
>          idx = hashent->idx;
> -        ASSERT(idx < dcache->entries);
> +        ASSERT(idx < cache->entries);
>          hashent->refcnt++;
>          ASSERT(hashent->refcnt);
>          ASSERT(mfn_eq(l1e_get_mfn(MAPCACHE_L1ENT(idx)), mfn));
>          goto out;
>      }
>  
> -    spin_lock(&dcache->lock);
> +    if ( !d->arch.vcpu_pt )
> +        spin_lock(&dcache->lock);

Hmmm. I wonder whether we might not want a nospec here...

>  
>      /* Has some other CPU caused a wrap? We must flush if so. */
> -    if ( unlikely(dcache->epoch != vcache->shadow_epoch) )
> +    if ( unlikely(!d->arch.vcpu_pt && dcache->epoch != vcache->shadow_epoch) )
>      {
>          vcache->shadow_epoch = dcache->epoch;
>          if ( NEED_FLUSH(this_cpu(tlbflush_time), dcache->tlbflush_timestamp) )
> @@ -118,21 +124,21 @@ void *map_domain_page(mfn_t mfn)
>          }
>      }
>  
> -    idx = find_next_zero_bit(dcache->inuse, dcache->entries, dcache->cursor);
> -    if ( unlikely(idx >= dcache->entries) )
> +    idx = find_next_zero_bit(cache->inuse, cache->entries, cache->cursor);
> +    if ( unlikely(idx >= cache->entries) )
>      {
>          unsigned long accum = 0, prev = 0;
>  
>          /* /First/, clean the garbage map and update the inuse list. */
> -        for ( i = 0; i < BITS_TO_LONGS(dcache->entries); i++ )
> +        for ( i = 0; i < BITS_TO_LONGS(cache->entries); i++ )
>          {
>              accum |= prev;
> -            dcache->inuse[i] &= ~xchg(&dcache->garbage[i], 0);
> -            prev = ~dcache->inuse[i];
> +            cache->inuse[i] &= ~xchg(&cache->garbage[i], 0);
> +            prev = ~cache->inuse[i];
>          }
>  
> -        if ( accum | (prev & BITMAP_LAST_WORD_MASK(dcache->entries)) )
> -            idx = find_first_zero_bit(dcache->inuse, dcache->entries);
> +        if ( accum | (prev & BITMAP_LAST_WORD_MASK(cache->entries)) )
> +            idx = find_first_zero_bit(cache->inuse, cache->entries);
>          else
>          {
>              /* Replace a hash entry instead. */
> @@ -152,19 +158,23 @@ void *map_domain_page(mfn_t mfn)
>                      i = 0;
>              } while ( i != MAPHASH_HASHFN(mfn_x(mfn)) );
>          }
> -        BUG_ON(idx >= dcache->entries);
> +        BUG_ON(idx >= cache->entries);
>  
>          /* /Second/, flush TLBs. */
>          perfc_incr(domain_page_tlb_flush);
>          flush_tlb_local();
> -        vcache->shadow_epoch = ++dcache->epoch;
> -        dcache->tlbflush_timestamp = tlbflush_current_time();
> +        if ( !d->arch.vcpu_pt )
> +        {
> +            vcache->shadow_epoch = ++dcache->epoch;
> +            dcache->tlbflush_timestamp = tlbflush_current_time();
> +        }
>      }
>  
> -    set_bit(idx, dcache->inuse);
> -    dcache->cursor = idx + 1;
> +    set_bit(idx, cache->inuse);
> +    cache->cursor = idx + 1;
>  
> -    spin_unlock(&dcache->lock);
> +    if ( !d->arch.vcpu_pt )
> +        spin_unlock(&dcache->lock);

... and here.

>  
>      l1e_write(&MAPCACHE_L1ENT(idx), l1e_from_mfn(mfn, __PAGE_HYPERVISOR_RW));
>  
> @@ -178,6 +188,7 @@ void unmap_domain_page(const void *ptr)
>      unsigned int idx;
>      struct vcpu *v;
>      struct mapcache_domain *dcache;
> +    struct mapcache *cache;
>      unsigned long va = (unsigned long)ptr, mfn, flags;
>      struct vcpu_maphash_entry *hashent;
>  
> @@ -190,7 +201,9 @@ void unmap_domain_page(const void *ptr)
>      ASSERT(v && is_pv_vcpu(v));
>  
>      dcache = &v->domain->arch.pv.mapcache;
> -    ASSERT(dcache->inuse);
> +    cache = v->domain->arch.vcpu_pt ? &v->arch.pv.mapcache.cache
> +                                    : &v->domain->arch.pv.mapcache.cache;
> +    ASSERT(cache->inuse);
>  
>      idx = PFN_DOWN(va - MAPCACHE_VIRT_START);
>      mfn = l1e_get_pfn(MAPCACHE_L1ENT(idx));
> @@ -213,7 +226,7 @@ void unmap_domain_page(const void *ptr)
>                     hashent->mfn);
>              l1e_write(&MAPCACHE_L1ENT(hashent->idx), l1e_empty());
>              /* /Second/, mark as garbage. */
> -            set_bit(hashent->idx, dcache->garbage);
> +            set_bit(hashent->idx, cache->garbage);
>          }
>  
>          /* Add newly-freed mapping to the maphash. */
> @@ -225,7 +238,7 @@ void unmap_domain_page(const void *ptr)
>          /* /First/, zap the PTE. */
>          l1e_write(&MAPCACHE_L1ENT(idx), l1e_empty());
>          /* /Second/, mark as garbage. */
> -        set_bit(idx, dcache->garbage);
> +        set_bit(idx, cache->garbage);
>      }
>  
>      local_irq_restore(flags);
> @@ -234,7 +247,6 @@ void unmap_domain_page(const void *ptr)
>  void mapcache_domain_init(struct domain *d)
>  {
>      struct mapcache_domain *dcache = &d->arch.pv.mapcache;
> -    unsigned int bitmap_pages;
>  
>      ASSERT(is_pv_domain(d));
>  
> @@ -243,13 +255,12 @@ void mapcache_domain_init(struct domain *d)
>          return;
>  #endif
>  
> +    if ( d->arch.vcpu_pt )
> +        return;
> +
>      BUILD_BUG_ON(MAPCACHE_VIRT_END + PAGE_SIZE * (3 +
>                   2 * PFN_UP(BITS_TO_LONGS(MAPCACHE_ENTRIES) * sizeof(long))) >
>                   MAPCACHE_VIRT_START + (PERDOMAIN_SLOT_MBYTES << 20));
> -    bitmap_pages = PFN_UP(BITS_TO_LONGS(MAPCACHE_ENTRIES) * sizeof(long));
> -    dcache->inuse = (void *)MAPCACHE_VIRT_END + PAGE_SIZE;
> -    dcache->garbage = dcache->inuse +
> -                      (bitmap_pages + 1) * PAGE_SIZE / sizeof(long);
>  
>      spin_lock_init(&dcache->lock);
>  }
> @@ -258,30 +269,45 @@ int mapcache_vcpu_init(struct vcpu *v)
>  {
>      struct domain *d = v->domain;
>      struct mapcache_domain *dcache = &d->arch.pv.mapcache;
> +    struct mapcache *cache;
>      unsigned long i;
> -    unsigned int ents = d->max_vcpus * MAPCACHE_VCPU_ENTRIES;
> +    unsigned int ents = (d->arch.vcpu_pt ? 1 : d->max_vcpus) *
> +                        MAPCACHE_VCPU_ENTRIES;
>      unsigned int nr = PFN_UP(BITS_TO_LONGS(ents) * sizeof(long));
>  
> -    if ( !is_pv_vcpu(v) || !dcache->inuse )
> +    if ( !is_pv_vcpu(v) )
>          return 0;
>  
> -    if ( ents > dcache->entries )
> +    cache = d->arch.vcpu_pt ? &v->arch.pv.mapcache.cache
> +                            : &dcache->cache;
> +
> +    if ( !cache->inuse )
> +        return 0;
> +
> +    if ( ents > cache->entries )
>      {
>          /* Populate page tables. */
>          int rc = create_perdomain_mapping(v, MAPCACHE_VIRT_START, ents, false);
> +        const unsigned int bitmap_pages =
> +            PFN_UP(BITS_TO_LONGS(MAPCACHE_ENTRIES) * sizeof(long));
> +
> +        cache->inuse = (void *)MAPCACHE_VIRT_END + PAGE_SIZE;
> +        cache->garbage = cache->inuse +
> +                         (bitmap_pages + 1) * PAGE_SIZE / sizeof(long);
> +
>  
>          /* Populate bit maps. */
>          if ( !rc )
> -            rc = create_perdomain_mapping(v, (unsigned long)dcache->inuse,
> +            rc = create_perdomain_mapping(v, (unsigned long)cache->inuse,
>                                            nr, true);
>          if ( !rc )
> -            rc = create_perdomain_mapping(v, (unsigned long)dcache->garbage,
> +            rc = create_perdomain_mapping(v, (unsigned long)cache->garbage,
>                                            nr, true);
>  
>          if ( rc )
>              return rc;
>  
> -        dcache->entries = ents;
> +        cache->entries = ents;
>      }
>  
>      /* Mark all maphash entries as not in use. */

Cheers,
Alejandro


^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [PATCH v2 02/18] x86/domain: limit window where curr_vcpu != current on context switch
  2025-01-09  8:59   ` Jan Beulich
@ 2025-01-09 17:33     ` Roger Pau Monné
  2025-01-14 15:02       ` Jan Beulich
  0 siblings, 1 reply; 55+ messages in thread
From: Roger Pau Monné @ 2025-01-09 17:33 UTC (permalink / raw)
  To: Jan Beulich; +Cc: Andrew Cooper, xen-devel

On Thu, Jan 09, 2025 at 09:59:58AM +0100, Jan Beulich wrote:
> On 08.01.2025 15:26, Roger Pau Monne wrote:
> > On x86 Xen will perform lazy context switches to the idle vCPU, where the
> > previously running vCPU context is not overwritten, and only current is updated
> > to point to the idle vCPU.  The state is then disjunct between current and
> > curr_vcpu: current points to the idle vCPU, while curr_vcpu points to the vCPU
> > whose context is loaded on the pCPU.
> > 
> > While on that lazy context switched state, certain calls (like
> > map_domain_page()) will trigger a full synchronization of the pCPU state by
> > forcing a context switch.  Note however how calling any of such functions
> > inside the context switch code itself is very likely to trigger an infinite
> > recursion loop.
> > 
> > Attempt to limit the window where curr_vcpu != current in the context switch
> > code, as to prevent and infinite recursion loop around sync_local_execstate().
> > 
> > This is required for using map_domain_page() in the vCPU context switch code,
> > otherwise using map_domain_page() in that context ends up in a recursive
> > sync_local_execstate() loop:
> 
> Question is whether it's a good idea in the first place to start using
> map_domain_page() from the context switch path. Surely there are possible
> alternatives.

It seemed more natural rather the introducing yet something new to use
in the context switch path.  I'm happy to hear recommendations, but
overall introducing yet another interface to map stuff just for the
context switch path seems worse than extending an existing interface
to work in that context.

> > --- a/xen/arch/x86/domain.c
> > +++ b/xen/arch/x86/domain.c
> > @@ -1982,16 +1982,16 @@ static void load_default_gdt(unsigned int cpu)
> >      per_cpu(full_gdt_loaded, cpu) = false;
> >  }
> >  
> > -static void __context_switch(void)
> > +static void __context_switch(struct vcpu *n)
> >  {
> >      struct cpu_user_regs *stack_regs = guest_cpu_user_regs();
> >      unsigned int          cpu = smp_processor_id();
> >      struct vcpu          *p = per_cpu(curr_vcpu, cpu);
> > -    struct vcpu          *n = current;
> >      struct domain        *pd = p->domain, *nd = n->domain;
> >  
> >      ASSERT(p != n);
> >      ASSERT(!vcpu_cpu_dirty(n));
> > +    ASSERT(p == current);
> >  
> >      if ( !is_idle_domain(pd) )
> >      {
> > @@ -2036,6 +2036,18 @@ static void __context_switch(void)
> >  
> >      write_ptbase(n);
> >  
> > +    /*
> > +     * It's relevant to set both current and curr_vcpu back-to-back, to avoid a
> > +     * window where calls to mapcache_current_vcpu() during the context switch
> > +     * could trigger a recursive loop.
> > +     *
> > +     * Do the current switch immediately after switching to the new guest
> > +     * page-tables, so that current is (almost) always in sync with the
> > +     * currently loaded page-tables.
> > +     */
> > +    set_current(n);
> > +    per_cpu(curr_vcpu, cpu) = n;
> 
> The latter paragraph of the comment states something that so far wasn't intended,
> and imo also shouldn't be going forward. It's curr_vcpu which wants to be in sync
> with the loaded page tables. (Whether pulling ahead its updating is okay is a
> separate question. All of these actions used to be be very carefully placed they
> way they are. Which isn't to say that I can exclude things having gone stale ...)

I've noticed this was all quite carefully placed.  I've also attempted
to take care with the changes I've done here (and tested them
extensively).

> And yes, that has always meant that mapcache_current_vcpu()'s condition for
> calling sync_local_execstate() was building upon the fact that it won't be called
> from context switching contexts.
> 
> Did you consider updating that condition (evaluating curr_cpu) instead?

We cannot safely use map_domain_page() if current != curr_vcpu,
because at any point (as a result of an interrupt) a call to
sync_local_execstate(), and thus remove the mappings created by
map_domain_page() as a result of performing a full context switch to
the idle vCPU (and the idle vCPU page tables).

> 
> > @@ -2048,8 +2060,6 @@ static void __context_switch(void)
> >      if ( pd != nd )
> >          cpumask_clear_cpu(cpu, pd->dirty_cpumask);
> >      write_atomic(&p->dirty_cpu, VCPU_CPU_CLEAN);
> > -
> > -    per_cpu(curr_vcpu, cpu) = n;
> >  }
> >  
> >  void context_switch(struct vcpu *prev, struct vcpu *next)
> > @@ -2081,16 +2091,36 @@ void context_switch(struct vcpu *prev, struct vcpu *next)
> >  
> >      local_irq_disable();
> >  
> > -    set_current(next);
> > -
> >      if ( (per_cpu(curr_vcpu, cpu) == next) ||
> >           (is_idle_domain(nextd) && cpu_online(cpu)) )
> >      {
> > +        /*
> > +         * Lazy context switch to the idle vCPU, set current == idle.  Full
> > +         * context switch happens if/when sync_local_execstate() is called.
> > +         */
> > +        set_current(next);
> >          local_irq_enable();
> 
> The comment is misleading as far as the first half of the if() condition goes:
> No further switching is going to happen in that case, aiui.

Right, I should clarify that comment: this is either a lazy context
switch, or the return from a lazy state to the previously running
vCPU.

> >      }
> >      else
> >      {
> > -        __context_switch();
> > +        /*
> > +         * curr_vcpu will always point to the currently loaded vCPU context, as
> > +         * it's not updated when doing a lazy switch to the idle vCPU.
> > +         */
> > +        struct vcpu *prev_ctx = per_cpu(curr_vcpu, cpu);
> > +
> > +        if ( prev_ctx != current )
> > +        {
> > +            /*
> > +             * Doing a full context switch to a non-idle vCPU from a lazy
> > +             * context switched state.  Adjust current to point to the
> > +             * currently loaded vCPU context.
> > +             */
> > +            ASSERT(current == idle_vcpu[cpu]);
> > +            ASSERT(!is_idle_vcpu(next));
> > +            set_current(prev_ctx);
> 
> This feels wrong, as in "current" then not representing what it should represent,
> for a certain time window. I may be dense, but neither comment not description
> clarify to me why this might be needed. I can see that it's needed to please the
> ASSERT() you add to __context_switch(), yet then I might ask why that assertion
> is put there.

This is done so that when calling __context_switch() current ==
curr_vcpu, and map_domain_page() can be used without getting into an
infinite sync_local_execstate() recursion loop.

> 
> > +        }
> > +        __context_switch(next);
> >  
> >          /* Re-enable interrupts before restoring state which may fault. */
> >          local_irq_enable();
> > @@ -2156,15 +2186,23 @@ int __sync_local_execstate(void)
> >  {
> >      unsigned long flags;
> >      int switch_required;
> > +    unsigned int cpu = smp_processor_id();
> > +    struct vcpu *p;
> >  
> >      local_irq_save(flags);
> >  
> > -    switch_required = (this_cpu(curr_vcpu) != current);
> > +    p = per_cpu(curr_vcpu, cpu);
> > +    switch_required = (p != current);
> >  
> >      if ( switch_required )
> >      {
> > -        ASSERT(current == idle_vcpu[smp_processor_id()]);
> > -        __context_switch();
> > +        ASSERT(current == idle_vcpu[cpu]);
> > +        /*
> > +         * Restore current to the previously running vCPU, __context_switch()
> > +         * will update current together with curr_vcpu.
> > +         */
> > +        set_current(p);
> 
> Similarly here.

Same reason, so that when calling __context_switch() current ==
curr_vcpu and map_domain_page() can be used (and in general
sync_local_execstate() becomes a no-op because a switch is already in
process.)

> 
> > +        __context_switch(idle_vcpu[cpu]);
> >      }
> >  
> >      local_irq_restore(flags);
> > --- a/xen/arch/x86/traps.c
> > +++ b/xen/arch/x86/traps.c
> > @@ -2232,8 +2232,6 @@ void __init trap_init(void)
> >  
> >  void activate_debugregs(const struct vcpu *curr)
> >  {
> > -    ASSERT(curr == current);
> > -
> >      write_debugreg(0, curr->arch.dr[0]);
> >      write_debugreg(1, curr->arch.dr[1]);
> >      write_debugreg(2, curr->arch.dr[2]);
> 
> Why would this assertion go away? If it suddenly triggers, the parameter name
> would now end up being wrong.

Well, at the point where activate_debugregs() gets called (in
paravirt_ctxt_switch_to()), current == previous as a result of this
change, so the assert is no longer true on purpose on that call
path.

Thanks, Roger.


^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [PATCH v2 02/18] x86/domain: limit window where curr_vcpu != current on context switch
  2025-01-08 16:26   ` Alejandro Vallejo
@ 2025-01-09 17:39     ` Roger Pau Monné
  0 siblings, 0 replies; 55+ messages in thread
From: Roger Pau Monné @ 2025-01-09 17:39 UTC (permalink / raw)
  To: Alejandro Vallejo; +Cc: xen-devel, Jan Beulich, Andrew Cooper

On Wed, Jan 08, 2025 at 04:26:46PM +0000, Alejandro Vallejo wrote:
> This is a net gain even without ASI. Having "current" hold the previous vCPU on
> __context_switch() makes it _a lot_ easier to follow the lazy switch path.
> 
> On Wed Jan 8, 2025 at 2:26 PM GMT, Roger Pau Monne wrote:
> > On x86 Xen will perform lazy context switches to the idle vCPU, where the
> > previously running vCPU context is not overwritten, and only current is updated
> > to point to the idle vCPU.  The state is then disjunct between current and
> > curr_vcpu: current points to the idle vCPU, while curr_vcpu points to the vCPU
> > whose context is loaded on the pCPU.
> >
> > While on that lazy context switched state, certain calls (like
> > map_domain_page()) will trigger a full synchronization of the pCPU state by
> > forcing a context switch.  Note however how calling any of such functions
> > inside the context switch code itself is very likely to trigger an infinite
> > recursion loop.
> >
> > Attempt to limit the window where curr_vcpu != current in the context switch
> > code, as to prevent and infinite recursion loop around sync_local_execstate().
> >
> > This is required for using map_domain_page() in the vCPU context switch code,
> > otherwise using map_domain_page() in that context ends up in a recursive
> > sync_local_execstate() loop:
> >
> > map_domain_page() -> sync_local_execstate() -> map_domain_page() -> ...
> 
> More generally, it's worth mentioning that we want to establish an invariant
> between a per-cpu variable (curr_vcpu) and the currently running page tables.
> That way it can be used as discriminant to know which are the currently active
> per-vCPU mappings.

You kind of already do this by checking curr_vcpu, as with this
changes there's still a window where the vCPU is lazy context
switched, and hence current != curr_vcpu (and curr_vcpu should signal
what page-tables are loaded).

The main point apart from more accurate signaling of the loaded
page-tables is to avoid infinite recursion if sync_local_execstate()
is called inside the context switch path.

> That's essential for implementing FPU hiding as proposed here:
> 
>   https://lore.kernel.org/xen-devel/20241105143310.28301-1-alejandro.vallejo@cloud.com/
> 
> A shorter form of that should probably be mentioned also...
> 
> >
> > Signed-off-by: Roger Pau Monné <roger.pau@citrix.com>
> > ---
> > Changes since v1:
> >  - New in this version.
> > ---
> >  xen/arch/x86/domain.c | 58 +++++++++++++++++++++++++++++++++++--------
> >  xen/arch/x86/traps.c  |  2 --
> >  2 files changed, 48 insertions(+), 12 deletions(-)
> >
> > diff --git a/xen/arch/x86/domain.c b/xen/arch/x86/domain.c
> > index 78a13e6812c9..1f680bf176ee 100644
> > --- a/xen/arch/x86/domain.c
> > +++ b/xen/arch/x86/domain.c
> > @@ -1982,16 +1982,16 @@ static void load_default_gdt(unsigned int cpu)
> >      per_cpu(full_gdt_loaded, cpu) = false;
> >  }
> >  
> > -static void __context_switch(void)
> > +static void __context_switch(struct vcpu *n)
> >  {
> >      struct cpu_user_regs *stack_regs = guest_cpu_user_regs();
> >      unsigned int          cpu = smp_processor_id();
> >      struct vcpu          *p = per_cpu(curr_vcpu, cpu);
> > -    struct vcpu          *n = current;
> >      struct domain        *pd = p->domain, *nd = n->domain;
> >  
> >      ASSERT(p != n);
> >      ASSERT(!vcpu_cpu_dirty(n));
> > +    ASSERT(p == current);
> >  
> >      if ( !is_idle_domain(pd) )
> >      {
> > @@ -2036,6 +2036,18 @@ static void __context_switch(void)
> >  
> >      write_ptbase(n);
> >  
> > +    /*
> > +     * It's relevant to set both current and curr_vcpu back-to-back, to avoid a
> > +     * window where calls to mapcache_current_vcpu() during the context switch
> > +     * could trigger a recursive loop.
> > +     *
> > +     * Do the current switch immediately after switching to the new guest
> > +     * page-tables, so that current is (almost) always in sync with the
> > +     * currently loaded page-tables.
> > +     */
> > +    set_current(n);
> > +    per_cpu(curr_vcpu, cpu) = n;
> 
> ... here. So we're not tempted to move these 2 far off from write_ptbase().

I think the "Do the current switch immediately after switching to the
new guest page-tables" sentence already signals that it's important to
keep the setting of current and curr_vcpu as close to the
write_ptbase() call as possible, but I'm open to suggestions for
better wording.

> > +
> >  #ifdef CONFIG_PV
> >      /* Prefetch the VMCB if we expect to use it later in the context switch */
> >      if ( using_svm() && is_pv_64bit_domain(nd) && !is_idle_domain(nd) )
> > @@ -2048,8 +2060,6 @@ static void __context_switch(void)
> >      if ( pd != nd )
> >          cpumask_clear_cpu(cpu, pd->dirty_cpumask);
> >      write_atomic(&p->dirty_cpu, VCPU_CPU_CLEAN);
> > -
> > -    per_cpu(curr_vcpu, cpu) = n;
> >  }
> >  
> >  void context_switch(struct vcpu *prev, struct vcpu *next)
> > @@ -2081,16 +2091,36 @@ void context_switch(struct vcpu *prev, struct vcpu *next)
> >  
> >      local_irq_disable();
> >  
> > -    set_current(next);
> > -
> >      if ( (per_cpu(curr_vcpu, cpu) == next) ||
> >           (is_idle_domain(nextd) && cpu_online(cpu)) )
> >      {
> > +        /*
> > +         * Lazy context switch to the idle vCPU, set current == idle.  Full
> > +         * context switch happens if/when sync_local_execstate() is called.
> > +         */
> > +        set_current(next);
> >          local_irq_enable();
> >      }
> >      else
> >      {
> > -        __context_switch();
> > +        /*
> > +         * curr_vcpu will always point to the currently loaded vCPU context, as
> 
> nit: s/will always point/always points/ ? It's an inconditional invariant,
> after all.

Sure.

Thanks, Roger.


^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [PATCH v2 04/18] x86/pv: introduce function to populate perdomain area and use it to map Xen GDT
  2025-01-09  9:10   ` Jan Beulich
@ 2025-01-10 14:15     ` Roger Pau Monné
  0 siblings, 0 replies; 55+ messages in thread
From: Roger Pau Monné @ 2025-01-10 14:15 UTC (permalink / raw)
  To: Jan Beulich; +Cc: Andrew Cooper, xen-devel

On Thu, Jan 09, 2025 at 10:10:20AM +0100, Jan Beulich wrote:
> On 08.01.2025 15:26, Roger Pau Monne wrote:
> > The current code to update the Xen part of the GDT when running a PV guest
> > relies on caching the direct map address of all the L1 tables used to map the
> > GDT and LDT, so that entries can be modified.
> > 
> > Introduce a new function that populates the per-domain region, either using the
> > recursive linear mappings when the target vCPU is the current one, or by
> > directly modifying the L1 table of the per-domain region.
> > 
> > Using such function to populate per-domain addresses drops the need to keep a
> > reference to per-domain L1 tables previously used to change the per-domain
> > mappings.
> 
> Well, yes. You now record MFNs instead. And you do so at the expense of about
> 100 lines of new code. I'm afraid I'm lacking justification for this price to
> be paid.

Oh, I should have been more explicit on the commit message probably.
The cover letter kind of covers this, the objective is to remove the
stashing of L1 page-table references in the domain struct.  Currently
the per-vCPU GDT L1 are stored in the domain struct, so PTEs can be
easily manipulated.

When moving the per-domain slot to being per-vCPU this stashing of the
L1 tables will become much more complex, and hence I wanted to get rid
of it.

With the introduction of populate_perdomain_mapping() I'm attempting
to get rid of all those L1 references in the domain struct, by having
a generic function that allows modifying the linea address range that
belongs to the per-domain slot.

See for example how patch 8 gets rid of all the l1_pgentry_t GDT/LDT
references in the domain struct.  And how patch 9 simplifies the
create_perdomain_mapping() interface to be much simpler.  All this is
built upon the addition of the populate_perdomain_mapping() helper and
the dropping of the l1_pgentry_t references in the domain struct.

Hope this helps clarify the intent of the change here.

> > @@ -2219,11 +2219,9 @@ void __init trap_init(void)
> >      init_ler();
> >  
> >      /* Cache {,compat_}gdt_l1e now that physically relocation is done. */
> > -    this_cpu(gdt_l1e) =
> > -        l1e_from_pfn(virt_to_mfn(boot_gdt), __PAGE_HYPERVISOR_RW);
> > +    this_cpu(gdt_mfn) = _mfn(virt_to_mfn(boot_gdt));
> >      if ( IS_ENABLED(CONFIG_PV32) )
> > -        this_cpu(compat_gdt_l1e) =
> > -            l1e_from_pfn(virt_to_mfn(boot_compat_gdt), __PAGE_HYPERVISOR_RW);
> > +        this_cpu(compat_gdt_mfn) = _mfn(virt_to_mfn(boot_compat_gdt));
> 
> The comment's going stale this way.

Right, the cache is still there but using a different field name.  I
can adjust to:

/* Cache {,compat_}gdt_mfn now that physically relocation is done. */

Thanks, Roger.

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [PATCH v2 04/18] x86/pv: introduce function to populate perdomain area and use it to map Xen GDT
  2025-01-09  9:55   ` Alejandro Vallejo
@ 2025-01-10 14:29     ` Roger Pau Monné
  2025-01-10 15:50       ` Alejandro Vallejo
  0 siblings, 1 reply; 55+ messages in thread
From: Roger Pau Monné @ 2025-01-10 14:29 UTC (permalink / raw)
  To: Alejandro Vallejo; +Cc: xen-devel, Jan Beulich, Andrew Cooper

On Thu, Jan 09, 2025 at 09:55:44AM +0000, Alejandro Vallejo wrote:
> On Wed Jan 8, 2025 at 2:26 PM GMT, Roger Pau Monne wrote:
> > The current code to update the Xen part of the GDT when running a PV guest
> > relies on caching the direct map address of all the L1 tables used to map the
> > GDT and LDT, so that entries can be modified.
> >
> > Introduce a new function that populates the per-domain region, either using the
> > recursive linear mappings when the target vCPU is the current one, or by
> > directly modifying the L1 table of the per-domain region.
> >
> > Using such function to populate per-domain addresses drops the need to keep a
> > reference to per-domain L1 tables previously used to change the per-domain
> > mappings.
> >
> > Signed-off-by: Roger Pau Monné <roger.pau@citrix.com>
> > ---
> >  xen/arch/x86/domain.c                | 11 +++-
> >  xen/arch/x86/include/asm/desc.h      |  6 +-
> >  xen/arch/x86/include/asm/mm.h        |  2 +
> >  xen/arch/x86/include/asm/processor.h |  5 ++
> >  xen/arch/x86/mm.c                    | 88 ++++++++++++++++++++++++++++
> >  xen/arch/x86/smpboot.c               |  6 +-
> >  xen/arch/x86/traps.c                 | 10 ++--
> >  7 files changed, 113 insertions(+), 15 deletions(-)
> >
> > diff --git a/xen/arch/x86/domain.c b/xen/arch/x86/domain.c
> > index 1f680bf176ee..0bd0ef7e40f4 100644
> > --- a/xen/arch/x86/domain.c
> > +++ b/xen/arch/x86/domain.c
> > @@ -1953,9 +1953,14 @@ static always_inline bool need_full_gdt(const struct domain *d)
> >  
> >  static void update_xen_slot_in_full_gdt(const struct vcpu *v, unsigned int cpu)
> >  {
> > -    l1e_write(pv_gdt_ptes(v) + FIRST_RESERVED_GDT_PAGE,
> > -              !is_pv_32bit_vcpu(v) ? per_cpu(gdt_l1e, cpu)
> > -                                   : per_cpu(compat_gdt_l1e, cpu));
> > +    ASSERT(v != current);
> 
> For this assert, and others below. IIUC, curr_vcpu == current when we're
> properly switched. When we're idling current == idle and curr_vcpu == prev_ctx.
> 
> Granted, calling this in the middle of a lazy idle loop would be weird, but
> would it make sense for PT consistency to use curr_vcpu here...

Hm, this function is called in a very specific context, and the assert
intends to reflect that.  TBH I could just drop it, as
populate_perdomain_mapping() will DTRT also when v == current. The
expectation for the context is also that current == curr_vcpu.

Note however that if v == current we would need a flush after the
populate_perdomain_mapping() call, since populate_perdomain_mapping()
doesn't perform any flushing of the modified entries.  The main
purpose of the ASSERT() is to notice this.

> > +
> > +    populate_perdomain_mapping(v,
> > +                               GDT_VIRT_START(v) +
> > +                               (FIRST_RESERVED_GDT_PAGE << PAGE_SHIFT),
> > +                               !is_pv_32bit_vcpu(v) ? &per_cpu(gdt_mfn, cpu)
> > +                                                    : &per_cpu(compat_gdt_mfn,
> > +                                                               cpu), 1);
> >  }
> >  
> >  static void load_full_gdt(const struct vcpu *v, unsigned int cpu)
> > diff --git a/xen/arch/x86/include/asm/desc.h b/xen/arch/x86/include/asm/desc.h
> > index a1e0807d97ed..33981bfca588 100644
> > --- a/xen/arch/x86/include/asm/desc.h
> > +++ b/xen/arch/x86/include/asm/desc.h
> > @@ -44,6 +44,8 @@
> >  
> >  #ifndef __ASSEMBLY__
> >  
> > +#include <xen/mm-frame.h>
> > +
> >  #define GUEST_KERNEL_RPL(d) (is_pv_32bit_domain(d) ? 1 : 3)
> >  
> >  /* Fix up the RPL of a guest segment selector. */
> > @@ -212,10 +214,10 @@ struct __packed desc_ptr {
> >  
> >  extern seg_desc_t boot_gdt[];
> >  DECLARE_PER_CPU(seg_desc_t *, gdt);
> > -DECLARE_PER_CPU(l1_pgentry_t, gdt_l1e);
> > +DECLARE_PER_CPU(mfn_t, gdt_mfn);
> >  extern seg_desc_t boot_compat_gdt[];
> >  DECLARE_PER_CPU(seg_desc_t *, compat_gdt);
> > -DECLARE_PER_CPU(l1_pgentry_t, compat_gdt_l1e);
> > +DECLARE_PER_CPU(mfn_t, compat_gdt_mfn);
> >  DECLARE_PER_CPU(bool, full_gdt_loaded);
> >  
> >  static inline void lgdt(const struct desc_ptr *gdtr)
> > diff --git a/xen/arch/x86/include/asm/mm.h b/xen/arch/x86/include/asm/mm.h
> > index 6c7e66ee21ab..b50a51327b2b 100644
> > --- a/xen/arch/x86/include/asm/mm.h
> > +++ b/xen/arch/x86/include/asm/mm.h
> > @@ -603,6 +603,8 @@ int compat_arch_memory_op(unsigned long cmd, XEN_GUEST_HANDLE_PARAM(void) arg);
> >  int create_perdomain_mapping(struct domain *d, unsigned long va,
> >                               unsigned int nr, l1_pgentry_t **pl1tab,
> >                               struct page_info **ppg);
> > +void populate_perdomain_mapping(const struct vcpu *v, unsigned long va,
> > +                                mfn_t *mfn, unsigned long nr);
> >  void destroy_perdomain_mapping(struct domain *d, unsigned long va,
> >                                 unsigned int nr);
> >  void free_perdomain_mappings(struct domain *d);
> > diff --git a/xen/arch/x86/include/asm/processor.h b/xen/arch/x86/include/asm/processor.h
> > index d247ef8dd226..82ee89f736c2 100644
> > --- a/xen/arch/x86/include/asm/processor.h
> > +++ b/xen/arch/x86/include/asm/processor.h
> > @@ -243,6 +243,11 @@ static inline unsigned long cr3_pa(unsigned long cr3)
> >      return cr3 & X86_CR3_ADDR_MASK;
> >  }
> >  
> > +static inline mfn_t cr3_mfn(unsigned long cr3)
> > +{
> > +    return maddr_to_mfn(cr3_pa(cr3));
> > +}
> > +
> >  static inline unsigned int cr3_pcid(unsigned long cr3)
> >  {
> >      return IS_ENABLED(CONFIG_PV) ? cr3 & X86_CR3_PCID_MASK : 0;
> > diff --git a/xen/arch/x86/mm.c b/xen/arch/x86/mm.c
> > index 3d5dd22b6c36..0abea792486c 100644
> > --- a/xen/arch/x86/mm.c
> > +++ b/xen/arch/x86/mm.c
> > @@ -6423,6 +6423,94 @@ int create_perdomain_mapping(struct domain *d, unsigned long va,
> >      return rc;
> >  }
> >  
> > +void populate_perdomain_mapping(const struct vcpu *v, unsigned long va,
> > +                                mfn_t *mfn, unsigned long nr)
> > +{
> > +    l1_pgentry_t *l1tab = NULL, *pl1e;
> > +    const l3_pgentry_t *l3tab;
> > +    const l2_pgentry_t *l2tab;
> > +    struct domain *d = v->domain;
> > +
> > +    ASSERT(va >= PERDOMAIN_VIRT_START &&
> > +           va < PERDOMAIN_VIRT_SLOT(PERDOMAIN_SLOTS));
> > +    ASSERT(!nr || !l3_table_offset(va ^ (va + nr * PAGE_SIZE - 1)));
> > +
> > +    /* Use likely to force the optimization for the fast path. */
> > +    if ( likely(v == current) )
> 
> ... and here? In particular I'd expect using curr_vcpu here means...

I'm afraid not, this is a trap I've fallen originally when doing this
series, as I indeed had v == curr_vcpu here (and no
sync_local_execstate() call).

However as a result of an interrupt, a call to sync_local_execstate()
might happen, at which point the previous check of v == curr_vcpu
becomes stale.

> > +    {
> > +        unsigned int i;
> > +
> > +        /* Ensure page-tables are from current (if current != curr_vcpu). */
> > +        sync_local_execstate();
> 
> ... this should not be needed.

As kind of mentioned above, this is required to ensure the page-tables
are in-sync with the vCPU in current, and cannot change as a result of
an interrupt triggering a call to sync_local_execstate().

Otherwise the page-tables could change while or after the call to
populate_perdomain_mapping(), and the mappings could end up being
created on the wrong page-tables.

Thanks, Roger.


^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [PATCH v2 05/18] x86/mm: switch destroy_perdomain_mapping() parameter from domain to vCPU
  2025-01-09 10:02   ` Alejandro Vallejo
@ 2025-01-10 14:30     ` Roger Pau Monné
  0 siblings, 0 replies; 55+ messages in thread
From: Roger Pau Monné @ 2025-01-10 14:30 UTC (permalink / raw)
  To: Alejandro Vallejo; +Cc: xen-devel, Jan Beulich, Andrew Cooper

On Thu, Jan 09, 2025 at 10:02:19AM +0000, Alejandro Vallejo wrote:
> On Wed Jan 8, 2025 at 2:26 PM GMT, Roger Pau Monne wrote:
> > In preparation for the per-domain area being populated with per-vCPU mappings
> > change the parameter of destroy_perdomain_mapping() to be a vCPU instead of a
> > domain, and also update the function logic to allow manipulation of per-domain
> > mappings using the linear page table mappings.
> >
> > Signed-off-by: Roger Pau Monné <roger.pau@citrix.com>
> > ---
> >  xen/arch/x86/include/asm/mm.h |  2 +-
> >  xen/arch/x86/mm.c             | 24 +++++++++++++++++++++++-
> >  xen/arch/x86/pv/domain.c      |  3 +--
> >  xen/arch/x86/x86_64/mm.c      |  2 +-
> >  4 files changed, 26 insertions(+), 5 deletions(-)
> >
> > diff --git a/xen/arch/x86/include/asm/mm.h b/xen/arch/x86/include/asm/mm.h
> > index b50a51327b2b..65cd751087dc 100644
> > --- a/xen/arch/x86/include/asm/mm.h
> > +++ b/xen/arch/x86/include/asm/mm.h
> > @@ -605,7 +605,7 @@ int create_perdomain_mapping(struct domain *d, unsigned long va,
> >                               struct page_info **ppg);
> >  void populate_perdomain_mapping(const struct vcpu *v, unsigned long va,
> >                                  mfn_t *mfn, unsigned long nr);
> > -void destroy_perdomain_mapping(struct domain *d, unsigned long va,
> > +void destroy_perdomain_mapping(const struct vcpu *v, unsigned long va,
> >                                 unsigned int nr);
> >  void free_perdomain_mappings(struct domain *d);
> >  
> > diff --git a/xen/arch/x86/mm.c b/xen/arch/x86/mm.c
> > index 0abea792486c..713ae8dd6fa3 100644
> > --- a/xen/arch/x86/mm.c
> > +++ b/xen/arch/x86/mm.c
> > @@ -6511,10 +6511,11 @@ void populate_perdomain_mapping(const struct vcpu *v, unsigned long va,
> >      unmap_domain_page(l3tab);
> >  }
> >  
> > -void destroy_perdomain_mapping(struct domain *d, unsigned long va,
> > +void destroy_perdomain_mapping(const struct vcpu *v, unsigned long va,
> >                                 unsigned int nr)
> >  {
> >      const l3_pgentry_t *l3tab, *pl3e;
> > +    const struct domain *d = v->domain;
> >  
> >      ASSERT(va >= PERDOMAIN_VIRT_START &&
> >             va < PERDOMAIN_VIRT_SLOT(PERDOMAIN_SLOTS));
> > @@ -6523,6 +6524,27 @@ void destroy_perdomain_mapping(struct domain *d, unsigned long va,
> >      if ( !d->arch.perdomain_l3_pg )
> >          return;
> >  
> > +    /* Use likely to force the optimization for the fast path. */
> > +    if ( likely(v == current) )
> 
> As in the previous patch, doesn't using curr_vcpu here...
> 
> > +    {
> > +        l1_pgentry_t *pl1e;
> > +
> > +        /* Ensure page-tables are from current (if current != curr_vcpu). */
> > +        sync_local_execstate();
> 
> ... avoid the need for this?

See previous reply and the hazards of curr_vcpu changing as a result
of an interrupt triggering a sync_local_execstate() call.

Thanks, Roger.


^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [PATCH v2.1 06/18] x86/pv: set/clear guest GDT mappings using {populate,destroy}_perdomain_mapping()
  2025-01-09 10:25     ` Alejandro Vallejo
@ 2025-01-10 14:33       ` Roger Pau Monné
  0 siblings, 0 replies; 55+ messages in thread
From: Roger Pau Monné @ 2025-01-10 14:33 UTC (permalink / raw)
  To: Alejandro Vallejo; +Cc: xen-devel, Jan Beulich, Andrew Cooper

On Thu, Jan 09, 2025 at 10:25:50AM +0000, Alejandro Vallejo wrote:
> On Wed Jan 8, 2025 at 3:11 PM GMT, Roger Pau Monne wrote:
> > The pv_{set,destroy}_gdt() functions rely on the L1 table(s) that contain such
> > mappings being stashed in the domain structure, and thus such mappings being
> > modified by merely updating the L1 entries.
> >
> > Switch both pv_{set,destroy}_gdt() to instead use
> > {populate,destory}_perdomain_mapping().
> 
> nit: s/destory/destroy
> 
> How come pv_set_gdt() doesn't need to be reordered here (as opposed to v2)?

In a previous version (that I've never published)
populate_perdomain_mapping() was using v->arch.cr3 as the root pointer
in which to do a page-walk and modify the requested entries.  That
required v->arch.cr3 to be valid when populate_perdomain_mapping() was
called, and hence needed the reordering.

Since populate_perdomain_mapping() no longer uses v->arch.cr3 it
doesn't matter whether the vCPU c3 is valid when calling
populate_perdomain_mapping().

Thanks, Roger.

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [PATCH v2 07/18] x86/pv: update guest LDT mappings using the linear entries
  2025-01-09 14:34   ` Alejandro Vallejo
@ 2025-01-10 14:44     ` Roger Pau Monné
  2025-01-10 15:36       ` Alejandro Vallejo
  0 siblings, 1 reply; 55+ messages in thread
From: Roger Pau Monné @ 2025-01-10 14:44 UTC (permalink / raw)
  To: Alejandro Vallejo; +Cc: xen-devel, Jan Beulich, Andrew Cooper

On Thu, Jan 09, 2025 at 02:34:05PM +0000, Alejandro Vallejo wrote:
> On Wed Jan 8, 2025 at 2:26 PM GMT, Roger Pau Monne wrote:
> > The pv_map_ldt_shadow_page() and pv_destroy_ldt() functions rely on the L1
> > table(s) that contain such mappings being stashed in the domain structure, and
> > thus such mappings being modified by merely updating the require L1 entries.
> >
> > Switch pv_map_ldt_shadow_page() to unconditionally use the linear recursive, as
> > that logic is always called while the vCPU is running on the current pCPU.
> >
> > For pv_destroy_ldt() use the linear mappings if the vCPU is the one currently
> > running on the pCPU, otherwise use destroy_mappings().
> >
> > Note this requires keeping an array with the pages currently mapped at the LDT
> > area, as that allows dropping the extra taken page reference when removing the
> > mappings.
> >
> > Signed-off-by: Roger Pau Monné <roger.pau@citrix.com>
> > ---
> >  xen/arch/x86/include/asm/domain.h   |  2 ++
> >  xen/arch/x86/pv/descriptor-tables.c | 19 ++++++++++---------
> >  xen/arch/x86/pv/domain.c            |  4 ++++
> >  xen/arch/x86/pv/mm.c                |  3 ++-
> >  4 files changed, 18 insertions(+), 10 deletions(-)
> >
> > diff --git a/xen/arch/x86/include/asm/domain.h b/xen/arch/x86/include/asm/domain.h
> > index b79d6badd71c..b659cffc7f81 100644
> > --- a/xen/arch/x86/include/asm/domain.h
> > +++ b/xen/arch/x86/include/asm/domain.h
> > @@ -523,6 +523,8 @@ struct pv_vcpu
> >      struct trap_info *trap_ctxt;
> >  
> >      unsigned long gdt_frames[FIRST_RESERVED_GDT_PAGE];
> > +    /* Max LDT entries is 8192, so 8192 * 8 = 64KiB (16 pages). */
> > +    mfn_t ldt_frames[16];
> >      unsigned long ldt_base;
> >      unsigned int gdt_ents, ldt_ents;
> >  
> > diff --git a/xen/arch/x86/pv/descriptor-tables.c b/xen/arch/x86/pv/descriptor-tables.c
> > index 5a79f022ce13..95b598a4c0cf 100644
> > --- a/xen/arch/x86/pv/descriptor-tables.c
> > +++ b/xen/arch/x86/pv/descriptor-tables.c
> > @@ -20,28 +20,29 @@
> >   */
> >  bool pv_destroy_ldt(struct vcpu *v)
> >  {
> > -    l1_pgentry_t *pl1e;
> > +    const unsigned int nr_frames = ARRAY_SIZE(v->arch.pv.ldt_frames);
> >      unsigned int i, mappings_dropped = 0;
> > -    struct page_info *page;
> >  
> >      ASSERT(!in_irq());
> >  
> >      ASSERT(v == current || !vcpu_cpu_dirty(v));
> >  
> > -    pl1e = pv_ldt_ptes(v);
> > +    destroy_perdomain_mapping(v, LDT_VIRT_START(v), nr_frames);
> >  
> > -    for ( i = 0; i < 16; i++ )
> > +    for ( i = 0; i < nr_frames; i++ )
> 
> nit: While at this, can the "unsigned int" be moved here too?

I don't mind much, but I also don't usually do such changes as I think
it adds more noise.

> >      {
> > -        if ( !(l1e_get_flags(pl1e[i]) & _PAGE_PRESENT) )
> > -            continue;
> > +        mfn_t mfn = v->arch.pv.ldt_frames[i];
> > +        struct page_info *page;
> >  
> > -        page = l1e_get_page(pl1e[i]);
> > -        l1e_write(&pl1e[i], l1e_empty());
> > -        mappings_dropped++;
> > +        if ( mfn_eq(mfn, INVALID_MFN) )
> > +            continue;
> 
> Can it really be disjoint? As in, why "continue" and not "break"?. Not that it
> matters in the slightest, and I prefer this form; but I'm curious.

I think so?  The PV guest LDT is populated as a result of page-faults,
so if the guest only happens to use segment descriptors that are on
the third page, the second page might not be mapped?

The continue was there already, and I really didn't dare to change
this, neither asked myself much.  Assumed due to how the guest LDT is
mapped on a page-fault basis it could indeed be disjointly mapped.

> >  
> > +        v->arch.pv.ldt_frames[i] = INVALID_MFN;
> > +        page = mfn_to_page(mfn);
> >          ASSERT_PAGE_IS_TYPE(page, PGT_seg_desc_page);
> >          ASSERT_PAGE_IS_DOMAIN(page, v->domain);
> >          put_page_and_type(page);
> > +        mappings_dropped++;
> >      }
> >  
> >      return mappings_dropped;
> > diff --git a/xen/arch/x86/pv/domain.c b/xen/arch/x86/pv/domain.c
> > index 7e8bffaae9a0..32d7488cc186 100644
> > --- a/xen/arch/x86/pv/domain.c
> > +++ b/xen/arch/x86/pv/domain.c
> > @@ -303,6 +303,7 @@ void pv_vcpu_destroy(struct vcpu *v)
> >  int pv_vcpu_initialise(struct vcpu *v)
> >  {
> >      struct domain *d = v->domain;
> > +    unsigned int i;
> >      int rc;
> >  
> >      ASSERT(!is_idle_domain(d));
> > @@ -311,6 +312,9 @@ int pv_vcpu_initialise(struct vcpu *v)
> >      if ( rc )
> >          return rc;
> >  
> > +    for ( i = 0; i < ARRAY_SIZE(v->arch.pv.ldt_frames); i++ )
> > +        v->arch.pv.ldt_frames[i] = INVALID_MFN;
> > +
> 
> I think it makes more sense to move this earlier so ldt_frames[] is initialised
> even if pv_vcpu_initialise() fails. It may be benign, but it looks like an
> accident abount to happen.

Right, pv_destroy_gdt_ldt_l1tab() doesn't care at all about the
contents of ldt_frames[], but it will be safe to do change the
ordering in pv_vcpu_initialise().

Thanks, Roger.


^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [PATCH v2 09/18] x86/mm: simplify create_perdomain_mapping() interface
  2025-01-09 11:01   ` Alejandro Vallejo
@ 2025-01-10 14:45     ` Roger Pau Monné
  0 siblings, 0 replies; 55+ messages in thread
From: Roger Pau Monné @ 2025-01-10 14:45 UTC (permalink / raw)
  To: Alejandro Vallejo; +Cc: xen-devel, Jan Beulich, Andrew Cooper

On Thu, Jan 09, 2025 at 11:01:00AM +0000, Alejandro Vallejo wrote:
> On Wed Jan 8, 2025 at 2:26 PM GMT, Roger Pau Monne wrote:
> > diff --git a/xen/arch/x86/include/asm/mm.h b/xen/arch/x86/include/asm/mm.h
> > index 65cd751087dc..0c57442c9593 100644
> > --- a/xen/arch/x86/include/asm/mm.h
> > +++ b/xen/arch/x86/include/asm/mm.h
> > @@ -601,8 +601,7 @@ int compat_arch_memory_op(unsigned long cmd, XEN_GUEST_HANDLE_PARAM(void) arg);
> >  #define IS_NIL(ptr) (!((uintptr_t)(ptr) + sizeof(*(ptr))))
> 
> Shouldn't IS_NIL() and NIL (out of context) be removed too?

Indeed, yet more cleanup, thanks!

Roger.


^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [PATCH v2 13/18] x86/spec-ctrl: introduce Address Space Isolation command line option
  2025-01-09 14:58   ` Alejandro Vallejo
@ 2025-01-10 14:55     ` Roger Pau Monné
  2025-01-10 15:51       ` Alejandro Vallejo
  0 siblings, 1 reply; 55+ messages in thread
From: Roger Pau Monné @ 2025-01-10 14:55 UTC (permalink / raw)
  To: Alejandro Vallejo
  Cc: xen-devel, Andrew Cooper, Anthony PERARD, Michal Orzel,
	Jan Beulich, Julien Grall, Stefano Stabellini

On Thu, Jan 09, 2025 at 02:58:29PM +0000, Alejandro Vallejo wrote:
> On Wed Jan 8, 2025 at 2:26 PM GMT, Roger Pau Monne wrote:
> > No functional change, as the option is not used.
> >
> > Introduced new so newly added functionality is keyed on the option being
> > enabled, even if the feature is non-functional.
> >
> > When ASI is enabled for PV domains, printing the usage of XPTI might be
> > omitted if it must be uniformly disabled given the usage of ASI.
> >
> > Signed-off-by: Roger Pau Monné <roger.pau@citrix.com>
> > ---
> > Changes since v1:
> >  - Improve comments and documentation about what ASI provides.
> >  - Do not print the XPTI information if ASI is used for pv domUs and dom0 is
> >    PVH, or if ASI is used for both domU and dom0.
> >
> > FWIW, I would print the state of XPTI uniformly, as otherwise I find the output
> > might be confusing for user expecting to assert the state of XPTI.
> > ---
> >  docs/misc/xen-command-line.pandoc    |  19 +++++
> >  xen/arch/x86/include/asm/domain.h    |   3 +
> >  xen/arch/x86/include/asm/spec_ctrl.h |   2 +
> >  xen/arch/x86/spec_ctrl.c             | 115 +++++++++++++++++++++++++--
> >  4 files changed, 133 insertions(+), 6 deletions(-)
> >
> > diff --git a/docs/misc/xen-command-line.pandoc b/docs/misc/xen-command-line.pandoc
> > index 08b0053f9ced..3c1ad7b5fe7d 100644
> > --- a/docs/misc/xen-command-line.pandoc
> > +++ b/docs/misc/xen-command-line.pandoc
> > @@ -202,6 +202,25 @@ to appropriate auditing by Xen.  Argo is disabled by default.
> >      This option is disabled by default, to protect domains from a DoS by a
> >      buggy or malicious other domain spamming the ring.
> >  
> > +### asi (x86)
> > +> `= List of [ <bool>, {pv,hvm}=<bool>,
> > +               {vcpu-pt}=<bool>|{pv,hvm}=<bool> ]`
> 
> nit: While this grows later, the braces around vcpu-pt aren't strictly needed here.

Since I have to modify the whole line I can indeed add the braces
later.

> > +
> > +Offers control over whether the hypervisor will engage in Address Space
> > +Isolation, by not having potentially sensitive information permanently mapped
> > +in the VMM page-tables.  Using this option might avoid the need to apply
> > +mitigations for certain speculative related attacks, at the cost of mapping
> > +sensitive information on-demand.
> 
> Might be worth mentioning that this provides some defense in depth against
> unmitigated attacks too.

It's IMO a bit too vague to make such promises, but I can add:

Offers control over whether the hypervisor will engage in Address Space
Isolation, by not having potentially sensitive information permanently mapped
in the VMM page-tables.  Using this option might avoid the need to apply
mitigations for certain speculative related attacks, at the cost of mapping
sensitive information on-demand.  It might also offer some protection
against unmitigated speculation-related attacks.

> > +
> > +* `pv=` and `hvm=` sub-options allow enabling for specific guest types.
> > +
> > +**WARNING: manual de-selection of enabled options will invalidate any
> > +protection offered by the feature.  The fine grained options provided below are
> > +meant to be used for debugging purposes only.**
> > +
> > +* `vcpu-pt` ensure each vCPU uses a unique top-level page-table and setup a
> > +  virtual address space region to map memory on a per-vCPU basis.
> > +
> >  ### asid (x86)
> >  > `= <boolean>`
> >  
> > diff --git a/xen/arch/x86/spec_ctrl.c b/xen/arch/x86/spec_ctrl.c
> > index ced84750015c..9463a8624701 100644
> > --- a/xen/arch/x86/spec_ctrl.c
> > +++ b/xen/arch/x86/spec_ctrl.c
> > @@ -2075,6 +2165,19 @@ void __init init_speculation_mitigations(void)
> >           hw_smt_enabled && default_xen_spec_ctrl )
> >          setup_force_cpu_cap(X86_FEATURE_SC_MSR_IDLE);
> >  
> > +    /* Disable all ASI options by default until feature is finished. */
> > +    if ( opt_vcpu_pt_pv == -1 )
> > +        opt_vcpu_pt_pv = 0;
> > +    if ( opt_vcpu_pt_hwdom == -1 )
> > +        opt_vcpu_pt_hwdom = 0;
> > +    if ( opt_vcpu_pt_hvm == -1 )
> > +        opt_vcpu_pt_hvm = 0;
> 
> Why not preinitialise them to zero instead in the static declarations?

Hm, indeed.  I can probably make them booleans then.  I wrongly
recall that checking whether they haven't been initialized was needed
somewhere, but that doesn't seem to be the case.

Thanks, Roger.


^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [PATCH v2 15/18] x86/mm: introduce a per-vCPU mapcache when using ASI
  2025-01-09 15:08   ` Alejandro Vallejo
@ 2025-01-10 15:02     ` Roger Pau Monné
  2025-01-10 16:12       ` Alejandro Vallejo
  2025-01-10 16:19       ` Alejandro Vallejo
  0 siblings, 2 replies; 55+ messages in thread
From: Roger Pau Monné @ 2025-01-10 15:02 UTC (permalink / raw)
  To: Alejandro Vallejo; +Cc: xen-devel, Jan Beulich, Andrew Cooper

On Thu, Jan 09, 2025 at 03:08:15PM +0000, Alejandro Vallejo wrote:
> On Wed Jan 8, 2025 at 2:26 PM GMT, Roger Pau Monne wrote:
> > When using a unique per-vCPU root page table the per-domain region becomes
> > per-vCPU, and hence the mapcache is no longer shared between all vCPUs of a
> > domain.  Introduce per-vCPU mapcache structures, and modify map_domain_page()
> > to create per-vCPU mappings when possible.  Note the lock is also not needed
> > with using per-vCPU map caches, as the structure is no longer shared.
> >
> > This introduces some duplication in the domain and vcpu structures, as both
> > contain a mapcache field to support running with and without per-vCPU
> > page-tables.
> >
> > Signed-off-by: Roger Pau Monné <roger.pau@citrix.com>
> > ---
> >  xen/arch/x86/domain_page.c        | 90 ++++++++++++++++++++-----------
> >  xen/arch/x86/include/asm/domain.h | 20 ++++---
> >  2 files changed, 71 insertions(+), 39 deletions(-)
> >
> > diff --git a/xen/arch/x86/domain_page.c b/xen/arch/x86/domain_page.c
> > index 1372be20224e..65900d6218f8 100644
> > --- a/xen/arch/x86/domain_page.c
> > +++ b/xen/arch/x86/domain_page.c
> > @@ -74,7 +74,9 @@ void *map_domain_page(mfn_t mfn)
> >      struct vcpu *v;
> >      struct mapcache_domain *dcache;
> >      struct mapcache_vcpu *vcache;
> > +    struct mapcache *cache;
> >      struct vcpu_maphash_entry *hashent;
> > +    struct domain *d;
> >  
> >  #ifdef NDEBUG
> >      if ( mfn_x(mfn) <= PFN_DOWN(__pa(HYPERVISOR_VIRT_END - 1)) )
> > @@ -85,9 +87,12 @@ void *map_domain_page(mfn_t mfn)
> >      if ( !v || !is_pv_vcpu(v) )
> >          return mfn_to_virt(mfn_x(mfn));
> >  
> > -    dcache = &v->domain->arch.pv.mapcache;
> > +    d = v->domain;
> > +    dcache = &d->arch.pv.mapcache;
> >      vcache = &v->arch.pv.mapcache;
> > -    if ( !dcache->inuse )
> > +    cache = d->arch.vcpu_pt ? &v->arch.pv.mapcache.cache
> > +                            : &d->arch.pv.mapcache.cache;
> > +    if ( !cache->inuse )
> >          return mfn_to_virt(mfn_x(mfn));
> >  
> >      perfc_incr(map_domain_page_count);
> > @@ -98,17 +103,18 @@ void *map_domain_page(mfn_t mfn)
> >      if ( hashent->mfn == mfn_x(mfn) )
> >      {
> >          idx = hashent->idx;
> > -        ASSERT(idx < dcache->entries);
> > +        ASSERT(idx < cache->entries);
> >          hashent->refcnt++;
> >          ASSERT(hashent->refcnt);
> >          ASSERT(mfn_eq(l1e_get_mfn(MAPCACHE_L1ENT(idx)), mfn));
> >          goto out;
> >      }
> >  
> > -    spin_lock(&dcache->lock);
> > +    if ( !d->arch.vcpu_pt )
> > +        spin_lock(&dcache->lock);
> 
> Hmmm. I wonder whether we might not want a nospec here...

Not sure TBH, we have other instances of conditional locking that
doesn't use nospec().  That said I'm not claiming those are correct.
Shouldn't people that care about this kind of speculation into
critical regions just use CONFIG_SPECULATIVE_HARDEN_LOCK?

Thanks, Roger.


^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [PATCH v2 07/18] x86/pv: update guest LDT mappings using the linear entries
  2025-01-10 14:44     ` Roger Pau Monné
@ 2025-01-10 15:36       ` Alejandro Vallejo
  0 siblings, 0 replies; 55+ messages in thread
From: Alejandro Vallejo @ 2025-01-10 15:36 UTC (permalink / raw)
  To: Roger Pau Monné; +Cc: xen-devel, Jan Beulich, Andrew Cooper

On Fri Jan 10, 2025 at 2:44 PM GMT, Roger Pau Monné wrote:
> On Thu, Jan 09, 2025 at 02:34:05PM +0000, Alejandro Vallejo wrote:
> > On Wed Jan 8, 2025 at 2:26 PM GMT, Roger Pau Monne wrote:
> > > The pv_map_ldt_shadow_page() and pv_destroy_ldt() functions rely on the L1
> > > table(s) that contain such mappings being stashed in the domain structure, and
> > > thus such mappings being modified by merely updating the require L1 entries.
> > >
> > > Switch pv_map_ldt_shadow_page() to unconditionally use the linear recursive, as
> > > that logic is always called while the vCPU is running on the current pCPU.
> > >
> > > For pv_destroy_ldt() use the linear mappings if the vCPU is the one currently
> > > running on the pCPU, otherwise use destroy_mappings().
> > >
> > > Note this requires keeping an array with the pages currently mapped at the LDT
> > > area, as that allows dropping the extra taken page reference when removing the
> > > mappings.
> > >
> > > Signed-off-by: Roger Pau Monné <roger.pau@citrix.com>
> > > ---
> > >  xen/arch/x86/include/asm/domain.h   |  2 ++
> > >  xen/arch/x86/pv/descriptor-tables.c | 19 ++++++++++---------
> > >  xen/arch/x86/pv/domain.c            |  4 ++++
> > >  xen/arch/x86/pv/mm.c                |  3 ++-
> > >  4 files changed, 18 insertions(+), 10 deletions(-)
> > >
> > > diff --git a/xen/arch/x86/include/asm/domain.h b/xen/arch/x86/include/asm/domain.h
> > > index b79d6badd71c..b659cffc7f81 100644
> > > --- a/xen/arch/x86/include/asm/domain.h
> > > +++ b/xen/arch/x86/include/asm/domain.h
> > > @@ -523,6 +523,8 @@ struct pv_vcpu
> > >      struct trap_info *trap_ctxt;
> > >  
> > >      unsigned long gdt_frames[FIRST_RESERVED_GDT_PAGE];
> > > +    /* Max LDT entries is 8192, so 8192 * 8 = 64KiB (16 pages). */
> > > +    mfn_t ldt_frames[16];
> > >      unsigned long ldt_base;
> > >      unsigned int gdt_ents, ldt_ents;
> > >  
> > > diff --git a/xen/arch/x86/pv/descriptor-tables.c b/xen/arch/x86/pv/descriptor-tables.c
> > > index 5a79f022ce13..95b598a4c0cf 100644
> > > --- a/xen/arch/x86/pv/descriptor-tables.c
> > > +++ b/xen/arch/x86/pv/descriptor-tables.c
> > > @@ -20,28 +20,29 @@
> > >   */
> > >  bool pv_destroy_ldt(struct vcpu *v)
> > >  {
> > > -    l1_pgentry_t *pl1e;
> > > +    const unsigned int nr_frames = ARRAY_SIZE(v->arch.pv.ldt_frames);
> > >      unsigned int i, mappings_dropped = 0;
> > > -    struct page_info *page;
> > >  
> > >      ASSERT(!in_irq());
> > >  
> > >      ASSERT(v == current || !vcpu_cpu_dirty(v));
> > >  
> > > -    pl1e = pv_ldt_ptes(v);
> > > +    destroy_perdomain_mapping(v, LDT_VIRT_START(v), nr_frames);
> > >  
> > > -    for ( i = 0; i < 16; i++ )
> > > +    for ( i = 0; i < nr_frames; i++ )
> > 
> > nit: While at this, can the "unsigned int" be moved here too?
>
> I don't mind much, but I also don't usually do such changes as I think
> it adds more noise.

Fair enough, nvm then.

>
> > >      {
> > > -        if ( !(l1e_get_flags(pl1e[i]) & _PAGE_PRESENT) )
> > > -            continue;
> > > +        mfn_t mfn = v->arch.pv.ldt_frames[i];
> > > +        struct page_info *page;
> > >  
> > > -        page = l1e_get_page(pl1e[i]);
> > > -        l1e_write(&pl1e[i], l1e_empty());
> > > -        mappings_dropped++;
> > > +        if ( mfn_eq(mfn, INVALID_MFN) )
> > > +            continue;
> > 
> > Can it really be disjoint? As in, why "continue" and not "break"?. Not that it
> > matters in the slightest, and I prefer this form; but I'm curious.
>
> I think so?  The PV guest LDT is populated as a result of page-faults,
> so if the guest only happens to use segment descriptors that are on
> the third page, the second page might not be mapped?
>
> The continue was there already, and I really didn't dare to change
> this, neither asked myself much.  Assumed due to how the guest LDT is
> mapped on a page-fault basis it could indeed be disjointly mapped.

Ah, I see. That makes sense then. I wouldn't suggest changing it either, I
was just curious :)

>
> > >  
> > > +        v->arch.pv.ldt_frames[i] = INVALID_MFN;
> > > +        page = mfn_to_page(mfn);
> > >          ASSERT_PAGE_IS_TYPE(page, PGT_seg_desc_page);
> > >          ASSERT_PAGE_IS_DOMAIN(page, v->domain);
> > >          put_page_and_type(page);
> > > +        mappings_dropped++;
> > >      }
> > >  
> > >      return mappings_dropped;
> > > diff --git a/xen/arch/x86/pv/domain.c b/xen/arch/x86/pv/domain.c
> > > index 7e8bffaae9a0..32d7488cc186 100644
> > > --- a/xen/arch/x86/pv/domain.c
> > > +++ b/xen/arch/x86/pv/domain.c
> > > @@ -303,6 +303,7 @@ void pv_vcpu_destroy(struct vcpu *v)
> > >  int pv_vcpu_initialise(struct vcpu *v)
> > >  {
> > >      struct domain *d = v->domain;
> > > +    unsigned int i;
> > >      int rc;
> > >  
> > >      ASSERT(!is_idle_domain(d));
> > > @@ -311,6 +312,9 @@ int pv_vcpu_initialise(struct vcpu *v)
> > >      if ( rc )
> > >          return rc;
> > >  
> > > +    for ( i = 0; i < ARRAY_SIZE(v->arch.pv.ldt_frames); i++ )
> > > +        v->arch.pv.ldt_frames[i] = INVALID_MFN;
> > > +
> > 
> > I think it makes more sense to move this earlier so ldt_frames[] is initialised
> > even if pv_vcpu_initialise() fails. It may be benign, but it looks like an
> > accident abount to happen.
>
> Right, pv_destroy_gdt_ldt_l1tab() doesn't care at all about the
> contents of ldt_frames[], but it will be safe to do change the
> ordering in pv_vcpu_initialise().
>
> Thanks, Roger.

Cheers,
Alejandro


^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [PATCH v2 04/18] x86/pv: introduce function to populate perdomain area and use it to map Xen GDT
  2025-01-10 14:29     ` Roger Pau Monné
@ 2025-01-10 15:50       ` Alejandro Vallejo
  0 siblings, 0 replies; 55+ messages in thread
From: Alejandro Vallejo @ 2025-01-10 15:50 UTC (permalink / raw)
  To: Roger Pau Monné; +Cc: xen-devel, Jan Beulich, Andrew Cooper

On Fri Jan 10, 2025 at 2:29 PM GMT, Roger Pau Monné wrote:
> On Thu, Jan 09, 2025 at 09:55:44AM +0000, Alejandro Vallejo wrote:
> > On Wed Jan 8, 2025 at 2:26 PM GMT, Roger Pau Monne wrote:
> > > The current code to update the Xen part of the GDT when running a PV guest
> > > relies on caching the direct map address of all the L1 tables used to map the
> > > GDT and LDT, so that entries can be modified.
> > >
> > > Introduce a new function that populates the per-domain region, either using the
> > > recursive linear mappings when the target vCPU is the current one, or by
> > > directly modifying the L1 table of the per-domain region.
> > >
> > > Using such function to populate per-domain addresses drops the need to keep a
> > > reference to per-domain L1 tables previously used to change the per-domain
> > > mappings.
> > >
> > > Signed-off-by: Roger Pau Monné <roger.pau@citrix.com>
> > > ---
> > >  xen/arch/x86/domain.c                | 11 +++-
> > >  xen/arch/x86/include/asm/desc.h      |  6 +-
> > >  xen/arch/x86/include/asm/mm.h        |  2 +
> > >  xen/arch/x86/include/asm/processor.h |  5 ++
> > >  xen/arch/x86/mm.c                    | 88 ++++++++++++++++++++++++++++
> > >  xen/arch/x86/smpboot.c               |  6 +-
> > >  xen/arch/x86/traps.c                 | 10 ++--
> > >  7 files changed, 113 insertions(+), 15 deletions(-)
> > >
> > > diff --git a/xen/arch/x86/domain.c b/xen/arch/x86/domain.c
> > > index 1f680bf176ee..0bd0ef7e40f4 100644
> > > --- a/xen/arch/x86/domain.c
> > > +++ b/xen/arch/x86/domain.c
> > > @@ -1953,9 +1953,14 @@ static always_inline bool need_full_gdt(const struct domain *d)
> > >  
> > >  static void update_xen_slot_in_full_gdt(const struct vcpu *v, unsigned int cpu)
> > >  {
> > > -    l1e_write(pv_gdt_ptes(v) + FIRST_RESERVED_GDT_PAGE,
> > > -              !is_pv_32bit_vcpu(v) ? per_cpu(gdt_l1e, cpu)
> > > -                                   : per_cpu(compat_gdt_l1e, cpu));
> > > +    ASSERT(v != current);
> > 
> > For this assert, and others below. IIUC, curr_vcpu == current when we're
> > properly switched. When we're idling current == idle and curr_vcpu == prev_ctx.
> > 
> > Granted, calling this in the middle of a lazy idle loop would be weird, but
> > would it make sense for PT consistency to use curr_vcpu here...
>
> Hm, this function is called in a very specific context, and the assert
> intends to reflect that.  TBH I could just drop it, as
> populate_perdomain_mapping() will DTRT also when v == current. The
> expectation for the context is also that current == curr_vcpu.
>
> Note however that if v == current we would need a flush after the
> populate_perdomain_mapping() call, since populate_perdomain_mapping()
> doesn't perform any flushing of the modified entries.  The main
> purpose of the ASSERT() is to notice this.
>
> > > +
> > > +    populate_perdomain_mapping(v,
> > > +                               GDT_VIRT_START(v) +
> > > +                               (FIRST_RESERVED_GDT_PAGE << PAGE_SHIFT),
> > > +                               !is_pv_32bit_vcpu(v) ? &per_cpu(gdt_mfn, cpu)
> > > +                                                    : &per_cpu(compat_gdt_mfn,
> > > +                                                               cpu), 1);
> > >  }
> > >  
> > >  static void load_full_gdt(const struct vcpu *v, unsigned int cpu)
> > > diff --git a/xen/arch/x86/include/asm/desc.h b/xen/arch/x86/include/asm/desc.h
> > > index a1e0807d97ed..33981bfca588 100644
> > > --- a/xen/arch/x86/include/asm/desc.h
> > > +++ b/xen/arch/x86/include/asm/desc.h
> > > @@ -44,6 +44,8 @@
> > >  
> > >  #ifndef __ASSEMBLY__
> > >  
> > > +#include <xen/mm-frame.h>
> > > +
> > >  #define GUEST_KERNEL_RPL(d) (is_pv_32bit_domain(d) ? 1 : 3)
> > >  
> > >  /* Fix up the RPL of a guest segment selector. */
> > > @@ -212,10 +214,10 @@ struct __packed desc_ptr {
> > >  
> > >  extern seg_desc_t boot_gdt[];
> > >  DECLARE_PER_CPU(seg_desc_t *, gdt);
> > > -DECLARE_PER_CPU(l1_pgentry_t, gdt_l1e);
> > > +DECLARE_PER_CPU(mfn_t, gdt_mfn);
> > >  extern seg_desc_t boot_compat_gdt[];
> > >  DECLARE_PER_CPU(seg_desc_t *, compat_gdt);
> > > -DECLARE_PER_CPU(l1_pgentry_t, compat_gdt_l1e);
> > > +DECLARE_PER_CPU(mfn_t, compat_gdt_mfn);
> > >  DECLARE_PER_CPU(bool, full_gdt_loaded);
> > >  
> > >  static inline void lgdt(const struct desc_ptr *gdtr)
> > > diff --git a/xen/arch/x86/include/asm/mm.h b/xen/arch/x86/include/asm/mm.h
> > > index 6c7e66ee21ab..b50a51327b2b 100644
> > > --- a/xen/arch/x86/include/asm/mm.h
> > > +++ b/xen/arch/x86/include/asm/mm.h
> > > @@ -603,6 +603,8 @@ int compat_arch_memory_op(unsigned long cmd, XEN_GUEST_HANDLE_PARAM(void) arg);
> > >  int create_perdomain_mapping(struct domain *d, unsigned long va,
> > >                               unsigned int nr, l1_pgentry_t **pl1tab,
> > >                               struct page_info **ppg);
> > > +void populate_perdomain_mapping(const struct vcpu *v, unsigned long va,
> > > +                                mfn_t *mfn, unsigned long nr);
> > >  void destroy_perdomain_mapping(struct domain *d, unsigned long va,
> > >                                 unsigned int nr);
> > >  void free_perdomain_mappings(struct domain *d);
> > > diff --git a/xen/arch/x86/include/asm/processor.h b/xen/arch/x86/include/asm/processor.h
> > > index d247ef8dd226..82ee89f736c2 100644
> > > --- a/xen/arch/x86/include/asm/processor.h
> > > +++ b/xen/arch/x86/include/asm/processor.h
> > > @@ -243,6 +243,11 @@ static inline unsigned long cr3_pa(unsigned long cr3)
> > >      return cr3 & X86_CR3_ADDR_MASK;
> > >  }
> > >  
> > > +static inline mfn_t cr3_mfn(unsigned long cr3)
> > > +{
> > > +    return maddr_to_mfn(cr3_pa(cr3));
> > > +}
> > > +
> > >  static inline unsigned int cr3_pcid(unsigned long cr3)
> > >  {
> > >      return IS_ENABLED(CONFIG_PV) ? cr3 & X86_CR3_PCID_MASK : 0;
> > > diff --git a/xen/arch/x86/mm.c b/xen/arch/x86/mm.c
> > > index 3d5dd22b6c36..0abea792486c 100644
> > > --- a/xen/arch/x86/mm.c
> > > +++ b/xen/arch/x86/mm.c
> > > @@ -6423,6 +6423,94 @@ int create_perdomain_mapping(struct domain *d, unsigned long va,
> > >      return rc;
> > >  }
> > >  
> > > +void populate_perdomain_mapping(const struct vcpu *v, unsigned long va,
> > > +                                mfn_t *mfn, unsigned long nr)
> > > +{
> > > +    l1_pgentry_t *l1tab = NULL, *pl1e;
> > > +    const l3_pgentry_t *l3tab;
> > > +    const l2_pgentry_t *l2tab;
> > > +    struct domain *d = v->domain;
> > > +
> > > +    ASSERT(va >= PERDOMAIN_VIRT_START &&
> > > +           va < PERDOMAIN_VIRT_SLOT(PERDOMAIN_SLOTS));
> > > +    ASSERT(!nr || !l3_table_offset(va ^ (va + nr * PAGE_SIZE - 1)));
> > > +
> > > +    /* Use likely to force the optimization for the fast path. */
> > > +    if ( likely(v == current) )
> > 
> > ... and here? In particular I'd expect using curr_vcpu here means...
>
> I'm afraid not, this is a trap I've fallen originally when doing this
> series, as I indeed had v == curr_vcpu here (and no
> sync_local_execstate() call).
>
> However as a result of an interrupt, a call to sync_local_execstate()
> might happen, at which point the previous check of v == curr_vcpu
> becomes stale.

Wow, that's nasty! More than fair enough then. Guess the XSAVE wrappers (and
more generally all vCPU-local memory accessors) will have to take this into
account before poking into the contents of the perdomain region.

>
> > > +    {
> > > +        unsigned int i;
> > > +
> > > +        /* Ensure page-tables are from current (if current != curr_vcpu). */
> > > +        sync_local_execstate();
> > 
> > ... this should not be needed.
>
> As kind of mentioned above, this is required to ensure the page-tables
> are in-sync with the vCPU in current, and cannot change as a result of
> an interrupt triggering a call to sync_local_execstate().
>
> Otherwise the page-tables could change while or after the call to
> populate_perdomain_mapping(), and the mappings could end up being
> created on the wrong page-tables.
>
> Thanks, Roger.

Cheers,
Alejandro


^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [PATCH v2 13/18] x86/spec-ctrl: introduce Address Space Isolation command line option
  2025-01-10 14:55     ` Roger Pau Monné
@ 2025-01-10 15:51       ` Alejandro Vallejo
  0 siblings, 0 replies; 55+ messages in thread
From: Alejandro Vallejo @ 2025-01-10 15:51 UTC (permalink / raw)
  To: Roger Pau Monné
  Cc: xen-devel, Andrew Cooper, Anthony PERARD, Michal Orzel,
	Jan Beulich, Julien Grall, Stefano Stabellini

On Fri Jan 10, 2025 at 2:55 PM GMT, Roger Pau Monné wrote:
> On Thu, Jan 09, 2025 at 02:58:29PM +0000, Alejandro Vallejo wrote:
> > On Wed Jan 8, 2025 at 2:26 PM GMT, Roger Pau Monne wrote:
> > > No functional change, as the option is not used.
> > >
> > > Introduced new so newly added functionality is keyed on the option being
> > > enabled, even if the feature is non-functional.
> > >
> > > When ASI is enabled for PV domains, printing the usage of XPTI might be
> > > omitted if it must be uniformly disabled given the usage of ASI.
> > >
> > > Signed-off-by: Roger Pau Monné <roger.pau@citrix.com>
> > > ---
> > > Changes since v1:
> > >  - Improve comments and documentation about what ASI provides.
> > >  - Do not print the XPTI information if ASI is used for pv domUs and dom0 is
> > >    PVH, or if ASI is used for both domU and dom0.
> > >
> > > FWIW, I would print the state of XPTI uniformly, as otherwise I find the output
> > > might be confusing for user expecting to assert the state of XPTI.
> > > ---
> > >  docs/misc/xen-command-line.pandoc    |  19 +++++
> > >  xen/arch/x86/include/asm/domain.h    |   3 +
> > >  xen/arch/x86/include/asm/spec_ctrl.h |   2 +
> > >  xen/arch/x86/spec_ctrl.c             | 115 +++++++++++++++++++++++++--
> > >  4 files changed, 133 insertions(+), 6 deletions(-)
> > >
> > > diff --git a/docs/misc/xen-command-line.pandoc b/docs/misc/xen-command-line.pandoc
> > > index 08b0053f9ced..3c1ad7b5fe7d 100644
> > > --- a/docs/misc/xen-command-line.pandoc
> > > +++ b/docs/misc/xen-command-line.pandoc
> > > @@ -202,6 +202,25 @@ to appropriate auditing by Xen.  Argo is disabled by default.
> > >      This option is disabled by default, to protect domains from a DoS by a
> > >      buggy or malicious other domain spamming the ring.
> > >  
> > > +### asi (x86)
> > > +> `= List of [ <bool>, {pv,hvm}=<bool>,
> > > +               {vcpu-pt}=<bool>|{pv,hvm}=<bool> ]`
> > 
> > nit: While this grows later, the braces around vcpu-pt aren't strictly needed here.
>
> Since I have to modify the whole line I can indeed add the braces
> later.
>
> > > +
> > > +Offers control over whether the hypervisor will engage in Address Space
> > > +Isolation, by not having potentially sensitive information permanently mapped
> > > +in the VMM page-tables.  Using this option might avoid the need to apply
> > > +mitigations for certain speculative related attacks, at the cost of mapping
> > > +sensitive information on-demand.
> > 
> > Might be worth mentioning that this provides some defense in depth against
> > unmitigated attacks too.
>
> It's IMO a bit too vague to make such promises, but I can add:
>
> Offers control over whether the hypervisor will engage in Address Space
> Isolation, by not having potentially sensitive information permanently mapped
> in the VMM page-tables.  Using this option might avoid the need to apply
> mitigations for certain speculative related attacks, at the cost of mapping
> sensitive information on-demand.  It might also offer some protection
> against unmitigated speculation-related attacks.

SGTM

Cheers,
Alejandro


^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [PATCH v2 15/18] x86/mm: introduce a per-vCPU mapcache when using ASI
  2025-01-10 15:02     ` Roger Pau Monné
@ 2025-01-10 16:12       ` Alejandro Vallejo
  2025-01-10 16:19       ` Alejandro Vallejo
  1 sibling, 0 replies; 55+ messages in thread
From: Alejandro Vallejo @ 2025-01-10 16:12 UTC (permalink / raw)
  To: Roger Pau Monné; +Cc: xen-devel, Jan Beulich, Andrew Cooper

On Fri Jan 10, 2025 at 3:02 PM GMT, Roger Pau Monné wrote:
> On Thu, Jan 09, 2025 at 03:08:15PM +0000, Alejandro Vallejo wrote:
> > On Wed Jan 8, 2025 at 2:26 PM GMT, Roger Pau Monne wrote:
> > > When using a unique per-vCPU root page table the per-domain region becomes
> > > per-vCPU, and hence the mapcache is no longer shared between all vCPUs of a
> > > domain.  Introduce per-vCPU mapcache structures, and modify map_domain_page()
> > > to create per-vCPU mappings when possible.  Note the lock is also not needed
> > > with using per-vCPU map caches, as the structure is no longer shared.
> > >
> > > This introduces some duplication in the domain and vcpu structures, as both
> > > contain a mapcache field to support running with and without per-vCPU
> > > page-tables.
> > >
> > > Signed-off-by: Roger Pau Monné <roger.pau@citrix.com>
> > > ---
> > >  xen/arch/x86/domain_page.c        | 90 ++++++++++++++++++++-----------
> > >  xen/arch/x86/include/asm/domain.h | 20 ++++---
> > >  2 files changed, 71 insertions(+), 39 deletions(-)
> > >
> > > diff --git a/xen/arch/x86/domain_page.c b/xen/arch/x86/domain_page.c
> > > index 1372be20224e..65900d6218f8 100644
> > > --- a/xen/arch/x86/domain_page.c
> > > +++ b/xen/arch/x86/domain_page.c
> > > @@ -74,7 +74,9 @@ void *map_domain_page(mfn_t mfn)
> > >      struct vcpu *v;
> > >      struct mapcache_domain *dcache;
> > >      struct mapcache_vcpu *vcache;
> > > +    struct mapcache *cache;
> > >      struct vcpu_maphash_entry *hashent;
> > > +    struct domain *d;
> > >  
> > >  #ifdef NDEBUG
> > >      if ( mfn_x(mfn) <= PFN_DOWN(__pa(HYPERVISOR_VIRT_END - 1)) )
> > > @@ -85,9 +87,12 @@ void *map_domain_page(mfn_t mfn)
> > >      if ( !v || !is_pv_vcpu(v) )
> > >          return mfn_to_virt(mfn_x(mfn));
> > >  
> > > -    dcache = &v->domain->arch.pv.mapcache;
> > > +    d = v->domain;
> > > +    dcache = &d->arch.pv.mapcache;
> > >      vcache = &v->arch.pv.mapcache;
> > > -    if ( !dcache->inuse )
> > > +    cache = d->arch.vcpu_pt ? &v->arch.pv.mapcache.cache
> > > +                            : &d->arch.pv.mapcache.cache;
> > > +    if ( !cache->inuse )
> > >          return mfn_to_virt(mfn_x(mfn));
> > >  
> > >      perfc_incr(map_domain_page_count);
> > > @@ -98,17 +103,18 @@ void *map_domain_page(mfn_t mfn)
> > >      if ( hashent->mfn == mfn_x(mfn) )
> > >      {
> > >          idx = hashent->idx;
> > > -        ASSERT(idx < dcache->entries);
> > > +        ASSERT(idx < cache->entries);
> > >          hashent->refcnt++;
> > >          ASSERT(hashent->refcnt);
> > >          ASSERT(mfn_eq(l1e_get_mfn(MAPCACHE_L1ENT(idx)), mfn));
> > >          goto out;
> > >      }
> > >  
> > > -    spin_lock(&dcache->lock);
> > > +    if ( !d->arch.vcpu_pt )
> > > +        spin_lock(&dcache->lock);
> > 
> > Hmmm. I wonder whether we might not want a nospec here...
>
> Not sure TBH, we have other instances of conditional locking that
> doesn't use nospec().  That said I'm not claiming those are correct.
> Shouldn't people that care about this kind of speculation into
> critical regions just use CONFIG_SPECULATIVE_HARDEN_LOCK?

Do people that care have a choice though? CONFIG_SPECULATIVE_HARDEN_LOCK only
blocks speculation in the taken branch here, so the critical region isn't
hardened when the relaxed branch is followed.

I suspect nospec in the condition would be fine perf-wise because the CPU can
still do straight-line-speculation on the underlying function call when
CONFIG_SPECULATIVE_HARDEN_LOCK is not defined.

It's not the end of the world either way.

>
> Thanks, Roger.

Cheers,
Alejandro


^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [PATCH v2 15/18] x86/mm: introduce a per-vCPU mapcache when using ASI
  2025-01-10 15:02     ` Roger Pau Monné
  2025-01-10 16:12       ` Alejandro Vallejo
@ 2025-01-10 16:19       ` Alejandro Vallejo
  2025-01-10 18:43         ` Roger Pau Monné
  1 sibling, 1 reply; 55+ messages in thread
From: Alejandro Vallejo @ 2025-01-10 16:19 UTC (permalink / raw)
  To: Roger Pau Monné; +Cc: xen-devel, Jan Beulich, Andrew Cooper

On Fri Jan 10, 2025 at 3:02 PM GMT, Roger Pau Monné wrote:
> On Thu, Jan 09, 2025 at 03:08:15PM +0000, Alejandro Vallejo wrote:
> > On Wed Jan 8, 2025 at 2:26 PM GMT, Roger Pau Monne wrote:
> > > When using a unique per-vCPU root page table the per-domain region becomes
> > > per-vCPU, and hence the mapcache is no longer shared between all vCPUs of a
> > > domain.  Introduce per-vCPU mapcache structures, and modify map_domain_page()
> > > to create per-vCPU mappings when possible.  Note the lock is also not needed
> > > with using per-vCPU map caches, as the structure is no longer shared.
> > >
> > > This introduces some duplication in the domain and vcpu structures, as both
> > > contain a mapcache field to support running with and without per-vCPU
> > > page-tables.
> > >
> > > Signed-off-by: Roger Pau Monné <roger.pau@citrix.com>
> > > ---
> > >  xen/arch/x86/domain_page.c        | 90 ++++++++++++++++++++-----------
> > >  xen/arch/x86/include/asm/domain.h | 20 ++++---
> > >  2 files changed, 71 insertions(+), 39 deletions(-)
> > >
> > > diff --git a/xen/arch/x86/domain_page.c b/xen/arch/x86/domain_page.c
> > > index 1372be20224e..65900d6218f8 100644
> > > --- a/xen/arch/x86/domain_page.c
> > > +++ b/xen/arch/x86/domain_page.c
> > > @@ -74,7 +74,9 @@ void *map_domain_page(mfn_t mfn)
> > >      struct vcpu *v;
> > >      struct mapcache_domain *dcache;
> > >      struct mapcache_vcpu *vcache;
> > > +    struct mapcache *cache;
> > >      struct vcpu_maphash_entry *hashent;
> > > +    struct domain *d;
> > >  
> > >  #ifdef NDEBUG
> > >      if ( mfn_x(mfn) <= PFN_DOWN(__pa(HYPERVISOR_VIRT_END - 1)) )
> > > @@ -85,9 +87,12 @@ void *map_domain_page(mfn_t mfn)
> > >      if ( !v || !is_pv_vcpu(v) )
> > >          return mfn_to_virt(mfn_x(mfn));
> > >  
> > > -    dcache = &v->domain->arch.pv.mapcache;
> > > +    d = v->domain;
> > > +    dcache = &d->arch.pv.mapcache;
> > >      vcache = &v->arch.pv.mapcache;
> > > -    if ( !dcache->inuse )
> > > +    cache = d->arch.vcpu_pt ? &v->arch.pv.mapcache.cache
> > > +                            : &d->arch.pv.mapcache.cache;
> > > +    if ( !cache->inuse )
> > >          return mfn_to_virt(mfn_x(mfn));
> > >  
> > >      perfc_incr(map_domain_page_count);
> > > @@ -98,17 +103,18 @@ void *map_domain_page(mfn_t mfn)
> > >      if ( hashent->mfn == mfn_x(mfn) )
> > >      {
> > >          idx = hashent->idx;
> > > -        ASSERT(idx < dcache->entries);
> > > +        ASSERT(idx < cache->entries);
> > >          hashent->refcnt++;
> > >          ASSERT(hashent->refcnt);
> > >          ASSERT(mfn_eq(l1e_get_mfn(MAPCACHE_L1ENT(idx)), mfn));
> > >          goto out;
> > >      }
> > >  
> > > -    spin_lock(&dcache->lock);
> > > +    if ( !d->arch.vcpu_pt )
> > > +        spin_lock(&dcache->lock);
> > 
> > Hmmm. I wonder whether we might not want a nospec here...
>
> Not sure TBH, we have other instances of conditional locking that
> doesn't use nospec().  That said I'm not claiming those are correct.
> Shouldn't people that care about this kind of speculation into
> critical regions just use CONFIG_SPECULATIVE_HARDEN_LOCK?
>
> Thanks, Roger.

Actually, to avoid the double lfence, I think this would work too while
avoiding the lfence unconditionally when CONFIG_SPECULATIVE_HARDEN_LOCK is not
set.

    if ( !d->arch.vcpu_pt )
        spin_lock(&dcache->lock);
    else
        block_lock_speculation();

Cheers,
Alejandro


^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [PATCH v2 15/18] x86/mm: introduce a per-vCPU mapcache when using ASI
  2025-01-10 16:19       ` Alejandro Vallejo
@ 2025-01-10 18:43         ` Roger Pau Monné
  0 siblings, 0 replies; 55+ messages in thread
From: Roger Pau Monné @ 2025-01-10 18:43 UTC (permalink / raw)
  To: Alejandro Vallejo; +Cc: xen-devel, Jan Beulich, Andrew Cooper

On Fri, Jan 10, 2025 at 04:19:03PM +0000, Alejandro Vallejo wrote:
> On Fri Jan 10, 2025 at 3:02 PM GMT, Roger Pau Monné wrote:
> > On Thu, Jan 09, 2025 at 03:08:15PM +0000, Alejandro Vallejo wrote:
> > > On Wed Jan 8, 2025 at 2:26 PM GMT, Roger Pau Monne wrote:
> > > > When using a unique per-vCPU root page table the per-domain region becomes
> > > > per-vCPU, and hence the mapcache is no longer shared between all vCPUs of a
> > > > domain.  Introduce per-vCPU mapcache structures, and modify map_domain_page()
> > > > to create per-vCPU mappings when possible.  Note the lock is also not needed
> > > > with using per-vCPU map caches, as the structure is no longer shared.
> > > >
> > > > This introduces some duplication in the domain and vcpu structures, as both
> > > > contain a mapcache field to support running with and without per-vCPU
> > > > page-tables.
> > > >
> > > > Signed-off-by: Roger Pau Monné <roger.pau@citrix.com>
> > > > ---
> > > >  xen/arch/x86/domain_page.c        | 90 ++++++++++++++++++++-----------
> > > >  xen/arch/x86/include/asm/domain.h | 20 ++++---
> > > >  2 files changed, 71 insertions(+), 39 deletions(-)
> > > >
> > > > diff --git a/xen/arch/x86/domain_page.c b/xen/arch/x86/domain_page.c
> > > > index 1372be20224e..65900d6218f8 100644
> > > > --- a/xen/arch/x86/domain_page.c
> > > > +++ b/xen/arch/x86/domain_page.c
> > > > @@ -74,7 +74,9 @@ void *map_domain_page(mfn_t mfn)
> > > >      struct vcpu *v;
> > > >      struct mapcache_domain *dcache;
> > > >      struct mapcache_vcpu *vcache;
> > > > +    struct mapcache *cache;
> > > >      struct vcpu_maphash_entry *hashent;
> > > > +    struct domain *d;
> > > >  
> > > >  #ifdef NDEBUG
> > > >      if ( mfn_x(mfn) <= PFN_DOWN(__pa(HYPERVISOR_VIRT_END - 1)) )
> > > > @@ -85,9 +87,12 @@ void *map_domain_page(mfn_t mfn)
> > > >      if ( !v || !is_pv_vcpu(v) )
> > > >          return mfn_to_virt(mfn_x(mfn));
> > > >  
> > > > -    dcache = &v->domain->arch.pv.mapcache;
> > > > +    d = v->domain;
> > > > +    dcache = &d->arch.pv.mapcache;
> > > >      vcache = &v->arch.pv.mapcache;
> > > > -    if ( !dcache->inuse )
> > > > +    cache = d->arch.vcpu_pt ? &v->arch.pv.mapcache.cache
> > > > +                            : &d->arch.pv.mapcache.cache;
> > > > +    if ( !cache->inuse )
> > > >          return mfn_to_virt(mfn_x(mfn));
> > > >  
> > > >      perfc_incr(map_domain_page_count);
> > > > @@ -98,17 +103,18 @@ void *map_domain_page(mfn_t mfn)
> > > >      if ( hashent->mfn == mfn_x(mfn) )
> > > >      {
> > > >          idx = hashent->idx;
> > > > -        ASSERT(idx < dcache->entries);
> > > > +        ASSERT(idx < cache->entries);
> > > >          hashent->refcnt++;
> > > >          ASSERT(hashent->refcnt);
> > > >          ASSERT(mfn_eq(l1e_get_mfn(MAPCACHE_L1ENT(idx)), mfn));
> > > >          goto out;
> > > >      }
> > > >  
> > > > -    spin_lock(&dcache->lock);
> > > > +    if ( !d->arch.vcpu_pt )
> > > > +        spin_lock(&dcache->lock);
> > > 
> > > Hmmm. I wonder whether we might not want a nospec here...
> >
> > Not sure TBH, we have other instances of conditional locking that
> > doesn't use nospec().  That said I'm not claiming those are correct.
> > Shouldn't people that care about this kind of speculation into
> > critical regions just use CONFIG_SPECULATIVE_HARDEN_LOCK?
> >
> > Thanks, Roger.
> 
> Actually, to avoid the double lfence, I think this would work too while
> avoiding the lfence unconditionally when CONFIG_SPECULATIVE_HARDEN_LOCK is not
> set.
> 
>     if ( !d->arch.vcpu_pt )
>         spin_lock(&dcache->lock);
>     else
>         block_lock_speculation();

We have a spin_lock_if() helper to do that.  I will use it here.

Thanks, Roger.


^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [PATCH v2 02/18] x86/domain: limit window where curr_vcpu != current on context switch
  2025-01-09 17:33     ` Roger Pau Monné
@ 2025-01-14 15:02       ` Jan Beulich
  2025-01-17 14:57         ` Roger Pau Monné
  0 siblings, 1 reply; 55+ messages in thread
From: Jan Beulich @ 2025-01-14 15:02 UTC (permalink / raw)
  To: Roger Pau Monné; +Cc: Andrew Cooper, xen-devel

On 09.01.2025 18:33, Roger Pau Monné wrote:
> On Thu, Jan 09, 2025 at 09:59:58AM +0100, Jan Beulich wrote:
>> On 08.01.2025 15:26, Roger Pau Monne wrote:
>>> @@ -2048,8 +2060,6 @@ static void __context_switch(void)
>>>      if ( pd != nd )
>>>          cpumask_clear_cpu(cpu, pd->dirty_cpumask);
>>>      write_atomic(&p->dirty_cpu, VCPU_CPU_CLEAN);
>>> -
>>> -    per_cpu(curr_vcpu, cpu) = n;
>>>  }
>>>  
>>>  void context_switch(struct vcpu *prev, struct vcpu *next)
>>> @@ -2081,16 +2091,36 @@ void context_switch(struct vcpu *prev, struct vcpu *next)
>>>  
>>>      local_irq_disable();
>>>  
>>> -    set_current(next);
>>> -
>>>      if ( (per_cpu(curr_vcpu, cpu) == next) ||
>>>           (is_idle_domain(nextd) && cpu_online(cpu)) )
>>>      {
>>> +        /*
>>> +         * Lazy context switch to the idle vCPU, set current == idle.  Full
>>> +         * context switch happens if/when sync_local_execstate() is called.
>>> +         */
>>> +        set_current(next);
>>>          local_irq_enable();
>>
>> The comment is misleading as far as the first half of the if() condition goes:
>> No further switching is going to happen in that case, aiui.
> 
> Right, I should clarify that comment: this is either a lazy context
> switch, or the return from a lazy state to the previously running
> vCPU.
> 
>>>      }
>>>      else
>>>      {
>>> -        __context_switch();
>>> +        /*
>>> +         * curr_vcpu will always point to the currently loaded vCPU context, as
>>> +         * it's not updated when doing a lazy switch to the idle vCPU.
>>> +         */
>>> +        struct vcpu *prev_ctx = per_cpu(curr_vcpu, cpu);
>>> +
>>> +        if ( prev_ctx != current )
>>> +        {
>>> +            /*
>>> +             * Doing a full context switch to a non-idle vCPU from a lazy
>>> +             * context switched state.  Adjust current to point to the
>>> +             * currently loaded vCPU context.
>>> +             */
>>> +            ASSERT(current == idle_vcpu[cpu]);
>>> +            ASSERT(!is_idle_vcpu(next));
>>> +            set_current(prev_ctx);
>>
>> This feels wrong, as in "current" then not representing what it should represent,
>> for a certain time window. I may be dense, but neither comment not description
>> clarify to me why this might be needed. I can see that it's needed to please the
>> ASSERT() you add to __context_switch(), yet then I might ask why that assertion
>> is put there.
> 
> This is done so that when calling __context_switch() current ==
> curr_vcpu, and map_domain_page() can be used without getting into an
> infinite sync_local_execstate() recursion loop.

Yet it's the purpose of __context_switch() to bring curr_vcpu in sync
with current. IOW both matching up is supposed to be an exit condition
of the function, not an entry one.

Plus, as indicated when we were talking this through yesterday, the
set_current() here make "current" no longer point at what - from the
scheduler's perspective - is (supposed to be) the current vCPU.

Aiui this adjustment is the reason for ...

>>> --- a/xen/arch/x86/traps.c
>>> +++ b/xen/arch/x86/traps.c
>>> @@ -2232,8 +2232,6 @@ void __init trap_init(void)
>>>  
>>>  void activate_debugregs(const struct vcpu *curr)
>>>  {
>>> -    ASSERT(curr == current);
>>> -
>>>      write_debugreg(0, curr->arch.dr[0]);
>>>      write_debugreg(1, curr->arch.dr[1]);
>>>      write_debugreg(2, curr->arch.dr[2]);
>>
>> Why would this assertion go away? If it suddenly triggers, the parameter name
>> would now end up being wrong.
> 
> Well, at the point where activate_debugregs() gets called (in
> paravirt_ctxt_switch_to()), current == previous as a result of this
> change, so the assert is no longer true on purpose on that call
> path.

... this behavior. Which, as said, feels wrong the latest when "curr" was
renamed to no longer suggest it actually is cached "current". At that point
it'll be dubious whose ->arch.dr[] are actually written into the CPU
registers.

Also let's not forget that there's a 2nd call here, where I very much hope
it continues to be "current" that's being passed in.

Jan


^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [PATCH v2.1 06/18] x86/pv: set/clear guest GDT mappings using {populate,destroy}_perdomain_mapping()
  2025-01-08 15:11   ` [PATCH v2.1 " Roger Pau Monne
  2025-01-09 10:25     ` Alejandro Vallejo
@ 2025-01-14 15:30     ` Jan Beulich
  1 sibling, 0 replies; 55+ messages in thread
From: Jan Beulich @ 2025-01-14 15:30 UTC (permalink / raw)
  To: Roger Pau Monne; +Cc: Andrew Cooper, xen-devel

On 08.01.2025 16:11, Roger Pau Monne wrote:
> The pv_{set,destroy}_gdt() functions rely on the L1 table(s) that contain such
> mappings being stashed in the domain structure, and thus such mappings being
> modified by merely updating the L1 entries.
> 
> Switch both pv_{set,destroy}_gdt() to instead use
> {populate,destory}_perdomain_mapping().

Like for an earlier patch it doesn't really become clear why what is being done
wants / needs doing. I might guess that it's the "stashed in the domain structure"
that you ultimately want to get rid of?

> --- a/xen/arch/x86/pv/descriptor-tables.c
> +++ b/xen/arch/x86/pv/descriptor-tables.c
> @@ -49,23 +49,20 @@ bool pv_destroy_ldt(struct vcpu *v)
>  
>  void pv_destroy_gdt(struct vcpu *v)
>  {
> -    l1_pgentry_t *pl1e = pv_gdt_ptes(v);
> -    mfn_t zero_mfn = _mfn(virt_to_mfn(zero_page));
> -    l1_pgentry_t zero_l1e = l1e_from_mfn(zero_mfn, __PAGE_HYPERVISOR_RO);
>      unsigned int i;
>  
>      ASSERT(v == current || !vcpu_cpu_dirty(v));
>  
> -    v->arch.pv.gdt_ents = 0;

How can this validly go away?

> -    for ( i = 0; i < FIRST_RESERVED_GDT_PAGE; i++ )
> -    {
> -        mfn_t mfn = l1e_get_mfn(pl1e[i]);
> +    if ( v->arch.cr3 )
> +        destroy_perdomain_mapping(v, GDT_VIRT_START(v),
> +                                  ARRAY_SIZE(v->arch.pv.gdt_frames));

How is v->arch.cr3 being non-zero related to the GDT area needing
destroying?

> -        if ( (l1e_get_flags(pl1e[i]) & _PAGE_PRESENT) &&
> -             !mfn_eq(mfn, zero_mfn) )
> -            put_page_and_type(mfn_to_page(mfn));
> +    for ( i = 0; i < ARRAY_SIZE(v->arch.pv.gdt_frames); i++)
> +    {
> +        if ( !v->arch.pv.gdt_frames[i] )
> +            break;
>  
> -        l1e_write(&pl1e[i], zero_l1e);
> +        put_page_and_type(mfn_to_page(_mfn(v->arch.pv.gdt_frames[i])));
>          v->arch.pv.gdt_frames[i] = 0;
>      }
>  }
> @@ -74,8 +71,8 @@ int pv_set_gdt(struct vcpu *v, const unsigned long frames[],
>                 unsigned int entries)
>  {
>      struct domain *d = v->domain;
> -    l1_pgentry_t *pl1e;
>      unsigned int i, nr_frames = DIV_ROUND_UP(entries, 512);
> +    mfn_t mfns[ARRAY_SIZE(v->arch.pv.gdt_frames)];

Having this array is kind of odd - it'll hold all the same values as
frames[], just under a different type. Considering the further copying
done ...

> @@ -90,6 +87,8 @@ int pv_set_gdt(struct vcpu *v, const unsigned long frames[],
>          if ( !mfn_valid(mfn) ||
>               !get_page_and_type(mfn_to_page(mfn), d, PGT_seg_desc_page) )
>              goto fail;
> +
> +        mfns[i] = mfn;
>      }
>  
>      /* Tear down the old GDT. */
> @@ -97,12 +96,9 @@ int pv_set_gdt(struct vcpu *v, const unsigned long frames[],
>  
>      /* Install the new GDT. */
>      v->arch.pv.gdt_ents = entries;
> -    pl1e = pv_gdt_ptes(v);
>      for ( i = 0; i < nr_frames; i++ )
> -    {
>          v->arch.pv.gdt_frames[i] = frames[i];

... here, would it perhaps be an option to change ->arch.pv.gdt_frames[]
to mfn_t[], thus allowing ...

> -        l1e_write(&pl1e[i], l1e_from_pfn(frames[i], __PAGE_HYPERVISOR_RW));
> -    }
> +    populate_perdomain_mapping(v, GDT_VIRT_START(v), mfns, nr_frames);

... that array to be passed into here?

Jan


^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [PATCH v2 07/18] x86/pv: update guest LDT mappings using the linear entries
  2025-01-08 14:26 ` [PATCH v2 07/18] x86/pv: update guest LDT mappings using the linear entries Roger Pau Monne
  2025-01-09 14:34   ` Alejandro Vallejo
@ 2025-01-14 15:42   ` Jan Beulich
  1 sibling, 0 replies; 55+ messages in thread
From: Jan Beulich @ 2025-01-14 15:42 UTC (permalink / raw)
  To: Roger Pau Monne; +Cc: Andrew Cooper, xen-devel

On 08.01.2025 15:26, Roger Pau Monne wrote:
> The pv_map_ldt_shadow_page() and pv_destroy_ldt() functions rely on the L1
> table(s) that contain such mappings being stashed in the domain structure, and
> thus such mappings being modified by merely updating the require L1 entries.
> 
> Switch pv_map_ldt_shadow_page() to unconditionally use the linear recursive, as
> that logic is always called while the vCPU is running on the current pCPU.
> 
> For pv_destroy_ldt() use the linear mappings if the vCPU is the one currently
> running on the pCPU, otherwise use destroy_mappings().
> 
> Note this requires keeping an array with the pages currently mapped at the LDT
> area, as that allows dropping the extra taken page reference when removing the
> mappings.

I'm confused by the wording of this paragraph: It reads as if you were
changing reference obtaining / dropping, yet it all looks to stay the
same. If I'm not mistaken you use the array to replace the acquiring of
the MFNs in question from the L1 page table entries. If so, I think it
would be nice if this could be described in a more direct way. Perhaps
first and foremost by replacing "allows" and getting rid of "extra".

Jan


^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [PATCH v2 00/18] x86: adventures in Address Space Isolation
  2025-01-08 14:26 [PATCH v2 00/18] x86: adventures in Address Space Isolation Roger Pau Monne
                   ` (17 preceding siblings ...)
  2025-01-08 14:26 ` [PATCH v2 18/18] x86/mm: zero stack on context switch Roger Pau Monne
@ 2025-01-14 16:20 ` Jan Beulich
  2025-01-17 14:45   ` Roger Pau Monné
  18 siblings, 1 reply; 55+ messages in thread
From: Jan Beulich @ 2025-01-14 16:20 UTC (permalink / raw)
  To: Roger Pau Monne
  Cc: Andrew Cooper, Anthony PERARD, Michal Orzel, Julien Grall,
	Stefano Stabellini, Tim Deegan, xen-devel

On 08.01.2025 15:26, Roger Pau Monne wrote:
> Hello,
> 
> The aim of this series is to introduce the functionality required to
> create linear mappings visible to a single pCPU.
> 
> Doing so requires having a per-vCPU root page-table (L4), and hence
> requires shadowing the guest selected L4 on PV guests.  As follow ups
> (and partially to ensure the per-CPU mappings work fine) the CPU stacks
> are switched to use per-CPU mappings, so that remote stack contents are
> not by default mapped on all page-tables (note: for this to be true the
> directmap entries for the stack pages would need to be removed also).
> 
> There's one known shortcoming with the presented code: migration of PV
> guests using per-vCPU root page-tables is not working.  I need to
> introduce extra logic to deal with PV shadow mode when using unique root
> page-tables.  I don't think this should block the series however, such
> missing functionality can always be added as follow up work.
> paging_domctl() is adjusted to reflect this restriction.
> 
> The main differences compared to v1 are the usage of per-vCPU root page
> tables (as opposed to per-pCPU), and the usage of the existing perdomain
> family of functions to manage the mappings in the per-domain slot, that
> now becomes per-vCPU.
> 
> All patches until 17 are mostly preparatory, I think there's a nice
> cleanup and generalization of the creation and managing of per-domain
> mappings, by no longer storing references to L1 page-tables in the vCPU
> or domain struct.

Since you referred me to the cover letter, I've looked back here after
making some more progress with the series. Along with my earlier comment
towards the need or ultimate goal, ...

> Patch 13 introduces the command line option, and would need discussion
> and integration with the sparse direct map series.  IMO we should get
> consensus on how we want the command line to look ASAP, so that we can
> basic parsing logic in place to be used by both the work here and the
> direct map removal series.
> 
> As part of this series the map_domain_page() helpers are also switched
> to create per-vCPU mappings (see patch 15), which converts an existing
> interface into creating per-vCPU mappings.  Such interface can be used
> to hide (map per-vCPU) further data that we don't want to be part of the
> direct map, or even shared between vCPUs of the same domain.  Also all
> existing users of the interface will already create per-vCPU mappings
> without needing additional changes.
> 
> Note that none of the logic introduced in the series removes entries for
> the directmap, so even when creating the per-CPU mappings the underlying
> physical addresses are fully accessible when using it's direct map
> entries.
> 
> I also haven't done any benchmarking.  Doesn't seem to cripple
> performance up to the point that XenRT jobs would timeout before
> finishing, that the only objective reference I can provide at the
> moment.
> 
> The series has been extensively tested on XenRT, but that doesn't cover
> all possible use-cases, so it's likely to still have some rough edges,
> handle with care.
> 
> Thanks, Roger.
> 
> Roger Pau Monne (18):
>   x86/mm: purge unneeded destroy_perdomain_mapping()
>   x86/domain: limit window where curr_vcpu != current on context switch
>   x86/mm: introduce helper to detect per-domain L1 entries that need
>     freeing
>   x86/pv: introduce function to populate perdomain area and use it to
>     map Xen GDT
>   x86/mm: switch destroy_perdomain_mapping() parameter from domain to
>     vCPU
>   x86/pv: set/clear guest GDT mappings using
>     {populate,destroy}_perdomain_mapping()
>   x86/pv: update guest LDT mappings using the linear entries
>   x86/pv: remove stashing of GDT/LDT L1 page-tables
>   x86/mm: simplify create_perdomain_mapping() interface
>   x86/mm: switch {create,destroy}_perdomain_mapping() domain parameter
>     to vCPU
>   x86/pv: untie issuing FLUSH_ROOT_PGTBL from XPTI
>   x86/mm: move FLUSH_ROOT_PGTBL handling before TLB flush
>   x86/spec-ctrl: introduce Address Space Isolation command line option
>   x86/mm: introduce per-vCPU L3 page-table
>   x86/mm: introduce a per-vCPU mapcache when using ASI
>   x86/pv: allow using a unique per-pCPU root page table (L4)
>   x86/mm: switch to a per-CPU mapped stack when using ASI
>   x86/mm: zero stack on context switch
> 
>  docs/misc/xen-command-line.pandoc    |  24 +++
>  xen/arch/x86/cpu/mcheck/mce.c        |   4 +
>  xen/arch/x86/domain.c                | 157 +++++++++++----
>  xen/arch/x86/domain_page.c           | 105 ++++++----
>  xen/arch/x86/flushtlb.c              |  28 ++-
>  xen/arch/x86/hvm/hvm.c               |   6 -
>  xen/arch/x86/include/asm/config.h    |  16 +-
>  xen/arch/x86/include/asm/current.h   |  58 +++++-
>  xen/arch/x86/include/asm/desc.h      |   6 +-
>  xen/arch/x86/include/asm/domain.h    |  50 +++--
>  xen/arch/x86/include/asm/flushtlb.h  |   2 +-
>  xen/arch/x86/include/asm/mm.h        |  15 +-
>  xen/arch/x86/include/asm/processor.h |   5 +
>  xen/arch/x86/include/asm/pv/mm.h     |   5 +
>  xen/arch/x86/include/asm/smp.h       |  12 ++
>  xen/arch/x86/include/asm/spec_ctrl.h |   4 +
>  xen/arch/x86/mm.c                    | 291 +++++++++++++++++++++------
>  xen/arch/x86/mm/hap/hap.c            |   2 +-
>  xen/arch/x86/mm/paging.c             |   6 +
>  xen/arch/x86/mm/shadow/hvm.c         |   2 +-
>  xen/arch/x86/mm/shadow/multi.c       |   2 +-
>  xen/arch/x86/pv/descriptor-tables.c  |  47 ++---
>  xen/arch/x86/pv/dom0_build.c         |  12 +-
>  xen/arch/x86/pv/domain.c             |  57 ++++--
>  xen/arch/x86/pv/mm.c                 |  43 +++-
>  xen/arch/x86/setup.c                 |  32 ++-
>  xen/arch/x86/smp.c                   |  39 ++++
>  xen/arch/x86/smpboot.c               |  26 ++-
>  xen/arch/x86/spec_ctrl.c             | 205 ++++++++++++++++++-
>  xen/arch/x86/traps.c                 |  25 ++-
>  xen/arch/x86/x86_64/mm.c             |   7 +-
>  xen/common/smp.c                     |  10 +
>  xen/common/stop_machine.c            |  10 +
>  xen/include/xen/smp.h                |   8 +
>  34 files changed, 1052 insertions(+), 269 deletions(-)

... this diffstat (even after subtracting out the contribution of the last two
patches in the series) doesn't really look like a cleanup / simplification.
Things becoming slightly slower (because of the L1 no longer directly available
to modify) may, otoh, not be a significant issue, if we assume that GDT/LDT
manipulation isn't normally a very frequent operation.

IOW my earlier request stands: Can you please try to make more clear (in the
patch descriptions) what exactly the motivation for these changes is? Just
doing things differently with more code overall can't be it, I don't think.

Jan


^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [PATCH v2 10/18] x86/mm: switch {create,destroy}_perdomain_mapping() domain parameter to vCPU
  2025-01-08 14:26 ` [PATCH v2 10/18] x86/mm: switch {create,destroy}_perdomain_mapping() domain parameter to vCPU Roger Pau Monne
@ 2025-01-14 16:27   ` Jan Beulich
  0 siblings, 0 replies; 55+ messages in thread
From: Jan Beulich @ 2025-01-14 16:27 UTC (permalink / raw)
  To: Roger Pau Monne; +Cc: Andrew Cooper, xen-devel

On 08.01.2025 15:26, Roger Pau Monne wrote:
> In preparation for the per-domain area being per-vCPU.  This requires moving
> some of the {create,destroy}_perdomain_mapping() calls to the domain
> initialization and tear down paths into vCPU initialization and tear down.

Am I confused or DYM "s/ to / from /"?

Jan


^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [PATCH v2 00/18] x86: adventures in Address Space Isolation
  2025-01-14 16:20 ` [PATCH v2 00/18] x86: adventures in Address Space Isolation Jan Beulich
@ 2025-01-17 14:45   ` Roger Pau Monné
  0 siblings, 0 replies; 55+ messages in thread
From: Roger Pau Monné @ 2025-01-17 14:45 UTC (permalink / raw)
  To: Jan Beulich
  Cc: Andrew Cooper, Anthony PERARD, Michal Orzel, Julien Grall,
	Stefano Stabellini, Tim Deegan, xen-devel

On Tue, Jan 14, 2025 at 05:20:04PM +0100, Jan Beulich wrote:
> On 08.01.2025 15:26, Roger Pau Monne wrote:
> > Hello,
> > 
> > The aim of this series is to introduce the functionality required to
> > create linear mappings visible to a single pCPU.
> > 
> > Doing so requires having a per-vCPU root page-table (L4), and hence
> > requires shadowing the guest selected L4 on PV guests.  As follow ups
> > (and partially to ensure the per-CPU mappings work fine) the CPU stacks
> > are switched to use per-CPU mappings, so that remote stack contents are
> > not by default mapped on all page-tables (note: for this to be true the
> > directmap entries for the stack pages would need to be removed also).
> > 
> > There's one known shortcoming with the presented code: migration of PV
> > guests using per-vCPU root page-tables is not working.  I need to
> > introduce extra logic to deal with PV shadow mode when using unique root
> > page-tables.  I don't think this should block the series however, such
> > missing functionality can always be added as follow up work.
> > paging_domctl() is adjusted to reflect this restriction.
> > 
> > The main differences compared to v1 are the usage of per-vCPU root page
> > tables (as opposed to per-pCPU), and the usage of the existing perdomain
> > family of functions to manage the mappings in the per-domain slot, that
> > now becomes per-vCPU.
> > 
> > All patches until 17 are mostly preparatory, I think there's a nice
> > cleanup and generalization of the creation and managing of per-domain
> > mappings, by no longer storing references to L1 page-tables in the vCPU
> > or domain struct.
> 
> Since you referred me to the cover letter, I've looked back here after
> making some more progress with the series. Along with my earlier comment
> towards the need or ultimate goal, ...
> 
> > Patch 13 introduces the command line option, and would need discussion
> > and integration with the sparse direct map series.  IMO we should get
> > consensus on how we want the command line to look ASAP, so that we can
> > basic parsing logic in place to be used by both the work here and the
> > direct map removal series.
> > 
> > As part of this series the map_domain_page() helpers are also switched
> > to create per-vCPU mappings (see patch 15), which converts an existing
> > interface into creating per-vCPU mappings.  Such interface can be used
> > to hide (map per-vCPU) further data that we don't want to be part of the
> > direct map, or even shared between vCPUs of the same domain.  Also all
> > existing users of the interface will already create per-vCPU mappings
> > without needing additional changes.
> > 
> > Note that none of the logic introduced in the series removes entries for
> > the directmap, so even when creating the per-CPU mappings the underlying
> > physical addresses are fully accessible when using it's direct map
> > entries.
> > 
> > I also haven't done any benchmarking.  Doesn't seem to cripple
> > performance up to the point that XenRT jobs would timeout before
> > finishing, that the only objective reference I can provide at the
> > moment.
> > 
> > The series has been extensively tested on XenRT, but that doesn't cover
> > all possible use-cases, so it's likely to still have some rough edges,
> > handle with care.
> > 
> > Thanks, Roger.
> > 
> > Roger Pau Monne (18):
> >   x86/mm: purge unneeded destroy_perdomain_mapping()
> >   x86/domain: limit window where curr_vcpu != current on context switch
> >   x86/mm: introduce helper to detect per-domain L1 entries that need
> >     freeing
> >   x86/pv: introduce function to populate perdomain area and use it to
> >     map Xen GDT
> >   x86/mm: switch destroy_perdomain_mapping() parameter from domain to
> >     vCPU
> >   x86/pv: set/clear guest GDT mappings using
> >     {populate,destroy}_perdomain_mapping()
> >   x86/pv: update guest LDT mappings using the linear entries
> >   x86/pv: remove stashing of GDT/LDT L1 page-tables
> >   x86/mm: simplify create_perdomain_mapping() interface
> >   x86/mm: switch {create,destroy}_perdomain_mapping() domain parameter
> >     to vCPU
> >   x86/pv: untie issuing FLUSH_ROOT_PGTBL from XPTI
> >   x86/mm: move FLUSH_ROOT_PGTBL handling before TLB flush
> >   x86/spec-ctrl: introduce Address Space Isolation command line option
> >   x86/mm: introduce per-vCPU L3 page-table
> >   x86/mm: introduce a per-vCPU mapcache when using ASI
> >   x86/pv: allow using a unique per-pCPU root page table (L4)
> >   x86/mm: switch to a per-CPU mapped stack when using ASI
> >   x86/mm: zero stack on context switch
> > 
> >  docs/misc/xen-command-line.pandoc    |  24 +++
> >  xen/arch/x86/cpu/mcheck/mce.c        |   4 +
> >  xen/arch/x86/domain.c                | 157 +++++++++++----
> >  xen/arch/x86/domain_page.c           | 105 ++++++----
> >  xen/arch/x86/flushtlb.c              |  28 ++-
> >  xen/arch/x86/hvm/hvm.c               |   6 -
> >  xen/arch/x86/include/asm/config.h    |  16 +-
> >  xen/arch/x86/include/asm/current.h   |  58 +++++-
> >  xen/arch/x86/include/asm/desc.h      |   6 +-
> >  xen/arch/x86/include/asm/domain.h    |  50 +++--
> >  xen/arch/x86/include/asm/flushtlb.h  |   2 +-
> >  xen/arch/x86/include/asm/mm.h        |  15 +-
> >  xen/arch/x86/include/asm/processor.h |   5 +
> >  xen/arch/x86/include/asm/pv/mm.h     |   5 +
> >  xen/arch/x86/include/asm/smp.h       |  12 ++
> >  xen/arch/x86/include/asm/spec_ctrl.h |   4 +
> >  xen/arch/x86/mm.c                    | 291 +++++++++++++++++++++------
> >  xen/arch/x86/mm/hap/hap.c            |   2 +-
> >  xen/arch/x86/mm/paging.c             |   6 +
> >  xen/arch/x86/mm/shadow/hvm.c         |   2 +-
> >  xen/arch/x86/mm/shadow/multi.c       |   2 +-
> >  xen/arch/x86/pv/descriptor-tables.c  |  47 ++---
> >  xen/arch/x86/pv/dom0_build.c         |  12 +-
> >  xen/arch/x86/pv/domain.c             |  57 ++++--
> >  xen/arch/x86/pv/mm.c                 |  43 +++-
> >  xen/arch/x86/setup.c                 |  32 ++-
> >  xen/arch/x86/smp.c                   |  39 ++++
> >  xen/arch/x86/smpboot.c               |  26 ++-
> >  xen/arch/x86/spec_ctrl.c             | 205 ++++++++++++++++++-
> >  xen/arch/x86/traps.c                 |  25 ++-
> >  xen/arch/x86/x86_64/mm.c             |   7 +-
> >  xen/common/smp.c                     |  10 +
> >  xen/common/stop_machine.c            |  10 +
> >  xen/include/xen/smp.h                |   8 +
> >  34 files changed, 1052 insertions(+), 269 deletions(-)
> 
> ... this diffstat (even after subtracting out the contribution of the last two
> patches in the series) doesn't really look like a cleanup / simplification.

To be fair you would need to subtract the contribution of the last 8
patches, as all those are strictly related to ASI.  The perdomain
mapping interface cleanup is just the first 10 patches.  Which leaves
a diffstat of:

 xen/arch/x86/domain.c                |  81 ++++++++++++----
 xen/arch/x86/domain_page.c           |  19 ++--
 xen/arch/x86/hvm/hvm.c               |   6 --
 xen/arch/x86/include/asm/desc.h      |   6 +-
 xen/arch/x86/include/asm/domain.h    |  13 +--
 xen/arch/x86/include/asm/mm.h        |  11 ++-
 xen/arch/x86/include/asm/processor.h |   5 +
 xen/arch/x86/mm.c                    | 175 ++++++++++++++++++++++++++---------
 xen/arch/x86/pv/descriptor-tables.c  |  47 +++++-----
 xen/arch/x86/pv/domain.c             |  24 ++---
 xen/arch/x86/pv/mm.c                 |   3 +-
 xen/arch/x86/smpboot.c               |   6 +-
 xen/arch/x86/traps.c                 |  12 +--
 xen/arch/x86/x86_64/mm.c             |   7 +-
 14 files changed, 260 insertions(+), 155 deletions(-)

That's including the context switch change and not differentiating
between lines of code vs comments.

However, I don't think cleanup / simplifications should be purely
based on diffstat LoC.  Arguably the current
create_perdomain_mapping() set of parameters are not the most obvious
ones:

int create_perdomain_mapping(struct domain *d, unsigned long va,
                             unsigned int nr, l1_pgentry_t **pl1tab,
                             struct page_info **ppg);

Compared to the result after the first 10 patches in the series:

int create_perdomain_mapping(struct vcpu *v, unsigned long va,
                             unsigned int nr, bool populate);

Together with the fact that callers no longer need to keep a reference
to the L1(s) tables to populate such area.

> Things becoming slightly slower (because of the L1 no longer directly available
> to modify) may, otoh, not be a significant issue, if we assume that GDT/LDT
> manipulation isn't normally a very frequent operation.

I introduce a fast path in both {populate,destroy}_perdomain_mapping()
that uses the recursive linear slot for manipulating the L1
page-table.  There's still a slow path that relies on walking the
page-tables, but that should only be used when the vCPU is not
running, and hence the added latency shouldn't be too critical.

As a side-effect of this logic the pages allocated for the per-domain
region page-tables can now uniformly be from domheap.  The usage of
xenheap pages for L1 page-tables is no longer needed once those are
not stashed in the domain structure anymore.

> IOW my earlier request stands: Can you please try to make more clear (in the
> patch descriptions) what exactly the motivation for these changes is? Just
> doing things differently with more code overall can't be it, I don't think.

The main motivation for the change is to remove stashing L1
page-tables for the per-domain area(s) in the domain struct, as with
the introduction of ASI Xen would need to stash L1 page-tables for the
per-domain area on the vcpu struct also, as a result of the per-domain
slot becoming per-vCPU.  IMO managing such references, and having
logic to deal with domain and vcpu L1 page-tables is too complex and
error prone.

Instead I propose to add an interface:
{populate,destroy}_perdomain_mapping() that can be used to manage
mappings on the per-domain area with callers being completely unaware
of whether the domain is running with per-vCPU mappings or not.

Please let me know if you are happy with the reasoning and arguments
provided.  I think the resulting perdomain mapping interface is much
better than what Xen currently does for manipulating per-domain
mapping entries, but I might be spoiled.

Thanks, Roger.


^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [PATCH v2 02/18] x86/domain: limit window where curr_vcpu != current on context switch
  2025-01-14 15:02       ` Jan Beulich
@ 2025-01-17 14:57         ` Roger Pau Monné
  0 siblings, 0 replies; 55+ messages in thread
From: Roger Pau Monné @ 2025-01-17 14:57 UTC (permalink / raw)
  To: Jan Beulich; +Cc: Andrew Cooper, xen-devel

On Tue, Jan 14, 2025 at 04:02:01PM +0100, Jan Beulich wrote:
> On 09.01.2025 18:33, Roger Pau Monné wrote:
> > On Thu, Jan 09, 2025 at 09:59:58AM +0100, Jan Beulich wrote:
> >> On 08.01.2025 15:26, Roger Pau Monne wrote:
> >>>      }
> >>>      else
> >>>      {
> >>> -        __context_switch();
> >>> +        /*
> >>> +         * curr_vcpu will always point to the currently loaded vCPU context, as
> >>> +         * it's not updated when doing a lazy switch to the idle vCPU.
> >>> +         */
> >>> +        struct vcpu *prev_ctx = per_cpu(curr_vcpu, cpu);
> >>> +
> >>> +        if ( prev_ctx != current )
> >>> +        {
> >>> +            /*
> >>> +             * Doing a full context switch to a non-idle vCPU from a lazy
> >>> +             * context switched state.  Adjust current to point to the
> >>> +             * currently loaded vCPU context.
> >>> +             */
> >>> +            ASSERT(current == idle_vcpu[cpu]);
> >>> +            ASSERT(!is_idle_vcpu(next));
> >>> +            set_current(prev_ctx);
> >>
> >> This feels wrong, as in "current" then not representing what it should represent,
> >> for a certain time window. I may be dense, but neither comment not description
> >> clarify to me why this might be needed. I can see that it's needed to please the
> >> ASSERT() you add to __context_switch(), yet then I might ask why that assertion
> >> is put there.
> > 
> > This is done so that when calling __context_switch() current ==
> > curr_vcpu, and map_domain_page() can be used without getting into an
> > infinite sync_local_execstate() recursion loop.
> 
> Yet it's the purpose of __context_switch() to bring curr_vcpu in sync
> with current. IOW both matching up is supposed to be an exit condition
> of the function, not an entry one.
> 
> Plus, as indicated when we were talking this through yesterday, the
> set_current() here make "current" no longer point at what - from the
> scheduler's perspective - is (supposed to be) the current vCPU.

I understand this, and I will look into alternative ways to workaround
the issues I'm facing that prompted the changes proposed on this
patch.

I've been thinking about what we spoke of disabling lazy idle context
switch when ASI was enabled, and I'm afraid that won't be enough.  The
{populate,destroy}_perdomain_mapping() functions added later in the
series will be used in the context switch path regardless of whether
ASI is enabled, and those functions require map_domain_page() to be
usable.  Hence map_domain_page() needs to be usable in the context
switch path.

I will see whether I can allow the usage of map_domain_page() at
context switch in a different way.

I understand the main concern is the window where current and the
scheduler notion of current diverge right?

Arguably this is already happening in context_switch(), as
set_current() gets called almost at the beggining of the function,
while the call to sched_context_switched() only happens at the tail of
the function.  So for the whole call to  __context_switch() current is
not in-sync with the scheduler currently running vCPU.  And I'm not
saying this is a model to follow, but the context switch code is
already fairly special, hence I don't see the change here as that much
different from the current logic.

That said, I will still try to figure an alternative way to deal with
the usage of map_domain_page() in the context switch path.

> Aiui this adjustment is the reason for ...
> 
> >>> --- a/xen/arch/x86/traps.c
> >>> +++ b/xen/arch/x86/traps.c
> >>> @@ -2232,8 +2232,6 @@ void __init trap_init(void)
> >>>  
> >>>  void activate_debugregs(const struct vcpu *curr)
> >>>  {
> >>> -    ASSERT(curr == current);
> >>> -
> >>>      write_debugreg(0, curr->arch.dr[0]);
> >>>      write_debugreg(1, curr->arch.dr[1]);
> >>>      write_debugreg(2, curr->arch.dr[2]);
> >>
> >> Why would this assertion go away? If it suddenly triggers, the parameter name
> >> would now end up being wrong.
> > 
> > Well, at the point where activate_debugregs() gets called (in
> > paravirt_ctxt_switch_to()), current == previous as a result of this
> > change, so the assert is no longer true on purpose on that call
> > path.
> 
> ... this behavior. Which, as said, feels wrong the latest when "curr" was
> renamed to no longer suggest it actually is cached "current". At that point
> it'll be dubious whose ->arch.dr[] are actually written into the CPU
> registers.
> 
> Also let's not forget that there's a 2nd call here, where I very much hope
> it continues to be "current" that's being passed in.

Indeed, for the other call the assert would still be valid, that
context is not changed.

Thanks, Roger.


^ permalink raw reply	[flat|nested] 55+ messages in thread

end of thread, other threads:[~2025-01-17 14:58 UTC | newest]

Thread overview: 55+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2025-01-08 14:26 [PATCH v2 00/18] x86: adventures in Address Space Isolation Roger Pau Monne
2025-01-08 14:26 ` [PATCH v2 01/18] x86/mm: purge unneeded destroy_perdomain_mapping() Roger Pau Monne
2025-01-08 15:59   ` Alejandro Vallejo
2025-01-08 14:26 ` [PATCH v2 02/18] x86/domain: limit window where curr_vcpu != current on context switch Roger Pau Monne
2025-01-08 16:26   ` Alejandro Vallejo
2025-01-09 17:39     ` Roger Pau Monné
2025-01-09  8:59   ` Jan Beulich
2025-01-09 17:33     ` Roger Pau Monné
2025-01-14 15:02       ` Jan Beulich
2025-01-17 14:57         ` Roger Pau Monné
2025-01-08 14:26 ` [PATCH v2 03/18] x86/mm: introduce helper to detect per-domain L1 entries that need freeing Roger Pau Monne
2025-01-09  9:03   ` Jan Beulich
2025-01-08 14:26 ` [PATCH v2 04/18] x86/pv: introduce function to populate perdomain area and use it to map Xen GDT Roger Pau Monne
2025-01-09  9:10   ` Jan Beulich
2025-01-10 14:15     ` Roger Pau Monné
2025-01-09  9:55   ` Alejandro Vallejo
2025-01-10 14:29     ` Roger Pau Monné
2025-01-10 15:50       ` Alejandro Vallejo
2025-01-08 14:26 ` [PATCH v2 05/18] x86/mm: switch destroy_perdomain_mapping() parameter from domain to vCPU Roger Pau Monne
2025-01-09 10:02   ` Alejandro Vallejo
2025-01-10 14:30     ` Roger Pau Monné
2025-01-08 14:26 ` [PATCH v2 06/18] x86/pv: set/clear guest GDT mappings using {populate,destroy}_perdomain_mapping() Roger Pau Monne
2025-01-08 15:11   ` [PATCH v2.1 " Roger Pau Monne
2025-01-09 10:25     ` Alejandro Vallejo
2025-01-10 14:33       ` Roger Pau Monné
2025-01-14 15:30     ` Jan Beulich
2025-01-08 14:26 ` [PATCH v2 07/18] x86/pv: update guest LDT mappings using the linear entries Roger Pau Monne
2025-01-09 14:34   ` Alejandro Vallejo
2025-01-10 14:44     ` Roger Pau Monné
2025-01-10 15:36       ` Alejandro Vallejo
2025-01-14 15:42   ` Jan Beulich
2025-01-08 14:26 ` [PATCH v2 08/18] x86/pv: remove stashing of GDT/LDT L1 page-tables Roger Pau Monne
2025-01-08 14:26 ` [PATCH v2 09/18] x86/mm: simplify create_perdomain_mapping() interface Roger Pau Monne
2025-01-09 11:01   ` Alejandro Vallejo
2025-01-10 14:45     ` Roger Pau Monné
2025-01-08 14:26 ` [PATCH v2 10/18] x86/mm: switch {create,destroy}_perdomain_mapping() domain parameter to vCPU Roger Pau Monne
2025-01-14 16:27   ` Jan Beulich
2025-01-08 14:26 ` [PATCH v2 11/18] x86/pv: untie issuing FLUSH_ROOT_PGTBL from XPTI Roger Pau Monne
2025-01-08 14:26 ` [PATCH v2 12/18] x86/mm: move FLUSH_ROOT_PGTBL handling before TLB flush Roger Pau Monne
2025-01-08 14:26 ` [PATCH v2 13/18] x86/spec-ctrl: introduce Address Space Isolation command line option Roger Pau Monne
2025-01-09 14:58   ` Alejandro Vallejo
2025-01-10 14:55     ` Roger Pau Monné
2025-01-10 15:51       ` Alejandro Vallejo
2025-01-08 14:26 ` [PATCH v2 14/18] x86/mm: introduce per-vCPU L3 page-table Roger Pau Monne
2025-01-08 14:26 ` [PATCH v2 15/18] x86/mm: introduce a per-vCPU mapcache when using ASI Roger Pau Monne
2025-01-09 15:08   ` Alejandro Vallejo
2025-01-10 15:02     ` Roger Pau Monné
2025-01-10 16:12       ` Alejandro Vallejo
2025-01-10 16:19       ` Alejandro Vallejo
2025-01-10 18:43         ` Roger Pau Monné
2025-01-08 14:26 ` [PATCH v2 16/18] x86/pv: allow using a unique per-pCPU root page table (L4) Roger Pau Monne
2025-01-08 14:26 ` [PATCH v2 17/18] x86/mm: switch to a per-CPU mapped stack when using ASI Roger Pau Monne
2025-01-08 14:26 ` [PATCH v2 18/18] x86/mm: zero stack on context switch Roger Pau Monne
2025-01-14 16:20 ` [PATCH v2 00/18] x86: adventures in Address Space Isolation Jan Beulich
2025-01-17 14:45   ` Roger Pau Monné

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.