[PATCH v1 0/3] Lockless SMP function call and TLB flushing

All of lore.kernel.org
 help / color / mirror / Atom feed

* [PATCH v1 0/3] Lockless SMP function call and TLB flushing
@ 2026-04-01 16:35 Ross Lagerwall
  2026-04-01 16:35 ` [PATCH v1 1/3] x86/hap: Wait for remote CPUs during TLB flush Ross Lagerwall
                   ` (3 more replies)
  0 siblings, 4 replies; 17+ messages in thread
From: Ross Lagerwall @ 2026-04-01 16:35 UTC (permalink / raw)
  To: xen-devel
  Cc: Ross Lagerwall, Jan Beulich, Andrew Cooper, Roger Pau Monné,
	Anthony PERARD, Michal Orzel, Julien Grall, Stefano Stabellini

Hi,

This series implements lockless SMP function call and then rewrites x86 TLB
flushing to use SMP function calls.

We have observed that the TLB flush lock can be a point of contention for
certain workloads, e.g. migrating 10 VMs off a host during a host evacuation.

Performance numbers:

I wrote a synthetic benchmark to measure the performance. The benchmark has one
or more CPUs in Xen calling on_selected_cpus() with between 1 and 64 CPUs in
the selected mask. The executed function simply delays for 500 microseconds.

The table below shows the % change in execution time of on_selected_cpus():

                  1 thread   2 threads    4 threads
1 CPU in mask     0.02       -35.23       -51.18
2 CPUs in mask    0.01       -47.20       -69.27
4 CPUs in mask    -0.02      -42.40       -66.55
8 CPUs in mask    -0.03      -47.82       -68.39
16 CPUs in mask   0.12       -41.95       -58.26
32 CPUs in mask   0.02       -25.43       -39.35
64 CPUs in mask   0.00       -24.70       -37.83

With 1 thread (i.e. no contention), there is no regression in execution time.
With multiple threads, as expected there is a significant improvement in
execution time.

As a more practical benchmark to simulate host evacuation, I measured the
memory dirtying rate across 10 VMs after enabling log dirty (on an AMD system,
so without PML). The rate increased by 16% with this patch series, even
after the recent deferred TLB flush changes.

FWIW, my first attempt at this was to port the SMP call functionality from
Linux. I found it didn't scale well as the number of CPUs in the mask
increases so I've taken a different approach here.

Thanks,
Ross

Ross Lagerwall (3):
  x86/hap: Wait for remote CPUs during TLB flush
  xen/smp: Rewrite on_selected_cpus() to be lockless
  x86/smp: Rewrite TLB flush using on_selected_cpus()

 tools/xentrace/xenalyze.c              |   2 -
 xen/arch/x86/include/asm/irq-vectors.h |   1 -
 xen/arch/x86/include/asm/irq.h         |   1 -
 xen/arch/x86/mm/hap/hap.c              |   2 +-
 xen/arch/x86/smp.c                     |  30 ++++----
 xen/arch/x86/smpboot.c                 |   1 -
 xen/common/smp.c                       | 101 ++++++++++++++++---------
 7 files changed, 80 insertions(+), 58 deletions(-)

-- 
2.53.0

^ permalink raw reply	[flat|nested] 17+ messages in thread

* [PATCH v1 1/3] x86/hap: Wait for remote CPUs during TLB flush
  2026-04-01 16:35 [PATCH v1 0/3] Lockless SMP function call and TLB flushing Ross Lagerwall
@ 2026-04-01 16:35 ` Ross Lagerwall
  2026-04-08 15:21   ` Jan Beulich
  2026-04-01 16:35 ` [PATCH v1 2/3] xen/smp: Rewrite on_selected_cpus() to be lockless Ross Lagerwall
                   ` (2 subsequent siblings)
  3 siblings, 1 reply; 17+ messages in thread
From: Ross Lagerwall @ 2026-04-01 16:35 UTC (permalink / raw)
  To: xen-devel
  Cc: Ross Lagerwall, Jan Beulich, Andrew Cooper, Roger Pau Monné

A future change to on_selected_cpus() will change the semantics of the
wait parameter so that it doesn't wait for remote CPUs to "check in" if
wait == 0. Adjust the call here to retain the existing behaviour so it
continues to wait for the remote CPUs to VMExit.

Signed-off-by: Ross Lagerwall <ross.lagerwall@citrix.com>
---
 xen/arch/x86/mm/hap/hap.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/xen/arch/x86/mm/hap/hap.c b/xen/arch/x86/mm/hap/hap.c
index 5ccb80bda5d3..fb48e470bbf5 100644
--- a/xen/arch/x86/mm/hap/hap.c
+++ b/xen/arch/x86/mm/hap/hap.c
@@ -763,7 +763,7 @@ static bool cf_check flush_tlb(const unsigned long *vcpu_bitmap)
      * not currently running will already be flushed when scheduled because of
      * the ASID tickle done in the loop above.
      */
-    on_selected_cpus(mask, NULL, NULL, 0);
+    on_selected_cpus(mask, NULL, NULL, 1);
 
     return true;
 }
-- 
2.53.0



^ permalink raw reply related	[flat|nested] 17+ messages in thread

* [PATCH v1 2/3] xen/smp: Rewrite on_selected_cpus() to be lockless
  2026-04-01 16:35 [PATCH v1 0/3] Lockless SMP function call and TLB flushing Ross Lagerwall
  2026-04-01 16:35 ` [PATCH v1 1/3] x86/hap: Wait for remote CPUs during TLB flush Ross Lagerwall
@ 2026-04-01 16:35 ` Ross Lagerwall
  2026-04-08 16:11   ` Jan Beulich
                     ` (2 more replies)
  2026-04-01 16:35 ` [PATCH v1 3/3] x86/smp: Rewrite TLB flush using on_selected_cpus() Ross Lagerwall
  2026-04-02  6:09 ` [PATCH v1 0/3] Lockless SMP function call and TLB flushing Jan Beulich
  3 siblings, 3 replies; 17+ messages in thread
From: Ross Lagerwall @ 2026-04-01 16:35 UTC (permalink / raw)
  To: xen-devel
  Cc: Ross Lagerwall, Andrew Cooper, Anthony PERARD, Michal Orzel,
	Jan Beulich, Julien Grall, Roger Pau Monné,
	Stefano Stabellini

on_selected_cpus() holds a global lock even if the function is to be
called on non-overlapping CPUs. This is a scalability bottleneck so to
avoid that:

1. Remove the global lock.
2. Make call_data_struct per-CPU.
3. Track which CPUs are currently running on_selected_cpus() using a
   global CPU mask. This tells CPUs running the interrupt which per-CPU
   call_data_structs to look at.

Since the call data is now per-CPU, skip waiting for CPUs to "check in"
for async calls. Instead, delay it until the next time
on_selected_cpus() is called by which point there should be nothing to
wait for.

Signed-off-by: Ross Lagerwall <ross.lagerwall@citrix.com>
---
 xen/common/smp.c | 101 +++++++++++++++++++++++++++++++----------------
 1 file changed, 66 insertions(+), 35 deletions(-)

diff --git a/xen/common/smp.c b/xen/common/smp.c
index a011f541f1ea..e592e8453fb3 100644
--- a/xen/common/smp.c
+++ b/xen/common/smp.c
@@ -24,13 +24,15 @@
 /*
  * Structure and data for smp_call_function()/on_selected_cpus().
  */
-static DEFINE_SPINLOCK(call_lock);
-static struct call_data_struct {
+struct call_data_struct {
     void (*func) (void *info);
     void *info;
     int wait;
-    cpumask_t selected;
-} call_data;
+    cpumask_t selected __cacheline_aligned;
+};
+
+DEFINE_PER_CPU(struct call_data_struct, call_data);
+static cpumask_t tasks;
 
 void smp_call_function(
     void (*func) (void *info),
@@ -50,55 +52,84 @@ void on_selected_cpus(
     void *info,
     int wait)
 {
+    struct call_data_struct *data;
+    unsigned int cpu = smp_processor_id();
+
     ASSERT(local_irq_is_enabled());
     ASSERT(cpumask_subset(selected, &cpu_online_map));
 
-    spin_lock(&call_lock);
+    if ( cpumask_empty(selected) )
+        return;
+
+    data = &this_cpu(call_data);
 
-    cpumask_copy(&call_data.selected, selected);
+    if ( !data->wait )
+    {
+        /* Wait for any previous async call to complete */
+        while ( !cpumask_empty(&data->selected) )
+            cpu_relax();
+
+        cpumask_clear_cpu(cpu, &tasks);
+    }
 
-    if ( cpumask_empty(&call_data.selected) )
-        goto out;
+    data->func = func;
+    data->info = info;
+    data->wait = wait;
 
-    call_data.func = func;
-    call_data.info = info;
-    call_data.wait = wait;
+    smp_wmb();
 
-    smp_send_call_function_mask(&call_data.selected);
+    cpumask_copy(&data->selected, selected);
 
-    while ( !cpumask_empty(&call_data.selected) )
-        cpu_relax();
+    cpumask_set_cpu(cpu, &tasks);
 
-out:
-    spin_unlock(&call_lock);
+    smp_send_call_function_mask(&data->selected);
+
+    if ( wait )
+    {
+        while ( !cpumask_empty(&data->selected) )
+            cpu_relax();
+
+        cpumask_clear_cpu(cpu, &tasks);
+    }
 }
 
 void smp_call_function_interrupt(void)
 {
-    void (*func)(void *info) = call_data.func;
-    void *info = call_data.info;
     unsigned int cpu = smp_processor_id();
-
-    if ( !cpumask_test_cpu(cpu, &call_data.selected) )
-        return;
+    unsigned int i;
+    struct call_data_struct *data;
+    void (*func)(void *info);
+    void *info;
 
     irq_enter();
 
-    if ( unlikely(!func) )
-    {
-        cpumask_clear_cpu(cpu, &call_data.selected);
-    }
-    else if ( call_data.wait )
-    {
-        (*func)(info);
-        smp_mb();
-        cpumask_clear_cpu(cpu, &call_data.selected);
-    }
-    else
+    for_each_cpu ( i, &tasks )
     {
-        smp_mb();
-        cpumask_clear_cpu(cpu, &call_data.selected);
-        (*func)(info);
+        data = &per_cpu(call_data, i);
+
+        if ( !cpumask_test_cpu(cpu, &data->selected) )
+            continue;
+
+        smp_rmb();
+        func = data->func;
+        info = data->info;
+
+        if ( unlikely(!func) )
+        {
+            cpumask_clear_cpu(cpu, &data->selected);
+        }
+        else if ( data->wait )
+        {
+            (*func)(info);
+            smp_mb();
+            cpumask_clear_cpu(cpu, &data->selected);
+        }
+        else
+        {
+            smp_mb();
+            cpumask_clear_cpu(cpu, &data->selected);
+            (*func)(info);
+        }
     }
 
     irq_exit();
-- 
2.53.0



^ permalink raw reply related	[flat|nested] 17+ messages in thread

* [PATCH v1 3/3] x86/smp: Rewrite TLB flush using on_selected_cpus()
  2026-04-01 16:35 [PATCH v1 0/3] Lockless SMP function call and TLB flushing Ross Lagerwall
  2026-04-01 16:35 ` [PATCH v1 1/3] x86/hap: Wait for remote CPUs during TLB flush Ross Lagerwall
  2026-04-01 16:35 ` [PATCH v1 2/3] xen/smp: Rewrite on_selected_cpus() to be lockless Ross Lagerwall
@ 2026-04-01 16:35 ` Ross Lagerwall
  2026-04-20 14:04   ` Jan Beulich
  2026-04-02  6:09 ` [PATCH v1 0/3] Lockless SMP function call and TLB flushing Jan Beulich
  3 siblings, 1 reply; 17+ messages in thread
From: Ross Lagerwall @ 2026-04-01 16:35 UTC (permalink / raw)
  To: xen-devel
  Cc: Ross Lagerwall, Anthony PERARD, Jan Beulich, Andrew Cooper,
	Roger Pau Monné

With on_selected_cpus() rewritten to avoid taking a global lock, also
use it for TLB flushes rather than the existing hand-rolled
functionality.

This improves performance by allowing TLB flushes on behalf of different
guests to happen at the same time.

Signed-off-by: Ross Lagerwall <ross.lagerwall@citrix.com>
---
 tools/xentrace/xenalyze.c              |  2 --
 xen/arch/x86/include/asm/irq-vectors.h |  1 -
 xen/arch/x86/include/asm/irq.h         |  1 -
 xen/arch/x86/smp.c                     | 30 +++++++++++---------------
 xen/arch/x86/smpboot.c                 |  1 -
 5 files changed, 13 insertions(+), 22 deletions(-)

diff --git a/tools/xentrace/xenalyze.c b/tools/xentrace/xenalyze.c
index 876d59d42ca5..a32f5f378a84 100644
--- a/tools/xentrace/xenalyze.c
+++ b/tools/xentrace/xenalyze.c
@@ -866,7 +866,6 @@ const char * hvm_svm_exit_reason_name[HVM_SVM_EXIT_REASON_MAX] = {
 /* General hvm information */
 #define SPURIOUS_APIC_VECTOR  0xff
 #define ERROR_APIC_VECTOR     0xfe
-#define INVALIDATE_TLB_VECTOR 0xfd
 #define EVENT_CHECK_VECTOR    0xfc
 #define CALL_FUNCTION_VECTOR  0xfb
 #define THERMAL_APIC_VECTOR   0xfa
@@ -878,7 +877,6 @@ const char * hvm_svm_exit_reason_name[HVM_SVM_EXIT_REASON_MAX] = {
 const char * hvm_extint_vector_name[EXTERNAL_INTERRUPT_MAX] = {
     [SPURIOUS_APIC_VECTOR] = "SPURIOS_APIC",
     [ERROR_APIC_VECTOR] =    "ERROR_APIC",
-    [INVALIDATE_TLB_VECTOR]= "INVALIDATE_TLB",
     [EVENT_CHECK_VECTOR]=    "EVENT_CHECK",
     [CALL_FUNCTION_VECTOR]=  "CALL_FUNCTION",
     [THERMAL_APIC_VECTOR]=   "THERMAL_APIC",
diff --git a/xen/arch/x86/include/asm/irq-vectors.h b/xen/arch/x86/include/asm/irq-vectors.h
index d75d1c56716a..a1eb160d500b 100644
--- a/xen/arch/x86/include/asm/irq-vectors.h
+++ b/xen/arch/x86/include/asm/irq-vectors.h
@@ -4,7 +4,6 @@
 /* Processor-initiated interrupts are all high priority. */
 #define SPURIOUS_APIC_VECTOR	0xff
 #define ERROR_APIC_VECTOR	0xfe
-#define INVALIDATE_TLB_VECTOR	0xfd
 #define EVENT_CHECK_VECTOR	0xfc
 #define CALL_FUNCTION_VECTOR	0xfb
 #define LOCAL_TIMER_VECTOR	0xfa
diff --git a/xen/arch/x86/include/asm/irq.h b/xen/arch/x86/include/asm/irq.h
index 7315150b66b4..57cf2f23eb8d 100644
--- a/xen/arch/x86/include/asm/irq.h
+++ b/xen/arch/x86/include/asm/irq.h
@@ -111,7 +111,6 @@ extern int opt_irq_vector_map;
 #define platform_legacy_irq(irq)	((irq) < NR_ISA_IRQS)
 
 void cf_check event_check_interrupt(void);
-void cf_check invalidate_interrupt(void);
 void cf_check call_function_interrupt(void);
 void cf_check irq_move_cleanup_interrupt(void);
 
diff --git a/xen/arch/x86/smp.c b/xen/arch/x86/smp.c
index 7936294f5fcd..a2a9abbb2f4e 100644
--- a/xen/arch/x86/smp.c
+++ b/xen/arch/x86/smp.c
@@ -242,21 +242,20 @@ void cf_check send_IPI_mask_phys(const cpumask_t *mask, int vector)
     local_irq_restore(flags);
 }
 
-static DEFINE_SPINLOCK(flush_lock);
-static cpumask_t flush_cpumask;
-static const void *flush_va;
-static unsigned int flush_flags;
+struct flush_data {
+    const void *va;
+    unsigned int flags;
+};
 
-void cf_check invalidate_interrupt(void)
+static void cf_check invalidate_cb(void *info)
 {
-    unsigned int flags = flush_flags;
-    ack_APIC_irq();
-    perfc_incr(ipis);
+    struct flush_data *data = info;
+    unsigned int flags = data->flags;
+
     if ( (flags & FLUSH_VCPU_STATE) && __sync_local_execstate() )
         flags &= ~(FLUSH_TLB | FLUSH_TLB_GLOBAL | FLUSH_ROOT_PGTBL);
     if ( flags & ~(FLUSH_VCPU_STATE | FLUSH_ORDER_MASK) )
-        flush_area_local(flush_va, flags);
-    cpumask_clear_cpu(smp_processor_id(), &flush_cpumask);
+        flush_area_local(data->va, flags);
 }
 
 void flush_area_mask(const cpumask_t *mask, const void *va, unsigned int flags)
@@ -275,21 +274,18 @@ void flush_area_mask(const cpumask_t *mask, const void *va, unsigned int flags)
     if ( (flags & ~FLUSH_ORDER_MASK) &&
          !cpumask_subset(mask, cpumask_of(cpu)) )
     {
+        cpumask_t flush_cpumask;
+        struct flush_data data = { .va = va, .flags = flags };
+
         if ( cpu_has_hypervisor &&
              !(flags & ~(FLUSH_TLB | FLUSH_TLB_GLOBAL | FLUSH_VA_VALID |
                          FLUSH_ORDER_MASK)) &&
              !hypervisor_flush_tlb(mask, va, flags) )
             return;
 
-        spin_lock(&flush_lock);
         cpumask_and(&flush_cpumask, mask, &cpu_online_map);
         cpumask_clear_cpu(cpu, &flush_cpumask);
-        flush_va      = va;
-        flush_flags   = flags;
-        send_IPI_mask(&flush_cpumask, INVALIDATE_TLB_VECTOR);
-        while ( !cpumask_empty(&flush_cpumask) )
-            cpu_relax();
-        spin_unlock(&flush_lock);
+        on_selected_cpus(&flush_cpumask, invalidate_cb, &data, 1);
     }
 }
 
diff --git a/xen/arch/x86/smpboot.c b/xen/arch/x86/smpboot.c
index 491cbbba33ae..51e6982a1d25 100644
--- a/xen/arch/x86/smpboot.c
+++ b/xen/arch/x86/smpboot.c
@@ -1482,6 +1482,5 @@ void __init smp_intr_init(void)
     /* Direct IPI vectors. */
     set_direct_apic_vector(IRQ_MOVE_CLEANUP_VECTOR, irq_move_cleanup_interrupt);
     set_direct_apic_vector(EVENT_CHECK_VECTOR, event_check_interrupt);
-    set_direct_apic_vector(INVALIDATE_TLB_VECTOR, invalidate_interrupt);
     set_direct_apic_vector(CALL_FUNCTION_VECTOR, call_function_interrupt);
 }
-- 
2.53.0



^ permalink raw reply related	[flat|nested] 17+ messages in thread

* Re: [PATCH v1 0/3] Lockless SMP function call and TLB flushing
  2026-04-01 16:35 [PATCH v1 0/3] Lockless SMP function call and TLB flushing Ross Lagerwall
                   ` (2 preceding siblings ...)
  2026-04-01 16:35 ` [PATCH v1 3/3] x86/smp: Rewrite TLB flush using on_selected_cpus() Ross Lagerwall
@ 2026-04-02  6:09 ` Jan Beulich
  2026-04-02  8:40   ` Ross Lagerwall
  3 siblings, 1 reply; 17+ messages in thread
From: Jan Beulich @ 2026-04-02  6:09 UTC (permalink / raw)
  To: Ross Lagerwall
  Cc: Andrew Cooper, Roger Pau Monné, Anthony PERARD, Michal Orzel,
	Julien Grall, Stefano Stabellini, xen-devel

On 01.04.2026 18:35, Ross Lagerwall wrote:
> Hi,
> 
> This series implements lockless SMP function call and then rewrites x86 TLB
> flushing to use SMP function calls.
> 
> We have observed that the TLB flush lock can be a point of contention for
> certain workloads, e.g. migrating 10 VMs off a host during a host evacuation.
> 
> Performance numbers:
> 
> I wrote a synthetic benchmark to measure the performance. The benchmark has one
> or more CPUs in Xen calling on_selected_cpus() with between 1 and 64 CPUs in
> the selected mask. The executed function simply delays for 500 microseconds.
> 
> The table below shows the % change in execution time of on_selected_cpus():
> 
>                   1 thread   2 threads    4 threads
> 1 CPU in mask     0.02       -35.23       -51.18
> 2 CPUs in mask    0.01       -47.20       -69.27
> 4 CPUs in mask    -0.02      -42.40       -66.55
> 8 CPUs in mask    -0.03      -47.82       -68.39
> 16 CPUs in mask   0.12       -41.95       -58.26
> 32 CPUs in mask   0.02       -25.43       -39.35
> 64 CPUs in mask   0.00       -24.70       -37.83
> 
> With 1 thread (i.e. no contention), there is no regression in execution time.
> With multiple threads, as expected there is a significant improvement in
> execution time.
> 
> As a more practical benchmark to simulate host evacuation, I measured the
> memory dirtying rate across 10 VMs after enabling log dirty (on an AMD system,
> so without PML). The rate increased by 16% with this patch series, even
> after the recent deferred TLB flush changes.

Is this a positive thing though? In the context of some related work something
similar was mentioned iirc, accompanied by stating that this is actually
problematic. A guest in log-dirty mode generally wants to be making progress,
but also wants to be throttled enough to limit re-dirtying, such that
subsequent iterations (in particular the final one) of page contents
migration won't have to process overly many pages a 2nd time.

Jan


^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH v1 0/3] Lockless SMP function call and TLB flushing
  2026-04-02  6:09 ` [PATCH v1 0/3] Lockless SMP function call and TLB flushing Jan Beulich
@ 2026-04-02  8:40   ` Ross Lagerwall
  2026-04-02  8:49     ` Jan Beulich
  0 siblings, 1 reply; 17+ messages in thread
From: Ross Lagerwall @ 2026-04-02  8:40 UTC (permalink / raw)
  To: Jan Beulich
  Cc: Andrew Cooper, Roger Pau Monné, Anthony PERARD, Michal Orzel,
	Julien Grall, Stefano Stabellini, xen-devel

On 4/2/26 7:09 AM, Jan Beulich wrote:
> On 01.04.2026 18:35, Ross Lagerwall wrote:
>> Hi,
>>
>> This series implements lockless SMP function call and then rewrites x86 TLB
>> flushing to use SMP function calls.
>>
>> We have observed that the TLB flush lock can be a point of contention for
>> certain workloads, e.g. migrating 10 VMs off a host during a host evacuation.
>>
>> Performance numbers:
>>
>> I wrote a synthetic benchmark to measure the performance. The benchmark has one
>> or more CPUs in Xen calling on_selected_cpus() with between 1 and 64 CPUs in
>> the selected mask. The executed function simply delays for 500 microseconds.
>>
>> The table below shows the % change in execution time of on_selected_cpus():
>>
>>                    1 thread   2 threads    4 threads
>> 1 CPU in mask     0.02       -35.23       -51.18
>> 2 CPUs in mask    0.01       -47.20       -69.27
>> 4 CPUs in mask    -0.02      -42.40       -66.55
>> 8 CPUs in mask    -0.03      -47.82       -68.39
>> 16 CPUs in mask   0.12       -41.95       -58.26
>> 32 CPUs in mask   0.02       -25.43       -39.35
>> 64 CPUs in mask   0.00       -24.70       -37.83
>>
>> With 1 thread (i.e. no contention), there is no regression in execution time.
>> With multiple threads, as expected there is a significant improvement in
>> execution time.
>>
>> As a more practical benchmark to simulate host evacuation, I measured the
>> memory dirtying rate across 10 VMs after enabling log dirty (on an AMD system,
>> so without PML). The rate increased by 16% with this patch series, even
>> after the recent deferred TLB flush changes.
> 
> Is this a positive thing though? In the context of some related work something
> similar was mentioned iirc, accompanied by stating that this is actually
> problematic. A guest in log-dirty mode generally wants to be making progress,
> but also wants to be throttled enough to limit re-dirtying, such that
> subsequent iterations (in particular the final one) of page contents
> migration won't have to process overly many pages a 2nd time.
> 

In the context of a real migration, both the process copying the pages
out of the guest and the guest itself will be hitting the TLB flush lock
so reducing that bottleneck may increase throughput on both sides.
Whether or not the overall migration time increases or decreases depends
on many factors (number of migrations in parallel, the rate the guest is
dirtying memory, the line speed of the NIC, whether PML is used, ...)
which is why I measured a more controlled scenario to demonstrate the
change.

IMO throttling of a guest during a migration should be something
intentional and controlled by userspace policy rather than a side effect
of some internal global locks.

Ross


^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH v1 0/3] Lockless SMP function call and TLB flushing
  2026-04-02  8:40   ` Ross Lagerwall
@ 2026-04-02  8:49     ` Jan Beulich
  2026-04-02 10:57       ` Ross Lagerwall
  0 siblings, 1 reply; 17+ messages in thread
From: Jan Beulich @ 2026-04-02  8:49 UTC (permalink / raw)
  To: Ross Lagerwall
  Cc: Andrew Cooper, Roger Pau Monné, Anthony PERARD, Michal Orzel,
	Julien Grall, Stefano Stabellini, xen-devel

On 02.04.2026 10:40, Ross Lagerwall wrote:
> On 4/2/26 7:09 AM, Jan Beulich wrote:
>> On 01.04.2026 18:35, Ross Lagerwall wrote:
>>> We have observed that the TLB flush lock can be a point of contention for
>>> certain workloads, e.g. migrating 10 VMs off a host during a host evacuation.
>>>
>>> Performance numbers:
>>>
>>> I wrote a synthetic benchmark to measure the performance. The benchmark has one
>>> or more CPUs in Xen calling on_selected_cpus() with between 1 and 64 CPUs in
>>> the selected mask. The executed function simply delays for 500 microseconds.
>>>
>>> The table below shows the % change in execution time of on_selected_cpus():
>>>
>>>                    1 thread   2 threads    4 threads
>>> 1 CPU in mask     0.02       -35.23       -51.18
>>> 2 CPUs in mask    0.01       -47.20       -69.27
>>> 4 CPUs in mask    -0.02      -42.40       -66.55
>>> 8 CPUs in mask    -0.03      -47.82       -68.39
>>> 16 CPUs in mask   0.12       -41.95       -58.26
>>> 32 CPUs in mask   0.02       -25.43       -39.35
>>> 64 CPUs in mask   0.00       -24.70       -37.83
>>>
>>> With 1 thread (i.e. no contention), there is no regression in execution time.
>>> With multiple threads, as expected there is a significant improvement in
>>> execution time.
>>>
>>> As a more practical benchmark to simulate host evacuation, I measured the
>>> memory dirtying rate across 10 VMs after enabling log dirty (on an AMD system,
>>> so without PML). The rate increased by 16% with this patch series, even
>>> after the recent deferred TLB flush changes.
>>
>> Is this a positive thing though? In the context of some related work something
>> similar was mentioned iirc, accompanied by stating that this is actually
>> problematic. A guest in log-dirty mode generally wants to be making progress,
>> but also wants to be throttled enough to limit re-dirtying, such that
>> subsequent iterations (in particular the final one) of page contents
>> migration won't have to process overly many pages a 2nd time.
> 
> In the context of a real migration, both the process copying the pages
> out of the guest and the guest itself will be hitting the TLB flush lock
> so reducing that bottleneck may increase throughput on both sides.
> Whether or not the overall migration time increases or decreases depends
> on many factors (number of migrations in parallel, the rate the guest is
> dirtying memory, the line speed of the NIC, whether PML is used, ...)
> which is why I measured a more controlled scenario to demonstrate the
> change.
> 
> IMO throttling of a guest during a migration should be something
> intentional and controlled by userspace policy rather than a side effect
> of some internal global locks.

I definitely agree here, but side effects going away may make it necessary to
add such explicit throttling.

Jan


^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH v1 0/3] Lockless SMP function call and TLB flushing
  2026-04-02  8:49     ` Jan Beulich
@ 2026-04-02 10:57       ` Ross Lagerwall
  2026-04-02 11:57         ` Andrew Cooper
  0 siblings, 1 reply; 17+ messages in thread
From: Ross Lagerwall @ 2026-04-02 10:57 UTC (permalink / raw)
  To: Jan Beulich
  Cc: Andrew Cooper, Roger Pau Monné, Anthony PERARD, Michal Orzel,
	Julien Grall, Stefano Stabellini, xen-devel

On 4/2/26 9:49 AM, Jan Beulich wrote:
> On 02.04.2026 10:40, Ross Lagerwall wrote:
>> On 4/2/26 7:09 AM, Jan Beulich wrote:
>>> On 01.04.2026 18:35, Ross Lagerwall wrote:
>>>> We have observed that the TLB flush lock can be a point of contention for
>>>> certain workloads, e.g. migrating 10 VMs off a host during a host evacuation.
>>>>
>>>> Performance numbers:
>>>>
>>>> I wrote a synthetic benchmark to measure the performance. The benchmark has one
>>>> or more CPUs in Xen calling on_selected_cpus() with between 1 and 64 CPUs in
>>>> the selected mask. The executed function simply delays for 500 microseconds.
>>>>
>>>> The table below shows the % change in execution time of on_selected_cpus():
>>>>
>>>>                     1 thread   2 threads    4 threads
>>>> 1 CPU in mask     0.02       -35.23       -51.18
>>>> 2 CPUs in mask    0.01       -47.20       -69.27
>>>> 4 CPUs in mask    -0.02      -42.40       -66.55
>>>> 8 CPUs in mask    -0.03      -47.82       -68.39
>>>> 16 CPUs in mask   0.12       -41.95       -58.26
>>>> 32 CPUs in mask   0.02       -25.43       -39.35
>>>> 64 CPUs in mask   0.00       -24.70       -37.83
>>>>
>>>> With 1 thread (i.e. no contention), there is no regression in execution time.
>>>> With multiple threads, as expected there is a significant improvement in
>>>> execution time.
>>>>
>>>> As a more practical benchmark to simulate host evacuation, I measured the
>>>> memory dirtying rate across 10 VMs after enabling log dirty (on an AMD system,
>>>> so without PML). The rate increased by 16% with this patch series, even
>>>> after the recent deferred TLB flush changes.
>>>
>>> Is this a positive thing though? In the context of some related work something
>>> similar was mentioned iirc, accompanied by stating that this is actually
>>> problematic. A guest in log-dirty mode generally wants to be making progress,
>>> but also wants to be throttled enough to limit re-dirtying, such that
>>> subsequent iterations (in particular the final one) of page contents
>>> migration won't have to process overly many pages a 2nd time.
>>
>> In the context of a real migration, both the process copying the pages
>> out of the guest and the guest itself will be hitting the TLB flush lock
>> so reducing that bottleneck may increase throughput on both sides.
>> Whether or not the overall migration time increases or decreases depends
>> on many factors (number of migrations in parallel, the rate the guest is
>> dirtying memory, the line speed of the NIC, whether PML is used, ...)
>> which is why I measured a more controlled scenario to demonstrate the
>> change.
>>
>> IMO throttling of a guest during a migration should be something
>> intentional and controlled by userspace policy rather than a side effect
>> of some internal global locks.
> 
> I definitely agree here, but side effects going away may make it necessary to
> add such explicit throttling.
> 

Explicit throttling is much more important for the already existing
case of Intel systems with PML. With log dirty enabled, a VM on an Intel
system can dirty memory an order of magnitude faster than an AMD system
without PML.

As an aside, for the same test an Intel machine without PML is still a
lot faster than AMD so there is probably something to improve in this
area for AMD machines.

Ross


^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH v1 0/3] Lockless SMP function call and TLB flushing
  2026-04-02 10:57       ` Ross Lagerwall
@ 2026-04-02 11:57         ` Andrew Cooper
  2026-04-09  8:13           ` Roger Pau Monné
  0 siblings, 1 reply; 17+ messages in thread
From: Andrew Cooper @ 2026-04-02 11:57 UTC (permalink / raw)
  To: Ross Lagerwall, Jan Beulich
  Cc: Andrew Cooper, Roger Pau Monné, Anthony PERARD, Michal Orzel,
	Julien Grall, Stefano Stabellini, xen-devel

On 02/04/2026 12:57 pm, Ross Lagerwall wrote:
> On 4/2/26 9:49 AM, Jan Beulich wrote:
>> On 02.04.2026 10:40, Ross Lagerwall wrote:
>>> On 4/2/26 7:09 AM, Jan Beulich wrote:
>>>> On 01.04.2026 18:35, Ross Lagerwall wrote:
>>>>> We have observed that the TLB flush lock can be a point of
>>>>> contention for
>>>>> certain workloads, e.g. migrating 10 VMs off a host during a host
>>>>> evacuation.
>>>>>
>>>>> Performance numbers:
>>>>>
>>>>> I wrote a synthetic benchmark to measure the performance. The
>>>>> benchmark has one
>>>>> or more CPUs in Xen calling on_selected_cpus() with between 1 and
>>>>> 64 CPUs in
>>>>> the selected mask. The executed function simply delays for 500
>>>>> microseconds.
>>>>>
>>>>> The table below shows the % change in execution time of
>>>>> on_selected_cpus():
>>>>>
>>>>>                     1 thread   2 threads    4 threads
>>>>> 1 CPU in mask     0.02       -35.23       -51.18
>>>>> 2 CPUs in mask    0.01       -47.20       -69.27
>>>>> 4 CPUs in mask    -0.02      -42.40       -66.55
>>>>> 8 CPUs in mask    -0.03      -47.82       -68.39
>>>>> 16 CPUs in mask   0.12       -41.95       -58.26
>>>>> 32 CPUs in mask   0.02       -25.43       -39.35
>>>>> 64 CPUs in mask   0.00       -24.70       -37.83
>>>>>
>>>>> With 1 thread (i.e. no contention), there is no regression in
>>>>> execution time.
>>>>> With multiple threads, as expected there is a significant
>>>>> improvement in
>>>>> execution time.
>>>>>
>>>>> As a more practical benchmark to simulate host evacuation, I
>>>>> measured the
>>>>> memory dirtying rate across 10 VMs after enabling log dirty (on an
>>>>> AMD system,
>>>>> so without PML). The rate increased by 16% with this patch series,
>>>>> even
>>>>> after the recent deferred TLB flush changes.
>>>>
>>>> Is this a positive thing though? In the context of some related
>>>> work something
>>>> similar was mentioned iirc, accompanied by stating that this is
>>>> actually
>>>> problematic. A guest in log-dirty mode generally wants to be making
>>>> progress,
>>>> but also wants to be throttled enough to limit re-dirtying, such that
>>>> subsequent iterations (in particular the final one) of page contents
>>>> migration won't have to process overly many pages a 2nd time.
>>>
>>> In the context of a real migration, both the process copying the pages
>>> out of the guest and the guest itself will be hitting the TLB flush
>>> lock
>>> so reducing that bottleneck may increase throughput on both sides.
>>> Whether or not the overall migration time increases or decreases
>>> depends
>>> on many factors (number of migrations in parallel, the rate the
>>> guest is
>>> dirtying memory, the line speed of the NIC, whether PML is used, ...)
>>> which is why I measured a more controlled scenario to demonstrate the
>>> change.
>>>
>>> IMO throttling of a guest during a migration should be something
>>> intentional and controlled by userspace policy rather than a side
>>> effect
>>> of some internal global locks.
>>
>> I definitely agree here, but side effects going away may make it
>> necessary to
>> add such explicit throttling.
>>
>
> Explicit throttling is much more important for the already existing
> case of Intel systems with PML. With log dirty enabled, a VM on an Intel
> system can dirty memory an order of magnitude faster than an AMD system
> without PML.
>
> As an aside, for the same test an Intel machine without PML is still a
> lot faster than AMD so there is probably something to improve in this
> area for AMD machines. 

AMD have PML on the way. 
https://docs.amd.com/v/u/en-US/69208_1.00_AMD64_PML_PUB

There is a mis-step with how support for Intel's PML is done, meaning
that draining the vCPU's PML buffers is extraordinarily expensive even
when there's no action to take.  (Specifically, the remote VMCS acquire)

A better option is this:  When logdirty is active, any VMExit will drain
the PML buffer into the logdirty bitmap before processing the main exit
reason.  This way, you drain all the PML buffers by just IPI-ing the
domain dirty mask.

~Andrew


^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH v1 1/3] x86/hap: Wait for remote CPUs during TLB flush
  2026-04-01 16:35 ` [PATCH v1 1/3] x86/hap: Wait for remote CPUs during TLB flush Ross Lagerwall
@ 2026-04-08 15:21   ` Jan Beulich
  2026-04-08 15:48     ` Ross Lagerwall
  0 siblings, 1 reply; 17+ messages in thread
From: Jan Beulich @ 2026-04-08 15:21 UTC (permalink / raw)
  To: Ross Lagerwall; +Cc: Andrew Cooper, Roger Pau Monné, xen-devel

On 01.04.2026 18:35, Ross Lagerwall wrote:
> A future change to on_selected_cpus() will change the semantics of the
> wait parameter so that it doesn't wait for remote CPUs to "check in" if
> wait == 0. Adjust the call here to retain the existing behaviour so it
> continues to wait for the remote CPUs to VMExit.

Doesn't this go too far though? IOW wouldn't we better make the "wait"
parameter a tristate then?

As to the semantic change, peeking at patch 2 I don't see you discussing
the safety of doing so for the several other callers that also pass 0
right now.

Jan


^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH v1 1/3] x86/hap: Wait for remote CPUs during TLB flush
  2026-04-08 15:21   ` Jan Beulich
@ 2026-04-08 15:48     ` Ross Lagerwall
  2026-04-09  6:55       ` Jan Beulich
  0 siblings, 1 reply; 17+ messages in thread
From: Ross Lagerwall @ 2026-04-08 15:48 UTC (permalink / raw)
  To: Jan Beulich; +Cc: Andrew Cooper, Roger Pau Monné, xen-devel

On 4/8/26 4:21 PM, Jan Beulich wrote:
> On 01.04.2026 18:35, Ross Lagerwall wrote:
>> A future change to on_selected_cpus() will change the semantics of the
>> wait parameter so that it doesn't wait for remote CPUs to "check in" if
>> wait == 0. Adjust the call here to retain the existing behaviour so it
>> continues to wait for the remote CPUs to VMExit.
> 
> Doesn't this go too far though? IOW wouldn't we better make the "wait"
> parameter a tristate then?

 From what I can see, the current wait == 0 behaviour only exists as a
limitation of the implementation: The local CPU needs to wait for all
the remote CPUs to have read "func" and "info" before dropping the lock
to avoid a race condition.

With that limitation removed, I don't see a valid reason to maintain
this beahviour as an option, though I could be convinced otherwise.

> 
> As to the semantic change, peeking at patch 2 I don't see you discussing
> the safety of doing so for the several other callers that also pass 0
> right now.
> 

I did audit the other callers (including via smp_call_function() and
on_each_cpu()) and I believe they are all fine as is.
I can update the commit message to say something along these lines.

Ross

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH v1 2/3] xen/smp: Rewrite on_selected_cpus() to be lockless
  2026-04-01 16:35 ` [PATCH v1 2/3] xen/smp: Rewrite on_selected_cpus() to be lockless Ross Lagerwall
@ 2026-04-08 16:11   ` Jan Beulich
  2026-04-09  8:09   ` Roger Pau Monné
  2026-04-09 11:46   ` Jan Beulich
  2 siblings, 0 replies; 17+ messages in thread
From: Jan Beulich @ 2026-04-08 16:11 UTC (permalink / raw)
  To: Ross Lagerwall
  Cc: Andrew Cooper, Anthony PERARD, Michal Orzel, Julien Grall,
	Roger Pau Monné, Stefano Stabellini, xen-devel

On 01.04.2026 18:35, Ross Lagerwall wrote:
> --- a/xen/common/smp.c
> +++ b/xen/common/smp.c
> @@ -24,13 +24,15 @@
>  /*
>   * Structure and data for smp_call_function()/on_selected_cpus().
>   */
> -static DEFINE_SPINLOCK(call_lock);
> -static struct call_data_struct {
> +struct call_data_struct {
>      void (*func) (void *info);
>      void *info;
>      int wait;
> -    cpumask_t selected;
> -} call_data;
> +    cpumask_t selected __cacheline_aligned;
> +};
> +
> +DEFINE_PER_CPU(struct call_data_struct, call_data);
> +static cpumask_t tasks;

Only first pass feedback for now (I still need to go over all of this more
thoroughly).

Having cpumask_t variables anywhere (not just on the stack, where they're
particularly problematic) isn't very nice. Can this become cpumask_var_t?
(We really also need to deal with the one in smp_call_function(), for
example.)

> @@ -50,55 +52,84 @@ void on_selected_cpus(
>      void *info,
>      int wait)
>  {
> +    struct call_data_struct *data;
> +    unsigned int cpu = smp_processor_id();
> +
>      ASSERT(local_irq_is_enabled());
>      ASSERT(cpumask_subset(selected, &cpu_online_map));
>  
> -    spin_lock(&call_lock);
> +    if ( cpumask_empty(selected) )
> +        return;
> +
> +    data = &this_cpu(call_data);
>  
> -    cpumask_copy(&call_data.selected, selected);
> +    if ( !data->wait )
> +    {
> +        /* Wait for any previous async call to complete */
> +        while ( !cpumask_empty(&data->selected) )
> +            cpu_relax();
> +
> +        cpumask_clear_cpu(cpu, &tasks);

Since you set this bit again almost immediately, the above can only be to
make sure that ...

> +    }
>  
> -    if ( cpumask_empty(&call_data.selected) )
> -        goto out;
> +    data->func = func;
> +    data->info = info;
> +    data->wait = wait;

... these updates and ...

> -    call_data.func = func;
> -    call_data.info = info;
> -    call_data.wait = wait;
> +    smp_wmb();
>  
> -    smp_send_call_function_mask(&call_data.selected);
> +    cpumask_copy(&data->selected, selected);

... and this copying happen with the bit clear. Don't you need another
barrier then, though (between cpumask_clear_cpu() and the writes)?

Further isn't the barrier you add coming too early? While the bit in
tasks is clear, nobody's going to look at ->selected. Doesn't the
barrier need to live here, to isolate from ...

> -    while ( !cpumask_empty(&call_data.selected) )
> -        cpu_relax();
> +    cpumask_set_cpu(cpu, &tasks);

... this?

> -out:
> -    spin_unlock(&call_lock);
> +    smp_send_call_function_mask(&data->selected);
> +
> +    if ( wait )
> +    {
> +        while ( !cpumask_empty(&data->selected) )
> +            cpu_relax();
> +
> +        cpumask_clear_cpu(cpu, &tasks);
> +    }
>  }
>  
>  void smp_call_function_interrupt(void)
>  {
> -    void (*func)(void *info) = call_data.func;
> -    void *info = call_data.info;
>      unsigned int cpu = smp_processor_id();
> -
> -    if ( !cpumask_test_cpu(cpu, &call_data.selected) )
> -        return;
> +    unsigned int i;
> +    struct call_data_struct *data;
> +    void (*func)(void *info);
> +    void *info;

Please move into the loop's scope whatever can be moved there.

>      irq_enter();
>  
> -    if ( unlikely(!func) )
> -    {
> -        cpumask_clear_cpu(cpu, &call_data.selected);
> -    }
> -    else if ( call_data.wait )
> -    {
> -        (*func)(info);
> -        smp_mb();
> -        cpumask_clear_cpu(cpu, &call_data.selected);
> -    }
> -    else
> +    for_each_cpu ( i, &tasks )
>      {
> -        smp_mb();
> -        cpumask_clear_cpu(cpu, &call_data.selected);
> -        (*func)(info);
> +        data = &per_cpu(call_data, i);
> +
> +        if ( !cpumask_test_cpu(cpu, &data->selected) )
> +            continue;
> +
> +        smp_rmb();

This barrier looks as if it also needs to move (up).

Jan

> +        func = data->func;
> +        info = data->info;
> +
> +        if ( unlikely(!func) )
> +        {
> +            cpumask_clear_cpu(cpu, &data->selected);
> +        }
> +        else if ( data->wait )
> +        {
> +            (*func)(info);
> +            smp_mb();
> +            cpumask_clear_cpu(cpu, &data->selected);
> +        }
> +        else
> +        {
> +            smp_mb();
> +            cpumask_clear_cpu(cpu, &data->selected);
> +            (*func)(info);
> +        }
>      }
>  
>      irq_exit();



^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH v1 1/3] x86/hap: Wait for remote CPUs during TLB flush
  2026-04-08 15:48     ` Ross Lagerwall
@ 2026-04-09  6:55       ` Jan Beulich
  0 siblings, 0 replies; 17+ messages in thread
From: Jan Beulich @ 2026-04-09  6:55 UTC (permalink / raw)
  To: Ross Lagerwall; +Cc: Andrew Cooper, Roger Pau Monné, xen-devel

On 08.04.2026 17:48, Ross Lagerwall wrote:
> On 4/8/26 4:21 PM, Jan Beulich wrote:
>> On 01.04.2026 18:35, Ross Lagerwall wrote:
>>> A future change to on_selected_cpus() will change the semantics of the
>>> wait parameter so that it doesn't wait for remote CPUs to "check in" if
>>> wait == 0. Adjust the call here to retain the existing behaviour so it
>>> continues to wait for the remote CPUs to VMExit.
>>
>> Doesn't this go too far though? IOW wouldn't we better make the "wait"
>> parameter a tristate then?
> 
>  From what I can see, the current wait == 0 behaviour only exists as a
> limitation of the implementation: The local CPU needs to wait for all
> the remote CPUs to have read "func" and "info" before dropping the lock
> to avoid a race condition.
> 
> With that limitation removed, I don't see a valid reason to maintain
> this beahviour as an option, though I could be convinced otherwise.

Well, the specific case here is where that behavior is wanted. Rather than
penalizing that case (by using "wait until complete"), I'm asking to retain
prior behavior ("wait until 'checked in'"). Even if (for now) only for this
one case.

Jan


^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH v1 2/3] xen/smp: Rewrite on_selected_cpus() to be lockless
  2026-04-01 16:35 ` [PATCH v1 2/3] xen/smp: Rewrite on_selected_cpus() to be lockless Ross Lagerwall
  2026-04-08 16:11   ` Jan Beulich
@ 2026-04-09  8:09   ` Roger Pau Monné
  2026-04-09 11:46   ` Jan Beulich
  2 siblings, 0 replies; 17+ messages in thread
From: Roger Pau Monné @ 2026-04-09  8:09 UTC (permalink / raw)
  To: Ross Lagerwall
  Cc: xen-devel, Andrew Cooper, Anthony PERARD, Michal Orzel,
	Jan Beulich, Julien Grall, Stefano Stabellini

On Wed, Apr 01, 2026 at 05:35:20PM +0100, Ross Lagerwall wrote:
> on_selected_cpus() holds a global lock even if the function is to be
> called on non-overlapping CPUs. This is a scalability bottleneck so to
> avoid that:
> 
> 1. Remove the global lock.
> 2. Make call_data_struct per-CPU.
> 3. Track which CPUs are currently running on_selected_cpus() using a
>    global CPU mask. This tells CPUs running the interrupt which per-CPU
>    call_data_structs to look at.
> 
> Since the call data is now per-CPU, skip waiting for CPUs to "check in"
> for async calls. Instead, delay it until the next time
> on_selected_cpus() is called by which point there should be nothing to
> wait for.

One concern I have about this approach is loosing the fairness that
the usage of the spin lock provided.  Without the spinlock, and with
the usage of for_each_cpu ( i, &tasks ) to process any pending tasks,
it means the tasks from lower CPUs will always get preference over
the tasks from higher CPUs, regardless of when they have been
queued.

Ideally we need some kind of mechanism that can do a FIFO style
processing of the pending tasks, otherwise we are introducing a
performance penalty to high CPU IDs compared to low ones, and that's
likely to get worse as systems get bigger.

Thanks, Roger.

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH v1 0/3] Lockless SMP function call and TLB flushing
  2026-04-02 11:57         ` Andrew Cooper
@ 2026-04-09  8:13           ` Roger Pau Monné
  0 siblings, 0 replies; 17+ messages in thread
From: Roger Pau Monné @ 2026-04-09  8:13 UTC (permalink / raw)
  To: Andrew Cooper
  Cc: Ross Lagerwall, Jan Beulich, Anthony PERARD, Michal Orzel,
	Julien Grall, Stefano Stabellini, xen-devel

On Thu, Apr 02, 2026 at 01:57:00PM +0200, Andrew Cooper wrote:
> On 02/04/2026 12:57 pm, Ross Lagerwall wrote:
> > On 4/2/26 9:49 AM, Jan Beulich wrote:
> >> On 02.04.2026 10:40, Ross Lagerwall wrote:
> >>> On 4/2/26 7:09 AM, Jan Beulich wrote:
> >>>> On 01.04.2026 18:35, Ross Lagerwall wrote:
> >>>>> We have observed that the TLB flush lock can be a point of
> >>>>> contention for
> >>>>> certain workloads, e.g. migrating 10 VMs off a host during a host
> >>>>> evacuation.
> >>>>>
> >>>>> Performance numbers:
> >>>>>
> >>>>> I wrote a synthetic benchmark to measure the performance. The
> >>>>> benchmark has one
> >>>>> or more CPUs in Xen calling on_selected_cpus() with between 1 and
> >>>>> 64 CPUs in
> >>>>> the selected mask. The executed function simply delays for 500
> >>>>> microseconds.
> >>>>>
> >>>>> The table below shows the % change in execution time of
> >>>>> on_selected_cpus():
> >>>>>
> >>>>>                     1 thread   2 threads    4 threads
> >>>>> 1 CPU in mask     0.02       -35.23       -51.18
> >>>>> 2 CPUs in mask    0.01       -47.20       -69.27
> >>>>> 4 CPUs in mask    -0.02      -42.40       -66.55
> >>>>> 8 CPUs in mask    -0.03      -47.82       -68.39
> >>>>> 16 CPUs in mask   0.12       -41.95       -58.26
> >>>>> 32 CPUs in mask   0.02       -25.43       -39.35
> >>>>> 64 CPUs in mask   0.00       -24.70       -37.83
> >>>>>
> >>>>> With 1 thread (i.e. no contention), there is no regression in
> >>>>> execution time.
> >>>>> With multiple threads, as expected there is a significant
> >>>>> improvement in
> >>>>> execution time.
> >>>>>
> >>>>> As a more practical benchmark to simulate host evacuation, I
> >>>>> measured the
> >>>>> memory dirtying rate across 10 VMs after enabling log dirty (on an
> >>>>> AMD system,
> >>>>> so without PML). The rate increased by 16% with this patch series,
> >>>>> even
> >>>>> after the recent deferred TLB flush changes.
> >>>>
> >>>> Is this a positive thing though? In the context of some related
> >>>> work something
> >>>> similar was mentioned iirc, accompanied by stating that this is
> >>>> actually
> >>>> problematic. A guest in log-dirty mode generally wants to be making
> >>>> progress,
> >>>> but also wants to be throttled enough to limit re-dirtying, such that
> >>>> subsequent iterations (in particular the final one) of page contents
> >>>> migration won't have to process overly many pages a 2nd time.
> >>>
> >>> In the context of a real migration, both the process copying the pages
> >>> out of the guest and the guest itself will be hitting the TLB flush
> >>> lock
> >>> so reducing that bottleneck may increase throughput on both sides.
> >>> Whether or not the overall migration time increases or decreases
> >>> depends
> >>> on many factors (number of migrations in parallel, the rate the
> >>> guest is
> >>> dirtying memory, the line speed of the NIC, whether PML is used, ...)
> >>> which is why I measured a more controlled scenario to demonstrate the
> >>> change.
> >>>
> >>> IMO throttling of a guest during a migration should be something
> >>> intentional and controlled by userspace policy rather than a side
> >>> effect
> >>> of some internal global locks.
> >>
> >> I definitely agree here, but side effects going away may make it
> >> necessary to
> >> add such explicit throttling.
> >>
> >
> > Explicit throttling is much more important for the already existing
> > case of Intel systems with PML. With log dirty enabled, a VM on an Intel
> > system can dirty memory an order of magnitude faster than an AMD system
> > without PML.
> >
> > As an aside, for the same test an Intel machine without PML is still a
> > lot faster than AMD so there is probably something to improve in this
> > area for AMD machines. 
> 
> AMD have PML on the way. 
> https://docs.amd.com/v/u/en-US/69208_1.00_AMD64_PML_PUB
> 
> There is a mis-step with how support for Intel's PML is done, meaning
> that draining the vCPU's PML buffers is extraordinarily expensive even
> when there's no action to take.  (Specifically, the remote VMCS acquire)
> 
> A better option is this:  When logdirty is active, any VMExit will drain
> the PML buffer into the logdirty bitmap before processing the main exit
> reason.  This way, you drain all the PML buffers by just IPI-ing the
> domain dirty mask.

Seems like a good and easy to implement optimization.  However we are
already too fast when using PML in the sense that the toolstack cannot
keep up with the rate of dirtied memory :).

Thanks, Roger.


^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH v1 2/3] xen/smp: Rewrite on_selected_cpus() to be lockless
  2026-04-01 16:35 ` [PATCH v1 2/3] xen/smp: Rewrite on_selected_cpus() to be lockless Ross Lagerwall
  2026-04-08 16:11   ` Jan Beulich
  2026-04-09  8:09   ` Roger Pau Monné
@ 2026-04-09 11:46   ` Jan Beulich
  2 siblings, 0 replies; 17+ messages in thread
From: Jan Beulich @ 2026-04-09 11:46 UTC (permalink / raw)
  To: Ross Lagerwall
  Cc: Andrew Cooper, Anthony PERARD, Michal Orzel, Julien Grall,
	Roger Pau Monné, Stefano Stabellini, xen-devel

On 01.04.2026 18:35, Ross Lagerwall wrote:
> --- a/xen/common/smp.c
> +++ b/xen/common/smp.c
> @@ -24,13 +24,15 @@
>  /*
>   * Structure and data for smp_call_function()/on_selected_cpus().
>   */
> -static DEFINE_SPINLOCK(call_lock);
> -static struct call_data_struct {
> +struct call_data_struct {
>      void (*func) (void *info);
>      void *info;
>      int wait;
> -    cpumask_t selected;
> -} call_data;
> +    cpumask_t selected __cacheline_aligned;

The adding of the alignment attribute isn't strictly required here, is it?
Can it, with its own justification, be split out?

However, wrt what I said in the first reply, putting a full cpumask_t in
per-CPU data is even a little worse than having one as a global. Imo this
also wants to become cpumask_var_t. Then the attribute wouldn't be quite
applicable anymore anyway.

> @@ -50,55 +52,84 @@ void on_selected_cpus(
>      void *info,
>      int wait)
>  {
> +    struct call_data_struct *data;
> +    unsigned int cpu = smp_processor_id();
> +
>      ASSERT(local_irq_is_enabled());
>      ASSERT(cpumask_subset(selected, &cpu_online_map));
>  
> -    spin_lock(&call_lock);
> +    if ( cpumask_empty(selected) )
> +        return;
> +
> +    data = &this_cpu(call_data);
>  
> -    cpumask_copy(&call_data.selected, selected);
> +    if ( !data->wait )
> +    {
> +        /* Wait for any previous async call to complete */
> +        while ( !cpumask_empty(&data->selected) )
> +            cpu_relax();
> +
> +        cpumask_clear_cpu(cpu, &tasks);

In the description you say "Track which CPUs are currently running
on_selected_cpus()", yet as per this the bit can remain set after the
function was left. Which isn't just an issue of describing things
correctly; there's also a performance concern: The IPI handler(s) will
need to carry out more work than necessary. That's not a lot of work (only
the subsequent cpumask_test_cpu() there), but also not nothing.

I also think another barrier is needed above here: We may only clear the
bit in tasks when the empty ->selected is globally visible. More generally,
since the bit in tasks is what everything derives from, barriers are
apparently needed around all of its updating / accessing. Which also meant
that ...

> +    }
>  
> -    if ( cpumask_empty(&call_data.selected) )
> -        goto out;
> +    data->func = func;
> +    data->info = info;
> +    data->wait = wait;
>  
> -    call_data.func = func;
> -    call_data.info = info;
> -    call_data.wait = wait;
> +    smp_wmb();

... besides (as already indicated) this barrier needing to move ...

> -    smp_send_call_function_mask(&call_data.selected);
> +    cpumask_copy(&data->selected, selected);

... here, I think that another one is going to be needed ...

> -    while ( !cpumask_empty(&call_data.selected) )
> -        cpu_relax();
> +    cpumask_set_cpu(cpu, &tasks);

... here, such that ...

> -out:
> -    spin_unlock(&call_lock);
> +    smp_send_call_function_mask(&data->selected);

... upon receipt of the IPI the target sees the up-to-date value.

> +    if ( wait )
> +    {
> +        while ( !cpumask_empty(&data->selected) )
> +            cpu_relax();
> +
> +        cpumask_clear_cpu(cpu, &tasks);
> +    }
>  }
>  
>  void smp_call_function_interrupt(void)
>  {
> -    void (*func)(void *info) = call_data.func;
> -    void *info = call_data.info;
>      unsigned int cpu = smp_processor_id();
> -
> -    if ( !cpumask_test_cpu(cpu, &call_data.selected) )
> -        return;
> +    unsigned int i;
> +    struct call_data_struct *data;
> +    void (*func)(void *info);
> +    void *info;
>  
>      irq_enter();
>  
> -    if ( unlikely(!func) )
> -    {
> -        cpumask_clear_cpu(cpu, &call_data.selected);
> -    }
> -    else if ( call_data.wait )
> -    {
> -        (*func)(info);
> -        smp_mb();
> -        cpumask_clear_cpu(cpu, &call_data.selected);
> -    }
> -    else
> +    for_each_cpu ( i, &tasks )
>      {
> -        smp_mb();
> -        cpumask_clear_cpu(cpu, &call_data.selected);
> -        (*func)(info);
> +        data = &per_cpu(call_data, i);
> +
> +        if ( !cpumask_test_cpu(cpu, &data->selected) )
> +            continue;
> +
> +        smp_rmb();
> +        func = data->func;
> +        info = data->info;
> +
> +        if ( unlikely(!func) )
> +        {
> +            cpumask_clear_cpu(cpu, &data->selected);
> +        }
> +        else if ( data->wait )
> +        {
> +            (*func)(info);
> +            smp_mb();
> +            cpumask_clear_cpu(cpu, &data->selected);
> +        }
> +        else
> +        {
> +            smp_mb();
> +            cpumask_clear_cpu(cpu, &data->selected);
> +            (*func)(info);

I understand you only re-indent this code, but I'm struggling with the
purpose of the barrier here. With the smp_rmb() above there are no reads
to isolate (data->func and data->info can't change while the bit in
data->selected is still set). And there are no earlier writes at all,
unless anything done outside of the interrupt handler would matter.

Jan


^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH v1 3/3] x86/smp: Rewrite TLB flush using on_selected_cpus()
  2026-04-01 16:35 ` [PATCH v1 3/3] x86/smp: Rewrite TLB flush using on_selected_cpus() Ross Lagerwall
@ 2026-04-20 14:04   ` Jan Beulich
  0 siblings, 0 replies; 17+ messages in thread
From: Jan Beulich @ 2026-04-20 14:04 UTC (permalink / raw)
  To: Ross Lagerwall
  Cc: Anthony PERARD, Andrew Cooper, Roger Pau Monné, xen-devel

On 01.04.2026 18:35, Ross Lagerwall wrote:
> @@ -275,21 +274,18 @@ void flush_area_mask(const cpumask_t *mask, const void *va, unsigned int flags)
>      if ( (flags & ~FLUSH_ORDER_MASK) &&
>           !cpumask_subset(mask, cpumask_of(cpu)) )
>      {
> +        cpumask_t flush_cpumask;

I think we want to avoid introduction of a new on-stack variable of this
type. For large NR_CPUS they consume excessively much stack space; it
wasn't that long ago that (iirc) in HPET code we successfully eliminated
one.

Instead a per-CPU variable of type cpumask_var_t may want using.

Jan


^ permalink raw reply	[flat|nested] 17+ messages in thread

end of thread, other threads:[~2026-04-20 14:04 UTC | newest]

Thread overview: 17+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2026-04-01 16:35 [PATCH v1 0/3] Lockless SMP function call and TLB flushing Ross Lagerwall
2026-04-01 16:35 ` [PATCH v1 1/3] x86/hap: Wait for remote CPUs during TLB flush Ross Lagerwall
2026-04-08 15:21   ` Jan Beulich
2026-04-08 15:48     ` Ross Lagerwall
2026-04-09  6:55       ` Jan Beulich
2026-04-01 16:35 ` [PATCH v1 2/3] xen/smp: Rewrite on_selected_cpus() to be lockless Ross Lagerwall
2026-04-08 16:11   ` Jan Beulich
2026-04-09  8:09   ` Roger Pau Monné
2026-04-09 11:46   ` Jan Beulich
2026-04-01 16:35 ` [PATCH v1 3/3] x86/smp: Rewrite TLB flush using on_selected_cpus() Ross Lagerwall
2026-04-20 14:04   ` Jan Beulich
2026-04-02  6:09 ` [PATCH v1 0/3] Lockless SMP function call and TLB flushing Jan Beulich
2026-04-02  8:40   ` Ross Lagerwall
2026-04-02  8:49     ` Jan Beulich
2026-04-02 10:57       ` Ross Lagerwall
2026-04-02 11:57         ` Andrew Cooper
2026-04-09  8:13           ` Roger Pau Monné

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.