[PATCH for-4.19 0/9] x86/irq: fixes for CPU hot{,un}plug

All of lore.kernel.org
 help / color / mirror / Atom feed

* [PATCH for-4.19 0/9] x86/irq: fixes for CPU hot{,un}plug
@ 2024-05-29  9:01 Roger Pau Monne
  2024-05-29  9:01 ` [PATCH for-4.19 1/9] x86/irq: remove offline CPUs from old CPU mask when adjusting move_cleanup_count Roger Pau Monne
                   ` (9 more replies)
  0 siblings, 10 replies; 33+ messages in thread
From: Roger Pau Monne @ 2024-05-29  9:01 UTC (permalink / raw)
  To: xen-devel
  Cc: Roger Pau Monne, Jan Beulich, Andrew Cooper, George Dunlap,
	Julien Grall, Stefano Stabellini, Oleksii Kurochko

Hello,

The following series aim to fix interrupt handling when doing CPU
plug/unplug operations.  Without this series running:

cpus=`xl info max_cpu_id`
while [ 1 ]; do
    for i in `seq 1 $cpus`; do
        xen-hptool cpu-offline $i;
        xen-hptool cpu-online $i;
    done
done

Quite quickly results in interrupts getting lost and "No irq handler for
vector" messages on the Xen console.  Drivers in dom0 also start getting
interrupt timeouts and the system becomes unusable.

After applying the series running the loop over night still result in a
fully usable system, no  "No irq handler for vector" messages at all, no
interrupt loses reported by dom0.  Test with
x2apic-mode={mixed,cluster}.

I'm tagging this for 4.19 as it's IMO bugfixes, but the series has grown
quite bigger than expected, and hence we need to be careful to not
introduce breakages late in the release cycle.  I've attempted to
document all code as good as I could, interrupt handling has some
unexpected corner cases that are hard to diagnose and reason about.

I'm currently also doing some extra testing with XenRT in case I've
missed something.

Thanks, Roger.

Roger Pau Monne (9):
  x86/irq: remove offline CPUs from old CPU mask when adjusting
    move_cleanup_count
  xen/cpu: do not get the CPU map in stop_machine_run()
  xen/cpu: ensure get_cpu_maps() returns false if CPU operations are
    underway
  x86/irq: describe how the interrupt CPU movement works
  x86/irq: limit interrupt movement done by fixup_irqs()
  x86/irq: restrict CPU movement in set_desc_affinity()
  x86/irq: deal with old_cpu_mask for interrupts in movement in
    fixup_irqs()
  x86/irq: handle moving interrupts in _assign_irq_vector()
  x86/irq: forward pending interrupts to new destination in fixup_irqs()

 xen/arch/x86/apic.c             |   5 +
 xen/arch/x86/include/asm/apic.h |   3 +
 xen/arch/x86/include/asm/irq.h  |  28 +++++-
 xen/arch/x86/irq.c              | 172 +++++++++++++++++++++++++-------
 xen/common/cpu.c                |   8 +-
 xen/common/stop_machine.c       |  15 +--
 xen/include/xen/cpu.h           |   2 +
 xen/include/xen/rwlock.h        |   2 +
 8 files changed, 190 insertions(+), 45 deletions(-)

-- 
2.44.0

^ permalink raw reply	[flat|nested] 33+ messages in thread

* [PATCH for-4.19 1/9] x86/irq: remove offline CPUs from old CPU mask when adjusting move_cleanup_count
  2024-05-29  9:01 [PATCH for-4.19 0/9] x86/irq: fixes for CPU hot{,un}plug Roger Pau Monne
@ 2024-05-29  9:01 ` Roger Pau Monne
  2024-05-29 12:40   ` Jan Beulich
  2024-05-29  9:01 ` [PATCH for-4.19 2/9] xen/cpu: do not get the CPU map in stop_machine_run() Roger Pau Monne
                   ` (8 subsequent siblings)
  9 siblings, 1 reply; 33+ messages in thread
From: Roger Pau Monne @ 2024-05-29  9:01 UTC (permalink / raw)
  To: xen-devel; +Cc: Roger Pau Monne, Jan Beulich, Andrew Cooper

When adjusting move_cleanup_count to account for CPUs that are offline also
adjust old_cpu_mask, otherwise further calls to fixup_irqs() could subtract
those again creating and create an imbalance in move_cleanup_count.

Fixes: 472e0b74c5c4 ('x86/IRQ: deal with move cleanup count state in fixup_irqs()')
Signed-off-by: Roger Pau Monné <roger.pau@citrix.com>
---
 xen/arch/x86/irq.c | 8 ++++++++
 1 file changed, 8 insertions(+)

diff --git a/xen/arch/x86/irq.c b/xen/arch/x86/irq.c
index c16205a9beb6..9716e00e873b 100644
--- a/xen/arch/x86/irq.c
+++ b/xen/arch/x86/irq.c
@@ -2572,6 +2572,14 @@ void fixup_irqs(const cpumask_t *mask, bool verbose)
             desc->arch.move_cleanup_count -= cpumask_weight(affinity);
             if ( !desc->arch.move_cleanup_count )
                 release_old_vec(desc);
+            else
+                /*
+                 * Adjust old_cpu_mask to account for the offline CPUs,
+                 * otherwise further calls to fixup_irqs() could subtract those
+                 * again and possibly underflow the counter.
+                 */
+                cpumask_and(desc->arch.old_cpu_mask, desc->arch.old_cpu_mask,
+                            &cpu_online_map);
         }
 
         if ( !desc->action || cpumask_subset(desc->affinity, mask) )
-- 
2.44.0



^ permalink raw reply related	[flat|nested] 33+ messages in thread

* [PATCH for-4.19 2/9] xen/cpu: do not get the CPU map in stop_machine_run()
  2024-05-29  9:01 [PATCH for-4.19 0/9] x86/irq: fixes for CPU hot{,un}plug Roger Pau Monne
  2024-05-29  9:01 ` [PATCH for-4.19 1/9] x86/irq: remove offline CPUs from old CPU mask when adjusting move_cleanup_count Roger Pau Monne
@ 2024-05-29  9:01 ` Roger Pau Monne
  2024-05-29 13:04   ` Jan Beulich
  2024-05-29  9:01 ` [PATCH for-4.19 3/9] xen/cpu: ensure get_cpu_maps() returns false if CPU operations are underway Roger Pau Monne
                   ` (7 subsequent siblings)
  9 siblings, 1 reply; 33+ messages in thread
From: Roger Pau Monne @ 2024-05-29  9:01 UTC (permalink / raw)
  To: xen-devel
  Cc: Roger Pau Monne, Andrew Cooper, George Dunlap, Jan Beulich,
	Julien Grall, Stefano Stabellini

The current callers of stop_machine_run() outside of init code already have the
CPU maps locked, and hence there's no reason for stop_machine_run() to attempt
to lock again.

Replace the get_cpu_maps() call with a suitable unreachable assert.

Further changes will modify the conditions under which get_cpu_maps() returns
success and without the adjustment proposed here the usage of
stop_machine_run() in cpu_down() would then return an error.

Signed-off-by: Roger Pau Monné <roger.pau@citrix.com>
---
 xen/common/cpu.c          |  5 +++++
 xen/common/stop_machine.c | 15 ++++++++-------
 xen/include/xen/cpu.h     |  2 ++
 3 files changed, 15 insertions(+), 7 deletions(-)

diff --git a/xen/common/cpu.c b/xen/common/cpu.c
index 8709db4d2957..6173220e771b 100644
--- a/xen/common/cpu.c
+++ b/xen/common/cpu.c
@@ -68,6 +68,11 @@ void cpu_hotplug_done(void)
     write_unlock(&cpu_add_remove_lock);
 }
 
+bool cpu_map_locked(void)
+{
+    return rw_is_locked(&cpu_add_remove_lock);
+}
+
 static NOTIFIER_HEAD(cpu_chain);
 
 void __init register_cpu_notifier(struct notifier_block *nb)
diff --git a/xen/common/stop_machine.c b/xen/common/stop_machine.c
index 398cfd507c10..7face75648e8 100644
--- a/xen/common/stop_machine.c
+++ b/xen/common/stop_machine.c
@@ -82,9 +82,15 @@ int stop_machine_run(int (*fn)(void *data), void *data, unsigned int cpu)
     BUG_ON(!local_irq_is_enabled());
     BUG_ON(!is_idle_vcpu(current));
 
-    /* cpu_online_map must not change. */
-    if ( !get_cpu_maps() )
+    /*
+     * cpu_online_map must not change.  The only two callers of
+     * stop_machine_run() outside of init code already have the CPU map locked.
+     */
+    if ( system_state >= SYS_STATE_active && !cpu_map_locked() )
+    {
+        ASSERT_UNREACHABLE();
         return -EBUSY;
+    }
 
     nr_cpus = num_online_cpus();
     if ( cpu_online(this) )
@@ -92,10 +98,7 @@ int stop_machine_run(int (*fn)(void *data), void *data, unsigned int cpu)
 
     /* Must not spin here as the holder will expect us to be descheduled. */
     if ( !spin_trylock(&stopmachine_lock) )
-    {
-        put_cpu_maps();
         return -EBUSY;
-    }
 
     stopmachine_data.fn = fn;
     stopmachine_data.fn_data = data;
@@ -136,8 +139,6 @@ int stop_machine_run(int (*fn)(void *data), void *data, unsigned int cpu)
 
     spin_unlock(&stopmachine_lock);
 
-    put_cpu_maps();
-
     return ret;
 }
 
diff --git a/xen/include/xen/cpu.h b/xen/include/xen/cpu.h
index e1d4eb59675c..d8c8264c58b0 100644
--- a/xen/include/xen/cpu.h
+++ b/xen/include/xen/cpu.h
@@ -13,6 +13,8 @@ void put_cpu_maps(void);
 void cpu_hotplug_begin(void);
 void cpu_hotplug_done(void);
 
+bool cpu_map_locked(void);
+
 /* Receive notification of CPU hotplug events. */
 void register_cpu_notifier(struct notifier_block *nb);
 
-- 
2.44.0



^ permalink raw reply related	[flat|nested] 33+ messages in thread

* [PATCH for-4.19 3/9] xen/cpu: ensure get_cpu_maps() returns false if CPU operations are underway
  2024-05-29  9:01 [PATCH for-4.19 0/9] x86/irq: fixes for CPU hot{,un}plug Roger Pau Monne
  2024-05-29  9:01 ` [PATCH for-4.19 1/9] x86/irq: remove offline CPUs from old CPU mask when adjusting move_cleanup_count Roger Pau Monne
  2024-05-29  9:01 ` [PATCH for-4.19 2/9] xen/cpu: do not get the CPU map in stop_machine_run() Roger Pau Monne
@ 2024-05-29  9:01 ` Roger Pau Monne
  2024-05-29 13:35   ` Jan Beulich
  2024-05-29  9:01 ` [PATCH for-4.19 4/9] x86/irq: describe how the interrupt CPU movement works Roger Pau Monne
                   ` (6 subsequent siblings)
  9 siblings, 1 reply; 33+ messages in thread
From: Roger Pau Monne @ 2024-05-29  9:01 UTC (permalink / raw)
  To: xen-devel
  Cc: Roger Pau Monne, Andrew Cooper, George Dunlap, Jan Beulich,
	Julien Grall, Stefano Stabellini

Due to the current rwlock logic, if the CPU calling get_cpu_maps() does so from
a cpu_hotplug_{begin,done}() region the function will still return success,
because a CPU taking the rwlock in read mode after having taken it in write
mode is allowed.  Such behavior however defeats the purpose of get_cpu_maps(),
as it should always return false when called with a CPU hot{,un}plug operation
is in progress.  Otherwise the logic in send_IPI_mask() for example is wrong,
as it could decide to use the shorthand even when a CPU operation is in
progress.

Adjust the logic in get_cpu_maps() to return false when the CPUs lock is
already hold in write mode by the current CPU, as read_trylock() would
otherwise return true.

Fixes: 868a01021c6f ('rwlock: allow recursive read locking when already locked in write mode')
Signed-off-by: Roger Pau Monné <roger.pau@citrix.com>
---
 xen/common/cpu.c         | 3 ++-
 xen/include/xen/rwlock.h | 2 ++
 2 files changed, 4 insertions(+), 1 deletion(-)

diff --git a/xen/common/cpu.c b/xen/common/cpu.c
index 6173220e771b..d76f80fe2e99 100644
--- a/xen/common/cpu.c
+++ b/xen/common/cpu.c
@@ -49,7 +49,8 @@ static DEFINE_RWLOCK(cpu_add_remove_lock);
 
 bool get_cpu_maps(void)
 {
-    return read_trylock(&cpu_add_remove_lock);
+    return !rw_is_write_locked_by_me(&cpu_add_remove_lock) &&
+           read_trylock(&cpu_add_remove_lock);
 }
 
 void put_cpu_maps(void)
diff --git a/xen/include/xen/rwlock.h b/xen/include/xen/rwlock.h
index a2e98cad343e..4e7802821859 100644
--- a/xen/include/xen/rwlock.h
+++ b/xen/include/xen/rwlock.h
@@ -316,6 +316,8 @@ static always_inline void write_lock_irq(rwlock_t *l)
 
 #define rw_is_locked(l)               _rw_is_locked(l)
 #define rw_is_write_locked(l)         _rw_is_write_locked(l)
+#define rw_is_write_locked_by_me(l) \
+    lock_evaluate_nospec(_is_write_locked_by_me(atomic_read(&(l)->cnts)))
 
 
 typedef struct percpu_rwlock percpu_rwlock_t;
-- 
2.44.0



^ permalink raw reply related	[flat|nested] 33+ messages in thread

* [PATCH for-4.19 4/9] x86/irq: describe how the interrupt CPU movement works
  2024-05-29  9:01 [PATCH for-4.19 0/9] x86/irq: fixes for CPU hot{,un}plug Roger Pau Monne
                   ` (2 preceding siblings ...)
  2024-05-29  9:01 ` [PATCH for-4.19 3/9] xen/cpu: ensure get_cpu_maps() returns false if CPU operations are underway Roger Pau Monne
@ 2024-05-29  9:01 ` Roger Pau Monne
  2024-05-29 13:57   ` Jan Beulich
  2024-05-29  9:01 ` [PATCH for-4.19 5/9] x86/irq: limit interrupt movement done by fixup_irqs() Roger Pau Monne
                   ` (5 subsequent siblings)
  9 siblings, 1 reply; 33+ messages in thread
From: Roger Pau Monne @ 2024-05-29  9:01 UTC (permalink / raw)
  To: xen-devel; +Cc: Roger Pau Monne, Jan Beulich, Andrew Cooper

The logic to move interrupts across CPUs is complex, attempt to provide a
comment that describes the expected behavior so users of the interrupt system
have more context about the usage of the arch_irq_desc structure fields.

Signed-off-by: Roger Pau Monné <roger.pau@citrix.com>
---
 xen/arch/x86/include/asm/irq.h | 26 ++++++++++++++++++++++++++
 1 file changed, 26 insertions(+)

diff --git a/xen/arch/x86/include/asm/irq.h b/xen/arch/x86/include/asm/irq.h
index 413994d2133b..80a3aa7a88b9 100644
--- a/xen/arch/x86/include/asm/irq.h
+++ b/xen/arch/x86/include/asm/irq.h
@@ -28,6 +28,32 @@ typedef struct {
 
 struct irq_desc;
 
+/*
+ * Xen logic for moving interrupts around CPUs allows manipulating interrupts
+ * that target remote CPUs.  The logic to move an interrupt from CPU(s) is as
+ * follows:
+ *
+ * 1. cpu_mask and vector is copied to old_cpu_mask and old_vector.
+ * 2. New cpu_mask and vector are set, vector is setup at the new destination.
+ * 3. move_in_progress is set.
+ * 4. Interrupt source is updated to target new CPU and vector.
+ * 5. Interrupts arriving at old_cpu_mask are processed normally.
+ * 6. When an interrupt is delivered at the new destination (cpu_mask) as part
+ *    of acking the interrupt move_in_progress is cleared and move_cleanup_count
+ *    is set to the weight of online CPUs in old_cpu_mask.
+ *    IRQ_MOVE_CLEANUP_VECTOR is sent to all CPUs in old_cpu_mask.
+ * 7. When receiving IRQ_MOVE_CLEANUP_VECTOR CPUs in old_cpu_mask clean the
+ *    vector entry and decrease the count in move_cleanup_count.  The CPU that
+ *    sets move_cleanup_count to 0 releases the vector.
+ *
+ * Note that when interrupt movement (either move_in_progress or
+ * move_cleanup_count set) is in progress it's not possible to move the
+ * interrupt to yet a different CPU.
+ *
+ * By keeping the vector in the old CPU(s) configured until the interrupt is
+ * acked on the new destination Xen allows draining any pending interrupts at
+ * the old destinations.
+ */
 struct arch_irq_desc {
         s16 vector;                  /* vector itself is only 8 bits, */
         s16 old_vector;              /* but we use -1 for unassigned  */
-- 
2.44.0



^ permalink raw reply related	[flat|nested] 33+ messages in thread

* [PATCH for-4.19 5/9] x86/irq: limit interrupt movement done by fixup_irqs()
  2024-05-29  9:01 [PATCH for-4.19 0/9] x86/irq: fixes for CPU hot{,un}plug Roger Pau Monne
                   ` (3 preceding siblings ...)
  2024-05-29  9:01 ` [PATCH for-4.19 4/9] x86/irq: describe how the interrupt CPU movement works Roger Pau Monne
@ 2024-05-29  9:01 ` Roger Pau Monne
  2024-05-29  9:01 ` [PATCH for-4.19 6/9] x86/irq: restrict CPU movement in set_desc_affinity() Roger Pau Monne
                   ` (4 subsequent siblings)
  9 siblings, 0 replies; 33+ messages in thread
From: Roger Pau Monne @ 2024-05-29  9:01 UTC (permalink / raw)
  To: xen-devel; +Cc: Roger Pau Monne, Jan Beulich, Andrew Cooper

The current check used in fixup_irqs() to decide whether to move around
interrupts is based on the affinity mask, but such mask can have all bits set,
and hence is unlikely to be a subset of the input mask.  For example if an
interrupt has an affinity mask of all 1s, any input to fixup_irqs() that's not
an all set CPU mask would cause that interrupt to be shuffled around
unconditionally.

What fixup_irqs() care about is evacuating interrupts from CPUs not set on the
input CPU mask, and for that purpose it should check whether the interrupt is
assigned to a CPU not present in the input mask.  Assume that ->arch.cpu_mask
is a subset of the ->affinity mask, and keep the current logic that resets the
->affinity mask if the interrupt has to be shuffled around.

While there also adjust the comment as to the purpose of fixup_irqs().

Signed-off-by: Roger Pau Monné <roger.pau@citrix.com>
---
Changes since v1:
 - Do not AND ->arch.cpu_mask with cpu_online_map.
 - Keep using cpumask_subset().
 - Integrate into bigger series.
---
 xen/arch/x86/include/asm/irq.h | 2 +-
 xen/arch/x86/irq.c             | 9 +++++++--
 2 files changed, 8 insertions(+), 3 deletions(-)

diff --git a/xen/arch/x86/include/asm/irq.h b/xen/arch/x86/include/asm/irq.h
index 80a3aa7a88b9..b7dc75d0acbd 100644
--- a/xen/arch/x86/include/asm/irq.h
+++ b/xen/arch/x86/include/asm/irq.h
@@ -156,7 +156,7 @@ void free_domain_pirqs(struct domain *d);
 int map_domain_emuirq_pirq(struct domain *d, int pirq, int emuirq);
 int unmap_domain_pirq_emuirq(struct domain *d, int pirq);
 
-/* Reset irq affinities to match the given CPU mask. */
+/* Evacuate interrupts assigned to CPUs not present in the input CPU mask. */
 void fixup_irqs(const cpumask_t *mask, bool verbose);
 void fixup_eoi(void);
 
diff --git a/xen/arch/x86/irq.c b/xen/arch/x86/irq.c
index 9716e00e873b..1b7127090377 100644
--- a/xen/arch/x86/irq.c
+++ b/xen/arch/x86/irq.c
@@ -2525,7 +2525,7 @@ static int __init cf_check setup_dump_irqs(void)
 }
 __initcall(setup_dump_irqs);
 
-/* Reset irq affinities to match the given CPU mask. */
+/* Evacuate interrupts assigned to CPUs not present in the input CPU mask. */
 void fixup_irqs(const cpumask_t *mask, bool verbose)
 {
     unsigned int irq;
@@ -2582,7 +2582,12 @@ void fixup_irqs(const cpumask_t *mask, bool verbose)
                             &cpu_online_map);
         }
 
-        if ( !desc->action || cpumask_subset(desc->affinity, mask) )
+        /*
+         * Avoid shuffling the interrupt around as long as current target CPUs
+         * are a subset of the input mask.  What fixup_irqs() cares about is
+         * evacuating interrupts from CPUs not in the input mask.
+         */
+        if ( !desc->action || cpumask_subset(desc->arch.cpu_mask, mask) )
         {
             spin_unlock(&desc->lock);
             continue;
-- 
2.44.0



^ permalink raw reply related	[flat|nested] 33+ messages in thread

* [PATCH for-4.19 6/9] x86/irq: restrict CPU movement in set_desc_affinity()
  2024-05-29  9:01 [PATCH for-4.19 0/9] x86/irq: fixes for CPU hot{,un}plug Roger Pau Monne
                   ` (4 preceding siblings ...)
  2024-05-29  9:01 ` [PATCH for-4.19 5/9] x86/irq: limit interrupt movement done by fixup_irqs() Roger Pau Monne
@ 2024-05-29  9:01 ` Roger Pau Monne
  2024-05-29  9:01 ` [PATCH for-4.19 7/9] x86/irq: deal with old_cpu_mask for interrupts in movement in fixup_irqs() Roger Pau Monne
                   ` (3 subsequent siblings)
  9 siblings, 0 replies; 33+ messages in thread
From: Roger Pau Monne @ 2024-05-29  9:01 UTC (permalink / raw)
  To: xen-devel; +Cc: Roger Pau Monne, Jan Beulich, Andrew Cooper

If external interrupts are using logical mode it's possible to have an overlap
between the current ->arch.cpu_mask and the provided mask (or TARGET_CPUS).  If
that's the case avoid assigning a new vector and just move the interrupt to a
member of ->arch.cpu_mask that overlaps with the provided mask and is online.

While there also add an extra assert to ensure the mask containing the possible
destinations is not empty before calling cpu_mask_to_apicid(), as at that point
having an empty mask is not expected.

Signed-off-by: Roger Pau Monné <roger.pau@citrix.com>
---
 xen/arch/x86/irq.c | 34 +++++++++++++++++++++++++++-------
 1 file changed, 27 insertions(+), 7 deletions(-)

diff --git a/xen/arch/x86/irq.c b/xen/arch/x86/irq.c
index 1b7127090377..ae8fa574d4e7 100644
--- a/xen/arch/x86/irq.c
+++ b/xen/arch/x86/irq.c
@@ -846,19 +846,38 @@ void cf_check irq_complete_move(struct irq_desc *desc)
 
 unsigned int set_desc_affinity(struct irq_desc *desc, const cpumask_t *mask)
 {
-    int ret;
-    unsigned long flags;
     cpumask_t dest_mask;
 
     if ( mask && !cpumask_intersects(mask, &cpu_online_map) )
         return BAD_APICID;
 
-    spin_lock_irqsave(&vector_lock, flags);
-    ret = _assign_irq_vector(desc, mask ?: TARGET_CPUS);
-    spin_unlock_irqrestore(&vector_lock, flags);
+    /*
+     * mask input set con contain CPUs that are not online.  To decide whether
+     * the interrupt needs to be migrated restrict the input mask to the CPUs
+     * that are online.
+     */
+    if ( mask )
+        cpumask_and(&dest_mask, mask, &cpu_online_map);
+    else
+        cpumask_copy(&dest_mask, TARGET_CPUS);
 
-    if ( ret < 0 )
-        return BAD_APICID;
+    /*
+     * Only move the interrupt if there are no CPUs left in ->arch.cpu_mask
+     * that can handle it, otherwise just shuffle it around ->arch.cpu_mask
+     * to an available destination.
+     */
+    if ( !cpumask_intersects(desc->arch.cpu_mask, &dest_mask) )
+    {
+        int ret;
+        unsigned long flags;
+
+        spin_lock_irqsave(&vector_lock, flags);
+        ret = _assign_irq_vector(desc, mask ?: TARGET_CPUS);
+        spin_unlock_irqrestore(&vector_lock, flags);
+
+        if ( ret < 0 )
+            return BAD_APICID;
+    }
 
     if ( mask )
     {
@@ -871,6 +890,7 @@ unsigned int set_desc_affinity(struct irq_desc *desc, const cpumask_t *mask)
         cpumask_copy(&dest_mask, desc->arch.cpu_mask);
     }
     cpumask_and(&dest_mask, &dest_mask, &cpu_online_map);
+    ASSERT(!cpumask_empty(&dest_mask));
 
     return cpu_mask_to_apicid(&dest_mask);
 }
-- 
2.44.0



^ permalink raw reply related	[flat|nested] 33+ messages in thread

* [PATCH for-4.19 7/9] x86/irq: deal with old_cpu_mask for interrupts in movement in fixup_irqs()
  2024-05-29  9:01 [PATCH for-4.19 0/9] x86/irq: fixes for CPU hot{,un}plug Roger Pau Monne
                   ` (5 preceding siblings ...)
  2024-05-29  9:01 ` [PATCH for-4.19 6/9] x86/irq: restrict CPU movement in set_desc_affinity() Roger Pau Monne
@ 2024-05-29  9:01 ` Roger Pau Monne
  2024-05-29  9:01 ` [PATCH for-4.19 8/9] x86/irq: handle moving interrupts in _assign_irq_vector() Roger Pau Monne
                   ` (2 subsequent siblings)
  9 siblings, 0 replies; 33+ messages in thread
From: Roger Pau Monne @ 2024-05-29  9:01 UTC (permalink / raw)
  To: xen-devel; +Cc: Roger Pau Monne, Jan Beulich, Andrew Cooper

Given the current logic it's possible for ->arch.old_cpu_mask to get out of
sync: if a CPU set in old_cpu_mask is offlined and then onlined
again without old_cpu_mask having been updated the data in the mask will no
longer be accurate, as when brought back online the CPU will no longer have
old_vector configured to handle the old interrupt source.

If there's an interrupt movement in progress, and the to be offlined CPU (which
is the call context) is in the old_cpu_mask clear it and update the mask, so it
doesn't contain stale data.

Signed-off-by: Roger Pau Monné <roger.pau@citrix.com>
---
 xen/arch/x86/irq.c | 22 ++++++++++++++++++++++
 1 file changed, 22 insertions(+)

diff --git a/xen/arch/x86/irq.c b/xen/arch/x86/irq.c
index ae8fa574d4e7..8822f382bfe4 100644
--- a/xen/arch/x86/irq.c
+++ b/xen/arch/x86/irq.c
@@ -2602,6 +2602,28 @@ void fixup_irqs(const cpumask_t *mask, bool verbose)
                             &cpu_online_map);
         }
 
+        if ( desc->arch.move_in_progress &&
+             !cpumask_test_cpu(cpu, &cpu_online_map) &&
+             cpumask_test_cpu(cpu, desc->arch.old_cpu_mask) )
+        {
+            /*
+             * This CPU is going offline, remove it from ->arch.old_cpu_mask
+             * and possibly release the old vector if the old mask becomes
+             * empty.
+             *
+             * Note cleaning ->arch.old_cpu_mask is required if the CPU is
+             * brought offline and then online again, as when re-onlined the
+             * per-cpu vector table will no longer have ->arch.old_vector
+             * setup, and hence ->arch.old_cpu_mask would be stale.
+             */
+            cpumask_clear_cpu(cpu, desc->arch.old_cpu_mask);
+            if ( cpumask_empty(desc->arch.old_cpu_mask) )
+            {
+                desc->arch.move_in_progress = 0;
+                release_old_vec(desc);
+            }
+        }
+
         /*
          * Avoid shuffling the interrupt around as long as current target CPUs
          * are a subset of the input mask.  What fixup_irqs() cares about is
-- 
2.44.0



^ permalink raw reply related	[flat|nested] 33+ messages in thread

* [PATCH for-4.19 8/9] x86/irq: handle moving interrupts in _assign_irq_vector()
  2024-05-29  9:01 [PATCH for-4.19 0/9] x86/irq: fixes for CPU hot{,un}plug Roger Pau Monne
                   ` (6 preceding siblings ...)
  2024-05-29  9:01 ` [PATCH for-4.19 7/9] x86/irq: deal with old_cpu_mask for interrupts in movement in fixup_irqs() Roger Pau Monne
@ 2024-05-29  9:01 ` Roger Pau Monne
  2024-05-29  9:01 ` [PATCH for-4.19 9/9] x86/irq: forward pending interrupts to new destination in fixup_irqs() Roger Pau Monne
  2024-05-29  9:37 ` [PATCH for-4.19 0/9] x86/irq: fixes for CPU hot{,un}plug Oleksii K.
  9 siblings, 0 replies; 33+ messages in thread
From: Roger Pau Monne @ 2024-05-29  9:01 UTC (permalink / raw)
  To: xen-devel; +Cc: Roger Pau Monne, Jan Beulich, Andrew Cooper

Currently there's logic in fixup_irqs() that attempts to prevent
_assign_irq_vector() from failing, as fixup_irqs() is required to evacuate all
interrupts form the CPUs not present in the input mask.  The current logic in
fixup_irqs() is incomplete, as it doesn't deal with interrupts that have
move_cleanup_count > 0 and a non-empty ->arch.old_cpu_mask field.

Instead of attempting to fixup the interrupt descriptor in fixup_irqs() so that
_assign_irq_vector() cannot fail, introduce some logic in _assign_irq_vector()
to deal with interrupts that have either move_{in_progress,cleanup_count} set
and no remaining online CPUs in ->arch.cpu_mask.

If _assign_irq_vector() is requested to move an interrupt in the state
described above, first attempt to see if ->arch.old_cpu_mask contains any valid
CPUs that could be used as fallback, and if that's the case do move the
interrupt back to the previous destination.  Note this is easier because the
vector hasn't been released yet, so there's no need to allocate and setup a new
vector on the destination.

Due to the logic in fixup_irqs() that clears offline CPUs from
->arch.old_cpu_mask (and releases the old vector if the mask becomes empty) it
shouldn't be possible to get into _assign_irq_vector() with
->arch.move_{in_progress,cleanup_count} set but no online CPUs in
->arch.old_cpu_mask.

Signed-off-by: Roger Pau Monné <roger.pau@citrix.com>
---
 xen/arch/x86/irq.c | 66 ++++++++++++++++++++++++++--------------------
 1 file changed, 38 insertions(+), 28 deletions(-)

diff --git a/xen/arch/x86/irq.c b/xen/arch/x86/irq.c
index 8822f382bfe4..33979729257e 100644
--- a/xen/arch/x86/irq.c
+++ b/xen/arch/x86/irq.c
@@ -553,7 +553,44 @@ static int _assign_irq_vector(struct irq_desc *desc, const cpumask_t *mask)
     }
 
     if ( desc->arch.move_in_progress || desc->arch.move_cleanup_count )
-        return -EAGAIN;
+    {
+        /*
+         * If the current destination is online refuse to shuffle.  Retry after
+         * the in-progress movement has finished.
+         */
+        if ( cpumask_intersects(desc->arch.cpu_mask, &cpu_online_map) )
+            return -EAGAIN;
+
+        /*
+         * Fallback to the old destination if moving is in progress and the
+         * current destination is to be offlined.  This is required by
+         * fixup_irqs() to evacuate interrupts from a CPU to be offlined.
+         *
+         * Due to the logic in fixup_irqs() that clears offlined CPUs from
+         * ->arch.old_cpu_mask it shouldn't be possible to get here with
+         * ->arch.move_{in_progress,cleanup_count} set and no online CPUs in
+         * ->arch.old_cpu_mask.
+         */
+        ASSERT(valid_irq_vector(desc->arch.old_vector));
+        ASSERT(cpumask_intersects(desc->arch.old_cpu_mask, mask));
+        ASSERT(cpumask_intersects(desc->arch.old_cpu_mask, &cpu_online_map));
+
+        /* Set the old destination as the current one. */
+        desc->arch.vector = desc->arch.old_vector;
+        cpumask_and(desc->arch.cpu_mask, desc->arch.old_cpu_mask, mask);
+
+        /* Undo any possibly done cleanup. */
+        for_each_cpu(cpu, desc->arch.cpu_mask)
+            per_cpu(vector_irq, cpu)[desc->arch.vector] = irq;
+
+        /* Cancel the pending move. */
+        desc->arch.old_vector = IRQ_VECTOR_UNASSIGNED;
+        cpumask_clear(desc->arch.old_cpu_mask);
+        desc->arch.move_in_progress = 0;
+        desc->arch.move_cleanup_count = 0;
+
+        return 0;
+    }
 
     err = -ENOSPC;
 
@@ -2635,33 +2672,6 @@ void fixup_irqs(const cpumask_t *mask, bool verbose)
             continue;
         }
 
-        /*
-         * In order for the affinity adjustment below to be successful, we
-         * need _assign_irq_vector() to succeed. This in particular means
-         * clearing desc->arch.move_in_progress if this would otherwise
-         * prevent the function from succeeding. Since there's no way for the
-         * flag to get cleared anymore when there's no possible destination
-         * left (the only possibility then would be the IRQs enabled window
-         * after this loop), there's then also no race with us doing it here.
-         *
-         * Therefore the logic here and there need to remain in sync.
-         */
-        if ( desc->arch.move_in_progress &&
-             !cpumask_intersects(mask, desc->arch.cpu_mask) )
-        {
-            unsigned int cpu;
-
-            cpumask_and(affinity, desc->arch.old_cpu_mask, &cpu_online_map);
-
-            spin_lock(&vector_lock);
-            for_each_cpu(cpu, affinity)
-                per_cpu(vector_irq, cpu)[desc->arch.old_vector] = ~irq;
-            spin_unlock(&vector_lock);
-
-            release_old_vec(desc);
-            desc->arch.move_in_progress = 0;
-        }
-
         if ( !cpumask_intersects(mask, desc->affinity) )
         {
             break_affinity = true;
-- 
2.44.0



^ permalink raw reply related	[flat|nested] 33+ messages in thread

* [PATCH for-4.19 9/9] x86/irq: forward pending interrupts to new destination in fixup_irqs()
  2024-05-29  9:01 [PATCH for-4.19 0/9] x86/irq: fixes for CPU hot{,un}plug Roger Pau Monne
                   ` (7 preceding siblings ...)
  2024-05-29  9:01 ` [PATCH for-4.19 8/9] x86/irq: handle moving interrupts in _assign_irq_vector() Roger Pau Monne
@ 2024-05-29  9:01 ` Roger Pau Monne
  2024-05-29  9:33   ` Andrew Cooper
  2024-05-29  9:37 ` [PATCH for-4.19 0/9] x86/irq: fixes for CPU hot{,un}plug Oleksii K.
  9 siblings, 1 reply; 33+ messages in thread
From: Roger Pau Monne @ 2024-05-29  9:01 UTC (permalink / raw)
  To: xen-devel; +Cc: Roger Pau Monne, Jan Beulich, Andrew Cooper

fixup_irqs() is used to evacuate interrupts from to be offlined CPUs.  Given
the CPU is to become offline, the normal migration logic used by Xen where the
vector in the previous target(s) is left configured until the interrupt is
received on the new destination is not suitable.

Instead attempt to do as much as possible in order to prevent loosing
interrupts.  If fixup_irqs() is called from the CPU to be offlined (as is
currently the case) attempt to forward pending vectors when interrupts that
target the current CPU are migrated to a different destination.

Additionally, for interrupts that have already been moved from the current CPU
prior to the call to fixup_irqs() but that haven't been delivered to the new
destination (iow: interrupts with move_in_progress set and the current CPU set
in ->arch.old_cpu_mask) also check whether the previous vector is pending and
forward it to the new destination.

Signed-off-by: Roger Pau Monné <roger.pau@citrix.com>
---
 xen/arch/x86/apic.c             |  5 +++++
 xen/arch/x86/include/asm/apic.h |  3 +++
 xen/arch/x86/irq.c              | 39 +++++++++++++++++++++++++++++++--
 3 files changed, 45 insertions(+), 2 deletions(-)

diff --git a/xen/arch/x86/apic.c b/xen/arch/x86/apic.c
index 6567af685a1b..4d77fba3ed19 100644
--- a/xen/arch/x86/apic.c
+++ b/xen/arch/x86/apic.c
@@ -1543,3 +1543,8 @@ void check_for_unexpected_msi(unsigned int vector)
 {
     BUG_ON(apic_isr_read(vector));
 }
+
+bool lapic_check_pending(unsigned int vector)
+{
+    return apic_read(APIC_IRR + (vector / 32 * 0x10)) & (1U << (vector % 32));
+}
diff --git a/xen/arch/x86/include/asm/apic.h b/xen/arch/x86/include/asm/apic.h
index d1cb001fb4ab..7b5a0832c05e 100644
--- a/xen/arch/x86/include/asm/apic.h
+++ b/xen/arch/x86/include/asm/apic.h
@@ -179,6 +179,9 @@ extern void record_boot_APIC_mode(void);
 extern enum apic_mode current_local_apic_mode(void);
 extern void check_for_unexpected_msi(unsigned int vector);
 
+/* Return whether vector is pending in IRR. */
+bool lapic_check_pending(unsigned int vector);
+
 uint64_t calibrate_apic_timer(void);
 uint32_t apic_tmcct_read(void);
 
diff --git a/xen/arch/x86/irq.c b/xen/arch/x86/irq.c
index 33979729257e..211ad3bd7cf5 100644
--- a/xen/arch/x86/irq.c
+++ b/xen/arch/x86/irq.c
@@ -2591,8 +2591,8 @@ void fixup_irqs(const cpumask_t *mask, bool verbose)
 
     for ( irq = 0; irq < nr_irqs; irq++ )
     {
-        bool break_affinity = false, set_affinity = true;
-        unsigned int vector;
+        bool break_affinity = false, set_affinity = true, check_irr = false;
+        unsigned int vector, cpu = smp_processor_id();
         cpumask_t *affinity = this_cpu(scratch_cpumask);
 
         if ( irq == 2 )
@@ -2643,6 +2643,25 @@ void fixup_irqs(const cpumask_t *mask, bool verbose)
              !cpumask_test_cpu(cpu, &cpu_online_map) &&
              cpumask_test_cpu(cpu, desc->arch.old_cpu_mask) )
         {
+            /*
+             * This to be offlined CPU was the target of an interrupt that's
+             * been moved, and the new destination target hasn't yet
+             * acknowledged any interrupt from it.
+             *
+             * We know the interrupt is configured to target the new CPU at
+             * this point, so we can check IRR for any pending vectors and
+             * forward them to the new destination.
+             *
+             * Note the difference between move_in_progress or
+             * move_cleanup_count being set.  For the later we know the new
+             * destination has already acked at least one interrupt from this
+             * source, and hence there's no need to forward any stale
+             * interrupts.
+             */
+            if ( lapic_check_pending(desc->arch.old_vector) )
+                send_IPI_mask(cpumask_of(cpumask_any(desc->arch.cpu_mask)),
+                              desc->arch.vector);
+
             /*
              * This CPU is going offline, remove it from ->arch.old_cpu_mask
              * and possibly release the old vector if the old mask becomes
@@ -2683,11 +2702,27 @@ void fixup_irqs(const cpumask_t *mask, bool verbose)
         if ( desc->handler->disable )
             desc->handler->disable(desc);
 
+        /*
+         * If the current CPU is going offline and is (one of) the target(s) of
+         * the interrupt signal to check whether there are any pending vectors
+         * to be handled in the local APIC after the interrupt has been moved.
+         */
+        if ( !cpu_online(cpu) && cpumask_test_cpu(cpu, desc->arch.cpu_mask) )
+            check_irr = true;
+
         if ( desc->handler->set_affinity )
             desc->handler->set_affinity(desc, affinity);
         else if ( !(warned++) )
             set_affinity = false;
 
+        if ( check_irr && lapic_check_pending(vector) )
+            /*
+             * Forward pending interrupt to the new destination, this CPU is
+             * going offline and otherwise the interrupt would be lost.
+             */
+            send_IPI_mask(cpumask_of(cpumask_any(desc->arch.cpu_mask)),
+                          desc->arch.vector);
+
         if ( desc->handler->enable )
             desc->handler->enable(desc);
 
-- 
2.44.0



^ permalink raw reply related	[flat|nested] 33+ messages in thread

* Re: [PATCH for-4.19 9/9] x86/irq: forward pending interrupts to new destination in fixup_irqs()
  2024-05-29  9:01 ` [PATCH for-4.19 9/9] x86/irq: forward pending interrupts to new destination in fixup_irqs() Roger Pau Monne
@ 2024-05-29  9:33   ` Andrew Cooper
  0 siblings, 0 replies; 33+ messages in thread
From: Andrew Cooper @ 2024-05-29  9:33 UTC (permalink / raw)
  To: Roger Pau Monne, xen-devel; +Cc: Jan Beulich

On 29/05/2024 10:01 am, Roger Pau Monne wrote:
> diff --git a/xen/arch/x86/apic.c b/xen/arch/x86/apic.c
> index 6567af685a1b..4d77fba3ed19 100644
> --- a/xen/arch/x86/apic.c
> +++ b/xen/arch/x86/apic.c
> @@ -1543,3 +1543,8 @@ void check_for_unexpected_msi(unsigned int vector)
>  {
>      BUG_ON(apic_isr_read(vector));
>  }
> +
> +bool lapic_check_pending(unsigned int vector)
> +{
> +    return apic_read(APIC_IRR + (vector / 32 * 0x10)) & (1U << (vector % 32));
> +}
> diff --git a/xen/arch/x86/include/asm/apic.h b/xen/arch/x86/include/asm/apic.h
> index d1cb001fb4ab..7b5a0832c05e 100644
> --- a/xen/arch/x86/include/asm/apic.h
> +++ b/xen/arch/x86/include/asm/apic.h
> @@ -179,6 +179,9 @@ extern void record_boot_APIC_mode(void);
>  extern enum apic_mode current_local_apic_mode(void);
>  extern void check_for_unexpected_msi(unsigned int vector);
>  
> +/* Return whether vector is pending in IRR. */
> +bool lapic_check_pending(unsigned int vector);
> +

As a minor point, we've already got apic_isr_read() as a static inline. 
This wants to be a matching apic_irr_read() IMO.

~Andrew


^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [PATCH for-4.19 0/9] x86/irq: fixes for CPU hot{,un}plug
  2024-05-29  9:01 [PATCH for-4.19 0/9] x86/irq: fixes for CPU hot{,un}plug Roger Pau Monne
                   ` (8 preceding siblings ...)
  2024-05-29  9:01 ` [PATCH for-4.19 9/9] x86/irq: forward pending interrupts to new destination in fixup_irqs() Roger Pau Monne
@ 2024-05-29  9:37 ` Oleksii K.
  2024-06-11 13:25   ` Jan Beulich
  9 siblings, 1 reply; 33+ messages in thread
From: Oleksii K. @ 2024-05-29  9:37 UTC (permalink / raw)
  To: Roger Pau Monne, xen-devel
  Cc: Jan Beulich, Andrew Cooper, George Dunlap, Julien Grall,
	Stefano Stabellini

On Wed, 2024-05-29 at 11:01 +0200, Roger Pau Monne wrote:
> Hello,
> 
> The following series aim to fix interrupt handling when doing CPU
> plug/unplug operations.  Without this series running:
> 
> cpus=`xl info max_cpu_id`
> while [ 1 ]; do
>     for i in `seq 1 $cpus`; do
>         xen-hptool cpu-offline $i;
>         xen-hptool cpu-online $i;
>     done
> done
> 
> Quite quickly results in interrupts getting lost and "No irq handler
> for
> vector" messages on the Xen console.  Drivers in dom0 also start
> getting
> interrupt timeouts and the system becomes unusable.
> 
> After applying the series running the loop over night still result in
> a
> fully usable system, no  "No irq handler for vector" messages at all,
> no
> interrupt loses reported by dom0.  Test with
> x2apic-mode={mixed,cluster}.
> 
> I'm tagging this for 4.19 as it's IMO bugfixes, but the series has
> grown
> quite bigger than expected, and hence we need to be careful to not
> introduce breakages late in the release cycle.  I've attempted to
> document all code as good as I could, interrupt handling has some
> unexpected corner cases that are hard to diagnose and reason about.
Despite of the fact that it can be considered as bugfixes, it seems to
me that this patch series can be risky. Let's wait for maintainers
opinion...

~ Oleksii
> 
> I'm currently also doing some extra testing with XenRT in case I've
> missed something.
> 
> Thanks, Roger.
> 
> Roger Pau Monne (9):
>   x86/irq: remove offline CPUs from old CPU mask when adjusting
>     move_cleanup_count
>   xen/cpu: do not get the CPU map in stop_machine_run()
>   xen/cpu: ensure get_cpu_maps() returns false if CPU operations are
>     underway
>   x86/irq: describe how the interrupt CPU movement works
>   x86/irq: limit interrupt movement done by fixup_irqs()
>   x86/irq: restrict CPU movement in set_desc_affinity()
>   x86/irq: deal with old_cpu_mask for interrupts in movement in
>     fixup_irqs()
>   x86/irq: handle moving interrupts in _assign_irq_vector()
>   x86/irq: forward pending interrupts to new destination in
> fixup_irqs()
> 
>  xen/arch/x86/apic.c             |   5 +
>  xen/arch/x86/include/asm/apic.h |   3 +
>  xen/arch/x86/include/asm/irq.h  |  28 +++++-
>  xen/arch/x86/irq.c              | 172 +++++++++++++++++++++++++-----
> --
>  xen/common/cpu.c                |   8 +-
>  xen/common/stop_machine.c       |  15 +--
>  xen/include/xen/cpu.h           |   2 +
>  xen/include/xen/rwlock.h        |   2 +
>  8 files changed, 190 insertions(+), 45 deletions(-)
> 



^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [PATCH for-4.19 1/9] x86/irq: remove offline CPUs from old CPU mask when adjusting move_cleanup_count
  2024-05-29  9:01 ` [PATCH for-4.19 1/9] x86/irq: remove offline CPUs from old CPU mask when adjusting move_cleanup_count Roger Pau Monne
@ 2024-05-29 12:40   ` Jan Beulich
  2024-05-29 15:15     ` Roger Pau Monné
  0 siblings, 1 reply; 33+ messages in thread
From: Jan Beulich @ 2024-05-29 12:40 UTC (permalink / raw)
  To: Roger Pau Monne; +Cc: Andrew Cooper, xen-devel

On 29.05.2024 11:01, Roger Pau Monne wrote:
> When adjusting move_cleanup_count to account for CPUs that are offline also
> adjust old_cpu_mask, otherwise further calls to fixup_irqs() could subtract
> those again creating and create an imbalance in move_cleanup_count.

I'm in trouble with "creating"; I can't seem to be able to guess what you may
have meant.

> Fixes: 472e0b74c5c4 ('x86/IRQ: deal with move cleanup count state in fixup_irqs()')
> Signed-off-by: Roger Pau Monné <roger.pau@citrix.com>

With the above clarified (adjustment can be done while committing)
Reviewed-by: Jan Beulich <jbeulich@suse.com>

> --- a/xen/arch/x86/irq.c
> +++ b/xen/arch/x86/irq.c
> @@ -2572,6 +2572,14 @@ void fixup_irqs(const cpumask_t *mask, bool verbose)
>              desc->arch.move_cleanup_count -= cpumask_weight(affinity);
>              if ( !desc->arch.move_cleanup_count )
>                  release_old_vec(desc);
> +            else
> +                /*
> +                 * Adjust old_cpu_mask to account for the offline CPUs,
> +                 * otherwise further calls to fixup_irqs() could subtract those
> +                 * again and possibly underflow the counter.
> +                 */
> +                cpumask_and(desc->arch.old_cpu_mask, desc->arch.old_cpu_mask,
> +                            &cpu_online_map);
>          }

While functionality-wise okay, imo it would be slightly better to use
"affinity" here as well, so that even without looking at context beyond
what's shown here there is a direct connection to the cpumask_weight()
call. I.e.

                cpumask_andnot(desc->arch.old_cpu_mask, desc->arch.old_cpu_mask,
                               affinity);

Thoughts?

Jan


^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [PATCH for-4.19 2/9] xen/cpu: do not get the CPU map in stop_machine_run()
  2024-05-29  9:01 ` [PATCH for-4.19 2/9] xen/cpu: do not get the CPU map in stop_machine_run() Roger Pau Monne
@ 2024-05-29 13:04   ` Jan Beulich
  2024-05-29 15:20     ` Roger Pau Monné
  0 siblings, 1 reply; 33+ messages in thread
From: Jan Beulich @ 2024-05-29 13:04 UTC (permalink / raw)
  To: Roger Pau Monne
  Cc: Andrew Cooper, George Dunlap, Julien Grall, Stefano Stabellini,
	xen-devel, Michal Orzel, Bertrand Marquis

On 29.05.2024 11:01, Roger Pau Monne wrote:
> The current callers of stop_machine_run() outside of init code already have the
> CPU maps locked, and hence there's no reason for stop_machine_run() to attempt
> to lock again.

While purely from a description perspective this is okay, ...

> --- a/xen/common/stop_machine.c
> +++ b/xen/common/stop_machine.c
> @@ -82,9 +82,15 @@ int stop_machine_run(int (*fn)(void *data), void *data, unsigned int cpu)
>      BUG_ON(!local_irq_is_enabled());
>      BUG_ON(!is_idle_vcpu(current));
>  
> -    /* cpu_online_map must not change. */
> -    if ( !get_cpu_maps() )
> +    /*
> +     * cpu_online_map must not change.  The only two callers of
> +     * stop_machine_run() outside of init code already have the CPU map locked.
> +     */

... the "two" here is not unlikely to quickly go stale; who knows what PPC
and RISC-V will have as their code becomes more complete?

I'm also unconvinced that requiring ...

> +    if ( system_state >= SYS_STATE_active && !cpu_map_locked() )

... this for all future (post-init) uses of stop_machine_run() is a good
idea. It is quite a bit more natural, to me at least, for the function to
effect this itself, as is the case prior to your change.

In fact I'm puzzled by Arm's (init-time) use: They use it twice, for errata
workaround and feature enabling. They do that after bringing up secondary
CPUs, but still from start_xen(). How's that going to work for CPUs brought
offline and then back online later on? __cpu_up() isn't __init, so there's
no obvious sign that soft-{off,on}lining isn't possible on Arm. Cc-ing Arm
maintainers.

Jan

> +    {
> +        ASSERT_UNREACHABLE();
>          return -EBUSY;
> +    }
>  
>      nr_cpus = num_online_cpus();
>      if ( cpu_online(this) )
> @@ -92,10 +98,7 @@ int stop_machine_run(int (*fn)(void *data), void *data, unsigned int cpu)
>  
>      /* Must not spin here as the holder will expect us to be descheduled. */
>      if ( !spin_trylock(&stopmachine_lock) )
> -    {
> -        put_cpu_maps();
>          return -EBUSY;
> -    }
>  
>      stopmachine_data.fn = fn;
>      stopmachine_data.fn_data = data;
> @@ -136,8 +139,6 @@ int stop_machine_run(int (*fn)(void *data), void *data, unsigned int cpu)
>  
>      spin_unlock(&stopmachine_lock);
>  
> -    put_cpu_maps();
> -
>      return ret;
>  }
>  
> diff --git a/xen/include/xen/cpu.h b/xen/include/xen/cpu.h
> index e1d4eb59675c..d8c8264c58b0 100644
> --- a/xen/include/xen/cpu.h
> +++ b/xen/include/xen/cpu.h
> @@ -13,6 +13,8 @@ void put_cpu_maps(void);
>  void cpu_hotplug_begin(void);
>  void cpu_hotplug_done(void);
>  
> +bool cpu_map_locked(void);
> +
>  /* Receive notification of CPU hotplug events. */
>  void register_cpu_notifier(struct notifier_block *nb);
>  



^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [PATCH for-4.19 3/9] xen/cpu: ensure get_cpu_maps() returns false if CPU operations are underway
  2024-05-29  9:01 ` [PATCH for-4.19 3/9] xen/cpu: ensure get_cpu_maps() returns false if CPU operations are underway Roger Pau Monne
@ 2024-05-29 13:35   ` Jan Beulich
  2024-05-29 15:03     ` Roger Pau Monné
  0 siblings, 1 reply; 33+ messages in thread
From: Jan Beulich @ 2024-05-29 13:35 UTC (permalink / raw)
  To: Roger Pau Monne
  Cc: Andrew Cooper, George Dunlap, Julien Grall, Stefano Stabellini,
	xen-devel

On 29.05.2024 11:01, Roger Pau Monne wrote:
> Due to the current rwlock logic, if the CPU calling get_cpu_maps() does so from
> a cpu_hotplug_{begin,done}() region the function will still return success,
> because a CPU taking the rwlock in read mode after having taken it in write
> mode is allowed.  Such behavior however defeats the purpose of get_cpu_maps(),
> as it should always return false when called with a CPU hot{,un}plug operation
> is in progress.

I'm not sure I can agree with this. The CPU doing said operation ought to be
aware of what it is itself doing. And all other CPUs will get back false from
get_cpu_maps().

>  Otherwise the logic in send_IPI_mask() for example is wrong,
> as it could decide to use the shorthand even when a CPU operation is in
> progress.

It's also not becoming clear what's wrong there: As long as a CPU isn't
offline enough to not be in cpu_online_map anymore, it may well need to still
be the target of IPIs, and targeting it with a shorthand then is still fine.

In any event this would again affect only the CPU leading the CPU operation,
which should clearly know at which point(s) it is okay to send IPIs. Are we
actually sending any IPIs from within CPU-online or CPU-offline paths?
Together with the earlier paragraph the critical window would be between the
CPU being taken off of cpu_online_map and the CPU actually going "dead" (i.e.
on x86: its LAPIC becoming unresponsive to other than INIT/SIPI). And even
then the question would be what bad, if any, would happen to that CPU if an
IPI was still targeted at it by way of using the shorthand. I'm pretty sure
it runs with IRQs off at that time, so no ordinary IRQ could be delivered.

> Adjust the logic in get_cpu_maps() to return false when the CPUs lock is
> already hold in write mode by the current CPU, as read_trylock() would
> otherwise return true.
> 
> Fixes: 868a01021c6f ('rwlock: allow recursive read locking when already locked in write mode')

I'm puzzled by this as well: Prior to that and the change referenced by its
Fixes: tag, recursive spin locks were used. For the purposes here that's the
same as permitting read locking even when the write lock is already held by
the local CPU.

Jan

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [PATCH for-4.19 4/9] x86/irq: describe how the interrupt CPU movement works
  2024-05-29  9:01 ` [PATCH for-4.19 4/9] x86/irq: describe how the interrupt CPU movement works Roger Pau Monne
@ 2024-05-29 13:57   ` Jan Beulich
  2024-05-29 15:28     ` Roger Pau Monné
  0 siblings, 1 reply; 33+ messages in thread
From: Jan Beulich @ 2024-05-29 13:57 UTC (permalink / raw)
  To: Roger Pau Monne; +Cc: Andrew Cooper, xen-devel

On 29.05.2024 11:01, Roger Pau Monne wrote:
> --- a/xen/arch/x86/include/asm/irq.h
> +++ b/xen/arch/x86/include/asm/irq.h
> @@ -28,6 +28,32 @@ typedef struct {
>  
>  struct irq_desc;
>  
> +/*
> + * Xen logic for moving interrupts around CPUs allows manipulating interrupts
> + * that target remote CPUs.  The logic to move an interrupt from CPU(s) is as
> + * follows:
> + *
> + * 1. cpu_mask and vector is copied to old_cpu_mask and old_vector.
> + * 2. New cpu_mask and vector are set, vector is setup at the new destination.
> + * 3. move_in_progress is set.
> + * 4. Interrupt source is updated to target new CPU and vector.
> + * 5. Interrupts arriving at old_cpu_mask are processed normally.
> + * 6. When an interrupt is delivered at the new destination (cpu_mask) as part
> + *    of acking the interrupt move_in_progress is cleared and move_cleanup_count

Nit: A comma after "interrupt" may help reading.

> + *    is set to the weight of online CPUs in old_cpu_mask.
> + *    IRQ_MOVE_CLEANUP_VECTOR is sent to all CPUs in old_cpu_mask.

These last two steps aren't precise enough, compared to what the code does.
old_cpu_mask is first reduced to online CPUs therein. If the result is non-
empty, what you describe is done. If, however, the result is empty, the
vector is released right away (this code may be there just in case, but I
think it shouldn't be omitted here).

> + * 7. When receiving IRQ_MOVE_CLEANUP_VECTOR CPUs in old_cpu_mask clean the
> + *    vector entry and decrease the count in move_cleanup_count.  The CPU that
> + *    sets move_cleanup_count to 0 releases the vector.
> + *
> + * Note that when interrupt movement (either move_in_progress or
> + * move_cleanup_count set) is in progress it's not possible to move the
> + * interrupt to yet a different CPU.
> + *
> + * By keeping the vector in the old CPU(s) configured until the interrupt is
> + * acked on the new destination Xen allows draining any pending interrupts at
> + * the old destinations.
> + */
>  struct arch_irq_desc {
>          s16 vector;                  /* vector itself is only 8 bits, */
>          s16 old_vector;              /* but we use -1 for unassigned  */

I take it that it is not a goal to (also) describe under what conditions
an IRQ move may actually be initiated (IRQ_MOVE_PENDING)? I ask not the
least because the 2nd from last paragraph lightly touches that area.

Jan


^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [PATCH for-4.19 3/9] xen/cpu: ensure get_cpu_maps() returns false if CPU operations are underway
  2024-05-29 13:35   ` Jan Beulich
@ 2024-05-29 15:03     ` Roger Pau Monné
  2024-05-29 15:49       ` Jan Beulich
  0 siblings, 1 reply; 33+ messages in thread
From: Roger Pau Monné @ 2024-05-29 15:03 UTC (permalink / raw)
  To: Jan Beulich
  Cc: Andrew Cooper, George Dunlap, Julien Grall, Stefano Stabellini,
	xen-devel

On Wed, May 29, 2024 at 03:35:04PM +0200, Jan Beulich wrote:
> On 29.05.2024 11:01, Roger Pau Monne wrote:
> > Due to the current rwlock logic, if the CPU calling get_cpu_maps() does so from
> > a cpu_hotplug_{begin,done}() region the function will still return success,
> > because a CPU taking the rwlock in read mode after having taken it in write
> > mode is allowed.  Such behavior however defeats the purpose of get_cpu_maps(),
> > as it should always return false when called with a CPU hot{,un}plug operation
> > is in progress.
> 
> I'm not sure I can agree with this. The CPU doing said operation ought to be
> aware of what it is itself doing. And all other CPUs will get back false from
> get_cpu_maps().

Well, the CPU is aware in the context of cpu_{up,down}(), but not in
the interrupts that might be handled while that operation is in
progress, see below for a concrete example.

> >  Otherwise the logic in send_IPI_mask() for example is wrong,
> > as it could decide to use the shorthand even when a CPU operation is in
> > progress.
> 
> It's also not becoming clear what's wrong there: As long as a CPU isn't
> offline enough to not be in cpu_online_map anymore, it may well need to still
> be the target of IPIs, and targeting it with a shorthand then is still fine.

The issue is in the online path: there's a window where the CPU is
online (and the lapic active), but cpu_online_map hasn't been updated
yet.  A specific example would be time_calibration() being executed on
the CPU that is running cpu_up().  That could result in a shorthand
IPI being used, but the mask in r.cpu_calibration_map not containing
the CPU that's being brought up online because it's not yet added to
cpu_online_map.  Then the number of CPUs actually running
time_calibration_rendezvous_fn won't match the weight of the cpumask
in r.cpu_calibration_map.

> In any event this would again affect only the CPU leading the CPU operation,
> which should clearly know at which point(s) it is okay to send IPIs. Are we
> actually sending any IPIs from within CPU-online or CPU-offline paths?

Yes, I've seen the time rendezvous happening while in the middle of a
hotplug operation, and the CPU coordinating the rendezvous being the
one doing the CPU hotplug operation, so get_cpu_maps() returning true.

> Together with the earlier paragraph the critical window would be between the
> CPU being taken off of cpu_online_map and the CPU actually going "dead" (i.e.
> on x86: its LAPIC becoming unresponsive to other than INIT/SIPI). And even
> then the question would be what bad, if any, would happen to that CPU if an
> IPI was still targeted at it by way of using the shorthand. I'm pretty sure
> it runs with IRQs off at that time, so no ordinary IRQ could be delivered.
> 
> > Adjust the logic in get_cpu_maps() to return false when the CPUs lock is
> > already hold in write mode by the current CPU, as read_trylock() would
> > otherwise return true.
> > 
> > Fixes: 868a01021c6f ('rwlock: allow recursive read locking when already locked in write mode')
> 
> I'm puzzled by this as well: Prior to that and the change referenced by its
> Fixes: tag, recursive spin locks were used. For the purposes here that's the
> same as permitting read locking even when the write lock is already held by
> the local CPU.

I see, so the Fixes should be:

x86/smp: use APIC ALLBUT destination shorthand when possible

Instead, which is the commit that started using get_cpu_maps() in
send_IPI_mask().

Thanks, Roger.


^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [PATCH for-4.19 1/9] x86/irq: remove offline CPUs from old CPU mask when adjusting move_cleanup_count
  2024-05-29 12:40   ` Jan Beulich
@ 2024-05-29 15:15     ` Roger Pau Monné
  2024-05-29 15:27       ` Jan Beulich
  0 siblings, 1 reply; 33+ messages in thread
From: Roger Pau Monné @ 2024-05-29 15:15 UTC (permalink / raw)
  To: Jan Beulich; +Cc: Andrew Cooper, xen-devel

On Wed, May 29, 2024 at 02:40:51PM +0200, Jan Beulich wrote:
> On 29.05.2024 11:01, Roger Pau Monne wrote:
> > When adjusting move_cleanup_count to account for CPUs that are offline also
> > adjust old_cpu_mask, otherwise further calls to fixup_irqs() could subtract
> > those again creating and create an imbalance in move_cleanup_count.
> 
> I'm in trouble with "creating"; I can't seem to be able to guess what you may
> have meant.

Oh, sorry, that's a typo.

I was meaning to point out that not removing the already subtracted
CPUs from the mask can lead to further calls to fixup_irqs()
subtracting them again and move_cleanup_count possibly underflowing.

Would you prefer to write it as:

"... could subtract those again and possibly underflow move_cleanup_count."

> > Fixes: 472e0b74c5c4 ('x86/IRQ: deal with move cleanup count state in fixup_irqs()')
> > Signed-off-by: Roger Pau Monné <roger.pau@citrix.com>
> 
> With the above clarified (adjustment can be done while committing)
> Reviewed-by: Jan Beulich <jbeulich@suse.com>
> 
> > --- a/xen/arch/x86/irq.c
> > +++ b/xen/arch/x86/irq.c
> > @@ -2572,6 +2572,14 @@ void fixup_irqs(const cpumask_t *mask, bool verbose)
> >              desc->arch.move_cleanup_count -= cpumask_weight(affinity);
> >              if ( !desc->arch.move_cleanup_count )
> >                  release_old_vec(desc);
> > +            else
> > +                /*
> > +                 * Adjust old_cpu_mask to account for the offline CPUs,
> > +                 * otherwise further calls to fixup_irqs() could subtract those
> > +                 * again and possibly underflow the counter.
> > +                 */
> > +                cpumask_and(desc->arch.old_cpu_mask, desc->arch.old_cpu_mask,
> > +                            &cpu_online_map);
> >          }
> 
> While functionality-wise okay, imo it would be slightly better to use
> "affinity" here as well, so that even without looking at context beyond
> what's shown here there is a direct connection to the cpumask_weight()
> call. I.e.
> 
>                 cpumask_andnot(desc->arch.old_cpu_mask, desc->arch.old_cpu_mask,
>                                affinity);
> 
> Thoughts?

It was more straightforward for me to reason that removing the offline
CPUs is OK, but I can see that you might prefer to use 'affinity',
because that's the weight that's subtracted from move_cleanup_count.
Using either should lead to the same result if my understanding is
correct.

Thanks, Roger.


^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [PATCH for-4.19 2/9] xen/cpu: do not get the CPU map in stop_machine_run()
  2024-05-29 13:04   ` Jan Beulich
@ 2024-05-29 15:20     ` Roger Pau Monné
  2024-05-29 15:31       ` Jan Beulich
  0 siblings, 1 reply; 33+ messages in thread
From: Roger Pau Monné @ 2024-05-29 15:20 UTC (permalink / raw)
  To: Jan Beulich
  Cc: Andrew Cooper, George Dunlap, Julien Grall, Stefano Stabellini,
	xen-devel, Michal Orzel, Bertrand Marquis

On Wed, May 29, 2024 at 03:04:13PM +0200, Jan Beulich wrote:
> On 29.05.2024 11:01, Roger Pau Monne wrote:
> > The current callers of stop_machine_run() outside of init code already have the
> > CPU maps locked, and hence there's no reason for stop_machine_run() to attempt
> > to lock again.
> 
> While purely from a description perspective this is okay, ...
> 
> > --- a/xen/common/stop_machine.c
> > +++ b/xen/common/stop_machine.c
> > @@ -82,9 +82,15 @@ int stop_machine_run(int (*fn)(void *data), void *data, unsigned int cpu)
> >      BUG_ON(!local_irq_is_enabled());
> >      BUG_ON(!is_idle_vcpu(current));
> >  
> > -    /* cpu_online_map must not change. */
> > -    if ( !get_cpu_maps() )
> > +    /*
> > +     * cpu_online_map must not change.  The only two callers of
> > +     * stop_machine_run() outside of init code already have the CPU map locked.
> > +     */
> 
> ... the "two" here is not unlikely to quickly go stale; who knows what PPC
> and RISC-V will have as their code becomes more complete?
> 
> I'm also unconvinced that requiring ...
> 
> > +    if ( system_state >= SYS_STATE_active && !cpu_map_locked() )
> 
> ... this for all future (post-init) uses of stop_machine_run() is a good
> idea. It is quite a bit more natural, to me at least, for the function to
> effect this itself, as is the case prior to your change.

This is mostly a pre-req for the next change that switches
get_cpu_maps() to return false if the current CPU is holding the CPU
maps lock in write mode.

IF we don't want to go this route we need a way to signal
send_IPI_mask() when a CPU hot{,un}plug operation is taking place,
because get_cpu_maps() enough is not suitable.

Overall I don't like the corner case where get_cpu_maps() returns true
if a CPU hot{,un}plug operation is taking place in the current CPU
context.  The guarantee of get_cpu_maps() is that no CPU hot{,un}plug
operations can be in progress if it returns true.

Thanks, Roger.


^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [PATCH for-4.19 1/9] x86/irq: remove offline CPUs from old CPU mask when adjusting move_cleanup_count
  2024-05-29 15:15     ` Roger Pau Monné
@ 2024-05-29 15:27       ` Jan Beulich
  2024-05-29 15:34         ` Roger Pau Monné
  0 siblings, 1 reply; 33+ messages in thread
From: Jan Beulich @ 2024-05-29 15:27 UTC (permalink / raw)
  To: Roger Pau Monné; +Cc: Andrew Cooper, xen-devel

On 29.05.2024 17:15, Roger Pau Monné wrote:
> On Wed, May 29, 2024 at 02:40:51PM +0200, Jan Beulich wrote:
>> On 29.05.2024 11:01, Roger Pau Monne wrote:
>>> When adjusting move_cleanup_count to account for CPUs that are offline also
>>> adjust old_cpu_mask, otherwise further calls to fixup_irqs() could subtract
>>> those again creating and create an imbalance in move_cleanup_count.
>>
>> I'm in trouble with "creating"; I can't seem to be able to guess what you may
>> have meant.
> 
> Oh, sorry, that's a typo.
> 
> I was meaning to point out that not removing the already subtracted
> CPUs from the mask can lead to further calls to fixup_irqs()
> subtracting them again and move_cleanup_count possibly underflowing.
> 
> Would you prefer to write it as:
> 
> "... could subtract those again and possibly underflow move_cleanup_count."

Fine with me. Looks like simply deleting "creating" and keeping the rest
as it was would be okay too? Whatever you prefer in the end.

>>> --- a/xen/arch/x86/irq.c
>>> +++ b/xen/arch/x86/irq.c
>>> @@ -2572,6 +2572,14 @@ void fixup_irqs(const cpumask_t *mask, bool verbose)
>>>              desc->arch.move_cleanup_count -= cpumask_weight(affinity);
>>>              if ( !desc->arch.move_cleanup_count )
>>>                  release_old_vec(desc);
>>> +            else
>>> +                /*
>>> +                 * Adjust old_cpu_mask to account for the offline CPUs,
>>> +                 * otherwise further calls to fixup_irqs() could subtract those
>>> +                 * again and possibly underflow the counter.
>>> +                 */
>>> +                cpumask_and(desc->arch.old_cpu_mask, desc->arch.old_cpu_mask,
>>> +                            &cpu_online_map);
>>>          }
>>
>> While functionality-wise okay, imo it would be slightly better to use
>> "affinity" here as well, so that even without looking at context beyond
>> what's shown here there is a direct connection to the cpumask_weight()
>> call. I.e.
>>
>>                 cpumask_andnot(desc->arch.old_cpu_mask, desc->arch.old_cpu_mask,
>>                                affinity);
>>
>> Thoughts?
> 
> It was more straightforward for me to reason that removing the offline
> CPUs is OK, but I can see that you might prefer to use 'affinity',
> because that's the weight that's subtracted from move_cleanup_count.
> Using either should lead to the same result if my understanding is
> correct.

That was the conclusion I came to, or else I wouldn't have made the
suggestion. Unless you have a strong preference for the as-is form, I'd
indeed prefer the suggested alternative.

Jan


^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [PATCH for-4.19 4/9] x86/irq: describe how the interrupt CPU movement works
  2024-05-29 13:57   ` Jan Beulich
@ 2024-05-29 15:28     ` Roger Pau Monné
  2024-05-31  7:06       ` Jan Beulich
  0 siblings, 1 reply; 33+ messages in thread
From: Roger Pau Monné @ 2024-05-29 15:28 UTC (permalink / raw)
  To: Jan Beulich; +Cc: Andrew Cooper, xen-devel

On Wed, May 29, 2024 at 03:57:19PM +0200, Jan Beulich wrote:
> On 29.05.2024 11:01, Roger Pau Monne wrote:
> > --- a/xen/arch/x86/include/asm/irq.h
> > +++ b/xen/arch/x86/include/asm/irq.h
> > @@ -28,6 +28,32 @@ typedef struct {
> >  
> >  struct irq_desc;
> >  
> > +/*
> > + * Xen logic for moving interrupts around CPUs allows manipulating interrupts
> > + * that target remote CPUs.  The logic to move an interrupt from CPU(s) is as
> > + * follows:
> > + *
> > + * 1. cpu_mask and vector is copied to old_cpu_mask and old_vector.
> > + * 2. New cpu_mask and vector are set, vector is setup at the new destination.
> > + * 3. move_in_progress is set.
> > + * 4. Interrupt source is updated to target new CPU and vector.
> > + * 5. Interrupts arriving at old_cpu_mask are processed normally.
> > + * 6. When an interrupt is delivered at the new destination (cpu_mask) as part
> > + *    of acking the interrupt move_in_progress is cleared and move_cleanup_count
> 
> Nit: A comma after "interrupt" may help reading.
> 
> > + *    is set to the weight of online CPUs in old_cpu_mask.
> > + *    IRQ_MOVE_CLEANUP_VECTOR is sent to all CPUs in old_cpu_mask.
> 
> These last two steps aren't precise enough, compared to what the code does.
> old_cpu_mask is first reduced to online CPUs therein. If the result is non-
> empty, what you describe is done. If, however, the result is empty, the
> vector is released right away (this code may be there just in case, but I
> think it shouldn't be omitted here).

I've left that out because I got the impression it made the text more
complex to follow (with the extra branch) for no real benefit, but I'm
happy to attempt to add it.

> 
> > + * 7. When receiving IRQ_MOVE_CLEANUP_VECTOR CPUs in old_cpu_mask clean the
> > + *    vector entry and decrease the count in move_cleanup_count.  The CPU that
> > + *    sets move_cleanup_count to 0 releases the vector.
> > + *
> > + * Note that when interrupt movement (either move_in_progress or
> > + * move_cleanup_count set) is in progress it's not possible to move the
> > + * interrupt to yet a different CPU.
> > + *
> > + * By keeping the vector in the old CPU(s) configured until the interrupt is
> > + * acked on the new destination Xen allows draining any pending interrupts at
> > + * the old destinations.
> > + */
> >  struct arch_irq_desc {
> >          s16 vector;                  /* vector itself is only 8 bits, */
> >          s16 old_vector;              /* but we use -1 for unassigned  */
> 
> I take it that it is not a goal to (also) describe under what conditions
> an IRQ move may actually be initiated (IRQ_MOVE_PENDING)? I ask not the
> least because the 2nd from last paragraph lightly touches that area.

Right, I was mostly focused on moves (forcefully) initiated from
fixup_irqs(), which is different from the opportunistic affinity
changes signaled by IRQ_MOVE_PENDING.

Not sure whether I want to mention this ahead of the list in a
paragraph, or just add it as a step.  Do you have any preference?

Thanks, Roger.


^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [PATCH for-4.19 2/9] xen/cpu: do not get the CPU map in stop_machine_run()
  2024-05-29 15:20     ` Roger Pau Monné
@ 2024-05-29 15:31       ` Jan Beulich
  2024-05-29 15:48         ` Roger Pau Monné
  0 siblings, 1 reply; 33+ messages in thread
From: Jan Beulich @ 2024-05-29 15:31 UTC (permalink / raw)
  To: Roger Pau Monné
  Cc: Andrew Cooper, George Dunlap, Julien Grall, Stefano Stabellini,
	xen-devel, Michal Orzel, Bertrand Marquis

On 29.05.2024 17:20, Roger Pau Monné wrote:
> On Wed, May 29, 2024 at 03:04:13PM +0200, Jan Beulich wrote:
>> On 29.05.2024 11:01, Roger Pau Monne wrote:
>>> The current callers of stop_machine_run() outside of init code already have the
>>> CPU maps locked, and hence there's no reason for stop_machine_run() to attempt
>>> to lock again.
>>
>> While purely from a description perspective this is okay, ...
>>
>>> --- a/xen/common/stop_machine.c
>>> +++ b/xen/common/stop_machine.c
>>> @@ -82,9 +82,15 @@ int stop_machine_run(int (*fn)(void *data), void *data, unsigned int cpu)
>>>      BUG_ON(!local_irq_is_enabled());
>>>      BUG_ON(!is_idle_vcpu(current));
>>>  
>>> -    /* cpu_online_map must not change. */
>>> -    if ( !get_cpu_maps() )
>>> +    /*
>>> +     * cpu_online_map must not change.  The only two callers of
>>> +     * stop_machine_run() outside of init code already have the CPU map locked.
>>> +     */
>>
>> ... the "two" here is not unlikely to quickly go stale; who knows what PPC
>> and RISC-V will have as their code becomes more complete?
>>
>> I'm also unconvinced that requiring ...
>>
>>> +    if ( system_state >= SYS_STATE_active && !cpu_map_locked() )
>>
>> ... this for all future (post-init) uses of stop_machine_run() is a good
>> idea. It is quite a bit more natural, to me at least, for the function to
>> effect this itself, as is the case prior to your change.
> 
> This is mostly a pre-req for the next change that switches
> get_cpu_maps() to return false if the current CPU is holding the CPU
> maps lock in write mode.
> 
> IF we don't want to go this route we need a way to signal
> send_IPI_mask() when a CPU hot{,un}plug operation is taking place,
> because get_cpu_maps() enough is not suitable.
> 
> Overall I don't like the corner case where get_cpu_maps() returns true
> if a CPU hot{,un}plug operation is taking place in the current CPU
> context.  The guarantee of get_cpu_maps() is that no CPU hot{,un}plug
> operations can be in progress if it returns true.

I'm not convinced of looking at it this way. To me the guarantee is
merely that no CPU operation is taking place _elsewhere_. As indicated,
imo the local CPU should be well aware of what context it's actually in,
and hence what is (or is not) appropriate to do at a particular point in
time.

I guess what I'm missing is an example of a concrete code path where
things presently go wrong.

Jan


^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [PATCH for-4.19 1/9] x86/irq: remove offline CPUs from old CPU mask when adjusting move_cleanup_count
  2024-05-29 15:27       ` Jan Beulich
@ 2024-05-29 15:34         ` Roger Pau Monné
  0 siblings, 0 replies; 33+ messages in thread
From: Roger Pau Monné @ 2024-05-29 15:34 UTC (permalink / raw)
  To: Jan Beulich; +Cc: Andrew Cooper, xen-devel

On Wed, May 29, 2024 at 05:27:06PM +0200, Jan Beulich wrote:
> On 29.05.2024 17:15, Roger Pau Monné wrote:
> > On Wed, May 29, 2024 at 02:40:51PM +0200, Jan Beulich wrote:
> >> On 29.05.2024 11:01, Roger Pau Monne wrote:
> >>> When adjusting move_cleanup_count to account for CPUs that are offline also
> >>> adjust old_cpu_mask, otherwise further calls to fixup_irqs() could subtract
> >>> those again creating and create an imbalance in move_cleanup_count.
> >>
> >> I'm in trouble with "creating"; I can't seem to be able to guess what you may
> >> have meant.
> > 
> > Oh, sorry, that's a typo.
> > 
> > I was meaning to point out that not removing the already subtracted
> > CPUs from the mask can lead to further calls to fixup_irqs()
> > subtracting them again and move_cleanup_count possibly underflowing.
> > 
> > Would you prefer to write it as:
> > 
> > "... could subtract those again and possibly underflow move_cleanup_count."
> 
> Fine with me. Looks like simply deleting "creating" and keeping the rest
> as it was would be okay too? Whatever you prefer in the end.

Yes, whatever you think it's clearer TBH, I don't really have a
preference.

Thanks, Roger.


^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [PATCH for-4.19 2/9] xen/cpu: do not get the CPU map in stop_machine_run()
  2024-05-29 15:31       ` Jan Beulich
@ 2024-05-29 15:48         ` Roger Pau Monné
  0 siblings, 0 replies; 33+ messages in thread
From: Roger Pau Monné @ 2024-05-29 15:48 UTC (permalink / raw)
  To: Jan Beulich
  Cc: Andrew Cooper, George Dunlap, Julien Grall, Stefano Stabellini,
	xen-devel, Michal Orzel, Bertrand Marquis

On Wed, May 29, 2024 at 05:31:02PM +0200, Jan Beulich wrote:
> On 29.05.2024 17:20, Roger Pau Monné wrote:
> > On Wed, May 29, 2024 at 03:04:13PM +0200, Jan Beulich wrote:
> >> On 29.05.2024 11:01, Roger Pau Monne wrote:
> >>> The current callers of stop_machine_run() outside of init code already have the
> >>> CPU maps locked, and hence there's no reason for stop_machine_run() to attempt
> >>> to lock again.
> >>
> >> While purely from a description perspective this is okay, ...
> >>
> >>> --- a/xen/common/stop_machine.c
> >>> +++ b/xen/common/stop_machine.c
> >>> @@ -82,9 +82,15 @@ int stop_machine_run(int (*fn)(void *data), void *data, unsigned int cpu)
> >>>      BUG_ON(!local_irq_is_enabled());
> >>>      BUG_ON(!is_idle_vcpu(current));
> >>>  
> >>> -    /* cpu_online_map must not change. */
> >>> -    if ( !get_cpu_maps() )
> >>> +    /*
> >>> +     * cpu_online_map must not change.  The only two callers of
> >>> +     * stop_machine_run() outside of init code already have the CPU map locked.
> >>> +     */
> >>
> >> ... the "two" here is not unlikely to quickly go stale; who knows what PPC
> >> and RISC-V will have as their code becomes more complete?
> >>
> >> I'm also unconvinced that requiring ...
> >>
> >>> +    if ( system_state >= SYS_STATE_active && !cpu_map_locked() )
> >>
> >> ... this for all future (post-init) uses of stop_machine_run() is a good
> >> idea. It is quite a bit more natural, to me at least, for the function to
> >> effect this itself, as is the case prior to your change.
> > 
> > This is mostly a pre-req for the next change that switches
> > get_cpu_maps() to return false if the current CPU is holding the CPU
> > maps lock in write mode.
> > 
> > IF we don't want to go this route we need a way to signal
> > send_IPI_mask() when a CPU hot{,un}plug operation is taking place,
> > because get_cpu_maps() enough is not suitable.
> > 
> > Overall I don't like the corner case where get_cpu_maps() returns true
> > if a CPU hot{,un}plug operation is taking place in the current CPU
> > context.  The guarantee of get_cpu_maps() is that no CPU hot{,un}plug
> > operations can be in progress if it returns true.
> 
> I'm not convinced of looking at it this way. To me the guarantee is
> merely that no CPU operation is taking place _elsewhere_. As indicated,
> imo the local CPU should be well aware of what context it's actually in,
> and hence what is (or is not) appropriate to do at a particular point in
> time.
> 
> I guess what I'm missing is an example of a concrete code path where
> things presently go wrong.

See the specific example in patch 3/9 with time_calibration() and it's
usage of send_IPI_mask() when called from a CPU executing in cpu_up()
context.

Thanks, Roger.


^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [PATCH for-4.19 3/9] xen/cpu: ensure get_cpu_maps() returns false if CPU operations are underway
  2024-05-29 15:03     ` Roger Pau Monné
@ 2024-05-29 15:49       ` Jan Beulich
  2024-05-29 16:14         ` Roger Pau Monné
  0 siblings, 1 reply; 33+ messages in thread
From: Jan Beulich @ 2024-05-29 15:49 UTC (permalink / raw)
  To: Roger Pau Monné
  Cc: Andrew Cooper, George Dunlap, Julien Grall, Stefano Stabellini,
	xen-devel

On 29.05.2024 17:03, Roger Pau Monné wrote:
> On Wed, May 29, 2024 at 03:35:04PM +0200, Jan Beulich wrote:
>> On 29.05.2024 11:01, Roger Pau Monne wrote:
>>> Due to the current rwlock logic, if the CPU calling get_cpu_maps() does so from
>>> a cpu_hotplug_{begin,done}() region the function will still return success,
>>> because a CPU taking the rwlock in read mode after having taken it in write
>>> mode is allowed.  Such behavior however defeats the purpose of get_cpu_maps(),
>>> as it should always return false when called with a CPU hot{,un}plug operation
>>> is in progress.
>>
>> I'm not sure I can agree with this. The CPU doing said operation ought to be
>> aware of what it is itself doing. And all other CPUs will get back false from
>> get_cpu_maps().
> 
> Well, the CPU is aware in the context of cpu_{up,down}(), but not in
> the interrupts that might be handled while that operation is in
> progress, see below for a concrete example.
> 
>>>  Otherwise the logic in send_IPI_mask() for example is wrong,
>>> as it could decide to use the shorthand even when a CPU operation is in
>>> progress.
>>
>> It's also not becoming clear what's wrong there: As long as a CPU isn't
>> offline enough to not be in cpu_online_map anymore, it may well need to still
>> be the target of IPIs, and targeting it with a shorthand then is still fine.
> 
> The issue is in the online path: there's a window where the CPU is
> online (and the lapic active), but cpu_online_map hasn't been updated
> yet.  A specific example would be time_calibration() being executed on
> the CPU that is running cpu_up().  That could result in a shorthand
> IPI being used, but the mask in r.cpu_calibration_map not containing
> the CPU that's being brought up online because it's not yet added to
> cpu_online_map.  Then the number of CPUs actually running
> time_calibration_rendezvous_fn won't match the weight of the cpumask
> in r.cpu_calibration_map.

I see, but maybe only partly. Prior to the CPU having its bit set in
cpu_online_map, can it really take interrupts already? Shouldn't it be
running with IRQs off until later, thus preventing it from making it
into the rendezvous function in the first place? But yes, I can see
how the IRQ (IPI) then being delivered later (once IRQs are enabled)
might cause problems, too.

Plus, with how the rendezvous function is invoked (via
on_selected_cpus() with the mask copied from cpu_online_map), the
first check in smp_call_function_interrupt() ought to prevent the
function from being called on the CPU being onlined. A problem would
arise though if the IPI arrived later and call_data was already
(partly or fully) overwritten with the next request.

>> In any event this would again affect only the CPU leading the CPU operation,
>> which should clearly know at which point(s) it is okay to send IPIs. Are we
>> actually sending any IPIs from within CPU-online or CPU-offline paths?
> 
> Yes, I've seen the time rendezvous happening while in the middle of a
> hotplug operation, and the CPU coordinating the rendezvous being the
> one doing the CPU hotplug operation, so get_cpu_maps() returning true.

Right, yet together with ...

>> Together with the earlier paragraph the critical window would be between the
>> CPU being taken off of cpu_online_map and the CPU actually going "dead" (i.e.
>> on x86: its LAPIC becoming unresponsive to other than INIT/SIPI). And even
>> then the question would be what bad, if any, would happen to that CPU if an
>> IPI was still targeted at it by way of using the shorthand. I'm pretty sure
>> it runs with IRQs off at that time, so no ordinary IRQ could be delivered.
>>
>>> Adjust the logic in get_cpu_maps() to return false when the CPUs lock is
>>> already hold in write mode by the current CPU, as read_trylock() would
>>> otherwise return true.
>>>
>>> Fixes: 868a01021c6f ('rwlock: allow recursive read locking when already locked in write mode')
>>
>> I'm puzzled by this as well: Prior to that and the change referenced by its
>> Fixes: tag, recursive spin locks were used. For the purposes here that's the
>> same as permitting read locking even when the write lock is already held by
>> the local CPU.
> 
> I see, so the Fixes should be:
> 
> x86/smp: use APIC ALLBUT destination shorthand when possible
> 
> Instead, which is the commit that started using get_cpu_maps() in
> send_IPI_mask().

... this I then wonder whether it's really only the condition in
send_IPI_mask() which needs further amending, rather than fiddling with
get_cpu_maps().

Jan


^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [PATCH for-4.19 3/9] xen/cpu: ensure get_cpu_maps() returns false if CPU operations are underway
  2024-05-29 15:49       ` Jan Beulich
@ 2024-05-29 16:14         ` Roger Pau Monné
  2024-05-31  7:02           ` Jan Beulich
  0 siblings, 1 reply; 33+ messages in thread
From: Roger Pau Monné @ 2024-05-29 16:14 UTC (permalink / raw)
  To: Jan Beulich
  Cc: Andrew Cooper, George Dunlap, Julien Grall, Stefano Stabellini,
	xen-devel

On Wed, May 29, 2024 at 05:49:48PM +0200, Jan Beulich wrote:
> On 29.05.2024 17:03, Roger Pau Monné wrote:
> > On Wed, May 29, 2024 at 03:35:04PM +0200, Jan Beulich wrote:
> >> On 29.05.2024 11:01, Roger Pau Monne wrote:
> >>> Due to the current rwlock logic, if the CPU calling get_cpu_maps() does so from
> >>> a cpu_hotplug_{begin,done}() region the function will still return success,
> >>> because a CPU taking the rwlock in read mode after having taken it in write
> >>> mode is allowed.  Such behavior however defeats the purpose of get_cpu_maps(),
> >>> as it should always return false when called with a CPU hot{,un}plug operation
> >>> is in progress.
> >>
> >> I'm not sure I can agree with this. The CPU doing said operation ought to be
> >> aware of what it is itself doing. And all other CPUs will get back false from
> >> get_cpu_maps().
> > 
> > Well, the CPU is aware in the context of cpu_{up,down}(), but not in
> > the interrupts that might be handled while that operation is in
> > progress, see below for a concrete example.
> > 
> >>>  Otherwise the logic in send_IPI_mask() for example is wrong,
> >>> as it could decide to use the shorthand even when a CPU operation is in
> >>> progress.
> >>
> >> It's also not becoming clear what's wrong there: As long as a CPU isn't
> >> offline enough to not be in cpu_online_map anymore, it may well need to still
> >> be the target of IPIs, and targeting it with a shorthand then is still fine.
> > 
> > The issue is in the online path: there's a window where the CPU is
> > online (and the lapic active), but cpu_online_map hasn't been updated
> > yet.  A specific example would be time_calibration() being executed on
> > the CPU that is running cpu_up().  That could result in a shorthand
> > IPI being used, but the mask in r.cpu_calibration_map not containing
> > the CPU that's being brought up online because it's not yet added to
> > cpu_online_map.  Then the number of CPUs actually running
> > time_calibration_rendezvous_fn won't match the weight of the cpumask
> > in r.cpu_calibration_map.
> 
> I see, but maybe only partly. Prior to the CPU having its bit set in
> cpu_online_map, can it really take interrupts already? Shouldn't it be
> running with IRQs off until later, thus preventing it from making it
> into the rendezvous function in the first place? But yes, I can see
> how the IRQ (IPI) then being delivered later (once IRQs are enabled)
> might cause problems, too.

The interrupt will get set in IRR and handled when interrupts are
enabled.

> 
> Plus, with how the rendezvous function is invoked (via
> on_selected_cpus() with the mask copied from cpu_online_map), the
> first check in smp_call_function_interrupt() ought to prevent the
> function from being called on the CPU being onlined. A problem would
> arise though if the IPI arrived later and call_data was already
> (partly or fully) overwritten with the next request.

Yeah, there's a small window where the fields in call_data are out of
sync.

> >> In any event this would again affect only the CPU leading the CPU operation,
> >> which should clearly know at which point(s) it is okay to send IPIs. Are we
> >> actually sending any IPIs from within CPU-online or CPU-offline paths?
> > 
> > Yes, I've seen the time rendezvous happening while in the middle of a
> > hotplug operation, and the CPU coordinating the rendezvous being the
> > one doing the CPU hotplug operation, so get_cpu_maps() returning true.
> 
> Right, yet together with ...
> 
> >> Together with the earlier paragraph the critical window would be between the
> >> CPU being taken off of cpu_online_map and the CPU actually going "dead" (i.e.
> >> on x86: its LAPIC becoming unresponsive to other than INIT/SIPI). And even
> >> then the question would be what bad, if any, would happen to that CPU if an
> >> IPI was still targeted at it by way of using the shorthand. I'm pretty sure
> >> it runs with IRQs off at that time, so no ordinary IRQ could be delivered.
> >>
> >>> Adjust the logic in get_cpu_maps() to return false when the CPUs lock is
> >>> already hold in write mode by the current CPU, as read_trylock() would
> >>> otherwise return true.
> >>>
> >>> Fixes: 868a01021c6f ('rwlock: allow recursive read locking when already locked in write mode')
> >>
> >> I'm puzzled by this as well: Prior to that and the change referenced by its
> >> Fixes: tag, recursive spin locks were used. For the purposes here that's the
> >> same as permitting read locking even when the write lock is already held by
> >> the local CPU.
> > 
> > I see, so the Fixes should be:
> > 
> > x86/smp: use APIC ALLBUT destination shorthand when possible
> > 
> > Instead, which is the commit that started using get_cpu_maps() in
> > send_IPI_mask().
> 
> ... this I then wonder whether it's really only the condition in
> send_IPI_mask() which needs further amending, rather than fiddling with
> get_cpu_maps().

That the other option, but I have impression it's more fragile to
adjust the condition in send_IPI_mask() rather than fiddle with
get_cpu_maps().

However if that's the preference I can adjust.

Thanks, Roger.


^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [PATCH for-4.19 3/9] xen/cpu: ensure get_cpu_maps() returns false if CPU operations are underway
  2024-05-29 16:14         ` Roger Pau Monné
@ 2024-05-31  7:02           ` Jan Beulich
  2024-05-31  7:31             ` Roger Pau Monné
  0 siblings, 1 reply; 33+ messages in thread
From: Jan Beulich @ 2024-05-31  7:02 UTC (permalink / raw)
  To: Roger Pau Monné
  Cc: Andrew Cooper, George Dunlap, Julien Grall, Stefano Stabellini,
	xen-devel

On 29.05.2024 18:14, Roger Pau Monné wrote:
> On Wed, May 29, 2024 at 05:49:48PM +0200, Jan Beulich wrote:
>> On 29.05.2024 17:03, Roger Pau Monné wrote:
>>> On Wed, May 29, 2024 at 03:35:04PM +0200, Jan Beulich wrote:
>>>> On 29.05.2024 11:01, Roger Pau Monne wrote:
>>>>> Due to the current rwlock logic, if the CPU calling get_cpu_maps() does so from
>>>>> a cpu_hotplug_{begin,done}() region the function will still return success,
>>>>> because a CPU taking the rwlock in read mode after having taken it in write
>>>>> mode is allowed.  Such behavior however defeats the purpose of get_cpu_maps(),
>>>>> as it should always return false when called with a CPU hot{,un}plug operation
>>>>> is in progress.
>>>>
>>>> I'm not sure I can agree with this. The CPU doing said operation ought to be
>>>> aware of what it is itself doing. And all other CPUs will get back false from
>>>> get_cpu_maps().
>>>
>>> Well, the CPU is aware in the context of cpu_{up,down}(), but not in
>>> the interrupts that might be handled while that operation is in
>>> progress, see below for a concrete example.
>>>
>>>>>  Otherwise the logic in send_IPI_mask() for example is wrong,
>>>>> as it could decide to use the shorthand even when a CPU operation is in
>>>>> progress.
>>>>
>>>> It's also not becoming clear what's wrong there: As long as a CPU isn't
>>>> offline enough to not be in cpu_online_map anymore, it may well need to still
>>>> be the target of IPIs, and targeting it with a shorthand then is still fine.
>>>
>>> The issue is in the online path: there's a window where the CPU is
>>> online (and the lapic active), but cpu_online_map hasn't been updated
>>> yet.  A specific example would be time_calibration() being executed on
>>> the CPU that is running cpu_up().  That could result in a shorthand
>>> IPI being used, but the mask in r.cpu_calibration_map not containing
>>> the CPU that's being brought up online because it's not yet added to
>>> cpu_online_map.  Then the number of CPUs actually running
>>> time_calibration_rendezvous_fn won't match the weight of the cpumask
>>> in r.cpu_calibration_map.
>>
>> I see, but maybe only partly. Prior to the CPU having its bit set in
>> cpu_online_map, can it really take interrupts already? Shouldn't it be
>> running with IRQs off until later, thus preventing it from making it
>> into the rendezvous function in the first place? But yes, I can see
>> how the IRQ (IPI) then being delivered later (once IRQs are enabled)
>> might cause problems, too.
> 
> The interrupt will get set in IRR and handled when interrupts are
> enabled.
> 
>>
>> Plus, with how the rendezvous function is invoked (via
>> on_selected_cpus() with the mask copied from cpu_online_map), the
>> first check in smp_call_function_interrupt() ought to prevent the
>> function from being called on the CPU being onlined. A problem would
>> arise though if the IPI arrived later and call_data was already
>> (partly or fully) overwritten with the next request.
> 
> Yeah, there's a small window where the fields in call_data are out of
> sync.
> 
>>>> In any event this would again affect only the CPU leading the CPU operation,
>>>> which should clearly know at which point(s) it is okay to send IPIs. Are we
>>>> actually sending any IPIs from within CPU-online or CPU-offline paths?
>>>
>>> Yes, I've seen the time rendezvous happening while in the middle of a
>>> hotplug operation, and the CPU coordinating the rendezvous being the
>>> one doing the CPU hotplug operation, so get_cpu_maps() returning true.
>>
>> Right, yet together with ...
>>
>>>> Together with the earlier paragraph the critical window would be between the
>>>> CPU being taken off of cpu_online_map and the CPU actually going "dead" (i.e.
>>>> on x86: its LAPIC becoming unresponsive to other than INIT/SIPI). And even
>>>> then the question would be what bad, if any, would happen to that CPU if an
>>>> IPI was still targeted at it by way of using the shorthand. I'm pretty sure
>>>> it runs with IRQs off at that time, so no ordinary IRQ could be delivered.
>>>>
>>>>> Adjust the logic in get_cpu_maps() to return false when the CPUs lock is
>>>>> already hold in write mode by the current CPU, as read_trylock() would
>>>>> otherwise return true.
>>>>>
>>>>> Fixes: 868a01021c6f ('rwlock: allow recursive read locking when already locked in write mode')
>>>>
>>>> I'm puzzled by this as well: Prior to that and the change referenced by its
>>>> Fixes: tag, recursive spin locks were used. For the purposes here that's the
>>>> same as permitting read locking even when the write lock is already held by
>>>> the local CPU.
>>>
>>> I see, so the Fixes should be:
>>>
>>> x86/smp: use APIC ALLBUT destination shorthand when possible
>>>
>>> Instead, which is the commit that started using get_cpu_maps() in
>>> send_IPI_mask().
>>
>> ... this I then wonder whether it's really only the condition in
>> send_IPI_mask() which needs further amending, rather than fiddling with
>> get_cpu_maps().
> 
> That the other option, but I have impression it's more fragile to
> adjust the condition in send_IPI_mask() rather than fiddle with
> get_cpu_maps().
> 
> However if that's the preference I can adjust.

I guess we need other REST input here then. The two of us clearly disagree on
what use of get_cpu_maps() is meant to guarantee. And I deem fiddling with
common code here more risky (and more intrusive - the other change would be
a single-line code change afaict, plus extending the related comment).

Jan


^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [PATCH for-4.19 4/9] x86/irq: describe how the interrupt CPU movement works
  2024-05-29 15:28     ` Roger Pau Monné
@ 2024-05-31  7:06       ` Jan Beulich
  2024-05-31  7:39         ` Roger Pau Monné
  0 siblings, 1 reply; 33+ messages in thread
From: Jan Beulich @ 2024-05-31  7:06 UTC (permalink / raw)
  To: Roger Pau Monné; +Cc: Andrew Cooper, xen-devel

On 29.05.2024 17:28, Roger Pau Monné wrote:
> On Wed, May 29, 2024 at 03:57:19PM +0200, Jan Beulich wrote:
>> On 29.05.2024 11:01, Roger Pau Monne wrote:
>>> --- a/xen/arch/x86/include/asm/irq.h
>>> +++ b/xen/arch/x86/include/asm/irq.h
>>> @@ -28,6 +28,32 @@ typedef struct {
>>>  
>>>  struct irq_desc;
>>>  
>>> +/*
>>> + * Xen logic for moving interrupts around CPUs allows manipulating interrupts
>>> + * that target remote CPUs.  The logic to move an interrupt from CPU(s) is as
>>> + * follows:
>>> + *
>>> + * 1. cpu_mask and vector is copied to old_cpu_mask and old_vector.
>>> + * 2. New cpu_mask and vector are set, vector is setup at the new destination.
>>> + * 3. move_in_progress is set.
>>> + * 4. Interrupt source is updated to target new CPU and vector.
>>> + * 5. Interrupts arriving at old_cpu_mask are processed normally.
>>> + * 6. When an interrupt is delivered at the new destination (cpu_mask) as part
>>> + *    of acking the interrupt move_in_progress is cleared and move_cleanup_count
>>
>> Nit: A comma after "interrupt" may help reading.
>>
>>> + *    is set to the weight of online CPUs in old_cpu_mask.
>>> + *    IRQ_MOVE_CLEANUP_VECTOR is sent to all CPUs in old_cpu_mask.
>>
>> These last two steps aren't precise enough, compared to what the code does.
>> old_cpu_mask is first reduced to online CPUs therein. If the result is non-
>> empty, what you describe is done. If, however, the result is empty, the
>> vector is released right away (this code may be there just in case, but I
>> think it shouldn't be omitted here).
> 
> I've left that out because I got the impression it made the text more
> complex to follow (with the extra branch) for no real benefit, but I'm
> happy to attempt to add it.

Why "no real benefit"? Isn't the goal to accurately describe what code does
(in various places)? If the result isn't an accurate description in one
specific regard, how reliable would the rest be from a reader's perspective?

>>> + * 7. When receiving IRQ_MOVE_CLEANUP_VECTOR CPUs in old_cpu_mask clean the
>>> + *    vector entry and decrease the count in move_cleanup_count.  The CPU that
>>> + *    sets move_cleanup_count to 0 releases the vector.
>>> + *
>>> + * Note that when interrupt movement (either move_in_progress or
>>> + * move_cleanup_count set) is in progress it's not possible to move the
>>> + * interrupt to yet a different CPU.
>>> + *
>>> + * By keeping the vector in the old CPU(s) configured until the interrupt is
>>> + * acked on the new destination Xen allows draining any pending interrupts at
>>> + * the old destinations.
>>> + */
>>>  struct arch_irq_desc {
>>>          s16 vector;                  /* vector itself is only 8 bits, */
>>>          s16 old_vector;              /* but we use -1 for unassigned  */
>>
>> I take it that it is not a goal to (also) describe under what conditions
>> an IRQ move may actually be initiated (IRQ_MOVE_PENDING)? I ask not the
>> least because the 2nd from last paragraph lightly touches that area.
> 
> Right, I was mostly focused on moves (forcefully) initiated from
> fixup_irqs(), which is different from the opportunistic affinity
> changes signaled by IRQ_MOVE_PENDING.
> 
> Not sure whether I want to mention this ahead of the list in a
> paragraph, or just add it as a step.  Do you have any preference?

I think ahead might be better. But I also won't insist on it being added.
Just if you don't, perhaps mention in the description that leaving that
out is intentional.

Jan


^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [PATCH for-4.19 3/9] xen/cpu: ensure get_cpu_maps() returns false if CPU operations are underway
  2024-05-31  7:02           ` Jan Beulich
@ 2024-05-31  7:31             ` Roger Pau Monné
  2024-05-31  8:33               ` Jan Beulich
  0 siblings, 1 reply; 33+ messages in thread
From: Roger Pau Monné @ 2024-05-31  7:31 UTC (permalink / raw)
  To: Jan Beulich
  Cc: Andrew Cooper, George Dunlap, Julien Grall, Stefano Stabellini,
	xen-devel

On Fri, May 31, 2024 at 09:02:20AM +0200, Jan Beulich wrote:
> On 29.05.2024 18:14, Roger Pau Monné wrote:
> > On Wed, May 29, 2024 at 05:49:48PM +0200, Jan Beulich wrote:
> >> On 29.05.2024 17:03, Roger Pau Monné wrote:
> >>> On Wed, May 29, 2024 at 03:35:04PM +0200, Jan Beulich wrote:
> >>>> On 29.05.2024 11:01, Roger Pau Monne wrote:
> >>>>> Due to the current rwlock logic, if the CPU calling get_cpu_maps() does so from
> >>>>> a cpu_hotplug_{begin,done}() region the function will still return success,
> >>>>> because a CPU taking the rwlock in read mode after having taken it in write
> >>>>> mode is allowed.  Such behavior however defeats the purpose of get_cpu_maps(),
> >>>>> as it should always return false when called with a CPU hot{,un}plug operation
> >>>>> is in progress.
> >>>>
> >>>> I'm not sure I can agree with this. The CPU doing said operation ought to be
> >>>> aware of what it is itself doing. And all other CPUs will get back false from
> >>>> get_cpu_maps().
> >>>
> >>> Well, the CPU is aware in the context of cpu_{up,down}(), but not in
> >>> the interrupts that might be handled while that operation is in
> >>> progress, see below for a concrete example.
> >>>
> >>>>>  Otherwise the logic in send_IPI_mask() for example is wrong,
> >>>>> as it could decide to use the shorthand even when a CPU operation is in
> >>>>> progress.
> >>>>
> >>>> It's also not becoming clear what's wrong there: As long as a CPU isn't
> >>>> offline enough to not be in cpu_online_map anymore, it may well need to still
> >>>> be the target of IPIs, and targeting it with a shorthand then is still fine.
> >>>
> >>> The issue is in the online path: there's a window where the CPU is
> >>> online (and the lapic active), but cpu_online_map hasn't been updated
> >>> yet.  A specific example would be time_calibration() being executed on
> >>> the CPU that is running cpu_up().  That could result in a shorthand
> >>> IPI being used, but the mask in r.cpu_calibration_map not containing
> >>> the CPU that's being brought up online because it's not yet added to
> >>> cpu_online_map.  Then the number of CPUs actually running
> >>> time_calibration_rendezvous_fn won't match the weight of the cpumask
> >>> in r.cpu_calibration_map.
> >>
> >> I see, but maybe only partly. Prior to the CPU having its bit set in
> >> cpu_online_map, can it really take interrupts already? Shouldn't it be
> >> running with IRQs off until later, thus preventing it from making it
> >> into the rendezvous function in the first place? But yes, I can see
> >> how the IRQ (IPI) then being delivered later (once IRQs are enabled)
> >> might cause problems, too.
> > 
> > The interrupt will get set in IRR and handled when interrupts are
> > enabled.
> > 
> >>
> >> Plus, with how the rendezvous function is invoked (via
> >> on_selected_cpus() with the mask copied from cpu_online_map), the
> >> first check in smp_call_function_interrupt() ought to prevent the
> >> function from being called on the CPU being onlined. A problem would
> >> arise though if the IPI arrived later and call_data was already
> >> (partly or fully) overwritten with the next request.
> > 
> > Yeah, there's a small window where the fields in call_data are out of
> > sync.
> > 
> >>>> In any event this would again affect only the CPU leading the CPU operation,
> >>>> which should clearly know at which point(s) it is okay to send IPIs. Are we
> >>>> actually sending any IPIs from within CPU-online or CPU-offline paths?
> >>>
> >>> Yes, I've seen the time rendezvous happening while in the middle of a
> >>> hotplug operation, and the CPU coordinating the rendezvous being the
> >>> one doing the CPU hotplug operation, so get_cpu_maps() returning true.
> >>
> >> Right, yet together with ...
> >>
> >>>> Together with the earlier paragraph the critical window would be between the
> >>>> CPU being taken off of cpu_online_map and the CPU actually going "dead" (i.e.
> >>>> on x86: its LAPIC becoming unresponsive to other than INIT/SIPI). And even
> >>>> then the question would be what bad, if any, would happen to that CPU if an
> >>>> IPI was still targeted at it by way of using the shorthand. I'm pretty sure
> >>>> it runs with IRQs off at that time, so no ordinary IRQ could be delivered.
> >>>>
> >>>>> Adjust the logic in get_cpu_maps() to return false when the CPUs lock is
> >>>>> already hold in write mode by the current CPU, as read_trylock() would
> >>>>> otherwise return true.
> >>>>>
> >>>>> Fixes: 868a01021c6f ('rwlock: allow recursive read locking when already locked in write mode')
> >>>>
> >>>> I'm puzzled by this as well: Prior to that and the change referenced by its
> >>>> Fixes: tag, recursive spin locks were used. For the purposes here that's the
> >>>> same as permitting read locking even when the write lock is already held by
> >>>> the local CPU.
> >>>
> >>> I see, so the Fixes should be:
> >>>
> >>> x86/smp: use APIC ALLBUT destination shorthand when possible
> >>>
> >>> Instead, which is the commit that started using get_cpu_maps() in
> >>> send_IPI_mask().
> >>
> >> ... this I then wonder whether it's really only the condition in
> >> send_IPI_mask() which needs further amending, rather than fiddling with
> >> get_cpu_maps().
> > 
> > That the other option, but I have impression it's more fragile to
> > adjust the condition in send_IPI_mask() rather than fiddle with
> > get_cpu_maps().
> > 
> > However if that's the preference I can adjust.
> 
> I guess we need other REST input here then. The two of us clearly disagree on
> what use of get_cpu_maps() is meant to guarantee. And I deem fiddling with
> common code here more risky (and more intrusive - the other change would be
> a single-line code change afaict, plus extending the related comment).

How do you envision that other change to be done?  Adding an extra
variable and toggling it in cpu_hotplug_{begin,done}() to signal
whether a CPU hotplug is in progress?

If I go this route I would like to add a comment to get_cpu_maps() in
order to notice this IMO weird property of succeeding when calling
from a CPU that's performing a hotplug operation.

Thanks, Roger.


^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [PATCH for-4.19 4/9] x86/irq: describe how the interrupt CPU movement works
  2024-05-31  7:06       ` Jan Beulich
@ 2024-05-31  7:39         ` Roger Pau Monné
  0 siblings, 0 replies; 33+ messages in thread
From: Roger Pau Monné @ 2024-05-31  7:39 UTC (permalink / raw)
  To: Jan Beulich; +Cc: Andrew Cooper, xen-devel

On Fri, May 31, 2024 at 09:06:10AM +0200, Jan Beulich wrote:
> On 29.05.2024 17:28, Roger Pau Monné wrote:
> > On Wed, May 29, 2024 at 03:57:19PM +0200, Jan Beulich wrote:
> >> On 29.05.2024 11:01, Roger Pau Monne wrote:
> >>> --- a/xen/arch/x86/include/asm/irq.h
> >>> +++ b/xen/arch/x86/include/asm/irq.h
> >>> @@ -28,6 +28,32 @@ typedef struct {
> >>>  
> >>>  struct irq_desc;
> >>>  
> >>> +/*
> >>> + * Xen logic for moving interrupts around CPUs allows manipulating interrupts
> >>> + * that target remote CPUs.  The logic to move an interrupt from CPU(s) is as
> >>> + * follows:
> >>> + *
> >>> + * 1. cpu_mask and vector is copied to old_cpu_mask and old_vector.
> >>> + * 2. New cpu_mask and vector are set, vector is setup at the new destination.
> >>> + * 3. move_in_progress is set.
> >>> + * 4. Interrupt source is updated to target new CPU and vector.
> >>> + * 5. Interrupts arriving at old_cpu_mask are processed normally.
> >>> + * 6. When an interrupt is delivered at the new destination (cpu_mask) as part
> >>> + *    of acking the interrupt move_in_progress is cleared and move_cleanup_count
> >>
> >> Nit: A comma after "interrupt" may help reading.
> >>
> >>> + *    is set to the weight of online CPUs in old_cpu_mask.
> >>> + *    IRQ_MOVE_CLEANUP_VECTOR is sent to all CPUs in old_cpu_mask.
> >>
> >> These last two steps aren't precise enough, compared to what the code does.
> >> old_cpu_mask is first reduced to online CPUs therein. If the result is non-
> >> empty, what you describe is done. If, however, the result is empty, the
> >> vector is released right away (this code may be there just in case, but I
> >> think it shouldn't be omitted here).
> > 
> > I've left that out because I got the impression it made the text more
> > complex to follow (with the extra branch) for no real benefit, but I'm
> > happy to attempt to add it.
> 
> Why "no real benefit"? Isn't the goal to accurately describe what code does
> (in various places)? If the result isn't an accurate description in one
> specific regard, how reliable would the rest be from a reader's perspective?

FWIW, it seemed to me the reduction of old_cpu_mask was (kind of) a
shortcut to what the normal path does, by releasing the vector early
if there are no online CPUs in old_cpu_mask.

Now that you made me look into it, I think after this series the
old_cpu_mask should never contain offline CPUs, as fixup_irqs() will
take care of removing offliend CPUs from old_cpu_mask, and freeing the
vector if the set becomes empty.

I will expand the comment to mention this case, and consider adjusting
it if this series get merged.

> >>> + * 7. When receiving IRQ_MOVE_CLEANUP_VECTOR CPUs in old_cpu_mask clean the
> >>> + *    vector entry and decrease the count in move_cleanup_count.  The CPU that
> >>> + *    sets move_cleanup_count to 0 releases the vector.
> >>> + *
> >>> + * Note that when interrupt movement (either move_in_progress or
> >>> + * move_cleanup_count set) is in progress it's not possible to move the
> >>> + * interrupt to yet a different CPU.
> >>> + *
> >>> + * By keeping the vector in the old CPU(s) configured until the interrupt is
> >>> + * acked on the new destination Xen allows draining any pending interrupts at
> >>> + * the old destinations.
> >>> + */
> >>>  struct arch_irq_desc {
> >>>          s16 vector;                  /* vector itself is only 8 bits, */
> >>>          s16 old_vector;              /* but we use -1 for unassigned  */
> >>
> >> I take it that it is not a goal to (also) describe under what conditions
> >> an IRQ move may actually be initiated (IRQ_MOVE_PENDING)? I ask not the
> >> least because the 2nd from last paragraph lightly touches that area.
> > 
> > Right, I was mostly focused on moves (forcefully) initiated from
> > fixup_irqs(), which is different from the opportunistic affinity
> > changes signaled by IRQ_MOVE_PENDING.
> > 
> > Not sure whether I want to mention this ahead of the list in a
> > paragraph, or just add it as a step.  Do you have any preference?
> 
> I think ahead might be better. But I also won't insist on it being added.
> Just if you don't, perhaps mention in the description that leaving that
> out is intentional.

No, I'm fine with adding it.

Thanks, Roger.


^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [PATCH for-4.19 3/9] xen/cpu: ensure get_cpu_maps() returns false if CPU operations are underway
  2024-05-31  7:31             ` Roger Pau Monné
@ 2024-05-31  8:33               ` Jan Beulich
  2024-05-31  9:15                 ` Roger Pau Monné
  0 siblings, 1 reply; 33+ messages in thread
From: Jan Beulich @ 2024-05-31  8:33 UTC (permalink / raw)
  To: Roger Pau Monné
  Cc: Andrew Cooper, George Dunlap, Julien Grall, Stefano Stabellini,
	xen-devel

On 31.05.2024 09:31, Roger Pau Monné wrote:
> On Fri, May 31, 2024 at 09:02:20AM +0200, Jan Beulich wrote:
>> On 29.05.2024 18:14, Roger Pau Monné wrote:
>>> On Wed, May 29, 2024 at 05:49:48PM +0200, Jan Beulich wrote:
>>>> On 29.05.2024 17:03, Roger Pau Monné wrote:
>>>>> On Wed, May 29, 2024 at 03:35:04PM +0200, Jan Beulich wrote:
>>>>>> On 29.05.2024 11:01, Roger Pau Monne wrote:
>>>>>>> Due to the current rwlock logic, if the CPU calling get_cpu_maps() does so from
>>>>>>> a cpu_hotplug_{begin,done}() region the function will still return success,
>>>>>>> because a CPU taking the rwlock in read mode after having taken it in write
>>>>>>> mode is allowed.  Such behavior however defeats the purpose of get_cpu_maps(),
>>>>>>> as it should always return false when called with a CPU hot{,un}plug operation
>>>>>>> is in progress.
>>>>>>
>>>>>> I'm not sure I can agree with this. The CPU doing said operation ought to be
>>>>>> aware of what it is itself doing. And all other CPUs will get back false from
>>>>>> get_cpu_maps().
>>>>>
>>>>> Well, the CPU is aware in the context of cpu_{up,down}(), but not in
>>>>> the interrupts that might be handled while that operation is in
>>>>> progress, see below for a concrete example.
>>>>>
>>>>>>>  Otherwise the logic in send_IPI_mask() for example is wrong,
>>>>>>> as it could decide to use the shorthand even when a CPU operation is in
>>>>>>> progress.
>>>>>>
>>>>>> It's also not becoming clear what's wrong there: As long as a CPU isn't
>>>>>> offline enough to not be in cpu_online_map anymore, it may well need to still
>>>>>> be the target of IPIs, and targeting it with a shorthand then is still fine.
>>>>>
>>>>> The issue is in the online path: there's a window where the CPU is
>>>>> online (and the lapic active), but cpu_online_map hasn't been updated
>>>>> yet.  A specific example would be time_calibration() being executed on
>>>>> the CPU that is running cpu_up().  That could result in a shorthand
>>>>> IPI being used, but the mask in r.cpu_calibration_map not containing
>>>>> the CPU that's being brought up online because it's not yet added to
>>>>> cpu_online_map.  Then the number of CPUs actually running
>>>>> time_calibration_rendezvous_fn won't match the weight of the cpumask
>>>>> in r.cpu_calibration_map.
>>>>
>>>> I see, but maybe only partly. Prior to the CPU having its bit set in
>>>> cpu_online_map, can it really take interrupts already? Shouldn't it be
>>>> running with IRQs off until later, thus preventing it from making it
>>>> into the rendezvous function in the first place? But yes, I can see
>>>> how the IRQ (IPI) then being delivered later (once IRQs are enabled)
>>>> might cause problems, too.
>>>
>>> The interrupt will get set in IRR and handled when interrupts are
>>> enabled.
>>>
>>>>
>>>> Plus, with how the rendezvous function is invoked (via
>>>> on_selected_cpus() with the mask copied from cpu_online_map), the
>>>> first check in smp_call_function_interrupt() ought to prevent the
>>>> function from being called on the CPU being onlined. A problem would
>>>> arise though if the IPI arrived later and call_data was already
>>>> (partly or fully) overwritten with the next request.
>>>
>>> Yeah, there's a small window where the fields in call_data are out of
>>> sync.
>>>
>>>>>> In any event this would again affect only the CPU leading the CPU operation,
>>>>>> which should clearly know at which point(s) it is okay to send IPIs. Are we
>>>>>> actually sending any IPIs from within CPU-online or CPU-offline paths?
>>>>>
>>>>> Yes, I've seen the time rendezvous happening while in the middle of a
>>>>> hotplug operation, and the CPU coordinating the rendezvous being the
>>>>> one doing the CPU hotplug operation, so get_cpu_maps() returning true.
>>>>
>>>> Right, yet together with ...
>>>>
>>>>>> Together with the earlier paragraph the critical window would be between the
>>>>>> CPU being taken off of cpu_online_map and the CPU actually going "dead" (i.e.
>>>>>> on x86: its LAPIC becoming unresponsive to other than INIT/SIPI). And even
>>>>>> then the question would be what bad, if any, would happen to that CPU if an
>>>>>> IPI was still targeted at it by way of using the shorthand. I'm pretty sure
>>>>>> it runs with IRQs off at that time, so no ordinary IRQ could be delivered.
>>>>>>
>>>>>>> Adjust the logic in get_cpu_maps() to return false when the CPUs lock is
>>>>>>> already hold in write mode by the current CPU, as read_trylock() would
>>>>>>> otherwise return true.
>>>>>>>
>>>>>>> Fixes: 868a01021c6f ('rwlock: allow recursive read locking when already locked in write mode')
>>>>>>
>>>>>> I'm puzzled by this as well: Prior to that and the change referenced by its
>>>>>> Fixes: tag, recursive spin locks were used. For the purposes here that's the
>>>>>> same as permitting read locking even when the write lock is already held by
>>>>>> the local CPU.
>>>>>
>>>>> I see, so the Fixes should be:
>>>>>
>>>>> x86/smp: use APIC ALLBUT destination shorthand when possible
>>>>>
>>>>> Instead, which is the commit that started using get_cpu_maps() in
>>>>> send_IPI_mask().
>>>>
>>>> ... this I then wonder whether it's really only the condition in
>>>> send_IPI_mask() which needs further amending, rather than fiddling with
>>>> get_cpu_maps().
>>>
>>> That the other option, but I have impression it's more fragile to
>>> adjust the condition in send_IPI_mask() rather than fiddle with
>>> get_cpu_maps().
>>>
>>> However if that's the preference I can adjust.
>>
>> I guess we need other REST input here then. The two of us clearly disagree on
>> what use of get_cpu_maps() is meant to guarantee. And I deem fiddling with
>> common code here more risky (and more intrusive - the other change would be
>> a single-line code change afaict, plus extending the related comment).
> 
> How do you envision that other change to be done?  Adding an extra
> variable and toggling it in cpu_hotplug_{begin,done}() to signal
> whether a CPU hotplug is in progress?

I was thinking of an is-write-locked-by-me check on cpu_add_remove_lock.

Jan


^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [PATCH for-4.19 3/9] xen/cpu: ensure get_cpu_maps() returns false if CPU operations are underway
  2024-05-31  8:33               ` Jan Beulich
@ 2024-05-31  9:15                 ` Roger Pau Monné
  0 siblings, 0 replies; 33+ messages in thread
From: Roger Pau Monné @ 2024-05-31  9:15 UTC (permalink / raw)
  To: Jan Beulich
  Cc: Andrew Cooper, George Dunlap, Julien Grall, Stefano Stabellini,
	xen-devel

On Fri, May 31, 2024 at 10:33:58AM +0200, Jan Beulich wrote:
> On 31.05.2024 09:31, Roger Pau Monné wrote:
> > On Fri, May 31, 2024 at 09:02:20AM +0200, Jan Beulich wrote:
> >> On 29.05.2024 18:14, Roger Pau Monné wrote:
> >>> On Wed, May 29, 2024 at 05:49:48PM +0200, Jan Beulich wrote:
> >>>> On 29.05.2024 17:03, Roger Pau Monné wrote:
> >>>>> On Wed, May 29, 2024 at 03:35:04PM +0200, Jan Beulich wrote:
> >>>>>> On 29.05.2024 11:01, Roger Pau Monne wrote:
> >>>>>>> Due to the current rwlock logic, if the CPU calling get_cpu_maps() does so from
> >>>>>>> a cpu_hotplug_{begin,done}() region the function will still return success,
> >>>>>>> because a CPU taking the rwlock in read mode after having taken it in write
> >>>>>>> mode is allowed.  Such behavior however defeats the purpose of get_cpu_maps(),
> >>>>>>> as it should always return false when called with a CPU hot{,un}plug operation
> >>>>>>> is in progress.
> >>>>>>
> >>>>>> I'm not sure I can agree with this. The CPU doing said operation ought to be
> >>>>>> aware of what it is itself doing. And all other CPUs will get back false from
> >>>>>> get_cpu_maps().
> >>>>>
> >>>>> Well, the CPU is aware in the context of cpu_{up,down}(), but not in
> >>>>> the interrupts that might be handled while that operation is in
> >>>>> progress, see below for a concrete example.
> >>>>>
> >>>>>>>  Otherwise the logic in send_IPI_mask() for example is wrong,
> >>>>>>> as it could decide to use the shorthand even when a CPU operation is in
> >>>>>>> progress.
> >>>>>>
> >>>>>> It's also not becoming clear what's wrong there: As long as a CPU isn't
> >>>>>> offline enough to not be in cpu_online_map anymore, it may well need to still
> >>>>>> be the target of IPIs, and targeting it with a shorthand then is still fine.
> >>>>>
> >>>>> The issue is in the online path: there's a window where the CPU is
> >>>>> online (and the lapic active), but cpu_online_map hasn't been updated
> >>>>> yet.  A specific example would be time_calibration() being executed on
> >>>>> the CPU that is running cpu_up().  That could result in a shorthand
> >>>>> IPI being used, but the mask in r.cpu_calibration_map not containing
> >>>>> the CPU that's being brought up online because it's not yet added to
> >>>>> cpu_online_map.  Then the number of CPUs actually running
> >>>>> time_calibration_rendezvous_fn won't match the weight of the cpumask
> >>>>> in r.cpu_calibration_map.
> >>>>
> >>>> I see, but maybe only partly. Prior to the CPU having its bit set in
> >>>> cpu_online_map, can it really take interrupts already? Shouldn't it be
> >>>> running with IRQs off until later, thus preventing it from making it
> >>>> into the rendezvous function in the first place? But yes, I can see
> >>>> how the IRQ (IPI) then being delivered later (once IRQs are enabled)
> >>>> might cause problems, too.
> >>>
> >>> The interrupt will get set in IRR and handled when interrupts are
> >>> enabled.
> >>>
> >>>>
> >>>> Plus, with how the rendezvous function is invoked (via
> >>>> on_selected_cpus() with the mask copied from cpu_online_map), the
> >>>> first check in smp_call_function_interrupt() ought to prevent the
> >>>> function from being called on the CPU being onlined. A problem would
> >>>> arise though if the IPI arrived later and call_data was already
> >>>> (partly or fully) overwritten with the next request.
> >>>
> >>> Yeah, there's a small window where the fields in call_data are out of
> >>> sync.
> >>>
> >>>>>> In any event this would again affect only the CPU leading the CPU operation,
> >>>>>> which should clearly know at which point(s) it is okay to send IPIs. Are we
> >>>>>> actually sending any IPIs from within CPU-online or CPU-offline paths?
> >>>>>
> >>>>> Yes, I've seen the time rendezvous happening while in the middle of a
> >>>>> hotplug operation, and the CPU coordinating the rendezvous being the
> >>>>> one doing the CPU hotplug operation, so get_cpu_maps() returning true.
> >>>>
> >>>> Right, yet together with ...
> >>>>
> >>>>>> Together with the earlier paragraph the critical window would be between the
> >>>>>> CPU being taken off of cpu_online_map and the CPU actually going "dead" (i.e.
> >>>>>> on x86: its LAPIC becoming unresponsive to other than INIT/SIPI). And even
> >>>>>> then the question would be what bad, if any, would happen to that CPU if an
> >>>>>> IPI was still targeted at it by way of using the shorthand. I'm pretty sure
> >>>>>> it runs with IRQs off at that time, so no ordinary IRQ could be delivered.
> >>>>>>
> >>>>>>> Adjust the logic in get_cpu_maps() to return false when the CPUs lock is
> >>>>>>> already hold in write mode by the current CPU, as read_trylock() would
> >>>>>>> otherwise return true.
> >>>>>>>
> >>>>>>> Fixes: 868a01021c6f ('rwlock: allow recursive read locking when already locked in write mode')
> >>>>>>
> >>>>>> I'm puzzled by this as well: Prior to that and the change referenced by its
> >>>>>> Fixes: tag, recursive spin locks were used. For the purposes here that's the
> >>>>>> same as permitting read locking even when the write lock is already held by
> >>>>>> the local CPU.
> >>>>>
> >>>>> I see, so the Fixes should be:
> >>>>>
> >>>>> x86/smp: use APIC ALLBUT destination shorthand when possible
> >>>>>
> >>>>> Instead, which is the commit that started using get_cpu_maps() in
> >>>>> send_IPI_mask().
> >>>>
> >>>> ... this I then wonder whether it's really only the condition in
> >>>> send_IPI_mask() which needs further amending, rather than fiddling with
> >>>> get_cpu_maps().
> >>>
> >>> That the other option, but I have impression it's more fragile to
> >>> adjust the condition in send_IPI_mask() rather than fiddle with
> >>> get_cpu_maps().
> >>>
> >>> However if that's the preference I can adjust.
> >>
> >> I guess we need other REST input here then. The two of us clearly disagree on
> >> what use of get_cpu_maps() is meant to guarantee. And I deem fiddling with
> >> common code here more risky (and more intrusive - the other change would be
> >> a single-line code change afaict, plus extending the related comment).
> > 
> > How do you envision that other change to be done?  Adding an extra
> > variable and toggling it in cpu_hotplug_{begin,done}() to signal
> > whether a CPU hotplug is in progress?
> 
> I was thinking of an is-write-locked-by-me check on cpu_add_remove_lock.

Oh, so basically open-coding what I proposed here as get_cpu_maps() in
send_IPI_mask().  Unless anyone else expresses interest in my current
proposal I would switch to that.

Thanks, Roger.


^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [PATCH for-4.19 0/9] x86/irq: fixes for CPU hot{,un}plug
  2024-05-29  9:37 ` [PATCH for-4.19 0/9] x86/irq: fixes for CPU hot{,un}plug Oleksii K.
@ 2024-06-11 13:25   ` Jan Beulich
  0 siblings, 0 replies; 33+ messages in thread
From: Jan Beulich @ 2024-06-11 13:25 UTC (permalink / raw)
  To: Oleksii K.
  Cc: Andrew Cooper, George Dunlap, Julien Grall, Stefano Stabellini,
	Roger Pau Monne, xen-devel

On 29.05.2024 11:37, Oleksii K. wrote:
> On Wed, 2024-05-29 at 11:01 +0200, Roger Pau Monne wrote:
>> Hello,
>>
>> The following series aim to fix interrupt handling when doing CPU
>> plug/unplug operations.  Without this series running:
>>
>> cpus=`xl info max_cpu_id`
>> while [ 1 ]; do
>>     for i in `seq 1 $cpus`; do
>>         xen-hptool cpu-offline $i;
>>         xen-hptool cpu-online $i;
>>     done
>> done
>>
>> Quite quickly results in interrupts getting lost and "No irq handler
>> for
>> vector" messages on the Xen console.  Drivers in dom0 also start
>> getting
>> interrupt timeouts and the system becomes unusable.
>>
>> After applying the series running the loop over night still result in
>> a
>> fully usable system, no  "No irq handler for vector" messages at all,
>> no
>> interrupt loses reported by dom0.  Test with
>> x2apic-mode={mixed,cluster}.
>>
>> I'm tagging this for 4.19 as it's IMO bugfixes, but the series has
>> grown
>> quite bigger than expected, and hence we need to be careful to not
>> introduce breakages late in the release cycle.  I've attempted to
>> document all code as good as I could, interrupt handling has some
>> unexpected corner cases that are hard to diagnose and reason about.
> Despite of the fact that it can be considered as bugfixes, it seems to
> me that this patch series can be risky. Let's wait for maintainers
> opinion...

Working my way through v2 of this series, I think I'd be okay with
including stuff there up to patch 5. Patch 6, which I just finished
taking a first look at, is likely correct (and it's just me missing some
aspects to fully grok the changes done there), but at the same time
looks to be more intrusive than we would like to have it at this point
of the release cycle. That said, I'd be pretty okay to be overridden in
this regard by Roger and/or Andrew.

Jan


^ permalink raw reply	[flat|nested] 33+ messages in thread

end of thread, other threads:[~2024-06-11 13:25 UTC | newest]

Thread overview: 33+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2024-05-29  9:01 [PATCH for-4.19 0/9] x86/irq: fixes for CPU hot{,un}plug Roger Pau Monne
2024-05-29  9:01 ` [PATCH for-4.19 1/9] x86/irq: remove offline CPUs from old CPU mask when adjusting move_cleanup_count Roger Pau Monne
2024-05-29 12:40   ` Jan Beulich
2024-05-29 15:15     ` Roger Pau Monné
2024-05-29 15:27       ` Jan Beulich
2024-05-29 15:34         ` Roger Pau Monné
2024-05-29  9:01 ` [PATCH for-4.19 2/9] xen/cpu: do not get the CPU map in stop_machine_run() Roger Pau Monne
2024-05-29 13:04   ` Jan Beulich
2024-05-29 15:20     ` Roger Pau Monné
2024-05-29 15:31       ` Jan Beulich
2024-05-29 15:48         ` Roger Pau Monné
2024-05-29  9:01 ` [PATCH for-4.19 3/9] xen/cpu: ensure get_cpu_maps() returns false if CPU operations are underway Roger Pau Monne
2024-05-29 13:35   ` Jan Beulich
2024-05-29 15:03     ` Roger Pau Monné
2024-05-29 15:49       ` Jan Beulich
2024-05-29 16:14         ` Roger Pau Monné
2024-05-31  7:02           ` Jan Beulich
2024-05-31  7:31             ` Roger Pau Monné
2024-05-31  8:33               ` Jan Beulich
2024-05-31  9:15                 ` Roger Pau Monné
2024-05-29  9:01 ` [PATCH for-4.19 4/9] x86/irq: describe how the interrupt CPU movement works Roger Pau Monne
2024-05-29 13:57   ` Jan Beulich
2024-05-29 15:28     ` Roger Pau Monné
2024-05-31  7:06       ` Jan Beulich
2024-05-31  7:39         ` Roger Pau Monné
2024-05-29  9:01 ` [PATCH for-4.19 5/9] x86/irq: limit interrupt movement done by fixup_irqs() Roger Pau Monne
2024-05-29  9:01 ` [PATCH for-4.19 6/9] x86/irq: restrict CPU movement in set_desc_affinity() Roger Pau Monne
2024-05-29  9:01 ` [PATCH for-4.19 7/9] x86/irq: deal with old_cpu_mask for interrupts in movement in fixup_irqs() Roger Pau Monne
2024-05-29  9:01 ` [PATCH for-4.19 8/9] x86/irq: handle moving interrupts in _assign_irq_vector() Roger Pau Monne
2024-05-29  9:01 ` [PATCH for-4.19 9/9] x86/irq: forward pending interrupts to new destination in fixup_irqs() Roger Pau Monne
2024-05-29  9:33   ` Andrew Cooper
2024-05-29  9:37 ` [PATCH for-4.19 0/9] x86/irq: fixes for CPU hot{,un}plug Oleksii K.
2024-06-11 13:25   ` Jan Beulich

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.