All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH v2 0/7] x86/irq: fixes for CPU hot{,un}plug
@ 2024-06-10 14:20 Roger Pau Monne
  2024-06-10 14:20 ` [PATCH v2 1/7] x86/smp: do not use shorthand IPI destinations in CPU hot{,un}plug contexts Roger Pau Monne
                   ` (6 more replies)
  0 siblings, 7 replies; 37+ messages in thread
From: Roger Pau Monne @ 2024-06-10 14:20 UTC (permalink / raw)
  To: xen-devel
  Cc: Roger Pau Monne, Jan Beulich, Andrew Cooper, George Dunlap,
	Julien Grall, Stefano Stabellini

Hello,

The following series aim to fix interrupt handling when doing CPU
plug/unplug operations.  Without this series running:

cpus=`xl info max_cpu_id`
while [ 1 ]; do
    for i in `seq 1 $cpus`; do
        xen-hptool cpu-offline $i;
        xen-hptool cpu-online $i;
    done
done

Quite quickly results in interrupts getting lost and "No irq handler for
vector" messages on the Xen console.  Drivers in dom0 also start getting
interrupt timeouts and the system becomes unusable.

After applying the series running the loop over night still result in a
fully usable system, no  "No irq handler for vector" messages at all, no
interrupt loses reported by dom0.  Test with x2apic-mode={mixed,cluster}.

I've attempted to document all code as good as I could, interrupt
handling has some unexpected corner cases that are hard to diagnose and
reason about.

I'm currently also doing some extra testing with XenRT in case I've
missed something.

Thanks, Roger.

Roger Pau Monne (7):
  x86/smp: do not use shorthand IPI destinations in CPU hot{,un}plug
    contexts
  x86/irq: describe how the interrupt CPU movement works
  x86/irq: limit interrupt movement done by fixup_irqs()
  x86/irq: restrict CPU movement in set_desc_affinity()
  x86/irq: deal with old_cpu_mask for interrupts in movement in
    fixup_irqs()
  x86/irq: handle moving interrupts in _assign_irq_vector()
  x86/irq: forward pending interrupts to new destination in fixup_irqs()

 xen/arch/x86/include/asm/apic.h |   5 +
 xen/arch/x86/include/asm/irq.h  |  40 ++++++-
 xen/arch/x86/irq.c              | 197 ++++++++++++++++++++++++--------
 xen/arch/x86/smp.c              |   2 +-
 xen/common/cpu.c                |   5 +
 xen/include/xen/cpu.h           |  10 ++
 xen/include/xen/rwlock.h        |   2 +
 7 files changed, 214 insertions(+), 47 deletions(-)

-- 
2.44.0



^ permalink raw reply	[flat|nested] 37+ messages in thread

* [PATCH v2 1/7] x86/smp: do not use shorthand IPI destinations in CPU hot{,un}plug contexts
  2024-06-10 14:20 [PATCH v2 0/7] x86/irq: fixes for CPU hot{,un}plug Roger Pau Monne
@ 2024-06-10 14:20 ` Roger Pau Monne
  2024-06-11  7:42   ` Jan Beulich
  2024-06-10 14:20 ` [PATCH v2 2/7] x86/irq: describe how the interrupt CPU movement works Roger Pau Monne
                   ` (5 subsequent siblings)
  6 siblings, 1 reply; 37+ messages in thread
From: Roger Pau Monne @ 2024-06-10 14:20 UTC (permalink / raw)
  To: xen-devel
  Cc: Roger Pau Monne, Jan Beulich, Andrew Cooper, George Dunlap,
	Julien Grall, Stefano Stabellini

Due to the current rwlock logic, if the CPU calling get_cpu_maps() does so from
a cpu_hotplug_{begin,done}() region the function will still return success,
because a CPU taking the rwlock in read mode after having taken it in write
mode is allowed.  Such behavior however defeats the purpose of get_cpu_maps(),
as it should always return false when called with a CPU hot{,un}plug operation
is in progress.  Otherwise the logic in send_IPI_mask() is wrong, as it could
decide to use the shorthand even when a CPU operation is in progress.

Introduce a new helper to detect whether the current caller is between a
cpu_hotplug_{begin,done}() region and use it in send_IPI_mask() to restrict
shorthand usage.

Fixes: 5500d265a2a8 ('x86/smp: use APIC ALLBUT destination shorthand when possible')
Signed-off-by: Roger Pau Monné <roger.pau@citrix.com>
---
Changes since v1:
 - Modify send_IPI_mask() to detect CPU hotplug context.
---
 xen/arch/x86/smp.c       |  2 +-
 xen/common/cpu.c         |  5 +++++
 xen/include/xen/cpu.h    | 10 ++++++++++
 xen/include/xen/rwlock.h |  2 ++
 4 files changed, 18 insertions(+), 1 deletion(-)

diff --git a/xen/arch/x86/smp.c b/xen/arch/x86/smp.c
index 7443ad20335e..04c6a0572319 100644
--- a/xen/arch/x86/smp.c
+++ b/xen/arch/x86/smp.c
@@ -88,7 +88,7 @@ void send_IPI_mask(const cpumask_t *mask, int vector)
      * the system have been accounted for.
      */
     if ( system_state > SYS_STATE_smp_boot &&
-         !unaccounted_cpus && !disabled_cpus &&
+         !unaccounted_cpus && !disabled_cpus && !cpu_in_hotplug_context() &&
          /* NB: get_cpu_maps lock requires enabled interrupts. */
          local_irq_is_enabled() && (cpus_locked = get_cpu_maps()) &&
          (park_offline_cpus ||
diff --git a/xen/common/cpu.c b/xen/common/cpu.c
index 8709db4d2957..6e35b114c080 100644
--- a/xen/common/cpu.c
+++ b/xen/common/cpu.c
@@ -68,6 +68,11 @@ void cpu_hotplug_done(void)
     write_unlock(&cpu_add_remove_lock);
 }
 
+bool cpu_in_hotplug_context(void)
+{
+    return rw_is_write_locked_by_me(&cpu_add_remove_lock);
+}
+
 static NOTIFIER_HEAD(cpu_chain);
 
 void __init register_cpu_notifier(struct notifier_block *nb)
diff --git a/xen/include/xen/cpu.h b/xen/include/xen/cpu.h
index e1d4eb59675c..6bf578675008 100644
--- a/xen/include/xen/cpu.h
+++ b/xen/include/xen/cpu.h
@@ -13,6 +13,16 @@ void put_cpu_maps(void);
 void cpu_hotplug_begin(void);
 void cpu_hotplug_done(void);
 
+/*
+ * Returns true when the caller CPU is between a cpu_hotplug_{begin,done}()
+ * region.
+ *
+ * This is required to safely identify hotplug contexts, as get_cpu_maps()
+ * would otherwise succeed because a caller holding the lock in write mode is
+ * allowed to acquire the same lock in read mode.
+ */
+bool cpu_in_hotplug_context(void);
+
 /* Receive notification of CPU hotplug events. */
 void register_cpu_notifier(struct notifier_block *nb);
 
diff --git a/xen/include/xen/rwlock.h b/xen/include/xen/rwlock.h
index a2e98cad343e..4e7802821859 100644
--- a/xen/include/xen/rwlock.h
+++ b/xen/include/xen/rwlock.h
@@ -316,6 +316,8 @@ static always_inline void write_lock_irq(rwlock_t *l)
 
 #define rw_is_locked(l)               _rw_is_locked(l)
 #define rw_is_write_locked(l)         _rw_is_write_locked(l)
+#define rw_is_write_locked_by_me(l) \
+    lock_evaluate_nospec(_is_write_locked_by_me(atomic_read(&(l)->cnts)))
 
 
 typedef struct percpu_rwlock percpu_rwlock_t;
-- 
2.44.0



^ permalink raw reply related	[flat|nested] 37+ messages in thread

* [PATCH v2 2/7] x86/irq: describe how the interrupt CPU movement works
  2024-06-10 14:20 [PATCH v2 0/7] x86/irq: fixes for CPU hot{,un}plug Roger Pau Monne
  2024-06-10 14:20 ` [PATCH v2 1/7] x86/smp: do not use shorthand IPI destinations in CPU hot{,un}plug contexts Roger Pau Monne
@ 2024-06-10 14:20 ` Roger Pau Monne
  2024-06-11  7:44   ` Jan Beulich
  2024-06-10 14:20 ` [PATCH v2 3/7] x86/irq: limit interrupt movement done by fixup_irqs() Roger Pau Monne
                   ` (4 subsequent siblings)
  6 siblings, 1 reply; 37+ messages in thread
From: Roger Pau Monne @ 2024-06-10 14:20 UTC (permalink / raw)
  To: xen-devel; +Cc: Roger Pau Monne, Jan Beulich, Andrew Cooper

The logic to move interrupts across CPUs is complex, attempt to provide a
comment that describes the expected behavior so users of the interrupt system
have more context about the usage of the arch_irq_desc structure fields.

Signed-off-by: Roger Pau Monné <roger.pau@citrix.com>
---
Changes since v1:
 - Mention the logic involved in IRQ_MOVE_PENDING and the reduction of
   old_cpu_mask.
---
 xen/arch/x86/include/asm/irq.h | 38 ++++++++++++++++++++++++++++++++++
 1 file changed, 38 insertions(+)

diff --git a/xen/arch/x86/include/asm/irq.h b/xen/arch/x86/include/asm/irq.h
index 413994d2133b..94f634ce10a1 100644
--- a/xen/arch/x86/include/asm/irq.h
+++ b/xen/arch/x86/include/asm/irq.h
@@ -28,6 +28,44 @@ typedef struct {
 
 struct irq_desc;
 
+/*
+ * Xen logic for moving interrupts around CPUs allows manipulating interrupts
+ * that target remote CPUs.  The logic to move an interrupt from CPU(s) is as
+ * follows:
+ *
+ * 1. irq_set_affinity() is called with the new destination mask, such mask is
+ *    copied into pending_mask and IRQ_MOVE_PENDING is set in status to notice
+ *    an affinity change has been requested.
+ * 2. An interrupt acked with the IRQ_MOVE_PENDING will trigger the logic to
+ *    migrate it to a destination in pending_mask as long as the mask contains
+ *    any online CPUs.
+ * 3. cpu_mask and vector is copied to old_cpu_mask and old_vector.
+ * 4. New cpu_mask and vector are set, vector is setup at the new destination.
+ * 5. move_in_progress is set.
+ * 6. Interrupt source is updated to target new CPU and vector.
+ * 7. Interrupts arriving at old_cpu_mask are processed normally.
+ * 8. When the first interrupt is delivered at the new destination (cpu_mask) as
+ *    part of acking the interrupt the cleanup of the old destination(s) is
+ *    engaged.  move_in_progress is cleared and old_cpu_mask is
+ *    reduced to the online CPUs.  If the result is empty the old vector is
+ *    released.  Otherwise move_cleanup_count is set to the weight of online
+ *    CPUs in old_cpu_mask and IRQ_MOVE_CLEANUP_VECTOR is sent to them.
+ * 9. When receiving IRQ_MOVE_CLEANUP_VECTOR CPUs in old_cpu_mask clean the
+ *    vector entry and decrease the count in move_cleanup_count.  The CPU that
+ *    sets move_cleanup_count to 0 releases the vector.
+ *
+ * Note that when interrupt movement (either move_in_progress or
+ * move_cleanup_count set) is in progress it's not possible to move the
+ * interrupt to yet a different CPU.
+ *
+ * Interrupt movements done by fixup_irqs() skip setting IRQ_MOVE_PENDING and
+ * pending_mask as the movement must be performed right away, and so start
+ * directly from step 3.
+ *
+ * By keeping the vector in the old CPU(s) configured until the interrupt is
+ * acked on the new destination Xen allows draining any pending interrupts at
+ * the old destinations.
+ */
 struct arch_irq_desc {
         s16 vector;                  /* vector itself is only 8 bits, */
         s16 old_vector;              /* but we use -1 for unassigned  */
-- 
2.44.0



^ permalink raw reply related	[flat|nested] 37+ messages in thread

* [PATCH v2 3/7] x86/irq: limit interrupt movement done by fixup_irqs()
  2024-06-10 14:20 [PATCH v2 0/7] x86/irq: fixes for CPU hot{,un}plug Roger Pau Monne
  2024-06-10 14:20 ` [PATCH v2 1/7] x86/smp: do not use shorthand IPI destinations in CPU hot{,un}plug contexts Roger Pau Monne
  2024-06-10 14:20 ` [PATCH v2 2/7] x86/irq: describe how the interrupt CPU movement works Roger Pau Monne
@ 2024-06-10 14:20 ` Roger Pau Monne
  2024-06-11  9:59   ` Jan Beulich
  2024-06-10 14:20 ` [PATCH v2 4/7] x86/irq: restrict CPU movement in set_desc_affinity() Roger Pau Monne
                   ` (3 subsequent siblings)
  6 siblings, 1 reply; 37+ messages in thread
From: Roger Pau Monne @ 2024-06-10 14:20 UTC (permalink / raw)
  To: xen-devel; +Cc: Roger Pau Monne, Jan Beulich, Andrew Cooper

The current check used in fixup_irqs() to decide whether to move around
interrupts is based on the affinity mask, but such mask can have all bits set,
and hence is unlikely to be a subset of the input mask.  For example if an
interrupt has an affinity mask of all 1s, any input to fixup_irqs() that's not
an all set CPU mask would cause that interrupt to be shuffled around
unconditionally.

What fixup_irqs() care about is evacuating interrupts from CPUs not set on the
input CPU mask, and for that purpose it should check whether the interrupt is
assigned to a CPU not present in the input mask.  Assume that ->arch.cpu_mask
is a subset of the ->affinity mask, and keep the current logic that resets the
->affinity mask if the interrupt has to be shuffled around.

Doing the affinity movement based on ->arch.cpu_mask requires removing the
special handling to ->arch.cpu_mask done for high priority vectors, otherwise
the adjustment done to cpu_mask makes them always skip the CPU interrupt
movement.

While there also adjust the comment as to the purpose of fixup_irqs().

Signed-off-by: Roger Pau Monné <roger.pau@citrix.com>
---
Changes since v1:
 - Adjust handling of high priority vectors.

Changes since v0 (outside the series):
 - Do not AND ->arch.cpu_mask with cpu_online_map.
 - Keep using cpumask_subset().
 - Integrate into bigger series.
---
 xen/arch/x86/include/asm/irq.h |  2 +-
 xen/arch/x86/irq.c             | 21 +++++++++++----------
 2 files changed, 12 insertions(+), 11 deletions(-)

diff --git a/xen/arch/x86/include/asm/irq.h b/xen/arch/x86/include/asm/irq.h
index 94f634ce10a1..5f445299be61 100644
--- a/xen/arch/x86/include/asm/irq.h
+++ b/xen/arch/x86/include/asm/irq.h
@@ -168,7 +168,7 @@ void free_domain_pirqs(struct domain *d);
 int map_domain_emuirq_pirq(struct domain *d, int pirq, int emuirq);
 int unmap_domain_pirq_emuirq(struct domain *d, int pirq);
 
-/* Reset irq affinities to match the given CPU mask. */
+/* Evacuate interrupts assigned to CPUs not present in the input CPU mask. */
 void fixup_irqs(const cpumask_t *mask, bool verbose);
 void fixup_eoi(void);
 
diff --git a/xen/arch/x86/irq.c b/xen/arch/x86/irq.c
index d101ffeaf9f3..263e502bc0f6 100644
--- a/xen/arch/x86/irq.c
+++ b/xen/arch/x86/irq.c
@@ -2516,7 +2516,7 @@ static int __init cf_check setup_dump_irqs(void)
 }
 __initcall(setup_dump_irqs);
 
-/* Reset irq affinities to match the given CPU mask. */
+/* Evacuate interrupts assigned to CPUs not present in the input CPU mask. */
 void fixup_irqs(const cpumask_t *mask, bool verbose)
 {
     unsigned int irq;
@@ -2540,19 +2540,15 @@ void fixup_irqs(const cpumask_t *mask, bool verbose)
 
         vector = irq_to_vector(irq);
         if ( vector >= FIRST_HIPRIORITY_VECTOR &&
-             vector <= LAST_HIPRIORITY_VECTOR )
+             vector <= LAST_HIPRIORITY_VECTOR &&
+             desc->handler == &no_irq_type )
         {
-            cpumask_and(desc->arch.cpu_mask, desc->arch.cpu_mask, mask);
-
             /*
              * This can in particular happen when parking secondary threads
              * during boot and when the serial console wants to use a PCI IRQ.
              */
-            if ( desc->handler == &no_irq_type )
-            {
-                spin_unlock(&desc->lock);
-                continue;
-            }
+            spin_unlock(&desc->lock);
+            continue;
         }
 
         if ( desc->arch.move_cleanup_count )
@@ -2573,7 +2569,12 @@ void fixup_irqs(const cpumask_t *mask, bool verbose)
                                affinity);
         }
 
-        if ( !desc->action || cpumask_subset(desc->affinity, mask) )
+        /*
+         * Avoid shuffling the interrupt around as long as current target CPUs
+         * are a subset of the input mask.  What fixup_irqs() cares about is
+         * evacuating interrupts from CPUs not in the input mask.
+         */
+        if ( !desc->action || cpumask_subset(desc->arch.cpu_mask, mask) )
         {
             spin_unlock(&desc->lock);
             continue;
-- 
2.44.0



^ permalink raw reply related	[flat|nested] 37+ messages in thread

* [PATCH v2 4/7] x86/irq: restrict CPU movement in set_desc_affinity()
  2024-06-10 14:20 [PATCH v2 0/7] x86/irq: fixes for CPU hot{,un}plug Roger Pau Monne
                   ` (2 preceding siblings ...)
  2024-06-10 14:20 ` [PATCH v2 3/7] x86/irq: limit interrupt movement done by fixup_irqs() Roger Pau Monne
@ 2024-06-10 14:20 ` Roger Pau Monne
  2024-06-11 10:20   ` Jan Beulich
  2024-06-10 14:20 ` [PATCH v2 5/7] x86/irq: deal with old_cpu_mask for interrupts in movement in fixup_irqs() Roger Pau Monne
                   ` (2 subsequent siblings)
  6 siblings, 1 reply; 37+ messages in thread
From: Roger Pau Monne @ 2024-06-10 14:20 UTC (permalink / raw)
  To: xen-devel; +Cc: Roger Pau Monne, Jan Beulich, Andrew Cooper

If external interrupts are using logical mode it's possible to have an overlap
between the current ->arch.cpu_mask and the provided mask (or TARGET_CPUS).  If
that's the case avoid assigning a new vector and just move the interrupt to a
member of ->arch.cpu_mask that overlaps with the provided mask and is online.

While there also add an extra assert to ensure the mask containing the possible
destinations is not empty before calling cpu_mask_to_apicid(), as at that point
having an empty mask is not expected.

Signed-off-by: Roger Pau Monné <roger.pau@citrix.com>
---
 xen/arch/x86/irq.c | 34 +++++++++++++++++++++++++++-------
 1 file changed, 27 insertions(+), 7 deletions(-)

diff --git a/xen/arch/x86/irq.c b/xen/arch/x86/irq.c
index 263e502bc0f6..306e7756112f 100644
--- a/xen/arch/x86/irq.c
+++ b/xen/arch/x86/irq.c
@@ -837,19 +837,38 @@ void cf_check irq_complete_move(struct irq_desc *desc)
 
 unsigned int set_desc_affinity(struct irq_desc *desc, const cpumask_t *mask)
 {
-    int ret;
-    unsigned long flags;
     cpumask_t dest_mask;
 
     if ( mask && !cpumask_intersects(mask, &cpu_online_map) )
         return BAD_APICID;
 
-    spin_lock_irqsave(&vector_lock, flags);
-    ret = _assign_irq_vector(desc, mask ?: TARGET_CPUS);
-    spin_unlock_irqrestore(&vector_lock, flags);
+    /*
+     * mask input set can contain CPUs that are not online.  To decide whether
+     * the interrupt needs to be migrated restrict the input mask to the CPUs
+     * that are online.
+     */
+    if ( mask )
+        cpumask_and(&dest_mask, mask, &cpu_online_map);
+    else
+        cpumask_copy(&dest_mask, TARGET_CPUS);
 
-    if ( ret < 0 )
-        return BAD_APICID;
+    /*
+     * Only move the interrupt if there are no CPUs left in ->arch.cpu_mask
+     * that can handle it, otherwise just shuffle it around ->arch.cpu_mask
+     * to an available destination.
+     */
+    if ( !cpumask_intersects(desc->arch.cpu_mask, &dest_mask) )
+    {
+        int ret;
+        unsigned long flags;
+
+        spin_lock_irqsave(&vector_lock, flags);
+        ret = _assign_irq_vector(desc, mask ?: TARGET_CPUS);
+        spin_unlock_irqrestore(&vector_lock, flags);
+
+        if ( ret < 0 )
+            return BAD_APICID;
+    }
 
     if ( mask )
     {
@@ -862,6 +881,7 @@ unsigned int set_desc_affinity(struct irq_desc *desc, const cpumask_t *mask)
         cpumask_copy(&dest_mask, desc->arch.cpu_mask);
     }
     cpumask_and(&dest_mask, &dest_mask, &cpu_online_map);
+    ASSERT(!cpumask_empty(&dest_mask));
 
     return cpu_mask_to_apicid(&dest_mask);
 }
-- 
2.44.0



^ permalink raw reply related	[flat|nested] 37+ messages in thread

* [PATCH v2 5/7] x86/irq: deal with old_cpu_mask for interrupts in movement in fixup_irqs()
  2024-06-10 14:20 [PATCH v2 0/7] x86/irq: fixes for CPU hot{,un}plug Roger Pau Monne
                   ` (3 preceding siblings ...)
  2024-06-10 14:20 ` [PATCH v2 4/7] x86/irq: restrict CPU movement in set_desc_affinity() Roger Pau Monne
@ 2024-06-10 14:20 ` Roger Pau Monne
  2024-06-11 12:45   ` Jan Beulich
  2024-06-11 13:47   ` Jan Beulich
  2024-06-10 14:20 ` [PATCH v2 6/7] x86/irq: handle moving interrupts in _assign_irq_vector() Roger Pau Monne
  2024-06-10 14:20 ` [PATCH v2 7/7] x86/irq: forward pending interrupts to new destination in fixup_irqs() Roger Pau Monne
  6 siblings, 2 replies; 37+ messages in thread
From: Roger Pau Monne @ 2024-06-10 14:20 UTC (permalink / raw)
  To: xen-devel; +Cc: Roger Pau Monne, Jan Beulich, Andrew Cooper

Given the current logic it's possible for ->arch.old_cpu_mask to get out of
sync: if a CPU set in old_cpu_mask is offlined and then onlined
again without old_cpu_mask having been updated the data in the mask will no
longer be accurate, as when brought back online the CPU will no longer have
old_vector configured to handle the old interrupt source.

If there's an interrupt movement in progress, and the to be offlined CPU (which
is the call context) is in the old_cpu_mask clear it and update the mask, so it
doesn't contain stale data.

Signed-off-by: Roger Pau Monné <roger.pau@citrix.com>
---
 xen/arch/x86/irq.c | 24 +++++++++++++++++++++++-
 1 file changed, 23 insertions(+), 1 deletion(-)

diff --git a/xen/arch/x86/irq.c b/xen/arch/x86/irq.c
index 306e7756112f..f07e09b63b53 100644
--- a/xen/arch/x86/irq.c
+++ b/xen/arch/x86/irq.c
@@ -2546,7 +2546,7 @@ void fixup_irqs(const cpumask_t *mask, bool verbose)
     for ( irq = 0; irq < nr_irqs; irq++ )
     {
         bool break_affinity = false, set_affinity = true;
-        unsigned int vector;
+        unsigned int vector, cpu = smp_processor_id();
         cpumask_t *affinity = this_cpu(scratch_cpumask);
 
         if ( irq == 2 )
@@ -2589,6 +2589,28 @@ void fixup_irqs(const cpumask_t *mask, bool verbose)
                                affinity);
         }
 
+        if ( desc->arch.move_in_progress &&
+             !cpumask_test_cpu(cpu, &cpu_online_map) &&
+             cpumask_test_cpu(cpu, desc->arch.old_cpu_mask) )
+        {
+            /*
+             * This CPU is going offline, remove it from ->arch.old_cpu_mask
+             * and possibly release the old vector if the old mask becomes
+             * empty.
+             *
+             * Note cleaning ->arch.old_cpu_mask is required if the CPU is
+             * brought offline and then online again, as when re-onlined the
+             * per-cpu vector table will no longer have ->arch.old_vector
+             * setup, and hence ->arch.old_cpu_mask would be stale.
+             */
+            cpumask_clear_cpu(cpu, desc->arch.old_cpu_mask);
+            if ( cpumask_empty(desc->arch.old_cpu_mask) )
+            {
+                desc->arch.move_in_progress = 0;
+                release_old_vec(desc);
+            }
+        }
+
         /*
          * Avoid shuffling the interrupt around as long as current target CPUs
          * are a subset of the input mask.  What fixup_irqs() cares about is
-- 
2.44.0



^ permalink raw reply related	[flat|nested] 37+ messages in thread

* [PATCH v2 6/7] x86/irq: handle moving interrupts in _assign_irq_vector()
  2024-06-10 14:20 [PATCH v2 0/7] x86/irq: fixes for CPU hot{,un}plug Roger Pau Monne
                   ` (4 preceding siblings ...)
  2024-06-10 14:20 ` [PATCH v2 5/7] x86/irq: deal with old_cpu_mask for interrupts in movement in fixup_irqs() Roger Pau Monne
@ 2024-06-10 14:20 ` Roger Pau Monne
  2024-06-11 13:18   ` Jan Beulich
  2024-06-10 14:20 ` [PATCH v2 7/7] x86/irq: forward pending interrupts to new destination in fixup_irqs() Roger Pau Monne
  6 siblings, 1 reply; 37+ messages in thread
From: Roger Pau Monne @ 2024-06-10 14:20 UTC (permalink / raw)
  To: xen-devel; +Cc: Roger Pau Monne, Jan Beulich, Andrew Cooper

Currently there's logic in fixup_irqs() that attempts to prevent
_assign_irq_vector() from failing, as fixup_irqs() is required to evacuate all
interrupts from the CPUs not present in the input mask.  The current logic in
fixup_irqs() is incomplete, as it doesn't deal with interrupts that have
move_cleanup_count > 0 and a non-empty ->arch.old_cpu_mask field.

Instead of attempting to fixup the interrupt descriptor in fixup_irqs() so that
_assign_irq_vector() cannot fail, introduce logic in _assign_irq_vector()
to deal with interrupts that have either move_{in_progress,cleanup_count} set
and no remaining online CPUs in ->arch.cpu_mask.

If _assign_irq_vector() is requested to move an interrupt in the state
described above, first attempt to see if ->arch.old_cpu_mask contains any valid
CPUs that could be used as fallback, and if that's the case do move the
interrupt back to the previous destination.  Note this is easier because the
vector hasn't been released yet, so there's no need to allocate and setup a new
vector on the destination.

Due to the logic in fixup_irqs() that clears offline CPUs from
->arch.old_cpu_mask (and releases the old vector if the mask becomes empty) it
shouldn't be possible to get into _assign_irq_vector() with
->arch.move_{in_progress,cleanup_count} set but no online CPUs in
->arch.old_cpu_mask.

However if ->arch.move_{in_progress,cleanup_count} is set and the interrupt has
also changed affinity, it's possible the members of ->arch.old_cpu_mask are no
longer part of the affinity set, move the interrupt to a different CPU part of
the provided mask and keep the current ->arch.old_{cpu_mask,vector} for the
pending interrupt movement to be completed.

Signed-off-by: Roger Pau Monné <roger.pau@citrix.com>
---
Changes since v1:
 - Further refine the logic in _assign_irq_vector().
---
 xen/arch/x86/irq.c | 87 ++++++++++++++++++++++++++++++----------------
 1 file changed, 58 insertions(+), 29 deletions(-)

diff --git a/xen/arch/x86/irq.c b/xen/arch/x86/irq.c
index f07e09b63b53..54eabd23995c 100644
--- a/xen/arch/x86/irq.c
+++ b/xen/arch/x86/irq.c
@@ -544,7 +544,53 @@ static int _assign_irq_vector(struct irq_desc *desc, const cpumask_t *mask)
     }
 
     if ( desc->arch.move_in_progress || desc->arch.move_cleanup_count )
-        return -EAGAIN;
+    {
+        /*
+         * If the current destination is online refuse to shuffle.  Retry after
+         * the in-progress movement has finished.
+         */
+        if ( cpumask_intersects(desc->arch.cpu_mask, &cpu_online_map) )
+            return -EAGAIN;
+
+        /*
+         * Due to the logic in fixup_irqs() that clears offlined CPUs from
+         * ->arch.old_cpu_mask it shouldn't be possible to get here with
+         * ->arch.move_{in_progress,cleanup_count} set and no online CPUs in
+         * ->arch.old_cpu_mask.
+         */
+        ASSERT(valid_irq_vector(desc->arch.old_vector));
+        ASSERT(cpumask_intersects(desc->arch.old_cpu_mask, &cpu_online_map));
+
+        if ( cpumask_intersects(desc->arch.old_cpu_mask, mask) )
+        {
+            /*
+             * Fallback to the old destination if moving is in progress and the
+             * current destination is to be offlined.  This is only possible if
+             * the CPUs in old_cpu_mask intersect with the affinity mask passed
+             * in the 'mask' parameter.
+             */
+            desc->arch.vector = desc->arch.old_vector;
+            cpumask_and(desc->arch.cpu_mask, desc->arch.old_cpu_mask, mask);
+
+            /* Undo any possibly done cleanup. */
+            for_each_cpu(cpu, desc->arch.cpu_mask)
+                per_cpu(vector_irq, cpu)[desc->arch.vector] = irq;
+
+            /* Cancel the pending move. */
+            desc->arch.old_vector = IRQ_VECTOR_UNASSIGNED;
+            cpumask_clear(desc->arch.old_cpu_mask);
+            desc->arch.move_in_progress = 0;
+            desc->arch.move_cleanup_count = 0;
+
+            return 0;
+        }
+
+        /*
+         * There's an interrupt movement in progress but the destination(s) in
+         * ->arch.old_cpu_mask are not suitable given the passed 'mask', go
+         * through the full logic to find a new vector in a suitable CPU.
+         */
+    }
 
     err = -ENOSPC;
 
@@ -600,7 +646,17 @@ next:
         current_vector = vector;
         current_offset = offset;
 
-        if ( valid_irq_vector(old_vector) )
+        if ( desc->arch.move_in_progress || desc->arch.move_cleanup_count )
+        {
+            ASSERT(!cpumask_intersects(desc->arch.cpu_mask, &cpu_online_map));
+            /*
+             * Special case when evacuating an interrupt from a CPU to be
+             * offlined and the interrupt was already in the process of being
+             * moved.  Leave ->arch.old_{vector,cpu_mask} as-is and just
+             * replace ->arch.{cpu_mask,vector} with the new destination.
+             */
+        }
+        else if ( valid_irq_vector(old_vector) )
         {
             cpumask_and(desc->arch.old_cpu_mask, desc->arch.cpu_mask,
                         &cpu_online_map);
@@ -2622,33 +2678,6 @@ void fixup_irqs(const cpumask_t *mask, bool verbose)
             continue;
         }
 
-        /*
-         * In order for the affinity adjustment below to be successful, we
-         * need _assign_irq_vector() to succeed. This in particular means
-         * clearing desc->arch.move_in_progress if this would otherwise
-         * prevent the function from succeeding. Since there's no way for the
-         * flag to get cleared anymore when there's no possible destination
-         * left (the only possibility then would be the IRQs enabled window
-         * after this loop), there's then also no race with us doing it here.
-         *
-         * Therefore the logic here and there need to remain in sync.
-         */
-        if ( desc->arch.move_in_progress &&
-             !cpumask_intersects(mask, desc->arch.cpu_mask) )
-        {
-            unsigned int cpu;
-
-            cpumask_and(affinity, desc->arch.old_cpu_mask, &cpu_online_map);
-
-            spin_lock(&vector_lock);
-            for_each_cpu(cpu, affinity)
-                per_cpu(vector_irq, cpu)[desc->arch.old_vector] = ~irq;
-            spin_unlock(&vector_lock);
-
-            release_old_vec(desc);
-            desc->arch.move_in_progress = 0;
-        }
-
         if ( !cpumask_intersects(mask, desc->affinity) )
         {
             break_affinity = true;
-- 
2.44.0



^ permalink raw reply related	[flat|nested] 37+ messages in thread

* [PATCH v2 7/7] x86/irq: forward pending interrupts to new destination in fixup_irqs()
  2024-06-10 14:20 [PATCH v2 0/7] x86/irq: fixes for CPU hot{,un}plug Roger Pau Monne
                   ` (5 preceding siblings ...)
  2024-06-10 14:20 ` [PATCH v2 6/7] x86/irq: handle moving interrupts in _assign_irq_vector() Roger Pau Monne
@ 2024-06-10 14:20 ` Roger Pau Monne
  2024-06-11 13:50   ` Jan Beulich
  6 siblings, 1 reply; 37+ messages in thread
From: Roger Pau Monne @ 2024-06-10 14:20 UTC (permalink / raw)
  To: xen-devel; +Cc: Roger Pau Monne, Jan Beulich, Andrew Cooper

fixup_irqs() is used to evacuate interrupts from to be offlined CPUs.  Given
the CPU is to become offline, the normal migration logic used by Xen where the
vector in the previous target(s) is left configured until the interrupt is
received on the new destination is not suitable.

Instead attempt to do as much as possible in order to prevent loosing
interrupts.  If fixup_irqs() is called from the CPU to be offlined (as is
currently the case) attempt to forward pending vectors when interrupts that
target the current CPU are migrated to a different destination.

Additionally, for interrupts that have already been moved from the current CPU
prior to the call to fixup_irqs() but that haven't been delivered to the new
destination (iow: interrupts with move_in_progress set and the current CPU set
in ->arch.old_cpu_mask) also check whether the previous vector is pending and
forward it to the new destination.

Signed-off-by: Roger Pau Monné <roger.pau@citrix.com>
---
Changes since v1:
 - Rename to apic_irr_read().
---
 xen/arch/x86/include/asm/apic.h |  5 +++++
 xen/arch/x86/irq.c              | 37 ++++++++++++++++++++++++++++++++-
 2 files changed, 41 insertions(+), 1 deletion(-)

diff --git a/xen/arch/x86/include/asm/apic.h b/xen/arch/x86/include/asm/apic.h
index d1cb001fb4ab..7bd66dc6e151 100644
--- a/xen/arch/x86/include/asm/apic.h
+++ b/xen/arch/x86/include/asm/apic.h
@@ -132,6 +132,11 @@ static inline bool apic_isr_read(uint8_t vector)
             (vector & 0x1f)) & 1;
 }
 
+static inline bool apic_irr_read(unsigned int vector)
+{
+    return apic_read(APIC_IRR + (vector / 32 * 0x10)) & (1U << (vector % 32));
+}
+
 static inline u32 get_apic_id(void)
 {
     u32 id = apic_read(APIC_ID);
diff --git a/xen/arch/x86/irq.c b/xen/arch/x86/irq.c
index 54eabd23995c..ed262fb55f4a 100644
--- a/xen/arch/x86/irq.c
+++ b/xen/arch/x86/irq.c
@@ -2601,7 +2601,7 @@ void fixup_irqs(const cpumask_t *mask, bool verbose)
 
     for ( irq = 0; irq < nr_irqs; irq++ )
     {
-        bool break_affinity = false, set_affinity = true;
+        bool break_affinity = false, set_affinity = true, check_irr = false;
         unsigned int vector, cpu = smp_processor_id();
         cpumask_t *affinity = this_cpu(scratch_cpumask);
 
@@ -2649,6 +2649,25 @@ void fixup_irqs(const cpumask_t *mask, bool verbose)
              !cpumask_test_cpu(cpu, &cpu_online_map) &&
              cpumask_test_cpu(cpu, desc->arch.old_cpu_mask) )
         {
+            /*
+             * This to be offlined CPU was the target of an interrupt that's
+             * been moved, and the new destination target hasn't yet
+             * acknowledged any interrupt from it.
+             *
+             * We know the interrupt is configured to target the new CPU at
+             * this point, so we can check IRR for any pending vectors and
+             * forward them to the new destination.
+             *
+             * Note the difference between move_in_progress or
+             * move_cleanup_count being set.  For the later we know the new
+             * destination has already acked at least one interrupt from this
+             * source, and hence there's no need to forward any stale
+             * interrupts.
+             */
+            if ( apic_irr_read(desc->arch.old_vector) )
+                send_IPI_mask(cpumask_of(cpumask_any(desc->arch.cpu_mask)),
+                              desc->arch.vector);
+
             /*
              * This CPU is going offline, remove it from ->arch.old_cpu_mask
              * and possibly release the old vector if the old mask becomes
@@ -2689,11 +2708,27 @@ void fixup_irqs(const cpumask_t *mask, bool verbose)
         if ( desc->handler->disable )
             desc->handler->disable(desc);
 
+        /*
+         * If the current CPU is going offline and is (one of) the target(s) of
+         * the interrupt signal to check whether there are any pending vectors
+         * to be handled in the local APIC after the interrupt has been moved.
+         */
+        if ( !cpu_online(cpu) && cpumask_test_cpu(cpu, desc->arch.cpu_mask) )
+            check_irr = true;
+
         if ( desc->handler->set_affinity )
             desc->handler->set_affinity(desc, affinity);
         else if ( !(warned++) )
             set_affinity = false;
 
+        if ( check_irr && apic_irr_read(vector) )
+            /*
+             * Forward pending interrupt to the new destination, this CPU is
+             * going offline and otherwise the interrupt would be lost.
+             */
+            send_IPI_mask(cpumask_of(cpumask_any(desc->arch.cpu_mask)),
+                          desc->arch.vector);
+
         if ( desc->handler->enable )
             desc->handler->enable(desc);
 
-- 
2.44.0



^ permalink raw reply related	[flat|nested] 37+ messages in thread

* Re: [PATCH v2 1/7] x86/smp: do not use shorthand IPI destinations in CPU hot{,un}plug contexts
  2024-06-10 14:20 ` [PATCH v2 1/7] x86/smp: do not use shorthand IPI destinations in CPU hot{,un}plug contexts Roger Pau Monne
@ 2024-06-11  7:42   ` Jan Beulich
  2024-06-12  8:09     ` Roger Pau Monné
  0 siblings, 1 reply; 37+ messages in thread
From: Jan Beulich @ 2024-06-11  7:42 UTC (permalink / raw)
  To: Roger Pau Monne
  Cc: Andrew Cooper, George Dunlap, Julien Grall, Stefano Stabellini,
	xen-devel

On 10.06.2024 16:20, Roger Pau Monne wrote:
> Due to the current rwlock logic, if the CPU calling get_cpu_maps() does so from
> a cpu_hotplug_{begin,done}() region the function will still return success,
> because a CPU taking the rwlock in read mode after having taken it in write
> mode is allowed.  Such behavior however defeats the purpose of get_cpu_maps(),

I'm not happy to see you still have this claim here. The behavior may (appear
to) defeat the purpose here, but as expressed previously I don't view that as
being a general pattern.

> as it should always return false when called with a CPU hot{,un}plug operation
> is in progress.  Otherwise the logic in send_IPI_mask() is wrong, as it could
> decide to use the shorthand even when a CPU operation is in progress.
> 
> Introduce a new helper to detect whether the current caller is between a
> cpu_hotplug_{begin,done}() region and use it in send_IPI_mask() to restrict
> shorthand usage.
> 
> Fixes: 5500d265a2a8 ('x86/smp: use APIC ALLBUT destination shorthand when possible')
> Signed-off-by: Roger Pau Monné <roger.pau@citrix.com>
> ---
> Changes since v1:
>  - Modify send_IPI_mask() to detect CPU hotplug context.
> ---
>  xen/arch/x86/smp.c       |  2 +-
>  xen/common/cpu.c         |  5 +++++
>  xen/include/xen/cpu.h    | 10 ++++++++++
>  xen/include/xen/rwlock.h |  2 ++
>  4 files changed, 18 insertions(+), 1 deletion(-)
> 
> diff --git a/xen/arch/x86/smp.c b/xen/arch/x86/smp.c
> index 7443ad20335e..04c6a0572319 100644
> --- a/xen/arch/x86/smp.c
> +++ b/xen/arch/x86/smp.c
> @@ -88,7 +88,7 @@ void send_IPI_mask(const cpumask_t *mask, int vector)
>       * the system have been accounted for.
>       */
>      if ( system_state > SYS_STATE_smp_boot &&
> -         !unaccounted_cpus && !disabled_cpus &&
> +         !unaccounted_cpus && !disabled_cpus && !cpu_in_hotplug_context() &&
>           /* NB: get_cpu_maps lock requires enabled interrupts. */
>           local_irq_is_enabled() && (cpus_locked = get_cpu_maps()) &&
>           (park_offline_cpus ||

Along with changing the condition you also want to update the comment to
reflect the code adjustment.

If we can agree on respective wording in both places, I'd be happy to make
respective adjustments while committing; the code changes themselves are
Reviewed-by: Jan Beulich <jbeulich@suse.com>

Jan


^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH v2 2/7] x86/irq: describe how the interrupt CPU movement works
  2024-06-10 14:20 ` [PATCH v2 2/7] x86/irq: describe how the interrupt CPU movement works Roger Pau Monne
@ 2024-06-11  7:44   ` Jan Beulich
  2024-06-12 10:08     ` Oleksii K.
  0 siblings, 1 reply; 37+ messages in thread
From: Jan Beulich @ 2024-06-11  7:44 UTC (permalink / raw)
  To: Roger Pau Monne; +Cc: Andrew Cooper, xen-devel

On 10.06.2024 16:20, Roger Pau Monne wrote:
> The logic to move interrupts across CPUs is complex, attempt to provide a
> comment that describes the expected behavior so users of the interrupt system
> have more context about the usage of the arch_irq_desc structure fields.
> 
> Signed-off-by: Roger Pau Monné <roger.pau@citrix.com>

Reviewed-by: Jan Beulich <jbeulich@suse.com>




^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH v2 3/7] x86/irq: limit interrupt movement done by fixup_irqs()
  2024-06-10 14:20 ` [PATCH v2 3/7] x86/irq: limit interrupt movement done by fixup_irqs() Roger Pau Monne
@ 2024-06-11  9:59   ` Jan Beulich
  2024-06-12  8:10     ` Roger Pau Monné
  2024-06-12 10:13     ` Oleksii K.
  0 siblings, 2 replies; 37+ messages in thread
From: Jan Beulich @ 2024-06-11  9:59 UTC (permalink / raw)
  To: Roger Pau Monne; +Cc: Andrew Cooper, xen-devel

On 10.06.2024 16:20, Roger Pau Monne wrote:
> The current check used in fixup_irqs() to decide whether to move around
> interrupts is based on the affinity mask, but such mask can have all bits set,
> and hence is unlikely to be a subset of the input mask.  For example if an
> interrupt has an affinity mask of all 1s, any input to fixup_irqs() that's not
> an all set CPU mask would cause that interrupt to be shuffled around
> unconditionally.
> 
> What fixup_irqs() care about is evacuating interrupts from CPUs not set on the
> input CPU mask, and for that purpose it should check whether the interrupt is
> assigned to a CPU not present in the input mask.  Assume that ->arch.cpu_mask
> is a subset of the ->affinity mask, and keep the current logic that resets the
> ->affinity mask if the interrupt has to be shuffled around.
> 
> Doing the affinity movement based on ->arch.cpu_mask requires removing the
> special handling to ->arch.cpu_mask done for high priority vectors, otherwise
> the adjustment done to cpu_mask makes them always skip the CPU interrupt
> movement.
> 
> While there also adjust the comment as to the purpose of fixup_irqs().
> 
> Signed-off-by: Roger Pau Monné <roger.pau@citrix.com>

Reviewed-by: Jan Beulich <jbeulich@suse.com>

Aiui this is independent of patch 1, so could go in while we still settle on
how to word things there?

Jan


^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH v2 4/7] x86/irq: restrict CPU movement in set_desc_affinity()
  2024-06-10 14:20 ` [PATCH v2 4/7] x86/irq: restrict CPU movement in set_desc_affinity() Roger Pau Monne
@ 2024-06-11 10:20   ` Jan Beulich
  2024-06-12  8:31     ` Roger Pau Monné
  0 siblings, 1 reply; 37+ messages in thread
From: Jan Beulich @ 2024-06-11 10:20 UTC (permalink / raw)
  To: Roger Pau Monne; +Cc: Andrew Cooper, xen-devel

On 10.06.2024 16:20, Roger Pau Monne wrote:
> If external interrupts are using logical mode it's possible to have an overlap
> between the current ->arch.cpu_mask and the provided mask (or TARGET_CPUS).  If
> that's the case avoid assigning a new vector and just move the interrupt to a
> member of ->arch.cpu_mask that overlaps with the provided mask and is online.

What I'm kind of missing here is an explanation of why what _assign_irq_vector()
does to avoid unnecessary migration (very first conditional there) isn't
sufficient.

> --- a/xen/arch/x86/irq.c
> +++ b/xen/arch/x86/irq.c
> @@ -837,19 +837,38 @@ void cf_check irq_complete_move(struct irq_desc *desc)
>  
>  unsigned int set_desc_affinity(struct irq_desc *desc, const cpumask_t *mask)
>  {
> -    int ret;
> -    unsigned long flags;
>      cpumask_t dest_mask;
>  
>      if ( mask && !cpumask_intersects(mask, &cpu_online_map) )
>          return BAD_APICID;
>  
> -    spin_lock_irqsave(&vector_lock, flags);
> -    ret = _assign_irq_vector(desc, mask ?: TARGET_CPUS);
> -    spin_unlock_irqrestore(&vector_lock, flags);
> +    /*
> +     * mask input set can contain CPUs that are not online.  To decide whether
> +     * the interrupt needs to be migrated restrict the input mask to the CPUs
> +     * that are online.
> +     */
> +    if ( mask )
> +        cpumask_and(&dest_mask, mask, &cpu_online_map);
> +    else
> +        cpumask_copy(&dest_mask, TARGET_CPUS);

Why once &cpu_online_map and once TARGET_CPUS?

> -    if ( ret < 0 )
> -        return BAD_APICID;
> +    /*
> +     * Only move the interrupt if there are no CPUs left in ->arch.cpu_mask
> +     * that can handle it, otherwise just shuffle it around ->arch.cpu_mask
> +     * to an available destination.
> +     */

"an available destination" (singular) gives the impression that it's only
ever going to be a single CPU. Yet cpu_mask_to_apicid_flat() and
cpu_mask_to_apicid_x2apic_cluster() can produce sets of multiple CPUs.
Therefore maybe "an available destination / the (sub)set of available
destinations"? Or as that's getting longish "(an) available destination(s)"?

> +    if ( !cpumask_intersects(desc->arch.cpu_mask, &dest_mask) )
> +    {
> +        int ret;
> +        unsigned long flags;
> +
> +        spin_lock_irqsave(&vector_lock, flags);
> +        ret = _assign_irq_vector(desc, mask ?: TARGET_CPUS);

Why not pass dest_mask here, as you now calculate that up front? The
function will skip offline CPUs anyway.

> @@ -862,6 +881,7 @@ unsigned int set_desc_affinity(struct irq_desc *desc, const cpumask_t *mask)
>          cpumask_copy(&dest_mask, desc->arch.cpu_mask);
>      }
>      cpumask_and(&dest_mask, &dest_mask, &cpu_online_map);
> +    ASSERT(!cpumask_empty(&dest_mask));
>  
>      return cpu_mask_to_apicid(&dest_mask);

I wonder whether the assertion wouldn't better live in cpu_mask_to_apicid()
itself (the macro, not the backing functions).

Jan


^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH v2 5/7] x86/irq: deal with old_cpu_mask for interrupts in movement in fixup_irqs()
  2024-06-10 14:20 ` [PATCH v2 5/7] x86/irq: deal with old_cpu_mask for interrupts in movement in fixup_irqs() Roger Pau Monne
@ 2024-06-11 12:45   ` Jan Beulich
  2024-06-12  8:47     ` Roger Pau Monné
  2024-06-11 13:47   ` Jan Beulich
  1 sibling, 1 reply; 37+ messages in thread
From: Jan Beulich @ 2024-06-11 12:45 UTC (permalink / raw)
  To: Roger Pau Monne; +Cc: Andrew Cooper, xen-devel

On 10.06.2024 16:20, Roger Pau Monne wrote:
> Given the current logic it's possible for ->arch.old_cpu_mask to get out of
> sync: if a CPU set in old_cpu_mask is offlined and then onlined
> again without old_cpu_mask having been updated the data in the mask will no
> longer be accurate, as when brought back online the CPU will no longer have
> old_vector configured to handle the old interrupt source.
> 
> If there's an interrupt movement in progress, and the to be offlined CPU (which
> is the call context) is in the old_cpu_mask clear it and update the mask, so it
> doesn't contain stale data.

This imo is too __cpu_disable()-centric. In the code you cover the
smp_send_stop() case afaict, where it's all _other_ CPUs which are being
offlined. As we're not meaning to bring CPUs online again in that case,
dealing with the situation likely isn't needed. Yet the description should
imo at least make clear that the case was considered.

> @@ -2589,6 +2589,28 @@ void fixup_irqs(const cpumask_t *mask, bool verbose)
>                                 affinity);
>          }
>  
> +        if ( desc->arch.move_in_progress &&
> +             !cpumask_test_cpu(cpu, &cpu_online_map) &&

This part of the condition is, afaict, what covers (excludes) the
smp_send_stop() case. Might be nice to have a brief comment here, thus
also clarifying ...

> +             cpumask_test_cpu(cpu, desc->arch.old_cpu_mask) )
> +        {
> +            /*
> +             * This CPU is going offline, remove it from ->arch.old_cpu_mask
> +             * and possibly release the old vector if the old mask becomes
> +             * empty.
> +             *
> +             * Note cleaning ->arch.old_cpu_mask is required if the CPU is
> +             * brought offline and then online again, as when re-onlined the
> +             * per-cpu vector table will no longer have ->arch.old_vector
> +             * setup, and hence ->arch.old_cpu_mask would be stale.
> +             */
> +            cpumask_clear_cpu(cpu, desc->arch.old_cpu_mask);
> +            if ( cpumask_empty(desc->arch.old_cpu_mask) )
> +            {
> +                desc->arch.move_in_progress = 0;
> +                release_old_vec(desc);
> +            }

... that none of this is really wanted or necessary in that other case.
Assuming my understanding above is correct, the code change itself is
once again
Reviewed-by: Jan Beulich <jbeulich@suse.com>
yet here I'm uncertain whether to offer on-commit editing, as it's not
really clear to me whether there's a dependencies on patch 4.

Jan


^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH v2 6/7] x86/irq: handle moving interrupts in _assign_irq_vector()
  2024-06-10 14:20 ` [PATCH v2 6/7] x86/irq: handle moving interrupts in _assign_irq_vector() Roger Pau Monne
@ 2024-06-11 13:18   ` Jan Beulich
  2024-06-12 10:39     ` Roger Pau Monné
  0 siblings, 1 reply; 37+ messages in thread
From: Jan Beulich @ 2024-06-11 13:18 UTC (permalink / raw)
  To: Roger Pau Monne; +Cc: Andrew Cooper, xen-devel

On 10.06.2024 16:20, Roger Pau Monne wrote:
> Currently there's logic in fixup_irqs() that attempts to prevent
> _assign_irq_vector() from failing, as fixup_irqs() is required to evacuate all
> interrupts from the CPUs not present in the input mask.  The current logic in
> fixup_irqs() is incomplete, as it doesn't deal with interrupts that have
> move_cleanup_count > 0 and a non-empty ->arch.old_cpu_mask field.
> 
> Instead of attempting to fixup the interrupt descriptor in fixup_irqs() so that
> _assign_irq_vector() cannot fail, introduce logic in _assign_irq_vector()
> to deal with interrupts that have either move_{in_progress,cleanup_count} set
> and no remaining online CPUs in ->arch.cpu_mask.
> 
> If _assign_irq_vector() is requested to move an interrupt in the state
> described above, first attempt to see if ->arch.old_cpu_mask contains any valid
> CPUs that could be used as fallback, and if that's the case do move the
> interrupt back to the previous destination.  Note this is easier because the
> vector hasn't been released yet, so there's no need to allocate and setup a new
> vector on the destination.
> 
> Due to the logic in fixup_irqs() that clears offline CPUs from
> ->arch.old_cpu_mask (and releases the old vector if the mask becomes empty) it
> shouldn't be possible to get into _assign_irq_vector() with
> ->arch.move_{in_progress,cleanup_count} set but no online CPUs in
> ->arch.old_cpu_mask.
> 
> However if ->arch.move_{in_progress,cleanup_count} is set and the interrupt has
> also changed affinity, it's possible the members of ->arch.old_cpu_mask are no
> longer part of the affinity set,

I'm having trouble relating this (->arch.old_cpu_mask related) to ...

> move the interrupt to a different CPU part of
> the provided mask

... this (->arch.cpu_mask related).

> and keep the current ->arch.old_{cpu_mask,vector} for the
> pending interrupt movement to be completed.

Right, that's to clean up state from before the initial move. What isn't
clear to me is what's to happen with the state of the intermediate
placement. Description and code changes leave me with the impression that
it's okay to simply abandon, without any cleanup, yet I can't quite figure
why that would be an okay thing to do.

> --- a/xen/arch/x86/irq.c
> +++ b/xen/arch/x86/irq.c
> @@ -544,7 +544,53 @@ static int _assign_irq_vector(struct irq_desc *desc, const cpumask_t *mask)
>      }
>  
>      if ( desc->arch.move_in_progress || desc->arch.move_cleanup_count )
> -        return -EAGAIN;
> +    {
> +        /*
> +         * If the current destination is online refuse to shuffle.  Retry after
> +         * the in-progress movement has finished.
> +         */
> +        if ( cpumask_intersects(desc->arch.cpu_mask, &cpu_online_map) )
> +            return -EAGAIN;
> +
> +        /*
> +         * Due to the logic in fixup_irqs() that clears offlined CPUs from
> +         * ->arch.old_cpu_mask it shouldn't be possible to get here with
> +         * ->arch.move_{in_progress,cleanup_count} set and no online CPUs in
> +         * ->arch.old_cpu_mask.
> +         */
> +        ASSERT(valid_irq_vector(desc->arch.old_vector));
> +        ASSERT(cpumask_intersects(desc->arch.old_cpu_mask, &cpu_online_map));
> +
> +        if ( cpumask_intersects(desc->arch.old_cpu_mask, mask) )
> +        {
> +            /*
> +             * Fallback to the old destination if moving is in progress and the
> +             * current destination is to be offlined.  This is only possible if
> +             * the CPUs in old_cpu_mask intersect with the affinity mask passed
> +             * in the 'mask' parameter.
> +             */
> +            desc->arch.vector = desc->arch.old_vector;
> +            cpumask_and(desc->arch.cpu_mask, desc->arch.old_cpu_mask, mask);
> +
> +            /* Undo any possibly done cleanup. */
> +            for_each_cpu(cpu, desc->arch.cpu_mask)
> +                per_cpu(vector_irq, cpu)[desc->arch.vector] = irq;
> +
> +            /* Cancel the pending move. */
> +            desc->arch.old_vector = IRQ_VECTOR_UNASSIGNED;
> +            cpumask_clear(desc->arch.old_cpu_mask);
> +            desc->arch.move_in_progress = 0;
> +            desc->arch.move_cleanup_count = 0;
> +
> +            return 0;
> +        }

In how far is this guaranteed to respect the (new) affinity that was set,
presumably having led to the movement in the first place?

> @@ -600,7 +646,17 @@ next:
>          current_vector = vector;
>          current_offset = offset;
>  
> -        if ( valid_irq_vector(old_vector) )
> +        if ( desc->arch.move_in_progress || desc->arch.move_cleanup_count )
> +        {
> +            ASSERT(!cpumask_intersects(desc->arch.cpu_mask, &cpu_online_map));
> +            /*
> +             * Special case when evacuating an interrupt from a CPU to be
> +             * offlined and the interrupt was already in the process of being
> +             * moved.  Leave ->arch.old_{vector,cpu_mask} as-is and just
> +             * replace ->arch.{cpu_mask,vector} with the new destination.
> +             */

And where's the cleaning up of ->arch.old_* going to be taken care of then?

Jan


^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH v2 5/7] x86/irq: deal with old_cpu_mask for interrupts in movement in fixup_irqs()
  2024-06-10 14:20 ` [PATCH v2 5/7] x86/irq: deal with old_cpu_mask for interrupts in movement in fixup_irqs() Roger Pau Monne
  2024-06-11 12:45   ` Jan Beulich
@ 2024-06-11 13:47   ` Jan Beulich
  2024-06-12  8:36     ` Roger Pau Monné
  1 sibling, 1 reply; 37+ messages in thread
From: Jan Beulich @ 2024-06-11 13:47 UTC (permalink / raw)
  To: Roger Pau Monne; +Cc: Andrew Cooper, xen-devel

On 10.06.2024 16:20, Roger Pau Monne wrote:
> @@ -2589,6 +2589,28 @@ void fixup_irqs(const cpumask_t *mask, bool verbose)
>                                 affinity);
>          }
>  
> +        if ( desc->arch.move_in_progress &&
> +             !cpumask_test_cpu(cpu, &cpu_online_map) &&

Btw - any reason you're open-coding !cpu_online() here? I've noticed this
in the context of patch 7, where a little further down a !cpu_online() is
being added. Those likely all want to be consistent.

Jan


^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH v2 7/7] x86/irq: forward pending interrupts to new destination in fixup_irqs()
  2024-06-10 14:20 ` [PATCH v2 7/7] x86/irq: forward pending interrupts to new destination in fixup_irqs() Roger Pau Monne
@ 2024-06-11 13:50   ` Jan Beulich
  2024-06-12 11:23     ` Roger Pau Monné
  0 siblings, 1 reply; 37+ messages in thread
From: Jan Beulich @ 2024-06-11 13:50 UTC (permalink / raw)
  To: Roger Pau Monne; +Cc: Andrew Cooper, xen-devel

On 10.06.2024 16:20, Roger Pau Monne wrote:
> fixup_irqs() is used to evacuate interrupts from to be offlined CPUs.  Given
> the CPU is to become offline, the normal migration logic used by Xen where the
> vector in the previous target(s) is left configured until the interrupt is
> received on the new destination is not suitable.
> 
> Instead attempt to do as much as possible in order to prevent loosing
> interrupts.  If fixup_irqs() is called from the CPU to be offlined (as is
> currently the case) attempt to forward pending vectors when interrupts that
> target the current CPU are migrated to a different destination.
> 
> Additionally, for interrupts that have already been moved from the current CPU
> prior to the call to fixup_irqs() but that haven't been delivered to the new
> destination (iow: interrupts with move_in_progress set and the current CPU set
> in ->arch.old_cpu_mask) also check whether the previous vector is pending and
> forward it to the new destination.
> 
> Signed-off-by: Roger Pau Monné <roger.pau@citrix.com>
> ---
> Changes since v1:
>  - Rename to apic_irr_read().
> ---
>  xen/arch/x86/include/asm/apic.h |  5 +++++
>  xen/arch/x86/irq.c              | 37 ++++++++++++++++++++++++++++++++-
>  2 files changed, 41 insertions(+), 1 deletion(-)
> 
> diff --git a/xen/arch/x86/include/asm/apic.h b/xen/arch/x86/include/asm/apic.h
> index d1cb001fb4ab..7bd66dc6e151 100644
> --- a/xen/arch/x86/include/asm/apic.h
> +++ b/xen/arch/x86/include/asm/apic.h
> @@ -132,6 +132,11 @@ static inline bool apic_isr_read(uint8_t vector)
>              (vector & 0x1f)) & 1;
>  }
>  
> +static inline bool apic_irr_read(unsigned int vector)
> +{
> +    return apic_read(APIC_IRR + (vector / 32 * 0x10)) & (1U << (vector % 32));
> +}
> +
>  static inline u32 get_apic_id(void)
>  {
>      u32 id = apic_read(APIC_ID);
> diff --git a/xen/arch/x86/irq.c b/xen/arch/x86/irq.c
> index 54eabd23995c..ed262fb55f4a 100644
> --- a/xen/arch/x86/irq.c
> +++ b/xen/arch/x86/irq.c
> @@ -2601,7 +2601,7 @@ void fixup_irqs(const cpumask_t *mask, bool verbose)
>  
>      for ( irq = 0; irq < nr_irqs; irq++ )
>      {
> -        bool break_affinity = false, set_affinity = true;
> +        bool break_affinity = false, set_affinity = true, check_irr = false;
>          unsigned int vector, cpu = smp_processor_id();
>          cpumask_t *affinity = this_cpu(scratch_cpumask);
>  
> @@ -2649,6 +2649,25 @@ void fixup_irqs(const cpumask_t *mask, bool verbose)
>               !cpumask_test_cpu(cpu, &cpu_online_map) &&
>               cpumask_test_cpu(cpu, desc->arch.old_cpu_mask) )
>          {
> +            /*
> +             * This to be offlined CPU was the target of an interrupt that's
> +             * been moved, and the new destination target hasn't yet
> +             * acknowledged any interrupt from it.
> +             *
> +             * We know the interrupt is configured to target the new CPU at
> +             * this point, so we can check IRR for any pending vectors and
> +             * forward them to the new destination.
> +             *
> +             * Note the difference between move_in_progress or
> +             * move_cleanup_count being set.  For the later we know the new
> +             * destination has already acked at least one interrupt from this
> +             * source, and hence there's no need to forward any stale
> +             * interrupts.
> +             */

I'm a little confused by this last paragraph: It talks about a difference,
yet ...

> +            if ( apic_irr_read(desc->arch.old_vector) )
> +                send_IPI_mask(cpumask_of(cpumask_any(desc->arch.cpu_mask)),
> +                              desc->arch.vector);

... in the code being commented there's no difference visible. Hmm, I guess
this is related to the enclosing if(). Maybe this could be worded a little
differently, e.g. starting with "Note that for the other case -
move_cleanup_count being non-zero - we know ..."?

> @@ -2689,11 +2708,27 @@ void fixup_irqs(const cpumask_t *mask, bool verbose)
>          if ( desc->handler->disable )
>              desc->handler->disable(desc);
>  
> +        /*
> +         * If the current CPU is going offline and is (one of) the target(s) of
> +         * the interrupt signal to check whether there are any pending vectors
> +         * to be handled in the local APIC after the interrupt has been moved.
> +         */

After reading this a number of times, I think there wants to be a comma between
"interrupt" and "signal". Or am I getting wrong what is being meant?

> +        if ( !cpu_online(cpu) && cpumask_test_cpu(cpu, desc->arch.cpu_mask) )
> +            check_irr = true;
> +
>          if ( desc->handler->set_affinity )
>              desc->handler->set_affinity(desc, affinity);
>          else if ( !(warned++) )
>              set_affinity = false;
>  
> +        if ( check_irr && apic_irr_read(vector) )
> +            /*
> +             * Forward pending interrupt to the new destination, this CPU is
> +             * going offline and otherwise the interrupt would be lost.
> +             */
> +            send_IPI_mask(cpumask_of(cpumask_any(desc->arch.cpu_mask)),
> +                          desc->arch.vector);
> +
>          if ( desc->handler->enable )
>              desc->handler->enable(desc);
>  

Down from here, after the loop, there's a 1ms window where latched but not
yet delivered interrupts can be received. How's that playing together with
the changes you're making? Aren't we then liable to get two interrupts, one
at the old and one at the new source, in unknown order?

Jan


^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH v2 1/7] x86/smp: do not use shorthand IPI destinations in CPU hot{,un}plug contexts
  2024-06-11  7:42   ` Jan Beulich
@ 2024-06-12  8:09     ` Roger Pau Monné
  2024-06-12  8:56       ` Jan Beulich
  0 siblings, 1 reply; 37+ messages in thread
From: Roger Pau Monné @ 2024-06-12  8:09 UTC (permalink / raw)
  To: Jan Beulich
  Cc: Andrew Cooper, George Dunlap, Julien Grall, Stefano Stabellini,
	xen-devel

On Tue, Jun 11, 2024 at 09:42:39AM +0200, Jan Beulich wrote:
> On 10.06.2024 16:20, Roger Pau Monne wrote:
> > Due to the current rwlock logic, if the CPU calling get_cpu_maps() does so from
> > a cpu_hotplug_{begin,done}() region the function will still return success,
> > because a CPU taking the rwlock in read mode after having taken it in write
> > mode is allowed.  Such behavior however defeats the purpose of get_cpu_maps(),
> 
> I'm not happy to see you still have this claim here. The behavior may (appear
> to) defeat the purpose here, but as expressed previously I don't view that as
> being a general pattern.

Right.  What about replacing the paragraph with:

"Due to the current rwlock logic, if the CPU calling get_cpu_maps() does so from
a cpu_hotplug_{begin,done}() region the function will still return success,
because a CPU taking the rwlock in read mode after having taken it in write
mode is allowed.  Such corner case makes using get_cpu_maps() alone
not enough to prevent using the shorthand in CPU hotplug regions."

I think the rest is of the commit message is not controversial.

> > as it should always return false when called with a CPU hot{,un}plug operation
> > is in progress.  Otherwise the logic in send_IPI_mask() is wrong, as it could
> > decide to use the shorthand even when a CPU operation is in progress.
> > 
> > Introduce a new helper to detect whether the current caller is between a
> > cpu_hotplug_{begin,done}() region and use it in send_IPI_mask() to restrict
> > shorthand usage.
> > 
> > Fixes: 5500d265a2a8 ('x86/smp: use APIC ALLBUT destination shorthand when possible')
> > Signed-off-by: Roger Pau Monné <roger.pau@citrix.com>
> > ---
> > Changes since v1:
> >  - Modify send_IPI_mask() to detect CPU hotplug context.
> > ---
> >  xen/arch/x86/smp.c       |  2 +-
> >  xen/common/cpu.c         |  5 +++++
> >  xen/include/xen/cpu.h    | 10 ++++++++++
> >  xen/include/xen/rwlock.h |  2 ++
> >  4 files changed, 18 insertions(+), 1 deletion(-)
> > 
> > diff --git a/xen/arch/x86/smp.c b/xen/arch/x86/smp.c
> > index 7443ad20335e..04c6a0572319 100644
> > --- a/xen/arch/x86/smp.c
> > +++ b/xen/arch/x86/smp.c
> > @@ -88,7 +88,7 @@ void send_IPI_mask(const cpumask_t *mask, int vector)
> >       * the system have been accounted for.
> >       */
> >      if ( system_state > SYS_STATE_smp_boot &&
> > -         !unaccounted_cpus && !disabled_cpus &&
> > +         !unaccounted_cpus && !disabled_cpus && !cpu_in_hotplug_context() &&
> >           /* NB: get_cpu_maps lock requires enabled interrupts. */
> >           local_irq_is_enabled() && (cpus_locked = get_cpu_maps()) &&
> >           (park_offline_cpus ||
> 
> Along with changing the condition you also want to update the comment to
> reflect the code adjustment.

I've assumed the wording in the commet that says: "no CPU hotplug or
unplug operations are taking place" would already cover the addition
of the !cpu_in_hotplug_context() check.

Thanks, Roger.


^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH v2 3/7] x86/irq: limit interrupt movement done by fixup_irqs()
  2024-06-11  9:59   ` Jan Beulich
@ 2024-06-12  8:10     ` Roger Pau Monné
  2024-06-12 10:13     ` Oleksii K.
  1 sibling, 0 replies; 37+ messages in thread
From: Roger Pau Monné @ 2024-06-12  8:10 UTC (permalink / raw)
  To: Jan Beulich; +Cc: Andrew Cooper, xen-devel

On Tue, Jun 11, 2024 at 11:59:39AM +0200, Jan Beulich wrote:
> On 10.06.2024 16:20, Roger Pau Monne wrote:
> > The current check used in fixup_irqs() to decide whether to move around
> > interrupts is based on the affinity mask, but such mask can have all bits set,
> > and hence is unlikely to be a subset of the input mask.  For example if an
> > interrupt has an affinity mask of all 1s, any input to fixup_irqs() that's not
> > an all set CPU mask would cause that interrupt to be shuffled around
> > unconditionally.
> > 
> > What fixup_irqs() care about is evacuating interrupts from CPUs not set on the
> > input CPU mask, and for that purpose it should check whether the interrupt is
> > assigned to a CPU not present in the input mask.  Assume that ->arch.cpu_mask
> > is a subset of the ->affinity mask, and keep the current logic that resets the
> > ->affinity mask if the interrupt has to be shuffled around.
> > 
> > Doing the affinity movement based on ->arch.cpu_mask requires removing the
> > special handling to ->arch.cpu_mask done for high priority vectors, otherwise
> > the adjustment done to cpu_mask makes them always skip the CPU interrupt
> > movement.
> > 
> > While there also adjust the comment as to the purpose of fixup_irqs().
> > 
> > Signed-off-by: Roger Pau Monné <roger.pau@citrix.com>
> 
> Reviewed-by: Jan Beulich <jbeulich@suse.com>
> 
> Aiui this is independent of patch 1, so could go in while we still settle on
> how to word things there?

I think so, the issue patch 1 fixes is independent from the rest of the
series.

Thanks, Roger.


^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH v2 4/7] x86/irq: restrict CPU movement in set_desc_affinity()
  2024-06-11 10:20   ` Jan Beulich
@ 2024-06-12  8:31     ` Roger Pau Monné
  0 siblings, 0 replies; 37+ messages in thread
From: Roger Pau Monné @ 2024-06-12  8:31 UTC (permalink / raw)
  To: Jan Beulich; +Cc: Andrew Cooper, xen-devel

On Tue, Jun 11, 2024 at 12:20:33PM +0200, Jan Beulich wrote:
> On 10.06.2024 16:20, Roger Pau Monne wrote:
> > If external interrupts are using logical mode it's possible to have an overlap
> > between the current ->arch.cpu_mask and the provided mask (or TARGET_CPUS).  If
> > that's the case avoid assigning a new vector and just move the interrupt to a
> > member of ->arch.cpu_mask that overlaps with the provided mask and is online.
> 
> What I'm kind of missing here is an explanation of why what _assign_irq_vector()
> does to avoid unnecessary migration (very first conditional there) isn't
> sufficient.

Somehow I looked at that and think it wasn't enough, but now I cannot
figure out why, so it might be just fine, and this patch is not
needed.  Let me test again and get back to you, for the time being
ignore this patch.

Thanks, Roger.


^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH v2 5/7] x86/irq: deal with old_cpu_mask for interrupts in movement in fixup_irqs()
  2024-06-11 13:47   ` Jan Beulich
@ 2024-06-12  8:36     ` Roger Pau Monné
  0 siblings, 0 replies; 37+ messages in thread
From: Roger Pau Monné @ 2024-06-12  8:36 UTC (permalink / raw)
  To: Jan Beulich; +Cc: Andrew Cooper, xen-devel

On Tue, Jun 11, 2024 at 03:47:03PM +0200, Jan Beulich wrote:
> On 10.06.2024 16:20, Roger Pau Monne wrote:
> > @@ -2589,6 +2589,28 @@ void fixup_irqs(const cpumask_t *mask, bool verbose)
> >                                 affinity);
> >          }
> >  
> > +        if ( desc->arch.move_in_progress &&
> > +             !cpumask_test_cpu(cpu, &cpu_online_map) &&
> 
> Btw - any reason you're open-coding !cpu_online() here? I've noticed this
> in the context of patch 7, where a little further down a !cpu_online() is
> being added. Those likely all want to be consistent.

No reason really - just me not realizing we had that helper.  Can
adjust in next version.

Thanks, Roger.


^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH v2 5/7] x86/irq: deal with old_cpu_mask for interrupts in movement in fixup_irqs()
  2024-06-11 12:45   ` Jan Beulich
@ 2024-06-12  8:47     ` Roger Pau Monné
  2024-06-12  9:04       ` Jan Beulich
  0 siblings, 1 reply; 37+ messages in thread
From: Roger Pau Monné @ 2024-06-12  8:47 UTC (permalink / raw)
  To: Jan Beulich; +Cc: Andrew Cooper, xen-devel

On Tue, Jun 11, 2024 at 02:45:09PM +0200, Jan Beulich wrote:
> On 10.06.2024 16:20, Roger Pau Monne wrote:
> > Given the current logic it's possible for ->arch.old_cpu_mask to get out of
> > sync: if a CPU set in old_cpu_mask is offlined and then onlined
> > again without old_cpu_mask having been updated the data in the mask will no
> > longer be accurate, as when brought back online the CPU will no longer have
> > old_vector configured to handle the old interrupt source.
> > 
> > If there's an interrupt movement in progress, and the to be offlined CPU (which
> > is the call context) is in the old_cpu_mask clear it and update the mask, so it
> > doesn't contain stale data.
> 
> This imo is too __cpu_disable()-centric. In the code you cover the
> smp_send_stop() case afaict, where it's all _other_ CPUs which are being
> offlined. As we're not meaning to bring CPUs online again in that case,
> dealing with the situation likely isn't needed. Yet the description should
> imo at least make clear that the case was considered.

What about adding the following paragraph:

Note that when the system is going down fixup_irqs() will be called by
smp_send_stop() from CPU 0 with a mask with only CPU 0 on it,
effectively asking to move all interrupts to the current caller (CPU
0) which is the only CPU online.  In that case we don't care to
migrate interrupts that are in the process of being moved, as it's
likely we won't be able to move all interrupts to CPU 0 due to vector
shortage anyway.

> 
> > @@ -2589,6 +2589,28 @@ void fixup_irqs(const cpumask_t *mask, bool verbose)
> >                                 affinity);
> >          }
> >  
> > +        if ( desc->arch.move_in_progress &&
> > +             !cpumask_test_cpu(cpu, &cpu_online_map) &&
> 
> This part of the condition is, afaict, what covers (excludes) the
> smp_send_stop() case. Might be nice to have a brief comment here, thus
> also clarifying ...

Would you be fine with:

        if ( desc->arch.move_in_progress &&
             /*
	      * Only attempt to migrate if the current CPU is going
	      * offline, otherwise the whole system is going down and
	      * leaving stale interrupts is fine.
	      */
             !cpumask_test_cpu(cpu, &cpu_online_map) &&
             cpumask_test_cpu(cpu, desc->arch.old_cpu_mask) )


> > +             cpumask_test_cpu(cpu, desc->arch.old_cpu_mask) )
> > +        {
> > +            /*
> > +             * This CPU is going offline, remove it from ->arch.old_cpu_mask
> > +             * and possibly release the old vector if the old mask becomes
> > +             * empty.
> > +             *
> > +             * Note cleaning ->arch.old_cpu_mask is required if the CPU is
> > +             * brought offline and then online again, as when re-onlined the
> > +             * per-cpu vector table will no longer have ->arch.old_vector
> > +             * setup, and hence ->arch.old_cpu_mask would be stale.
> > +             */
> > +            cpumask_clear_cpu(cpu, desc->arch.old_cpu_mask);
> > +            if ( cpumask_empty(desc->arch.old_cpu_mask) )
> > +            {
> > +                desc->arch.move_in_progress = 0;
> > +                release_old_vec(desc);
> > +            }
> 
> ... that none of this is really wanted or necessary in that other case.
> Assuming my understanding above is correct, the code change itself is
> once again

It is.  For the smp_send_stop() case we don't care much about leaving
stale data around, as the system is going down.  It's also likely
impossible to move all interrupts to CPU0 due to vector shortage, so
some interrupts will be left assigned to different CPUs.

> Reviewed-by: Jan Beulich <jbeulich@suse.com>
> yet here I'm uncertain whether to offer on-commit editing, as it's not
> really clear to me whether there's a dependencies on patch 4.

No, in principle it should be fine to skip patch 4, but I would like
to do another round of testing before confirming.

Thanks, Roger.


^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH v2 1/7] x86/smp: do not use shorthand IPI destinations in CPU hot{,un}plug contexts
  2024-06-12  8:09     ` Roger Pau Monné
@ 2024-06-12  8:56       ` Jan Beulich
  2024-06-12 10:07         ` Oleksii K.
  0 siblings, 1 reply; 37+ messages in thread
From: Jan Beulich @ 2024-06-12  8:56 UTC (permalink / raw)
  To: Roger Pau Monné
  Cc: Andrew Cooper, George Dunlap, Julien Grall, Stefano Stabellini,
	xen-devel

On 12.06.2024 10:09, Roger Pau Monné wrote:
> On Tue, Jun 11, 2024 at 09:42:39AM +0200, Jan Beulich wrote:
>> On 10.06.2024 16:20, Roger Pau Monne wrote:
>>> Due to the current rwlock logic, if the CPU calling get_cpu_maps() does so from
>>> a cpu_hotplug_{begin,done}() region the function will still return success,
>>> because a CPU taking the rwlock in read mode after having taken it in write
>>> mode is allowed.  Such behavior however defeats the purpose of get_cpu_maps(),
>>
>> I'm not happy to see you still have this claim here. The behavior may (appear
>> to) defeat the purpose here, but as expressed previously I don't view that as
>> being a general pattern.
> 
> Right.  What about replacing the paragraph with:
> 
> "Due to the current rwlock logic, if the CPU calling get_cpu_maps() does so from
> a cpu_hotplug_{begin,done}() region the function will still return success,
> because a CPU taking the rwlock in read mode after having taken it in write
> mode is allowed.  Such corner case makes using get_cpu_maps() alone
> not enough to prevent using the shorthand in CPU hotplug regions."

Thanks.

> I think the rest is of the commit message is not controversial.

Indeed.

>>> --- a/xen/arch/x86/smp.c
>>> +++ b/xen/arch/x86/smp.c
>>> @@ -88,7 +88,7 @@ void send_IPI_mask(const cpumask_t *mask, int vector)
>>>       * the system have been accounted for.
>>>       */
>>>      if ( system_state > SYS_STATE_smp_boot &&
>>> -         !unaccounted_cpus && !disabled_cpus &&
>>> +         !unaccounted_cpus && !disabled_cpus && !cpu_in_hotplug_context() &&
>>>           /* NB: get_cpu_maps lock requires enabled interrupts. */
>>>           local_irq_is_enabled() && (cpus_locked = get_cpu_maps()) &&
>>>           (park_offline_cpus ||
>>
>> Along with changing the condition you also want to update the comment to
>> reflect the code adjustment.
> 
> I've assumed the wording in the commet that says: "no CPU hotplug or
> unplug operations are taking place" would already cover the addition
> of the !cpu_in_hotplug_context() check.

Hmm, yes, you're right. Just needs a release-ack then to go in (with the
description adjustment).

Jan


^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH v2 5/7] x86/irq: deal with old_cpu_mask for interrupts in movement in fixup_irqs()
  2024-06-12  8:47     ` Roger Pau Monné
@ 2024-06-12  9:04       ` Jan Beulich
  2024-06-12 10:41         ` Roger Pau Monné
  0 siblings, 1 reply; 37+ messages in thread
From: Jan Beulich @ 2024-06-12  9:04 UTC (permalink / raw)
  To: Roger Pau Monné; +Cc: Andrew Cooper, xen-devel

On 12.06.2024 10:47, Roger Pau Monné wrote:
> On Tue, Jun 11, 2024 at 02:45:09PM +0200, Jan Beulich wrote:
>> On 10.06.2024 16:20, Roger Pau Monne wrote:
>>> Given the current logic it's possible for ->arch.old_cpu_mask to get out of
>>> sync: if a CPU set in old_cpu_mask is offlined and then onlined
>>> again without old_cpu_mask having been updated the data in the mask will no
>>> longer be accurate, as when brought back online the CPU will no longer have
>>> old_vector configured to handle the old interrupt source.
>>>
>>> If there's an interrupt movement in progress, and the to be offlined CPU (which
>>> is the call context) is in the old_cpu_mask clear it and update the mask, so it
>>> doesn't contain stale data.
>>
>> This imo is too __cpu_disable()-centric. In the code you cover the
>> smp_send_stop() case afaict, where it's all _other_ CPUs which are being
>> offlined. As we're not meaning to bring CPUs online again in that case,
>> dealing with the situation likely isn't needed. Yet the description should
>> imo at least make clear that the case was considered.
> 
> What about adding the following paragraph:

Sounds good, just maybe one small adjustment:

> Note that when the system is going down fixup_irqs() will be called by
> smp_send_stop() from CPU 0 with a mask with only CPU 0 on it,
> effectively asking to move all interrupts to the current caller (CPU
> 0) which is the only CPU online.  In that case we don't care to

"... the only CPU to remain online."

> migrate interrupts that are in the process of being moved, as it's
> likely we won't be able to move all interrupts to CPU 0 due to vector
> shortage anyway.
> 
>>
>>> @@ -2589,6 +2589,28 @@ void fixup_irqs(const cpumask_t *mask, bool verbose)
>>>                                 affinity);
>>>          }
>>>  
>>> +        if ( desc->arch.move_in_progress &&
>>> +             !cpumask_test_cpu(cpu, &cpu_online_map) &&
>>
>> This part of the condition is, afaict, what covers (excludes) the
>> smp_send_stop() case. Might be nice to have a brief comment here, thus
>> also clarifying ...
> 
> Would you be fine with:
> 
>         if ( desc->arch.move_in_progress &&
>              /*
> 	      * Only attempt to migrate if the current CPU is going
> 	      * offline, otherwise the whole system is going down and
> 	      * leaving stale interrupts is fine.
> 	      */
>              !cpumask_test_cpu(cpu, &cpu_online_map) &&
>              cpumask_test_cpu(cpu, desc->arch.old_cpu_mask) )

Sure, this is even more verbose (i.e. better) than I was after.

Jan


^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH v2 1/7] x86/smp: do not use shorthand IPI destinations in CPU hot{,un}plug contexts
  2024-06-12  8:56       ` Jan Beulich
@ 2024-06-12 10:07         ` Oleksii K.
  0 siblings, 0 replies; 37+ messages in thread
From: Oleksii K. @ 2024-06-12 10:07 UTC (permalink / raw)
  To: Jan Beulich, Roger Pau Monné
  Cc: Andrew Cooper, George Dunlap, Julien Grall, Stefano Stabellini,
	xen-devel

On Wed, 2024-06-12 at 10:56 +0200, Jan Beulich wrote:
> On 12.06.2024 10:09, Roger Pau Monné wrote:
> > On Tue, Jun 11, 2024 at 09:42:39AM +0200, Jan Beulich wrote:
> > > On 10.06.2024 16:20, Roger Pau Monne wrote:
> > > > Due to the current rwlock logic, if the CPU calling
> > > > get_cpu_maps() does so from
> > > > a cpu_hotplug_{begin,done}() region the function will still
> > > > return success,
> > > > because a CPU taking the rwlock in read mode after having taken
> > > > it in write
> > > > mode is allowed.  Such behavior however defeats the purpose of
> > > > get_cpu_maps(),
> > > 
> > > I'm not happy to see you still have this claim here. The behavior
> > > may (appear
> > > to) defeat the purpose here, but as expressed previously I don't
> > > view that as
> > > being a general pattern.
> > 
> > Right.  What about replacing the paragraph with:
> > 
> > "Due to the current rwlock logic, if the CPU calling get_cpu_maps()
> > does so from
> > a cpu_hotplug_{begin,done}() region the function will still return
> > success,
> > because a CPU taking the rwlock in read mode after having taken it
> > in write
> > mode is allowed.  Such corner case makes using get_cpu_maps() alone
> > not enough to prevent using the shorthand in CPU hotplug regions."
> 
> Thanks.
> 
> > I think the rest is of the commit message is not controversial.
> 
> Indeed.
> 
> > > > --- a/xen/arch/x86/smp.c
> > > > +++ b/xen/arch/x86/smp.c
> > > > @@ -88,7 +88,7 @@ void send_IPI_mask(const cpumask_t *mask, int
> > > > vector)
> > > >       * the system have been accounted for.
> > > >       */
> > > >      if ( system_state > SYS_STATE_smp_boot &&
> > > > -         !unaccounted_cpus && !disabled_cpus &&
> > > > +         !unaccounted_cpus && !disabled_cpus &&
> > > > !cpu_in_hotplug_context() &&
> > > >           /* NB: get_cpu_maps lock requires enabled interrupts.
> > > > */
> > > >           local_irq_is_enabled() && (cpus_locked =
> > > > get_cpu_maps()) &&
> > > >           (park_offline_cpus ||
> > > 
> > > Along with changing the condition you also want to update the
> > > comment to
> > > reflect the code adjustment.
> > 
> > I've assumed the wording in the commet that says: "no CPU hotplug
> > or
> > unplug operations are taking place" would already cover the
> > addition
> > of the !cpu_in_hotplug_context() check.
> 
> Hmm, yes, you're right. Just needs a release-ack then to go in (with
> the
> description adjustment).
Release-Acked-by: Oleksii Kurochko <oleksii.kurochko@gmail.com>

~ Oleksii



^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH v2 2/7] x86/irq: describe how the interrupt CPU movement works
  2024-06-11  7:44   ` Jan Beulich
@ 2024-06-12 10:08     ` Oleksii K.
  0 siblings, 0 replies; 37+ messages in thread
From: Oleksii K. @ 2024-06-12 10:08 UTC (permalink / raw)
  To: Jan Beulich, Roger Pau Monne; +Cc: Andrew Cooper, xen-devel

On Tue, 2024-06-11 at 09:44 +0200, Jan Beulich wrote:
> On 10.06.2024 16:20, Roger Pau Monne wrote:
> > The logic to move interrupts across CPUs is complex, attempt to
> > provide a
> > comment that describes the expected behavior so users of the
> > interrupt system
> > have more context about the usage of the arch_irq_desc structure
> > fields.
> > 
> > Signed-off-by: Roger Pau Monné <roger.pau@citrix.com>
> 
> Reviewed-by: Jan Beulich <jbeulich@suse.com>
Release-Acked-by: Oleksii Kurochko <oleksii.kurochko@gmail.com>

~ Oleksii


^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH v2 3/7] x86/irq: limit interrupt movement done by fixup_irqs()
  2024-06-11  9:59   ` Jan Beulich
  2024-06-12  8:10     ` Roger Pau Monné
@ 2024-06-12 10:13     ` Oleksii K.
  1 sibling, 0 replies; 37+ messages in thread
From: Oleksii K. @ 2024-06-12 10:13 UTC (permalink / raw)
  To: Jan Beulich, Roger Pau Monne; +Cc: Andrew Cooper, xen-devel

On Tue, 2024-06-11 at 11:59 +0200, Jan Beulich wrote:
> On 10.06.2024 16:20, Roger Pau Monne wrote:
> > The current check used in fixup_irqs() to decide whether to move
> > around
> > interrupts is based on the affinity mask, but such mask can have
> > all bits set,
> > and hence is unlikely to be a subset of the input mask.  For
> > example if an
> > interrupt has an affinity mask of all 1s, any input to fixup_irqs()
> > that's not
> > an all set CPU mask would cause that interrupt to be shuffled
> > around
> > unconditionally.
> > 
> > What fixup_irqs() care about is evacuating interrupts from CPUs not
> > set on the
> > input CPU mask, and for that purpose it should check whether the
> > interrupt is
> > assigned to a CPU not present in the input mask.  Assume that -
> > >arch.cpu_mask
> > is a subset of the ->affinity mask, and keep the current logic that
> > resets the
> > ->affinity mask if the interrupt has to be shuffled around.
> > 
> > Doing the affinity movement based on ->arch.cpu_mask requires
> > removing the
> > special handling to ->arch.cpu_mask done for high priority vectors,
> > otherwise
> > the adjustment done to cpu_mask makes them always skip the CPU
> > interrupt
> > movement.
> > 
> > While there also adjust the comment as to the purpose of
> > fixup_irqs().
> > 
> > Signed-off-by: Roger Pau Monné <roger.pau@citrix.com>
> 
> Reviewed-by: Jan Beulich <jbeulich@suse.com>
Release-Acked-By: Oleksii Kurochko <oleksii.kurochko@gmail.com>

~ Oleksii
> 
> Aiui this is independent of patch 1, so could go in while we still
> settle on
> how to word things there?
> 
> Jan
> 



^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH v2 6/7] x86/irq: handle moving interrupts in _assign_irq_vector()
  2024-06-11 13:18   ` Jan Beulich
@ 2024-06-12 10:39     ` Roger Pau Monné
  2024-06-12 13:42       ` Jan Beulich
  0 siblings, 1 reply; 37+ messages in thread
From: Roger Pau Monné @ 2024-06-12 10:39 UTC (permalink / raw)
  To: Jan Beulich; +Cc: Andrew Cooper, xen-devel

On Tue, Jun 11, 2024 at 03:18:32PM +0200, Jan Beulich wrote:
> On 10.06.2024 16:20, Roger Pau Monne wrote:
> > Currently there's logic in fixup_irqs() that attempts to prevent
> > _assign_irq_vector() from failing, as fixup_irqs() is required to evacuate all
> > interrupts from the CPUs not present in the input mask.  The current logic in
> > fixup_irqs() is incomplete, as it doesn't deal with interrupts that have
> > move_cleanup_count > 0 and a non-empty ->arch.old_cpu_mask field.
> > 
> > Instead of attempting to fixup the interrupt descriptor in fixup_irqs() so that
> > _assign_irq_vector() cannot fail, introduce logic in _assign_irq_vector()
> > to deal with interrupts that have either move_{in_progress,cleanup_count} set
> > and no remaining online CPUs in ->arch.cpu_mask.
> > 
> > If _assign_irq_vector() is requested to move an interrupt in the state
> > described above, first attempt to see if ->arch.old_cpu_mask contains any valid
> > CPUs that could be used as fallback, and if that's the case do move the
> > interrupt back to the previous destination.  Note this is easier because the
> > vector hasn't been released yet, so there's no need to allocate and setup a new
> > vector on the destination.
> > 
> > Due to the logic in fixup_irqs() that clears offline CPUs from
> > ->arch.old_cpu_mask (and releases the old vector if the mask becomes empty) it
> > shouldn't be possible to get into _assign_irq_vector() with
> > ->arch.move_{in_progress,cleanup_count} set but no online CPUs in
> > ->arch.old_cpu_mask.
> > 
> > However if ->arch.move_{in_progress,cleanup_count} is set and the interrupt has
> > also changed affinity, it's possible the members of ->arch.old_cpu_mask are no
> > longer part of the affinity set,
> 
> I'm having trouble relating this (->arch.old_cpu_mask related) to ...
> 
> > move the interrupt to a different CPU part of
> > the provided mask
> 
> ... this (->arch.cpu_mask related).

No, the "provided mask" here is the "mask" parameter, not
->arch.cpu_mask.

> 
> > and keep the current ->arch.old_{cpu_mask,vector} for the
> > pending interrupt movement to be completed.
> 
> Right, that's to clean up state from before the initial move. What isn't
> clear to me is what's to happen with the state of the intermediate
> placement. Description and code changes leave me with the impression that
> it's okay to simply abandon, without any cleanup, yet I can't quite figure
> why that would be an okay thing to do.

There isn't much we can do with the intermediate placement, as the CPU
is going offline.  However we can drain any pending interrupts from
IRR after the new destination has been set, since setting the
destination is done from the CPU that's the current target of the
interrupts.  So we can ensure the draining is done strictly after the
target has been switched, hence ensuring no further interrupts from
this source will be delivered to the current CPU.

> > --- a/xen/arch/x86/irq.c
> > +++ b/xen/arch/x86/irq.c
> > @@ -544,7 +544,53 @@ static int _assign_irq_vector(struct irq_desc *desc, const cpumask_t *mask)
> >      }
> >  
> >      if ( desc->arch.move_in_progress || desc->arch.move_cleanup_count )
> > -        return -EAGAIN;
> > +    {
> > +        /*
> > +         * If the current destination is online refuse to shuffle.  Retry after
> > +         * the in-progress movement has finished.
> > +         */
> > +        if ( cpumask_intersects(desc->arch.cpu_mask, &cpu_online_map) )
> > +            return -EAGAIN;
> > +
> > +        /*
> > +         * Due to the logic in fixup_irqs() that clears offlined CPUs from
> > +         * ->arch.old_cpu_mask it shouldn't be possible to get here with
> > +         * ->arch.move_{in_progress,cleanup_count} set and no online CPUs in
> > +         * ->arch.old_cpu_mask.
> > +         */
> > +        ASSERT(valid_irq_vector(desc->arch.old_vector));
> > +        ASSERT(cpumask_intersects(desc->arch.old_cpu_mask, &cpu_online_map));
> > +
> > +        if ( cpumask_intersects(desc->arch.old_cpu_mask, mask) )
> > +        {
> > +            /*
> > +             * Fallback to the old destination if moving is in progress and the
> > +             * current destination is to be offlined.  This is only possible if
> > +             * the CPUs in old_cpu_mask intersect with the affinity mask passed
> > +             * in the 'mask' parameter.
> > +             */
> > +            desc->arch.vector = desc->arch.old_vector;
> > +            cpumask_and(desc->arch.cpu_mask, desc->arch.old_cpu_mask, mask);
> > +
> > +            /* Undo any possibly done cleanup. */
> > +            for_each_cpu(cpu, desc->arch.cpu_mask)
> > +                per_cpu(vector_irq, cpu)[desc->arch.vector] = irq;
> > +
> > +            /* Cancel the pending move. */
> > +            desc->arch.old_vector = IRQ_VECTOR_UNASSIGNED;
> > +            cpumask_clear(desc->arch.old_cpu_mask);
> > +            desc->arch.move_in_progress = 0;
> > +            desc->arch.move_cleanup_count = 0;
> > +
> > +            return 0;
> > +        }
> 
> In how far is this guaranteed to respect the (new) affinity that was set,
> presumably having led to the movement in the first place?

The 'mask' parameter should account for the new affinity, hence the
cpumask_intersects() check guarantees we are moving to a CPU still in
the affinity mask.

> > @@ -600,7 +646,17 @@ next:
> >          current_vector = vector;
> >          current_offset = offset;
> >  
> > -        if ( valid_irq_vector(old_vector) )
> > +        if ( desc->arch.move_in_progress || desc->arch.move_cleanup_count )
> > +        {
> > +            ASSERT(!cpumask_intersects(desc->arch.cpu_mask, &cpu_online_map));
> > +            /*
> > +             * Special case when evacuating an interrupt from a CPU to be
> > +             * offlined and the interrupt was already in the process of being
> > +             * moved.  Leave ->arch.old_{vector,cpu_mask} as-is and just
> > +             * replace ->arch.{cpu_mask,vector} with the new destination.
> > +             */
> 
> And where's the cleaning up of ->arch.old_* going to be taken care of then?

Such cleaning will be handled normally by the interrupt still having
->arch.move_{in_progress,cleanup_count} set.  The CPUs in
->arch.old_cpu_mask must not all be offline, otherwise the logic in
fixup_irqs() would have already released the old vector.

Thanks, Roger.


^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH v2 5/7] x86/irq: deal with old_cpu_mask for interrupts in movement in fixup_irqs()
  2024-06-12  9:04       ` Jan Beulich
@ 2024-06-12 10:41         ` Roger Pau Monné
  0 siblings, 0 replies; 37+ messages in thread
From: Roger Pau Monné @ 2024-06-12 10:41 UTC (permalink / raw)
  To: Jan Beulich; +Cc: Andrew Cooper, xen-devel

On Wed, Jun 12, 2024 at 11:04:26AM +0200, Jan Beulich wrote:
> On 12.06.2024 10:47, Roger Pau Monné wrote:
> > On Tue, Jun 11, 2024 at 02:45:09PM +0200, Jan Beulich wrote:
> >> On 10.06.2024 16:20, Roger Pau Monne wrote:
> >>> Given the current logic it's possible for ->arch.old_cpu_mask to get out of
> >>> sync: if a CPU set in old_cpu_mask is offlined and then onlined
> >>> again without old_cpu_mask having been updated the data in the mask will no
> >>> longer be accurate, as when brought back online the CPU will no longer have
> >>> old_vector configured to handle the old interrupt source.
> >>>
> >>> If there's an interrupt movement in progress, and the to be offlined CPU (which
> >>> is the call context) is in the old_cpu_mask clear it and update the mask, so it
> >>> doesn't contain stale data.
> >>
> >> This imo is too __cpu_disable()-centric. In the code you cover the
> >> smp_send_stop() case afaict, where it's all _other_ CPUs which are being
> >> offlined. As we're not meaning to bring CPUs online again in that case,
> >> dealing with the situation likely isn't needed. Yet the description should
> >> imo at least make clear that the case was considered.
> > 
> > What about adding the following paragraph:
> 
> Sounds good, just maybe one small adjustment:
> 
> > Note that when the system is going down fixup_irqs() will be called by
> > smp_send_stop() from CPU 0 with a mask with only CPU 0 on it,
> > effectively asking to move all interrupts to the current caller (CPU
> > 0) which is the only CPU online.  In that case we don't care to
> 
> "... the only CPU to remain online."

Right, that's better.

Thanks, Roger.


^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH v2 7/7] x86/irq: forward pending interrupts to new destination in fixup_irqs()
  2024-06-11 13:50   ` Jan Beulich
@ 2024-06-12 11:23     ` Roger Pau Monné
  2024-06-12 13:47       ` Jan Beulich
  0 siblings, 1 reply; 37+ messages in thread
From: Roger Pau Monné @ 2024-06-12 11:23 UTC (permalink / raw)
  To: Jan Beulich; +Cc: Andrew Cooper, xen-devel

On Tue, Jun 11, 2024 at 03:50:42PM +0200, Jan Beulich wrote:
> On 10.06.2024 16:20, Roger Pau Monne wrote:
> > fixup_irqs() is used to evacuate interrupts from to be offlined CPUs.  Given
> > the CPU is to become offline, the normal migration logic used by Xen where the
> > vector in the previous target(s) is left configured until the interrupt is
> > received on the new destination is not suitable.
> > 
> > Instead attempt to do as much as possible in order to prevent loosing
> > interrupts.  If fixup_irqs() is called from the CPU to be offlined (as is
> > currently the case) attempt to forward pending vectors when interrupts that
> > target the current CPU are migrated to a different destination.
> > 
> > Additionally, for interrupts that have already been moved from the current CPU
> > prior to the call to fixup_irqs() but that haven't been delivered to the new
> > destination (iow: interrupts with move_in_progress set and the current CPU set
> > in ->arch.old_cpu_mask) also check whether the previous vector is pending and
> > forward it to the new destination.
> > 
> > Signed-off-by: Roger Pau Monné <roger.pau@citrix.com>
> > ---
> > Changes since v1:
> >  - Rename to apic_irr_read().
> > ---
> >  xen/arch/x86/include/asm/apic.h |  5 +++++
> >  xen/arch/x86/irq.c              | 37 ++++++++++++++++++++++++++++++++-
> >  2 files changed, 41 insertions(+), 1 deletion(-)
> > 
> > diff --git a/xen/arch/x86/include/asm/apic.h b/xen/arch/x86/include/asm/apic.h
> > index d1cb001fb4ab..7bd66dc6e151 100644
> > --- a/xen/arch/x86/include/asm/apic.h
> > +++ b/xen/arch/x86/include/asm/apic.h
> > @@ -132,6 +132,11 @@ static inline bool apic_isr_read(uint8_t vector)
> >              (vector & 0x1f)) & 1;
> >  }
> >  
> > +static inline bool apic_irr_read(unsigned int vector)
> > +{
> > +    return apic_read(APIC_IRR + (vector / 32 * 0x10)) & (1U << (vector % 32));
> > +}
> > +
> >  static inline u32 get_apic_id(void)
> >  {
> >      u32 id = apic_read(APIC_ID);
> > diff --git a/xen/arch/x86/irq.c b/xen/arch/x86/irq.c
> > index 54eabd23995c..ed262fb55f4a 100644
> > --- a/xen/arch/x86/irq.c
> > +++ b/xen/arch/x86/irq.c
> > @@ -2601,7 +2601,7 @@ void fixup_irqs(const cpumask_t *mask, bool verbose)
> >  
> >      for ( irq = 0; irq < nr_irqs; irq++ )
> >      {
> > -        bool break_affinity = false, set_affinity = true;
> > +        bool break_affinity = false, set_affinity = true, check_irr = false;
> >          unsigned int vector, cpu = smp_processor_id();
> >          cpumask_t *affinity = this_cpu(scratch_cpumask);
> >  
> > @@ -2649,6 +2649,25 @@ void fixup_irqs(const cpumask_t *mask, bool verbose)
> >               !cpumask_test_cpu(cpu, &cpu_online_map) &&
> >               cpumask_test_cpu(cpu, desc->arch.old_cpu_mask) )
> >          {
> > +            /*
> > +             * This to be offlined CPU was the target of an interrupt that's
> > +             * been moved, and the new destination target hasn't yet
> > +             * acknowledged any interrupt from it.
> > +             *
> > +             * We know the interrupt is configured to target the new CPU at
> > +             * this point, so we can check IRR for any pending vectors and
> > +             * forward them to the new destination.
> > +             *
> > +             * Note the difference between move_in_progress or
> > +             * move_cleanup_count being set.  For the later we know the new
> > +             * destination has already acked at least one interrupt from this
> > +             * source, and hence there's no need to forward any stale
> > +             * interrupts.
> > +             */
> 
> I'm a little confused by this last paragraph: It talks about a difference,
> yet ...
> 
> > +            if ( apic_irr_read(desc->arch.old_vector) )
> > +                send_IPI_mask(cpumask_of(cpumask_any(desc->arch.cpu_mask)),
> > +                              desc->arch.vector);
> 
> ... in the code being commented there's no difference visible. Hmm, I guess
> this is related to the enclosing if(). Maybe this could be worded a little
> differently, e.g. starting with "Note that for the other case -
> move_cleanup_count being non-zero - we know ..."?

Hm, I see.  Yes, the difference is that for interrupts that have
move_cleanup_count set we don't forward pending interrupts in IRR on
this CPU.  I put this here because I think it's more naturally
arranged with the rest of the comment.  I can pull the whole comment
ahead if the if() if that's better.

> > @@ -2689,11 +2708,27 @@ void fixup_irqs(const cpumask_t *mask, bool verbose)
> >          if ( desc->handler->disable )
> >              desc->handler->disable(desc);
> >  
> > +        /*
> > +         * If the current CPU is going offline and is (one of) the target(s) of
> > +         * the interrupt signal to check whether there are any pending vectors
> > +         * to be handled in the local APIC after the interrupt has been moved.
> > +         */
> 
> After reading this a number of times, I think there wants to be a comma between
> "interrupt" and "signal". Or am I getting wrong what is being meant?

Indeed.

> > +        if ( !cpu_online(cpu) && cpumask_test_cpu(cpu, desc->arch.cpu_mask) )
> > +            check_irr = true;
> > +
> >          if ( desc->handler->set_affinity )
> >              desc->handler->set_affinity(desc, affinity);
> >          else if ( !(warned++) )
> >              set_affinity = false;
> >  
> > +        if ( check_irr && apic_irr_read(vector) )
> > +            /*
> > +             * Forward pending interrupt to the new destination, this CPU is
> > +             * going offline and otherwise the interrupt would be lost.
> > +             */
> > +            send_IPI_mask(cpumask_of(cpumask_any(desc->arch.cpu_mask)),
> > +                          desc->arch.vector);
> > +
> >          if ( desc->handler->enable )
> >              desc->handler->enable(desc);
> >  
> 
> Down from here, after the loop, there's a 1ms window where latched but not
> yet delivered interrupts can be received. How's that playing together with
> the changes you're making? Aren't we then liable to get two interrupts, one
> at the old and one at the new source, in unknown order?

I was mistakenly thinking that clear_local_APIC() would block
interrupt delivery, but that's not the case, so yes, interrupts should
still be delivered in the window below.

Let me test without this last patch.

Thanks, Roger.


^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH v2 6/7] x86/irq: handle moving interrupts in _assign_irq_vector()
  2024-06-12 10:39     ` Roger Pau Monné
@ 2024-06-12 13:42       ` Jan Beulich
  2024-06-12 15:36         ` Roger Pau Monné
  0 siblings, 1 reply; 37+ messages in thread
From: Jan Beulich @ 2024-06-12 13:42 UTC (permalink / raw)
  To: Roger Pau Monné; +Cc: Andrew Cooper, xen-devel

On 12.06.2024 12:39, Roger Pau Monné wrote:
> On Tue, Jun 11, 2024 at 03:18:32PM +0200, Jan Beulich wrote:
>> On 10.06.2024 16:20, Roger Pau Monne wrote:
>>> Currently there's logic in fixup_irqs() that attempts to prevent
>>> _assign_irq_vector() from failing, as fixup_irqs() is required to evacuate all
>>> interrupts from the CPUs not present in the input mask.  The current logic in
>>> fixup_irqs() is incomplete, as it doesn't deal with interrupts that have
>>> move_cleanup_count > 0 and a non-empty ->arch.old_cpu_mask field.
>>>
>>> Instead of attempting to fixup the interrupt descriptor in fixup_irqs() so that
>>> _assign_irq_vector() cannot fail, introduce logic in _assign_irq_vector()
>>> to deal with interrupts that have either move_{in_progress,cleanup_count} set
>>> and no remaining online CPUs in ->arch.cpu_mask.
>>>
>>> If _assign_irq_vector() is requested to move an interrupt in the state
>>> described above, first attempt to see if ->arch.old_cpu_mask contains any valid
>>> CPUs that could be used as fallback, and if that's the case do move the
>>> interrupt back to the previous destination.  Note this is easier because the
>>> vector hasn't been released yet, so there's no need to allocate and setup a new
>>> vector on the destination.
>>>
>>> Due to the logic in fixup_irqs() that clears offline CPUs from
>>> ->arch.old_cpu_mask (and releases the old vector if the mask becomes empty) it
>>> shouldn't be possible to get into _assign_irq_vector() with
>>> ->arch.move_{in_progress,cleanup_count} set but no online CPUs in
>>> ->arch.old_cpu_mask.
>>>
>>> However if ->arch.move_{in_progress,cleanup_count} is set and the interrupt has
>>> also changed affinity, it's possible the members of ->arch.old_cpu_mask are no
>>> longer part of the affinity set,
>>
>> I'm having trouble relating this (->arch.old_cpu_mask related) to ...
>>
>>> move the interrupt to a different CPU part of
>>> the provided mask
>>
>> ... this (->arch.cpu_mask related).
> 
> No, the "provided mask" here is the "mask" parameter, not
> ->arch.cpu_mask.

Oh, so this describes the case of "hitting" the comment at the very bottom of
the first hunk then? (I probably was misreading this because I was expecting
it to describe a code change, rather than the case where original behavior
needs retaining. IOW - all fine here then.)

>>> and keep the current ->arch.old_{cpu_mask,vector} for the
>>> pending interrupt movement to be completed.
>>
>> Right, that's to clean up state from before the initial move. What isn't
>> clear to me is what's to happen with the state of the intermediate
>> placement. Description and code changes leave me with the impression that
>> it's okay to simply abandon, without any cleanup, yet I can't quite figure
>> why that would be an okay thing to do.
> 
> There isn't much we can do with the intermediate placement, as the CPU
> is going offline.  However we can drain any pending interrupts from
> IRR after the new destination has been set, since setting the
> destination is done from the CPU that's the current target of the
> interrupts.  So we can ensure the draining is done strictly after the
> target has been switched, hence ensuring no further interrupts from
> this source will be delivered to the current CPU.

Hmm, I'm afraid I still don't follow: I'm specifically in trouble with
the ...

>>> --- a/xen/arch/x86/irq.c
>>> +++ b/xen/arch/x86/irq.c
>>> @@ -544,7 +544,53 @@ static int _assign_irq_vector(struct irq_desc *desc, const cpumask_t *mask)
>>>      }
>>>  
>>>      if ( desc->arch.move_in_progress || desc->arch.move_cleanup_count )
>>> -        return -EAGAIN;
>>> +    {
>>> +        /*
>>> +         * If the current destination is online refuse to shuffle.  Retry after
>>> +         * the in-progress movement has finished.
>>> +         */
>>> +        if ( cpumask_intersects(desc->arch.cpu_mask, &cpu_online_map) )
>>> +            return -EAGAIN;
>>> +
>>> +        /*
>>> +         * Due to the logic in fixup_irqs() that clears offlined CPUs from
>>> +         * ->arch.old_cpu_mask it shouldn't be possible to get here with
>>> +         * ->arch.move_{in_progress,cleanup_count} set and no online CPUs in
>>> +         * ->arch.old_cpu_mask.
>>> +         */
>>> +        ASSERT(valid_irq_vector(desc->arch.old_vector));
>>> +        ASSERT(cpumask_intersects(desc->arch.old_cpu_mask, &cpu_online_map));
>>> +
>>> +        if ( cpumask_intersects(desc->arch.old_cpu_mask, mask) )
>>> +        {
>>> +            /*
>>> +             * Fallback to the old destination if moving is in progress and the
>>> +             * current destination is to be offlined.  This is only possible if
>>> +             * the CPUs in old_cpu_mask intersect with the affinity mask passed
>>> +             * in the 'mask' parameter.
>>> +             */
>>> +            desc->arch.vector = desc->arch.old_vector;
>>> +            cpumask_and(desc->arch.cpu_mask, desc->arch.old_cpu_mask, mask);

... replacing of vector (and associated mask), without any further accounting.

>>> +            /* Undo any possibly done cleanup. */
>>> +            for_each_cpu(cpu, desc->arch.cpu_mask)
>>> +                per_cpu(vector_irq, cpu)[desc->arch.vector] = irq;
>>> +
>>> +            /* Cancel the pending move. */
>>> +            desc->arch.old_vector = IRQ_VECTOR_UNASSIGNED;
>>> +            cpumask_clear(desc->arch.old_cpu_mask);
>>> +            desc->arch.move_in_progress = 0;
>>> +            desc->arch.move_cleanup_count = 0;
>>> +
>>> +            return 0;
>>> +        }
>>
>> In how far is this guaranteed to respect the (new) affinity that was set,
>> presumably having led to the movement in the first place?
> 
> The 'mask' parameter should account for the new affinity, hence the
> cpumask_intersects() check guarantees we are moving to a CPU still in
> the affinity mask.

Ah, right, I must have been confused.

>>> @@ -600,7 +646,17 @@ next:
>>>          current_vector = vector;
>>>          current_offset = offset;
>>>  
>>> -        if ( valid_irq_vector(old_vector) )
>>> +        if ( desc->arch.move_in_progress || desc->arch.move_cleanup_count )
>>> +        {
>>> +            ASSERT(!cpumask_intersects(desc->arch.cpu_mask, &cpu_online_map));
>>> +            /*
>>> +             * Special case when evacuating an interrupt from a CPU to be
>>> +             * offlined and the interrupt was already in the process of being
>>> +             * moved.  Leave ->arch.old_{vector,cpu_mask} as-is and just
>>> +             * replace ->arch.{cpu_mask,vector} with the new destination.
>>> +             */
>>
>> And where's the cleaning up of ->arch.old_* going to be taken care of then?
> 
> Such cleaning will be handled normally by the interrupt still having
> ->arch.move_{in_progress,cleanup_count} set.  The CPUs in
> ->arch.old_cpu_mask must not all be offline, otherwise the logic in
> fixup_irqs() would have already released the old vector.

Maybe add "Cleanup will be done normally" to the comment?

Jan


^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH v2 7/7] x86/irq: forward pending interrupts to new destination in fixup_irqs()
  2024-06-12 11:23     ` Roger Pau Monné
@ 2024-06-12 13:47       ` Jan Beulich
  0 siblings, 0 replies; 37+ messages in thread
From: Jan Beulich @ 2024-06-12 13:47 UTC (permalink / raw)
  To: Roger Pau Monné; +Cc: Andrew Cooper, xen-devel

On 12.06.2024 13:23, Roger Pau Monné wrote:
> On Tue, Jun 11, 2024 at 03:50:42PM +0200, Jan Beulich wrote:
>> On 10.06.2024 16:20, Roger Pau Monne wrote:
>>> @@ -2649,6 +2649,25 @@ void fixup_irqs(const cpumask_t *mask, bool verbose)
>>>               !cpumask_test_cpu(cpu, &cpu_online_map) &&
>>>               cpumask_test_cpu(cpu, desc->arch.old_cpu_mask) )
>>>          {
>>> +            /*
>>> +             * This to be offlined CPU was the target of an interrupt that's
>>> +             * been moved, and the new destination target hasn't yet
>>> +             * acknowledged any interrupt from it.
>>> +             *
>>> +             * We know the interrupt is configured to target the new CPU at
>>> +             * this point, so we can check IRR for any pending vectors and
>>> +             * forward them to the new destination.
>>> +             *
>>> +             * Note the difference between move_in_progress or
>>> +             * move_cleanup_count being set.  For the later we know the new
>>> +             * destination has already acked at least one interrupt from this
>>> +             * source, and hence there's no need to forward any stale
>>> +             * interrupts.
>>> +             */
>>
>> I'm a little confused by this last paragraph: It talks about a difference,
>> yet ...
>>
>>> +            if ( apic_irr_read(desc->arch.old_vector) )
>>> +                send_IPI_mask(cpumask_of(cpumask_any(desc->arch.cpu_mask)),
>>> +                              desc->arch.vector);
>>
>> ... in the code being commented there's no difference visible. Hmm, I guess
>> this is related to the enclosing if(). Maybe this could be worded a little
>> differently, e.g. starting with "Note that for the other case -
>> move_cleanup_count being non-zero - we know ..."?
> 
> Hm, I see.  Yes, the difference is that for interrupts that have
> move_cleanup_count set we don't forward pending interrupts in IRR on
> this CPU.  I put this here because I think it's more naturally
> arranged with the rest of the comment.  I can pull the whole comment
> ahead if the if() if that's better.

I actually agree with you that the placement right now is "more natural".
I'm really just after making more clear what difference it is that is
being talked about. Assuming of course ...

>>> +        if ( !cpu_online(cpu) && cpumask_test_cpu(cpu, desc->arch.cpu_mask) )
>>> +            check_irr = true;
>>> +
>>>          if ( desc->handler->set_affinity )
>>>              desc->handler->set_affinity(desc, affinity);
>>>          else if ( !(warned++) )
>>>              set_affinity = false;
>>>  
>>> +        if ( check_irr && apic_irr_read(vector) )
>>> +            /*
>>> +             * Forward pending interrupt to the new destination, this CPU is
>>> +             * going offline and otherwise the interrupt would be lost.
>>> +             */
>>> +            send_IPI_mask(cpumask_of(cpumask_any(desc->arch.cpu_mask)),
>>> +                          desc->arch.vector);
>>> +
>>>          if ( desc->handler->enable )
>>>              desc->handler->enable(desc);
>>>  
>>
>> Down from here, after the loop, there's a 1ms window where latched but not
>> yet delivered interrupts can be received. How's that playing together with
>> the changes you're making? Aren't we then liable to get two interrupts, one
>> at the old and one at the new source, in unknown order?
> 
> I was mistakenly thinking that clear_local_APIC() would block
> interrupt delivery, but that's not the case, so yes, interrupts should
> still be delivered in the window below.
> 
> Let me test without this last patch.

... the patch wants / needs retaining in the first place.

Jan


^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH v2 6/7] x86/irq: handle moving interrupts in _assign_irq_vector()
  2024-06-12 13:42       ` Jan Beulich
@ 2024-06-12 15:36         ` Roger Pau Monné
  2024-06-13  8:38           ` Jan Beulich
  0 siblings, 1 reply; 37+ messages in thread
From: Roger Pau Monné @ 2024-06-12 15:36 UTC (permalink / raw)
  To: Jan Beulich; +Cc: Andrew Cooper, xen-devel

On Wed, Jun 12, 2024 at 03:42:58PM +0200, Jan Beulich wrote:
> On 12.06.2024 12:39, Roger Pau Monné wrote:
> > On Tue, Jun 11, 2024 at 03:18:32PM +0200, Jan Beulich wrote:
> >> On 10.06.2024 16:20, Roger Pau Monne wrote:
> >>> Currently there's logic in fixup_irqs() that attempts to prevent
> >>> _assign_irq_vector() from failing, as fixup_irqs() is required to evacuate all
> >>> interrupts from the CPUs not present in the input mask.  The current logic in
> >>> fixup_irqs() is incomplete, as it doesn't deal with interrupts that have
> >>> move_cleanup_count > 0 and a non-empty ->arch.old_cpu_mask field.
> >>>
> >>> Instead of attempting to fixup the interrupt descriptor in fixup_irqs() so that
> >>> _assign_irq_vector() cannot fail, introduce logic in _assign_irq_vector()
> >>> to deal with interrupts that have either move_{in_progress,cleanup_count} set
> >>> and no remaining online CPUs in ->arch.cpu_mask.
> >>>
> >>> If _assign_irq_vector() is requested to move an interrupt in the state
> >>> described above, first attempt to see if ->arch.old_cpu_mask contains any valid
> >>> CPUs that could be used as fallback, and if that's the case do move the
> >>> interrupt back to the previous destination.  Note this is easier because the
> >>> vector hasn't been released yet, so there's no need to allocate and setup a new
> >>> vector on the destination.
> >>>
> >>> Due to the logic in fixup_irqs() that clears offline CPUs from
> >>> ->arch.old_cpu_mask (and releases the old vector if the mask becomes empty) it
> >>> shouldn't be possible to get into _assign_irq_vector() with
> >>> ->arch.move_{in_progress,cleanup_count} set but no online CPUs in
> >>> ->arch.old_cpu_mask.
> >>>
> >>> However if ->arch.move_{in_progress,cleanup_count} is set and the interrupt has
> >>> also changed affinity, it's possible the members of ->arch.old_cpu_mask are no
> >>> longer part of the affinity set,
> >>
> >> I'm having trouble relating this (->arch.old_cpu_mask related) to ...
> >>
> >>> move the interrupt to a different CPU part of
> >>> the provided mask
> >>
> >> ... this (->arch.cpu_mask related).
> > 
> > No, the "provided mask" here is the "mask" parameter, not
> > ->arch.cpu_mask.
> 
> Oh, so this describes the case of "hitting" the comment at the very bottom of
> the first hunk then? (I probably was misreading this because I was expecting
> it to describe a code change, rather than the case where original behavior
> needs retaining. IOW - all fine here then.)
> 
> >>> and keep the current ->arch.old_{cpu_mask,vector} for the
> >>> pending interrupt movement to be completed.
> >>
> >> Right, that's to clean up state from before the initial move. What isn't
> >> clear to me is what's to happen with the state of the intermediate
> >> placement. Description and code changes leave me with the impression that
> >> it's okay to simply abandon, without any cleanup, yet I can't quite figure
> >> why that would be an okay thing to do.
> > 
> > There isn't much we can do with the intermediate placement, as the CPU
> > is going offline.  However we can drain any pending interrupts from
> > IRR after the new destination has been set, since setting the
> > destination is done from the CPU that's the current target of the
> > interrupts.  So we can ensure the draining is done strictly after the
> > target has been switched, hence ensuring no further interrupts from
> > this source will be delivered to the current CPU.
> 
> Hmm, I'm afraid I still don't follow: I'm specifically in trouble with
> the ...
> 
> >>> --- a/xen/arch/x86/irq.c
> >>> +++ b/xen/arch/x86/irq.c
> >>> @@ -544,7 +544,53 @@ static int _assign_irq_vector(struct irq_desc *desc, const cpumask_t *mask)
> >>>      }
> >>>  
> >>>      if ( desc->arch.move_in_progress || desc->arch.move_cleanup_count )
> >>> -        return -EAGAIN;
> >>> +    {
> >>> +        /*
> >>> +         * If the current destination is online refuse to shuffle.  Retry after
> >>> +         * the in-progress movement has finished.
> >>> +         */
> >>> +        if ( cpumask_intersects(desc->arch.cpu_mask, &cpu_online_map) )
> >>> +            return -EAGAIN;
> >>> +
> >>> +        /*
> >>> +         * Due to the logic in fixup_irqs() that clears offlined CPUs from
> >>> +         * ->arch.old_cpu_mask it shouldn't be possible to get here with
> >>> +         * ->arch.move_{in_progress,cleanup_count} set and no online CPUs in
> >>> +         * ->arch.old_cpu_mask.
> >>> +         */
> >>> +        ASSERT(valid_irq_vector(desc->arch.old_vector));
> >>> +        ASSERT(cpumask_intersects(desc->arch.old_cpu_mask, &cpu_online_map));
> >>> +
> >>> +        if ( cpumask_intersects(desc->arch.old_cpu_mask, mask) )
> >>> +        {
> >>> +            /*
> >>> +             * Fallback to the old destination if moving is in progress and the
> >>> +             * current destination is to be offlined.  This is only possible if
> >>> +             * the CPUs in old_cpu_mask intersect with the affinity mask passed
> >>> +             * in the 'mask' parameter.
> >>> +             */
> >>> +            desc->arch.vector = desc->arch.old_vector;
> >>> +            cpumask_and(desc->arch.cpu_mask, desc->arch.old_cpu_mask, mask);
> 
> ... replacing of vector (and associated mask), without any further accounting.

It's quite likely I'm missing something here, but what further
accounting you would like to do?

The current target of the interrupt (->arch.cpu_mask previous to
cpumask_and()) is all going offline, so any attempt to set it in
->arch.old_cpu_mask would just result in a stale (offline) CPU getting
set in ->arch.old_cpu_mask, which previous patches attempted to
solve.

Maybe by "further accounting" you meant something else not related to
->arch.old_{cpu_mask,vector}?

> >>> @@ -600,7 +646,17 @@ next:
> >>>          current_vector = vector;
> >>>          current_offset = offset;
> >>>  
> >>> -        if ( valid_irq_vector(old_vector) )
> >>> +        if ( desc->arch.move_in_progress || desc->arch.move_cleanup_count )
> >>> +        {
> >>> +            ASSERT(!cpumask_intersects(desc->arch.cpu_mask, &cpu_online_map));
> >>> +            /*
> >>> +             * Special case when evacuating an interrupt from a CPU to be
> >>> +             * offlined and the interrupt was already in the process of being
> >>> +             * moved.  Leave ->arch.old_{vector,cpu_mask} as-is and just
> >>> +             * replace ->arch.{cpu_mask,vector} with the new destination.
> >>> +             */
> >>
> >> And where's the cleaning up of ->arch.old_* going to be taken care of then?
> > 
> > Such cleaning will be handled normally by the interrupt still having
> > ->arch.move_{in_progress,cleanup_count} set.  The CPUs in
> > ->arch.old_cpu_mask must not all be offline, otherwise the logic in
> > fixup_irqs() would have already released the old vector.
> 
> Maybe add "Cleanup will be done normally" to the comment?


Can do.

Thanks, Roger.


^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH v2 6/7] x86/irq: handle moving interrupts in _assign_irq_vector()
  2024-06-12 15:36         ` Roger Pau Monné
@ 2024-06-13  8:38           ` Jan Beulich
  2024-06-13 11:31             ` Roger Pau Monné
  0 siblings, 1 reply; 37+ messages in thread
From: Jan Beulich @ 2024-06-13  8:38 UTC (permalink / raw)
  To: Roger Pau Monné; +Cc: Andrew Cooper, xen-devel

On 12.06.2024 17:36, Roger Pau Monné wrote:
> On Wed, Jun 12, 2024 at 03:42:58PM +0200, Jan Beulich wrote:
>> On 12.06.2024 12:39, Roger Pau Monné wrote:
>>> On Tue, Jun 11, 2024 at 03:18:32PM +0200, Jan Beulich wrote:
>>>> On 10.06.2024 16:20, Roger Pau Monne wrote:
>>>>> Currently there's logic in fixup_irqs() that attempts to prevent
>>>>> _assign_irq_vector() from failing, as fixup_irqs() is required to evacuate all
>>>>> interrupts from the CPUs not present in the input mask.  The current logic in
>>>>> fixup_irqs() is incomplete, as it doesn't deal with interrupts that have
>>>>> move_cleanup_count > 0 and a non-empty ->arch.old_cpu_mask field.
>>>>>
>>>>> Instead of attempting to fixup the interrupt descriptor in fixup_irqs() so that
>>>>> _assign_irq_vector() cannot fail, introduce logic in _assign_irq_vector()
>>>>> to deal with interrupts that have either move_{in_progress,cleanup_count} set
>>>>> and no remaining online CPUs in ->arch.cpu_mask.
>>>>>
>>>>> If _assign_irq_vector() is requested to move an interrupt in the state
>>>>> described above, first attempt to see if ->arch.old_cpu_mask contains any valid
>>>>> CPUs that could be used as fallback, and if that's the case do move the
>>>>> interrupt back to the previous destination.  Note this is easier because the
>>>>> vector hasn't been released yet, so there's no need to allocate and setup a new
>>>>> vector on the destination.
>>>>>
>>>>> Due to the logic in fixup_irqs() that clears offline CPUs from
>>>>> ->arch.old_cpu_mask (and releases the old vector if the mask becomes empty) it
>>>>> shouldn't be possible to get into _assign_irq_vector() with
>>>>> ->arch.move_{in_progress,cleanup_count} set but no online CPUs in
>>>>> ->arch.old_cpu_mask.
>>>>>
>>>>> However if ->arch.move_{in_progress,cleanup_count} is set and the interrupt has
>>>>> also changed affinity, it's possible the members of ->arch.old_cpu_mask are no
>>>>> longer part of the affinity set,
>>>>
>>>> I'm having trouble relating this (->arch.old_cpu_mask related) to ...
>>>>
>>>>> move the interrupt to a different CPU part of
>>>>> the provided mask
>>>>
>>>> ... this (->arch.cpu_mask related).
>>>
>>> No, the "provided mask" here is the "mask" parameter, not
>>> ->arch.cpu_mask.
>>
>> Oh, so this describes the case of "hitting" the comment at the very bottom of
>> the first hunk then? (I probably was misreading this because I was expecting
>> it to describe a code change, rather than the case where original behavior
>> needs retaining. IOW - all fine here then.)
>>
>>>>> and keep the current ->arch.old_{cpu_mask,vector} for the
>>>>> pending interrupt movement to be completed.
>>>>
>>>> Right, that's to clean up state from before the initial move. What isn't
>>>> clear to me is what's to happen with the state of the intermediate
>>>> placement. Description and code changes leave me with the impression that
>>>> it's okay to simply abandon, without any cleanup, yet I can't quite figure
>>>> why that would be an okay thing to do.
>>>
>>> There isn't much we can do with the intermediate placement, as the CPU
>>> is going offline.  However we can drain any pending interrupts from
>>> IRR after the new destination has been set, since setting the
>>> destination is done from the CPU that's the current target of the
>>> interrupts.  So we can ensure the draining is done strictly after the
>>> target has been switched, hence ensuring no further interrupts from
>>> this source will be delivered to the current CPU.
>>
>> Hmm, I'm afraid I still don't follow: I'm specifically in trouble with
>> the ...
>>
>>>>> --- a/xen/arch/x86/irq.c
>>>>> +++ b/xen/arch/x86/irq.c
>>>>> @@ -544,7 +544,53 @@ static int _assign_irq_vector(struct irq_desc *desc, const cpumask_t *mask)
>>>>>      }
>>>>>  
>>>>>      if ( desc->arch.move_in_progress || desc->arch.move_cleanup_count )
>>>>> -        return -EAGAIN;
>>>>> +    {
>>>>> +        /*
>>>>> +         * If the current destination is online refuse to shuffle.  Retry after
>>>>> +         * the in-progress movement has finished.
>>>>> +         */
>>>>> +        if ( cpumask_intersects(desc->arch.cpu_mask, &cpu_online_map) )
>>>>> +            return -EAGAIN;
>>>>> +
>>>>> +        /*
>>>>> +         * Due to the logic in fixup_irqs() that clears offlined CPUs from
>>>>> +         * ->arch.old_cpu_mask it shouldn't be possible to get here with
>>>>> +         * ->arch.move_{in_progress,cleanup_count} set and no online CPUs in
>>>>> +         * ->arch.old_cpu_mask.
>>>>> +         */
>>>>> +        ASSERT(valid_irq_vector(desc->arch.old_vector));
>>>>> +        ASSERT(cpumask_intersects(desc->arch.old_cpu_mask, &cpu_online_map));
>>>>> +
>>>>> +        if ( cpumask_intersects(desc->arch.old_cpu_mask, mask) )
>>>>> +        {
>>>>> +            /*
>>>>> +             * Fallback to the old destination if moving is in progress and the
>>>>> +             * current destination is to be offlined.  This is only possible if
>>>>> +             * the CPUs in old_cpu_mask intersect with the affinity mask passed
>>>>> +             * in the 'mask' parameter.
>>>>> +             */
>>>>> +            desc->arch.vector = desc->arch.old_vector;
>>>>> +            cpumask_and(desc->arch.cpu_mask, desc->arch.old_cpu_mask, mask);
>>
>> ... replacing of vector (and associated mask), without any further accounting.
> 
> It's quite likely I'm missing something here, but what further
> accounting you would like to do?
> 
> The current target of the interrupt (->arch.cpu_mask previous to
> cpumask_and()) is all going offline, so any attempt to set it in
> ->arch.old_cpu_mask would just result in a stale (offline) CPU getting
> set in ->arch.old_cpu_mask, which previous patches attempted to
> solve.
> 
> Maybe by "further accounting" you meant something else not related to
> ->arch.old_{cpu_mask,vector}?

Indeed. What I'm thinking of is what normally release_old_vec() would
do (of which only desc->arch.used_vectors updating would appear to be
relevant, seeing the CPU's going offline). The other one I was thinking
of, updating vector_irq[], likely is also unnecessary, again because
that's per-CPU data of a CPU going down.

Jan


^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH v2 6/7] x86/irq: handle moving interrupts in _assign_irq_vector()
  2024-06-13  8:38           ` Jan Beulich
@ 2024-06-13 11:31             ` Roger Pau Monné
  2024-06-13 11:36               ` Jan Beulich
  0 siblings, 1 reply; 37+ messages in thread
From: Roger Pau Monné @ 2024-06-13 11:31 UTC (permalink / raw)
  To: Jan Beulich; +Cc: Andrew Cooper, xen-devel

On Thu, Jun 13, 2024 at 10:38:35AM +0200, Jan Beulich wrote:
> On 12.06.2024 17:36, Roger Pau Monné wrote:
> > On Wed, Jun 12, 2024 at 03:42:58PM +0200, Jan Beulich wrote:
> >> On 12.06.2024 12:39, Roger Pau Monné wrote:
> >>> On Tue, Jun 11, 2024 at 03:18:32PM +0200, Jan Beulich wrote:
> >>>> On 10.06.2024 16:20, Roger Pau Monne wrote:
> >>>>> Currently there's logic in fixup_irqs() that attempts to prevent
> >>>>> _assign_irq_vector() from failing, as fixup_irqs() is required to evacuate all
> >>>>> interrupts from the CPUs not present in the input mask.  The current logic in
> >>>>> fixup_irqs() is incomplete, as it doesn't deal with interrupts that have
> >>>>> move_cleanup_count > 0 and a non-empty ->arch.old_cpu_mask field.
> >>>>>
> >>>>> Instead of attempting to fixup the interrupt descriptor in fixup_irqs() so that
> >>>>> _assign_irq_vector() cannot fail, introduce logic in _assign_irq_vector()
> >>>>> to deal with interrupts that have either move_{in_progress,cleanup_count} set
> >>>>> and no remaining online CPUs in ->arch.cpu_mask.
> >>>>>
> >>>>> If _assign_irq_vector() is requested to move an interrupt in the state
> >>>>> described above, first attempt to see if ->arch.old_cpu_mask contains any valid
> >>>>> CPUs that could be used as fallback, and if that's the case do move the
> >>>>> interrupt back to the previous destination.  Note this is easier because the
> >>>>> vector hasn't been released yet, so there's no need to allocate and setup a new
> >>>>> vector on the destination.
> >>>>>
> >>>>> Due to the logic in fixup_irqs() that clears offline CPUs from
> >>>>> ->arch.old_cpu_mask (and releases the old vector if the mask becomes empty) it
> >>>>> shouldn't be possible to get into _assign_irq_vector() with
> >>>>> ->arch.move_{in_progress,cleanup_count} set but no online CPUs in
> >>>>> ->arch.old_cpu_mask.
> >>>>>
> >>>>> However if ->arch.move_{in_progress,cleanup_count} is set and the interrupt has
> >>>>> also changed affinity, it's possible the members of ->arch.old_cpu_mask are no
> >>>>> longer part of the affinity set,
> >>>>
> >>>> I'm having trouble relating this (->arch.old_cpu_mask related) to ...
> >>>>
> >>>>> move the interrupt to a different CPU part of
> >>>>> the provided mask
> >>>>
> >>>> ... this (->arch.cpu_mask related).
> >>>
> >>> No, the "provided mask" here is the "mask" parameter, not
> >>> ->arch.cpu_mask.
> >>
> >> Oh, so this describes the case of "hitting" the comment at the very bottom of
> >> the first hunk then? (I probably was misreading this because I was expecting
> >> it to describe a code change, rather than the case where original behavior
> >> needs retaining. IOW - all fine here then.)
> >>
> >>>>> and keep the current ->arch.old_{cpu_mask,vector} for the
> >>>>> pending interrupt movement to be completed.
> >>>>
> >>>> Right, that's to clean up state from before the initial move. What isn't
> >>>> clear to me is what's to happen with the state of the intermediate
> >>>> placement. Description and code changes leave me with the impression that
> >>>> it's okay to simply abandon, without any cleanup, yet I can't quite figure
> >>>> why that would be an okay thing to do.
> >>>
> >>> There isn't much we can do with the intermediate placement, as the CPU
> >>> is going offline.  However we can drain any pending interrupts from
> >>> IRR after the new destination has been set, since setting the
> >>> destination is done from the CPU that's the current target of the
> >>> interrupts.  So we can ensure the draining is done strictly after the
> >>> target has been switched, hence ensuring no further interrupts from
> >>> this source will be delivered to the current CPU.
> >>
> >> Hmm, I'm afraid I still don't follow: I'm specifically in trouble with
> >> the ...
> >>
> >>>>> --- a/xen/arch/x86/irq.c
> >>>>> +++ b/xen/arch/x86/irq.c
> >>>>> @@ -544,7 +544,53 @@ static int _assign_irq_vector(struct irq_desc *desc, const cpumask_t *mask)
> >>>>>      }
> >>>>>  
> >>>>>      if ( desc->arch.move_in_progress || desc->arch.move_cleanup_count )
> >>>>> -        return -EAGAIN;
> >>>>> +    {
> >>>>> +        /*
> >>>>> +         * If the current destination is online refuse to shuffle.  Retry after
> >>>>> +         * the in-progress movement has finished.
> >>>>> +         */
> >>>>> +        if ( cpumask_intersects(desc->arch.cpu_mask, &cpu_online_map) )
> >>>>> +            return -EAGAIN;
> >>>>> +
> >>>>> +        /*
> >>>>> +         * Due to the logic in fixup_irqs() that clears offlined CPUs from
> >>>>> +         * ->arch.old_cpu_mask it shouldn't be possible to get here with
> >>>>> +         * ->arch.move_{in_progress,cleanup_count} set and no online CPUs in
> >>>>> +         * ->arch.old_cpu_mask.
> >>>>> +         */
> >>>>> +        ASSERT(valid_irq_vector(desc->arch.old_vector));
> >>>>> +        ASSERT(cpumask_intersects(desc->arch.old_cpu_mask, &cpu_online_map));
> >>>>> +
> >>>>> +        if ( cpumask_intersects(desc->arch.old_cpu_mask, mask) )
> >>>>> +        {
> >>>>> +            /*
> >>>>> +             * Fallback to the old destination if moving is in progress and the
> >>>>> +             * current destination is to be offlined.  This is only possible if
> >>>>> +             * the CPUs in old_cpu_mask intersect with the affinity mask passed
> >>>>> +             * in the 'mask' parameter.
> >>>>> +             */
> >>>>> +            desc->arch.vector = desc->arch.old_vector;
> >>>>> +            cpumask_and(desc->arch.cpu_mask, desc->arch.old_cpu_mask, mask);
> >>
> >> ... replacing of vector (and associated mask), without any further accounting.
> > 
> > It's quite likely I'm missing something here, but what further
> > accounting you would like to do?
> > 
> > The current target of the interrupt (->arch.cpu_mask previous to
> > cpumask_and()) is all going offline, so any attempt to set it in
> > ->arch.old_cpu_mask would just result in a stale (offline) CPU getting
> > set in ->arch.old_cpu_mask, which previous patches attempted to
> > solve.
> > 
> > Maybe by "further accounting" you meant something else not related to
> > ->arch.old_{cpu_mask,vector}?
> 
> Indeed. What I'm thinking of is what normally release_old_vec() would
> do (of which only desc->arch.used_vectors updating would appear to be
> relevant, seeing the CPU's going offline). The other one I was thinking
> of, updating vector_irq[], likely is also unnecessary, again because
> that's per-CPU data of a CPU going down.

I think updating vector_irq[] should be explicitly avoided, as doing
so would prevent us from correctly draining any pending interrupts
because the vector -> irq mapping would be broken when the interrupt
enable window at the bottom of fixup_irqs() is reached.

For used_vectors: we might clean it, I'm a bit worried however that at
some point we insert a check in do_IRQ() path that ensures the
vector_irq[] is inline with desc->arch.used_vectors, which would fail
for interrupts drained at the bottom of fixup_irqs().  Let me attempt
to clean the currently used vector from ->arch.used_vectors.

Thanks, Roger.


^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH v2 6/7] x86/irq: handle moving interrupts in _assign_irq_vector()
  2024-06-13 11:31             ` Roger Pau Monné
@ 2024-06-13 11:36               ` Jan Beulich
  2024-06-13 12:55                 ` Roger Pau Monné
  0 siblings, 1 reply; 37+ messages in thread
From: Jan Beulich @ 2024-06-13 11:36 UTC (permalink / raw)
  To: Roger Pau Monné; +Cc: Andrew Cooper, xen-devel

On 13.06.2024 13:31, Roger Pau Monné wrote:
> On Thu, Jun 13, 2024 at 10:38:35AM +0200, Jan Beulich wrote:
>> On 12.06.2024 17:36, Roger Pau Monné wrote:
>>> On Wed, Jun 12, 2024 at 03:42:58PM +0200, Jan Beulich wrote:
>>>> On 12.06.2024 12:39, Roger Pau Monné wrote:
>>>>> On Tue, Jun 11, 2024 at 03:18:32PM +0200, Jan Beulich wrote:
>>>>>> On 10.06.2024 16:20, Roger Pau Monne wrote:
>>>>>>> Currently there's logic in fixup_irqs() that attempts to prevent
>>>>>>> _assign_irq_vector() from failing, as fixup_irqs() is required to evacuate all
>>>>>>> interrupts from the CPUs not present in the input mask.  The current logic in
>>>>>>> fixup_irqs() is incomplete, as it doesn't deal with interrupts that have
>>>>>>> move_cleanup_count > 0 and a non-empty ->arch.old_cpu_mask field.
>>>>>>>
>>>>>>> Instead of attempting to fixup the interrupt descriptor in fixup_irqs() so that
>>>>>>> _assign_irq_vector() cannot fail, introduce logic in _assign_irq_vector()
>>>>>>> to deal with interrupts that have either move_{in_progress,cleanup_count} set
>>>>>>> and no remaining online CPUs in ->arch.cpu_mask.
>>>>>>>
>>>>>>> If _assign_irq_vector() is requested to move an interrupt in the state
>>>>>>> described above, first attempt to see if ->arch.old_cpu_mask contains any valid
>>>>>>> CPUs that could be used as fallback, and if that's the case do move the
>>>>>>> interrupt back to the previous destination.  Note this is easier because the
>>>>>>> vector hasn't been released yet, so there's no need to allocate and setup a new
>>>>>>> vector on the destination.
>>>>>>>
>>>>>>> Due to the logic in fixup_irqs() that clears offline CPUs from
>>>>>>> ->arch.old_cpu_mask (and releases the old vector if the mask becomes empty) it
>>>>>>> shouldn't be possible to get into _assign_irq_vector() with
>>>>>>> ->arch.move_{in_progress,cleanup_count} set but no online CPUs in
>>>>>>> ->arch.old_cpu_mask.
>>>>>>>
>>>>>>> However if ->arch.move_{in_progress,cleanup_count} is set and the interrupt has
>>>>>>> also changed affinity, it's possible the members of ->arch.old_cpu_mask are no
>>>>>>> longer part of the affinity set,
>>>>>>
>>>>>> I'm having trouble relating this (->arch.old_cpu_mask related) to ...
>>>>>>
>>>>>>> move the interrupt to a different CPU part of
>>>>>>> the provided mask
>>>>>>
>>>>>> ... this (->arch.cpu_mask related).
>>>>>
>>>>> No, the "provided mask" here is the "mask" parameter, not
>>>>> ->arch.cpu_mask.
>>>>
>>>> Oh, so this describes the case of "hitting" the comment at the very bottom of
>>>> the first hunk then? (I probably was misreading this because I was expecting
>>>> it to describe a code change, rather than the case where original behavior
>>>> needs retaining. IOW - all fine here then.)
>>>>
>>>>>>> and keep the current ->arch.old_{cpu_mask,vector} for the
>>>>>>> pending interrupt movement to be completed.
>>>>>>
>>>>>> Right, that's to clean up state from before the initial move. What isn't
>>>>>> clear to me is what's to happen with the state of the intermediate
>>>>>> placement. Description and code changes leave me with the impression that
>>>>>> it's okay to simply abandon, without any cleanup, yet I can't quite figure
>>>>>> why that would be an okay thing to do.
>>>>>
>>>>> There isn't much we can do with the intermediate placement, as the CPU
>>>>> is going offline.  However we can drain any pending interrupts from
>>>>> IRR after the new destination has been set, since setting the
>>>>> destination is done from the CPU that's the current target of the
>>>>> interrupts.  So we can ensure the draining is done strictly after the
>>>>> target has been switched, hence ensuring no further interrupts from
>>>>> this source will be delivered to the current CPU.
>>>>
>>>> Hmm, I'm afraid I still don't follow: I'm specifically in trouble with
>>>> the ...
>>>>
>>>>>>> --- a/xen/arch/x86/irq.c
>>>>>>> +++ b/xen/arch/x86/irq.c
>>>>>>> @@ -544,7 +544,53 @@ static int _assign_irq_vector(struct irq_desc *desc, const cpumask_t *mask)
>>>>>>>      }
>>>>>>>  
>>>>>>>      if ( desc->arch.move_in_progress || desc->arch.move_cleanup_count )
>>>>>>> -        return -EAGAIN;
>>>>>>> +    {
>>>>>>> +        /*
>>>>>>> +         * If the current destination is online refuse to shuffle.  Retry after
>>>>>>> +         * the in-progress movement has finished.
>>>>>>> +         */
>>>>>>> +        if ( cpumask_intersects(desc->arch.cpu_mask, &cpu_online_map) )
>>>>>>> +            return -EAGAIN;
>>>>>>> +
>>>>>>> +        /*
>>>>>>> +         * Due to the logic in fixup_irqs() that clears offlined CPUs from
>>>>>>> +         * ->arch.old_cpu_mask it shouldn't be possible to get here with
>>>>>>> +         * ->arch.move_{in_progress,cleanup_count} set and no online CPUs in
>>>>>>> +         * ->arch.old_cpu_mask.
>>>>>>> +         */
>>>>>>> +        ASSERT(valid_irq_vector(desc->arch.old_vector));
>>>>>>> +        ASSERT(cpumask_intersects(desc->arch.old_cpu_mask, &cpu_online_map));
>>>>>>> +
>>>>>>> +        if ( cpumask_intersects(desc->arch.old_cpu_mask, mask) )
>>>>>>> +        {
>>>>>>> +            /*
>>>>>>> +             * Fallback to the old destination if moving is in progress and the
>>>>>>> +             * current destination is to be offlined.  This is only possible if
>>>>>>> +             * the CPUs in old_cpu_mask intersect with the affinity mask passed
>>>>>>> +             * in the 'mask' parameter.
>>>>>>> +             */
>>>>>>> +            desc->arch.vector = desc->arch.old_vector;
>>>>>>> +            cpumask_and(desc->arch.cpu_mask, desc->arch.old_cpu_mask, mask);
>>>>
>>>> ... replacing of vector (and associated mask), without any further accounting.
>>>
>>> It's quite likely I'm missing something here, but what further
>>> accounting you would like to do?
>>>
>>> The current target of the interrupt (->arch.cpu_mask previous to
>>> cpumask_and()) is all going offline, so any attempt to set it in
>>> ->arch.old_cpu_mask would just result in a stale (offline) CPU getting
>>> set in ->arch.old_cpu_mask, which previous patches attempted to
>>> solve.
>>>
>>> Maybe by "further accounting" you meant something else not related to
>>> ->arch.old_{cpu_mask,vector}?
>>
>> Indeed. What I'm thinking of is what normally release_old_vec() would
>> do (of which only desc->arch.used_vectors updating would appear to be
>> relevant, seeing the CPU's going offline). The other one I was thinking
>> of, updating vector_irq[], likely is also unnecessary, again because
>> that's per-CPU data of a CPU going down.
> 
> I think updating vector_irq[] should be explicitly avoided, as doing
> so would prevent us from correctly draining any pending interrupts
> because the vector -> irq mapping would be broken when the interrupt
> enable window at the bottom of fixup_irqs() is reached.
> 
> For used_vectors: we might clean it, I'm a bit worried however that at
> some point we insert a check in do_IRQ() path that ensures the
> vector_irq[] is inline with desc->arch.used_vectors, which would fail
> for interrupts drained at the bottom of fixup_irqs().  Let me attempt
> to clean the currently used vector from ->arch.used_vectors.

Just to clarify: It may well be that for draining the bit can't be cleared
right here. But it then still needs clearing _somewhere_, or else we
chance ending up with inconsistent state (triggering e.g. an assertion
later on) or the leaking of vectors. My problem here was that I also
couldn't locate any such "somewhere", and commentary also didn't point me
anywhere.

Jan


^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH v2 6/7] x86/irq: handle moving interrupts in _assign_irq_vector()
  2024-06-13 11:36               ` Jan Beulich
@ 2024-06-13 12:55                 ` Roger Pau Monné
  2024-06-13 13:07                   ` Jan Beulich
  0 siblings, 1 reply; 37+ messages in thread
From: Roger Pau Monné @ 2024-06-13 12:55 UTC (permalink / raw)
  To: Jan Beulich; +Cc: Andrew Cooper, xen-devel

On Thu, Jun 13, 2024 at 01:36:55PM +0200, Jan Beulich wrote:
> On 13.06.2024 13:31, Roger Pau Monné wrote:
> > On Thu, Jun 13, 2024 at 10:38:35AM +0200, Jan Beulich wrote:
> >> On 12.06.2024 17:36, Roger Pau Monné wrote:
> >>> On Wed, Jun 12, 2024 at 03:42:58PM +0200, Jan Beulich wrote:
> >>>> On 12.06.2024 12:39, Roger Pau Monné wrote:
> >>>>> On Tue, Jun 11, 2024 at 03:18:32PM +0200, Jan Beulich wrote:
> >>>>>> On 10.06.2024 16:20, Roger Pau Monne wrote:
> >>>>>>> Currently there's logic in fixup_irqs() that attempts to prevent
> >>>>>>> _assign_irq_vector() from failing, as fixup_irqs() is required to evacuate all
> >>>>>>> interrupts from the CPUs not present in the input mask.  The current logic in
> >>>>>>> fixup_irqs() is incomplete, as it doesn't deal with interrupts that have
> >>>>>>> move_cleanup_count > 0 and a non-empty ->arch.old_cpu_mask field.
> >>>>>>>
> >>>>>>> Instead of attempting to fixup the interrupt descriptor in fixup_irqs() so that
> >>>>>>> _assign_irq_vector() cannot fail, introduce logic in _assign_irq_vector()
> >>>>>>> to deal with interrupts that have either move_{in_progress,cleanup_count} set
> >>>>>>> and no remaining online CPUs in ->arch.cpu_mask.
> >>>>>>>
> >>>>>>> If _assign_irq_vector() is requested to move an interrupt in the state
> >>>>>>> described above, first attempt to see if ->arch.old_cpu_mask contains any valid
> >>>>>>> CPUs that could be used as fallback, and if that's the case do move the
> >>>>>>> interrupt back to the previous destination.  Note this is easier because the
> >>>>>>> vector hasn't been released yet, so there's no need to allocate and setup a new
> >>>>>>> vector on the destination.
> >>>>>>>
> >>>>>>> Due to the logic in fixup_irqs() that clears offline CPUs from
> >>>>>>> ->arch.old_cpu_mask (and releases the old vector if the mask becomes empty) it
> >>>>>>> shouldn't be possible to get into _assign_irq_vector() with
> >>>>>>> ->arch.move_{in_progress,cleanup_count} set but no online CPUs in
> >>>>>>> ->arch.old_cpu_mask.
> >>>>>>>
> >>>>>>> However if ->arch.move_{in_progress,cleanup_count} is set and the interrupt has
> >>>>>>> also changed affinity, it's possible the members of ->arch.old_cpu_mask are no
> >>>>>>> longer part of the affinity set,
> >>>>>>
> >>>>>> I'm having trouble relating this (->arch.old_cpu_mask related) to ...
> >>>>>>
> >>>>>>> move the interrupt to a different CPU part of
> >>>>>>> the provided mask
> >>>>>>
> >>>>>> ... this (->arch.cpu_mask related).
> >>>>>
> >>>>> No, the "provided mask" here is the "mask" parameter, not
> >>>>> ->arch.cpu_mask.
> >>>>
> >>>> Oh, so this describes the case of "hitting" the comment at the very bottom of
> >>>> the first hunk then? (I probably was misreading this because I was expecting
> >>>> it to describe a code change, rather than the case where original behavior
> >>>> needs retaining. IOW - all fine here then.)
> >>>>
> >>>>>>> and keep the current ->arch.old_{cpu_mask,vector} for the
> >>>>>>> pending interrupt movement to be completed.
> >>>>>>
> >>>>>> Right, that's to clean up state from before the initial move. What isn't
> >>>>>> clear to me is what's to happen with the state of the intermediate
> >>>>>> placement. Description and code changes leave me with the impression that
> >>>>>> it's okay to simply abandon, without any cleanup, yet I can't quite figure
> >>>>>> why that would be an okay thing to do.
> >>>>>
> >>>>> There isn't much we can do with the intermediate placement, as the CPU
> >>>>> is going offline.  However we can drain any pending interrupts from
> >>>>> IRR after the new destination has been set, since setting the
> >>>>> destination is done from the CPU that's the current target of the
> >>>>> interrupts.  So we can ensure the draining is done strictly after the
> >>>>> target has been switched, hence ensuring no further interrupts from
> >>>>> this source will be delivered to the current CPU.
> >>>>
> >>>> Hmm, I'm afraid I still don't follow: I'm specifically in trouble with
> >>>> the ...
> >>>>
> >>>>>>> --- a/xen/arch/x86/irq.c
> >>>>>>> +++ b/xen/arch/x86/irq.c
> >>>>>>> @@ -544,7 +544,53 @@ static int _assign_irq_vector(struct irq_desc *desc, const cpumask_t *mask)
> >>>>>>>      }
> >>>>>>>  
> >>>>>>>      if ( desc->arch.move_in_progress || desc->arch.move_cleanup_count )
> >>>>>>> -        return -EAGAIN;
> >>>>>>> +    {
> >>>>>>> +        /*
> >>>>>>> +         * If the current destination is online refuse to shuffle.  Retry after
> >>>>>>> +         * the in-progress movement has finished.
> >>>>>>> +         */
> >>>>>>> +        if ( cpumask_intersects(desc->arch.cpu_mask, &cpu_online_map) )
> >>>>>>> +            return -EAGAIN;
> >>>>>>> +
> >>>>>>> +        /*
> >>>>>>> +         * Due to the logic in fixup_irqs() that clears offlined CPUs from
> >>>>>>> +         * ->arch.old_cpu_mask it shouldn't be possible to get here with
> >>>>>>> +         * ->arch.move_{in_progress,cleanup_count} set and no online CPUs in
> >>>>>>> +         * ->arch.old_cpu_mask.
> >>>>>>> +         */
> >>>>>>> +        ASSERT(valid_irq_vector(desc->arch.old_vector));
> >>>>>>> +        ASSERT(cpumask_intersects(desc->arch.old_cpu_mask, &cpu_online_map));
> >>>>>>> +
> >>>>>>> +        if ( cpumask_intersects(desc->arch.old_cpu_mask, mask) )
> >>>>>>> +        {
> >>>>>>> +            /*
> >>>>>>> +             * Fallback to the old destination if moving is in progress and the
> >>>>>>> +             * current destination is to be offlined.  This is only possible if
> >>>>>>> +             * the CPUs in old_cpu_mask intersect with the affinity mask passed
> >>>>>>> +             * in the 'mask' parameter.
> >>>>>>> +             */
> >>>>>>> +            desc->arch.vector = desc->arch.old_vector;
> >>>>>>> +            cpumask_and(desc->arch.cpu_mask, desc->arch.old_cpu_mask, mask);
> >>>>
> >>>> ... replacing of vector (and associated mask), without any further accounting.
> >>>
> >>> It's quite likely I'm missing something here, but what further
> >>> accounting you would like to do?
> >>>
> >>> The current target of the interrupt (->arch.cpu_mask previous to
> >>> cpumask_and()) is all going offline, so any attempt to set it in
> >>> ->arch.old_cpu_mask would just result in a stale (offline) CPU getting
> >>> set in ->arch.old_cpu_mask, which previous patches attempted to
> >>> solve.
> >>>
> >>> Maybe by "further accounting" you meant something else not related to
> >>> ->arch.old_{cpu_mask,vector}?
> >>
> >> Indeed. What I'm thinking of is what normally release_old_vec() would
> >> do (of which only desc->arch.used_vectors updating would appear to be
> >> relevant, seeing the CPU's going offline). The other one I was thinking
> >> of, updating vector_irq[], likely is also unnecessary, again because
> >> that's per-CPU data of a CPU going down.
> > 
> > I think updating vector_irq[] should be explicitly avoided, as doing
> > so would prevent us from correctly draining any pending interrupts
> > because the vector -> irq mapping would be broken when the interrupt
> > enable window at the bottom of fixup_irqs() is reached.
> > 
> > For used_vectors: we might clean it, I'm a bit worried however that at
> > some point we insert a check in do_IRQ() path that ensures the
> > vector_irq[] is inline with desc->arch.used_vectors, which would fail
> > for interrupts drained at the bottom of fixup_irqs().  Let me attempt
> > to clean the currently used vector from ->arch.used_vectors.
> 
> Just to clarify: It may well be that for draining the bit can't be cleared
> right here. But it then still needs clearing _somewhere_, or else we
> chance ending up with inconsistent state (triggering e.g. an assertion
> later on) or the leaking of vectors. My problem here was that I also
> couldn't locate any such "somewhere", and commentary also didn't point me
> anywhere.

You are correct, there's no such place where the cleanup would happen.

I'm afraid the only option I see to correctly deal with this is to do
the cleanup of the old destination in _assign_irq_vector(), and then
do the pending interrupt draining from IRR like I had proposed in
patch 7/7, thus removing the interrupt enable window at the bottom of
fixup_irqs().

Let me know if that seems sensible.

Thanks, Roger.


^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH v2 6/7] x86/irq: handle moving interrupts in _assign_irq_vector()
  2024-06-13 12:55                 ` Roger Pau Monné
@ 2024-06-13 13:07                   ` Jan Beulich
  0 siblings, 0 replies; 37+ messages in thread
From: Jan Beulich @ 2024-06-13 13:07 UTC (permalink / raw)
  To: Roger Pau Monné; +Cc: Andrew Cooper, xen-devel

On 13.06.2024 14:55, Roger Pau Monné wrote:
> On Thu, Jun 13, 2024 at 01:36:55PM +0200, Jan Beulich wrote:
>> On 13.06.2024 13:31, Roger Pau Monné wrote:
>>> On Thu, Jun 13, 2024 at 10:38:35AM +0200, Jan Beulich wrote:
>>>> On 12.06.2024 17:36, Roger Pau Monné wrote:
>>>>> On Wed, Jun 12, 2024 at 03:42:58PM +0200, Jan Beulich wrote:
>>>>>> On 12.06.2024 12:39, Roger Pau Monné wrote:
>>>>>>> On Tue, Jun 11, 2024 at 03:18:32PM +0200, Jan Beulich wrote:
>>>>>>>> On 10.06.2024 16:20, Roger Pau Monne wrote:
>>>>>>>>> Currently there's logic in fixup_irqs() that attempts to prevent
>>>>>>>>> _assign_irq_vector() from failing, as fixup_irqs() is required to evacuate all
>>>>>>>>> interrupts from the CPUs not present in the input mask.  The current logic in
>>>>>>>>> fixup_irqs() is incomplete, as it doesn't deal with interrupts that have
>>>>>>>>> move_cleanup_count > 0 and a non-empty ->arch.old_cpu_mask field.
>>>>>>>>>
>>>>>>>>> Instead of attempting to fixup the interrupt descriptor in fixup_irqs() so that
>>>>>>>>> _assign_irq_vector() cannot fail, introduce logic in _assign_irq_vector()
>>>>>>>>> to deal with interrupts that have either move_{in_progress,cleanup_count} set
>>>>>>>>> and no remaining online CPUs in ->arch.cpu_mask.
>>>>>>>>>
>>>>>>>>> If _assign_irq_vector() is requested to move an interrupt in the state
>>>>>>>>> described above, first attempt to see if ->arch.old_cpu_mask contains any valid
>>>>>>>>> CPUs that could be used as fallback, and if that's the case do move the
>>>>>>>>> interrupt back to the previous destination.  Note this is easier because the
>>>>>>>>> vector hasn't been released yet, so there's no need to allocate and setup a new
>>>>>>>>> vector on the destination.
>>>>>>>>>
>>>>>>>>> Due to the logic in fixup_irqs() that clears offline CPUs from
>>>>>>>>> ->arch.old_cpu_mask (and releases the old vector if the mask becomes empty) it
>>>>>>>>> shouldn't be possible to get into _assign_irq_vector() with
>>>>>>>>> ->arch.move_{in_progress,cleanup_count} set but no online CPUs in
>>>>>>>>> ->arch.old_cpu_mask.
>>>>>>>>>
>>>>>>>>> However if ->arch.move_{in_progress,cleanup_count} is set and the interrupt has
>>>>>>>>> also changed affinity, it's possible the members of ->arch.old_cpu_mask are no
>>>>>>>>> longer part of the affinity set,
>>>>>>>>
>>>>>>>> I'm having trouble relating this (->arch.old_cpu_mask related) to ...
>>>>>>>>
>>>>>>>>> move the interrupt to a different CPU part of
>>>>>>>>> the provided mask
>>>>>>>>
>>>>>>>> ... this (->arch.cpu_mask related).
>>>>>>>
>>>>>>> No, the "provided mask" here is the "mask" parameter, not
>>>>>>> ->arch.cpu_mask.
>>>>>>
>>>>>> Oh, so this describes the case of "hitting" the comment at the very bottom of
>>>>>> the first hunk then? (I probably was misreading this because I was expecting
>>>>>> it to describe a code change, rather than the case where original behavior
>>>>>> needs retaining. IOW - all fine here then.)
>>>>>>
>>>>>>>>> and keep the current ->arch.old_{cpu_mask,vector} for the
>>>>>>>>> pending interrupt movement to be completed.
>>>>>>>>
>>>>>>>> Right, that's to clean up state from before the initial move. What isn't
>>>>>>>> clear to me is what's to happen with the state of the intermediate
>>>>>>>> placement. Description and code changes leave me with the impression that
>>>>>>>> it's okay to simply abandon, without any cleanup, yet I can't quite figure
>>>>>>>> why that would be an okay thing to do.
>>>>>>>
>>>>>>> There isn't much we can do with the intermediate placement, as the CPU
>>>>>>> is going offline.  However we can drain any pending interrupts from
>>>>>>> IRR after the new destination has been set, since setting the
>>>>>>> destination is done from the CPU that's the current target of the
>>>>>>> interrupts.  So we can ensure the draining is done strictly after the
>>>>>>> target has been switched, hence ensuring no further interrupts from
>>>>>>> this source will be delivered to the current CPU.
>>>>>>
>>>>>> Hmm, I'm afraid I still don't follow: I'm specifically in trouble with
>>>>>> the ...
>>>>>>
>>>>>>>>> --- a/xen/arch/x86/irq.c
>>>>>>>>> +++ b/xen/arch/x86/irq.c
>>>>>>>>> @@ -544,7 +544,53 @@ static int _assign_irq_vector(struct irq_desc *desc, const cpumask_t *mask)
>>>>>>>>>      }
>>>>>>>>>  
>>>>>>>>>      if ( desc->arch.move_in_progress || desc->arch.move_cleanup_count )
>>>>>>>>> -        return -EAGAIN;
>>>>>>>>> +    {
>>>>>>>>> +        /*
>>>>>>>>> +         * If the current destination is online refuse to shuffle.  Retry after
>>>>>>>>> +         * the in-progress movement has finished.
>>>>>>>>> +         */
>>>>>>>>> +        if ( cpumask_intersects(desc->arch.cpu_mask, &cpu_online_map) )
>>>>>>>>> +            return -EAGAIN;
>>>>>>>>> +
>>>>>>>>> +        /*
>>>>>>>>> +         * Due to the logic in fixup_irqs() that clears offlined CPUs from
>>>>>>>>> +         * ->arch.old_cpu_mask it shouldn't be possible to get here with
>>>>>>>>> +         * ->arch.move_{in_progress,cleanup_count} set and no online CPUs in
>>>>>>>>> +         * ->arch.old_cpu_mask.
>>>>>>>>> +         */
>>>>>>>>> +        ASSERT(valid_irq_vector(desc->arch.old_vector));
>>>>>>>>> +        ASSERT(cpumask_intersects(desc->arch.old_cpu_mask, &cpu_online_map));
>>>>>>>>> +
>>>>>>>>> +        if ( cpumask_intersects(desc->arch.old_cpu_mask, mask) )
>>>>>>>>> +        {
>>>>>>>>> +            /*
>>>>>>>>> +             * Fallback to the old destination if moving is in progress and the
>>>>>>>>> +             * current destination is to be offlined.  This is only possible if
>>>>>>>>> +             * the CPUs in old_cpu_mask intersect with the affinity mask passed
>>>>>>>>> +             * in the 'mask' parameter.
>>>>>>>>> +             */
>>>>>>>>> +            desc->arch.vector = desc->arch.old_vector;
>>>>>>>>> +            cpumask_and(desc->arch.cpu_mask, desc->arch.old_cpu_mask, mask);
>>>>>>
>>>>>> ... replacing of vector (and associated mask), without any further accounting.
>>>>>
>>>>> It's quite likely I'm missing something here, but what further
>>>>> accounting you would like to do?
>>>>>
>>>>> The current target of the interrupt (->arch.cpu_mask previous to
>>>>> cpumask_and()) is all going offline, so any attempt to set it in
>>>>> ->arch.old_cpu_mask would just result in a stale (offline) CPU getting
>>>>> set in ->arch.old_cpu_mask, which previous patches attempted to
>>>>> solve.
>>>>>
>>>>> Maybe by "further accounting" you meant something else not related to
>>>>> ->arch.old_{cpu_mask,vector}?
>>>>
>>>> Indeed. What I'm thinking of is what normally release_old_vec() would
>>>> do (of which only desc->arch.used_vectors updating would appear to be
>>>> relevant, seeing the CPU's going offline). The other one I was thinking
>>>> of, updating vector_irq[], likely is also unnecessary, again because
>>>> that's per-CPU data of a CPU going down.
>>>
>>> I think updating vector_irq[] should be explicitly avoided, as doing
>>> so would prevent us from correctly draining any pending interrupts
>>> because the vector -> irq mapping would be broken when the interrupt
>>> enable window at the bottom of fixup_irqs() is reached.
>>>
>>> For used_vectors: we might clean it, I'm a bit worried however that at
>>> some point we insert a check in do_IRQ() path that ensures the
>>> vector_irq[] is inline with desc->arch.used_vectors, which would fail
>>> for interrupts drained at the bottom of fixup_irqs().  Let me attempt
>>> to clean the currently used vector from ->arch.used_vectors.
>>
>> Just to clarify: It may well be that for draining the bit can't be cleared
>> right here. But it then still needs clearing _somewhere_, or else we
>> chance ending up with inconsistent state (triggering e.g. an assertion
>> later on) or the leaking of vectors. My problem here was that I also
>> couldn't locate any such "somewhere", and commentary also didn't point me
>> anywhere.
> 
> You are correct, there's no such place where the cleanup would happen.
> 
> I'm afraid the only option I see to correctly deal with this is to do
> the cleanup of the old destination in _assign_irq_vector(), and then
> do the pending interrupt draining from IRR like I had proposed in
> patch 7/7, thus removing the interrupt enable window at the bottom of
> fixup_irqs().
> 
> Let me know if that seems sensible.

I think it does; doing away with that entirely heuristic window would be
pretty nice anyway.

Jan


^ permalink raw reply	[flat|nested] 37+ messages in thread

end of thread, other threads:[~2024-06-13 13:08 UTC | newest]

Thread overview: 37+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2024-06-10 14:20 [PATCH v2 0/7] x86/irq: fixes for CPU hot{,un}plug Roger Pau Monne
2024-06-10 14:20 ` [PATCH v2 1/7] x86/smp: do not use shorthand IPI destinations in CPU hot{,un}plug contexts Roger Pau Monne
2024-06-11  7:42   ` Jan Beulich
2024-06-12  8:09     ` Roger Pau Monné
2024-06-12  8:56       ` Jan Beulich
2024-06-12 10:07         ` Oleksii K.
2024-06-10 14:20 ` [PATCH v2 2/7] x86/irq: describe how the interrupt CPU movement works Roger Pau Monne
2024-06-11  7:44   ` Jan Beulich
2024-06-12 10:08     ` Oleksii K.
2024-06-10 14:20 ` [PATCH v2 3/7] x86/irq: limit interrupt movement done by fixup_irqs() Roger Pau Monne
2024-06-11  9:59   ` Jan Beulich
2024-06-12  8:10     ` Roger Pau Monné
2024-06-12 10:13     ` Oleksii K.
2024-06-10 14:20 ` [PATCH v2 4/7] x86/irq: restrict CPU movement in set_desc_affinity() Roger Pau Monne
2024-06-11 10:20   ` Jan Beulich
2024-06-12  8:31     ` Roger Pau Monné
2024-06-10 14:20 ` [PATCH v2 5/7] x86/irq: deal with old_cpu_mask for interrupts in movement in fixup_irqs() Roger Pau Monne
2024-06-11 12:45   ` Jan Beulich
2024-06-12  8:47     ` Roger Pau Monné
2024-06-12  9:04       ` Jan Beulich
2024-06-12 10:41         ` Roger Pau Monné
2024-06-11 13:47   ` Jan Beulich
2024-06-12  8:36     ` Roger Pau Monné
2024-06-10 14:20 ` [PATCH v2 6/7] x86/irq: handle moving interrupts in _assign_irq_vector() Roger Pau Monne
2024-06-11 13:18   ` Jan Beulich
2024-06-12 10:39     ` Roger Pau Monné
2024-06-12 13:42       ` Jan Beulich
2024-06-12 15:36         ` Roger Pau Monné
2024-06-13  8:38           ` Jan Beulich
2024-06-13 11:31             ` Roger Pau Monné
2024-06-13 11:36               ` Jan Beulich
2024-06-13 12:55                 ` Roger Pau Monné
2024-06-13 13:07                   ` Jan Beulich
2024-06-10 14:20 ` [PATCH v2 7/7] x86/irq: forward pending interrupts to new destination in fixup_irqs() Roger Pau Monne
2024-06-11 13:50   ` Jan Beulich
2024-06-12 11:23     ` Roger Pau Monné
2024-06-12 13:47       ` Jan Beulich

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.