[PATCH] x86/irq: Cure live lock in irq_force_complete

public inbox for linux-kernel@vger.kernel.org
 help / color / mirror / Atom feed

* [PATCH] x86/irq: Cure live lock in irq_force_complete_move()
@ 2016-03-11 20:19 Thomas Gleixner
  2016-03-11 22:33 ` Luck, Tony
  2016-03-12 15:23 ` Ingo Molnar
  0 siblings, 2 replies; 6+ messages in thread
From: Thomas Gleixner @ 2016-03-11 20:19 UTC (permalink / raw)
  To: LKML
  Cc: Harry Junior, Tony Luck, x86, Peter Zijlstra, Joe Lawrence,
	Borislav Petkov

Harry reported, that he's able to trigger a system freeze with cpu hot
unplug. The freeze turned out to be a live lock caused by recent changes in
irq_force_complete_move().

When fixup_irqs() and from there irq_force_complete_move() is called on the
dying cpu, then all other cpus are in stop machine and wait for the dying cpu
to complete the teardown. If there is a move of an interrupt pending then
irq_force_complete_move() sends the cleanup IPI to the cpus in the old_domain
mask and waits for them to clear the mask. That's obviously impossible as
those cpus are firmly stuck in stop machine with interrupts disabled.

I should have known that, but I completely overlooked it being concentrated on
the locking issues around the vectors. And the existance of the call to
__irq_complete_move() in the code, which actually sends the cleanup IPI made
it look completely logical that waiting for that cleanup to complete is the
right thing to do. That call was bogus even before the recent changes, but it
was the pointless distraction which tricked me not to see the real issue. :(

So looking deeper into the issue I discovered that the cleanup of the vectors
is actually pretty simple. We have to look at three cases:

1) The move_in_progress flag of the interrupt is set

   A) The interrupt must be moved in interrupt context, i.e. the affinity
      change takes place when the next interrupt happens.

      In that case the io_apic is not yet updated to the new vector, so we can
      simply restore the target domain mask to the previous state,
      i.e. old_domain, and restore the old vector in the configuration data.

      Further we need to check whether the affinity update actually changed
      the vector or merily reduced the target mask. If it's a new vector, then
      we need to clear the vector entries of the new vector.

      This undoes the pending affinity change to the old target, but with the
      outgoing cpu cleared in the target domain mask.

   B) The interrupt can be moved in any context, i.e. the io_apic has been
      updated with the new vector already, but no interrupt was delivered
      after that update, so we know for sure, that the next interrupt will be
      delivered to the new vector.

      So it's the same as case #2 where the cleanup IPI has been issued
      already and the domain cpu mask is not yet empty. See below.

2) The move_in_progress flag is not set and the old_domain cpu mask is not
   empty.

   That means, that an interrupt was delivered after the change and the
   cleanup IPI has been sent to the cpus in old_domain, but not all CPUs have
   responded to it yet.

   It does not matter in which context the io_apic update happened, the
   io_apic contains the new vector already. See also case 1B)

   So we know at this point that the next interrupt will arrive on the new
   vector, so we can safely cleanup the old vectors on the cpus in the
   old_domain cpu mask.

Fixes: 98229aa36caa "x86/irq: Plug vector cleanup race"
Reported-by: Harry Junior <harryjr@outlook.fr>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: stable@vger.kernel.org
---
 arch/x86/include/asm/hw_irq.h |    1 
 arch/x86/kernel/apic/vector.c |   94 +++++++++++++++++++++++++++++++++---------
 2 files changed, 77 insertions(+), 18 deletions(-)

--- a/arch/x86/include/asm/hw_irq.h
+++ b/arch/x86/include/asm/hw_irq.h
@@ -141,6 +141,7 @@ struct irq_alloc_info {
 struct irq_cfg {
 	unsigned int		dest_apicid;
 	u8			vector;
+	u8			old_vector;
 };
 
 extern struct irq_cfg *irq_cfg(unsigned int irq);
--- a/arch/x86/kernel/apic/vector.c
+++ b/arch/x86/kernel/apic/vector.c
@@ -213,6 +213,7 @@ static int __assign_irq_vector(int irq,
 	 */
 	cpumask_and(d->old_domain, d->old_domain, cpu_online_mask);
 	d->move_in_progress = !cpumask_empty(d->old_domain);
+	d->cfg.old_vector = d->move_in_progress ? d->cfg.vector : 0;
 	d->cfg.vector = vector;
 	cpumask_copy(d->domain, vector_cpumask);
 success:
@@ -655,46 +656,103 @@ void irq_complete_move(struct irq_cfg *c
 }
 
 /*
- * Called with @desc->lock held and interrupts disabled.
+ * Called from fixup_irqs() with @desc->lock held and interrupts disabled.
  */
 void irq_force_complete_move(struct irq_desc *desc)
 {
 	struct irq_data *irqdata = irq_desc_get_irq_data(desc);
 	struct apic_chip_data *data = apic_chip_data(irqdata);
 	struct irq_cfg *cfg = data ? &data->cfg : NULL;
+	unsigned int cpu, v;
 
 	if (!cfg)
 		return;
 
-	__irq_complete_move(cfg, cfg->vector);
-
 	/*
 	 * This is tricky. If the cleanup of @data->old_domain has not been
 	 * done yet, then the following setaffinity call will fail with
 	 * -EBUSY. This can leave the interrupt in a stale state.
 	 *
-	 * The cleanup cannot make progress because we hold @desc->lock. So in
-	 * case @data->old_domain is not yet cleaned up, we need to drop the
-	 * lock and acquire it again. @desc cannot go away, because the
-	 * hotplug code holds the sparse irq lock.
+	 * All CPUs are stuck in stop machine with interrupts disabled so
+	 * calling __irq_complete_move() would be completely pointless.
 	 */
 	raw_spin_lock(&vector_lock);
-	/* Clean out all offline cpus (including ourself) first. */
+
+	/*
+	 * Clean out all offline cpus (including the outgoing one) from the
+	 * old_domain mask.
+	 */
 	cpumask_and(data->old_domain, data->old_domain, cpu_online_mask);
-	while (!cpumask_empty(data->old_domain)) {
+
+	/*
+	 * If move_in_progress is cleared and the old_domain mask is empty,
+	 * then there is nothing to cleanup. fixup_irqs() will take care of
+	 * the stale vectors on the outgoing cpu.
+	 */
+	if (!data->move_in_progress && cpumask_empty(data->old_domain)) {
 		raw_spin_unlock(&vector_lock);
-		raw_spin_unlock(&desc->lock);
-		cpu_relax();
-		raw_spin_lock(&desc->lock);
+		return;
+	}
+
+	/*
+	 * We have to distinguish three cases:
+	 *
+	 * 1) The interrupt is in move_in_progress state and the interrupt is
+	 *    not marked with IRQ_MOVE_PCNTXT. That means the io_apic still
+	 *    points to the old vector.
+	 *
+	 * 2) The interrupt is in move_in_progress state and the interrupt is
+	 *    marked with IRQ_MOVE_PCNTXT. That means the io_apic already has
+	 *    the new vector.
+	 *
+	 * 3) The interrupt has been moved, the io_apic has already the new
+	 *    vector, but the cleanup IPIs have not been processed yet.
+	 *
+	 * #2 and #3 can be handled in the same way as the old vector is not
+	 * longer in use and the vector entries of the cpus in old_domain mask
+	 * can be cleaned up safely now.
+	 */
+	if (!irqd_can_move_in_process_context(irqdata) &&
+	    data->move_in_progress) {
 		/*
-		 * Reevaluate apic_chip_data. It might have been cleared after
-		 * we dropped @desc->lock.
+		 * We restore old_domain (the offline cpus have been masked
+		 * out), clear old domain and move_in_progress.
+		 *
+		 * If the old vector is the same as the new vector, then there
+		 * is nothing to do here because the move only reduced the
+		 * domain mask, but did not change the vector.
 		 */
-		data = apic_chip_data(irqdata);
-		if (!data)
-			return;
-		raw_spin_lock(&vector_lock);
+		if (cfg->vector != cfg->old_vector) {
+			/*
+			 * A move to a new vector is pending, but the new
+			 * vector is not yet in use because the ioapic still
+			 * has the old vector. Clean up the new vector.
+			 */
+			v = cfg->vector;
+			for_each_cpu(cpu, data->domain)
+				per_cpu(vector_irq, cpu)[v] = VECTOR_UNUSED;
+		}
+		/*
+		 * Restore the old vector in the configuration data and
+		 * restore the old_domain mask.
+		 */
+		cfg->vector = cfg->old_vector;
+		cpumask_copy(data->domain, data->old_domain);
+	} else {
+		/*
+		 * If old_domain is not empty, then other cpus still have the
+		 * irq descriptor set in their vector array. Clean it up, it's
+		 * not longer possible that the interrupt happens on that
+		 * vector.
+		 */
+		v = cfg->old_vector;
+		for_each_cpu(cpu, data->old_domain)
+			per_cpu(vector_irq, cpu)[v] = VECTOR_UNUSED;
 	}
+	/* Cleanup the left overs of the (half finished) move */
+	cpumask_clear(data->old_domain);
+	data->move_in_progress = 0;
+	cfg->old_vector = 0;
 	raw_spin_unlock(&vector_lock);
 }
 #endif

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: [PATCH] x86/irq: Cure live lock in irq_force_complete_move()
  2016-03-11 20:19 [PATCH] x86/irq: Cure live lock in irq_force_complete_move() Thomas Gleixner
@ 2016-03-11 22:33 ` Luck, Tony
  2016-03-12  8:27   ` Thomas Gleixner
  2016-03-12 10:58   ` Thomas Gleixner
  2016-03-12 15:23 ` Ingo Molnar
  1 sibling, 2 replies; 6+ messages in thread
From: Luck, Tony @ 2016-03-11 22:33 UTC (permalink / raw)
  To: Thomas Gleixner
  Cc: LKML, Harry Junior, x86, Peter Zijlstra, Joe Lawrence,
	Borislav Petkov

With this patch applied my system survives me doing
several rounds of:

# echo 0 | tee /sys/devices/system/cpu/cpu*/online
# echo 1 | tee /sys/devices/system/cpu/cpu*/online

whereas without the patch the first of those went to

[152455.129604] NMI watchdog: Watchdog detected hard LOCKUP on cpu 96
[152455.136943] Kernel panic - not syncing: Hard LOCKUP

I'm not sure we care to optimize the cpu offline path, but I'll note here
that taking all (but one) cpus offline took 52 seconds (for a hundred
and mumble logical cpus).  Bringing them all back is just 4 seconds.

Tested-by: Tony Luck <tony.luck@intel.com>

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: [PATCH] x86/irq: Cure live lock in irq_force_complete_move()
  2016-03-11 22:33 ` Luck, Tony
@ 2016-03-12  8:27   ` Thomas Gleixner
  2016-03-12 10:58   ` Thomas Gleixner
  1 sibling, 0 replies; 6+ messages in thread
From: Thomas Gleixner @ 2016-03-12  8:27 UTC (permalink / raw)
  To: Luck, Tony
  Cc: LKML, Harry Junior, x86, Peter Zijlstra, Joe Lawrence,
	Borislav Petkov

On Fri, 11 Mar 2016, Luck, Tony wrote:

> With this patch applied my system survives me doing
> several rounds of:
> 
> # echo 0 | tee /sys/devices/system/cpu/cpu*/online
> # echo 1 | tee /sys/devices/system/cpu/cpu*/online
> 
> whereas without the patch the first of those went to
> 
> [152455.129604] NMI watchdog: Watchdog detected hard LOCKUP on cpu 96
> [152455.136943] Kernel panic - not syncing: Hard LOCKUP
> 
> I'm not sure we care to optimize the cpu offline path, but I'll note here
> that taking all (but one) cpus offline took 52 seconds (for a hundred
> and mumble logical cpus).  Bringing them all back is just 4 seconds.

Yes. I noticed that as well. No idea where we waste all that time. I've put
that on the todo list of our ongoing hotplug refactoring work.

Thanks,

	tglx

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: [PATCH] x86/irq: Cure live lock in irq_force_complete_move()
  2016-03-11 22:33 ` Luck, Tony
  2016-03-12  8:27   ` Thomas Gleixner
@ 2016-03-12 10:58   ` Thomas Gleixner
  1 sibling, 0 replies; 6+ messages in thread
From: Thomas Gleixner @ 2016-03-12 10:58 UTC (permalink / raw)
  To: Luck, Tony
  Cc: LKML, Harry Junior, x86, Peter Zijlstra, Joe Lawrence,
	Borislav Petkov

On Fri, 11 Mar 2016, Luck, Tony wrote:
> With this patch applied my system survives me doing
> several rounds of:
> 
> # echo 0 | tee /sys/devices/system/cpu/cpu*/online
> # echo 1 | tee /sys/devices/system/cpu/cpu*/online
> 
> whereas without the patch the first of those went to
> 
> [152455.129604] NMI watchdog: Watchdog detected hard LOCKUP on cpu 96
> [152455.136943] Kernel panic - not syncing: Hard LOCKUP
> 
> I'm not sure we care to optimize the cpu offline path, but I'll note here
> that taking all (but one) cpus offline took 52 seconds (for a hundred
> and mumble logical cpus).  Bringing them all back is just 4 seconds.
> 
> Tested-by: Tony Luck <tony.luck@intel.com>

Unfortunately while it cures the livelock it's still not correct.

/me goes back to the drawing board

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: [PATCH] x86/irq: Cure live lock in irq_force_complete_move()
  2016-03-11 20:19 [PATCH] x86/irq: Cure live lock in irq_force_complete_move() Thomas Gleixner
  2016-03-11 22:33 ` Luck, Tony
@ 2016-03-12 15:23 ` Ingo Molnar
  2016-03-12 19:56   ` Thomas Gleixner
  1 sibling, 1 reply; 6+ messages in thread
From: Ingo Molnar @ 2016-03-12 15:23 UTC (permalink / raw)
  To: Thomas Gleixner
  Cc: LKML, Harry Junior, Tony Luck, x86, Peter Zijlstra, Joe Lawrence,
	Borislav Petkov


* Thomas Gleixner <tglx@linutronix.de> wrote:

> Harry reported, that he's able to trigger a system freeze with cpu hot
> unplug. The freeze turned out to be a live lock caused by recent changes in
> irq_force_complete_move().
> 
> When fixup_irqs() and from there irq_force_complete_move() is called on the
> dying cpu, then all other cpus are in stop machine and wait for the dying cpu
> to complete the teardown. If there is a move of an interrupt pending then
> irq_force_complete_move() sends the cleanup IPI to the cpus in the old_domain
> mask and waits for them to clear the mask. That's obviously impossible as
> those cpus are firmly stuck in stop machine with interrupts disabled.
> 
> I should have known that, but I completely overlooked it being concentrated on
> the locking issues around the vectors. And the existance of the call to

s/existence

> __irq_complete_move() in the code, which actually sends the cleanup IPI made
> it look completely logical that waiting for that cleanup to complete is the
> right thing to do. That call was bogus even before the recent changes, but it
> was the pointless distraction which tricked me not to see the real issue. :(
> 
> So looking deeper into the issue I discovered that the cleanup of the vectors
> is actually pretty simple. We have to look at three cases:
> 
> 1) The move_in_progress flag of the interrupt is set
> 
>    A) The interrupt must be moved in interrupt context, i.e. the affinity
>       change takes place when the next interrupt happens.
> 
>       In that case the io_apic is not yet updated to the new vector, so we can
>       simply restore the target domain mask to the previous state,
>       i.e. old_domain, and restore the old vector in the configuration data.
> 
>       Further we need to check whether the affinity update actually changed
>       the vector or merily reduced the target mask. If it's a new vector, then
>       we need to clear the vector entries of the new vector.
> 
>       This undoes the pending affinity change to the old target, but with the
>       outgoing cpu cleared in the target domain mask.
> 
>    B) The interrupt can be moved in any context, i.e. the io_apic has been
>       updated with the new vector already, but no interrupt was delivered
>       after that update, so we know for sure, that the next interrupt will be
>       delivered to the new vector.
> 
>       So it's the same as case #2 where the cleanup IPI has been issued
>       already and the domain cpu mask is not yet empty. See below.
> 
> 2) The move_in_progress flag is not set and the old_domain cpu mask is not
>    empty.
> 
>    That means, that an interrupt was delivered after the change and the
>    cleanup IPI has been sent to the cpus in old_domain, but not all CPUs have
>    responded to it yet.
> 
>    It does not matter in which context the io_apic update happened, the
>    io_apic contains the new vector already. See also case 1B)
> 
>    So we know at this point that the next interrupt will arrive on the new
>    vector, so we can safely cleanup the old vectors on the cpus in the
>    old_domain cpu mask.
> 
> Fixes: 98229aa36caa "x86/irq: Plug vector cleanup race"
> Reported-by: Harry Junior <harryjr@outlook.fr>
> Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
> Cc: stable@vger.kernel.org
> ---
>  arch/x86/include/asm/hw_irq.h |    1 
>  arch/x86/kernel/apic/vector.c |   94 +++++++++++++++++++++++++++++++++---------
>  2 files changed, 77 insertions(+), 18 deletions(-)

Cool fix!! :-)

How much time did it take for you to figure out this one?? ...

Some minor spelchecking nits:

> +	 * All CPUs are stuck in stop machine with interrupts disabled so
> +	 * calling __irq_complete_move() would be completely pointless.
>  	 */
>  	raw_spin_lock(&vector_lock);
> +
> +	/*
> +	 * Clean out all offline cpus (including the outgoing one) from the
> +	 * old_domain mask.

s/cpus/CPUs

> +	 */
>  	cpumask_and(data->old_domain, data->old_domain, cpu_online_mask);
> -	while (!cpumask_empty(data->old_domain)) {
> +
> +	/*
> +	 * If move_in_progress is cleared and the old_domain mask is empty,
> +	 * then there is nothing to cleanup. fixup_irqs() will take care of

s/cleanup/clean up

> +	 * the stale vectors on the outgoing cpu.

s/cpu/CPU

> +	 */
> +	if (!data->move_in_progress && cpumask_empty(data->old_domain)) {
>  		raw_spin_unlock(&vector_lock);
> -		raw_spin_unlock(&desc->lock);
> -		cpu_relax();
> -		raw_spin_lock(&desc->lock);
> +		return;
> +	}
> +
> +	/*
> +	 * We have to distinguish three cases:
> +	 *
> +	 * 1) The interrupt is in move_in_progress state and the interrupt is
> +	 *    not marked with IRQ_MOVE_PCNTXT. That means the io_apic still
> +	 *    points to the old vector.
> +	 *
> +	 * 2) The interrupt is in move_in_progress state and the interrupt is
> +	 *    marked with IRQ_MOVE_PCNTXT. That means the io_apic already has
> +	 *    the new vector.
> +	 *
> +	 * 3) The interrupt has been moved, the io_apic has already the new
> +	 *    vector, but the cleanup IPIs have not been processed yet.
> +	 *
> +	 * #2 and #3 can be handled in the same way as the old vector is not
> +	 * longer in use and the vector entries of the cpus in old_domain mask

s/not longer in use/no longer in use
s/cpus/CPUs

> +	 * can be cleaned up safely now.
> +	 */
> +	if (!irqd_can_move_in_process_context(irqdata) &&
> +	    data->move_in_progress) {
>  		/*
> +		 * We restore old_domain (the offline cpus have been masked

s/cpus/CPUs

> +		/*
> +		 * If old_domain is not empty, then other cpus still have the

s/CPUs

> +		 * irq descriptor set in their vector array. Clean it up, it's
> +		 * not longer possible that the interrupt happens on that
> +		 * vector.

s/it's not longer possible/it's no longer possible

> +		 */
> +		v = cfg->old_vector;
> +		for_each_cpu(cpu, data->old_domain)
> +			per_cpu(vector_irq, cpu)[v] = VECTOR_UNUSED;
>  	}
> +	/* Cleanup the left overs of the (half finished) move */

s/Cleanup/clean up

Thanks,

	Ingo

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: [PATCH] x86/irq: Cure live lock in irq_force_complete_move()
  2016-03-12 15:23 ` Ingo Molnar
@ 2016-03-12 19:56   ` Thomas Gleixner
  0 siblings, 0 replies; 6+ messages in thread
From: Thomas Gleixner @ 2016-03-12 19:56 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: LKML, Harry Junior, Tony Luck, x86, Peter Zijlstra, Joe Lawrence,
	Borislav Petkov

On Sat, 12 Mar 2016, Ingo Molnar wrote:
> Cool fix!! :-)

Not so cool, as it's not correct. Working on it ...
 
> How much time did it take for you to figure out this one?? ...

Harry provided backtraces of all cpus, so figuring out what was wrong was not
the problem, but wrapping the head around a fix still is one :)

^ permalink raw reply	[flat|nested] 6+ messages in thread

end of thread, other threads:[~2016-03-12 19:57 UTC | newest]

Thread overview: 6+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2016-03-11 20:19 [PATCH] x86/irq: Cure live lock in irq_force_complete_move() Thomas Gleixner
2016-03-11 22:33 ` Luck, Tony
2016-03-12  8:27   ` Thomas Gleixner
2016-03-12 10:58   ` Thomas Gleixner
2016-03-12 15:23 ` Ingo Molnar
2016-03-12 19:56   ` Thomas Gleixner

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox