From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <linux-kernel-owner@vger.kernel.org>
Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand
	id S1753420AbcCLPX5 (ORCPT <rfc822;w@1wt.eu>);
	Sat, 12 Mar 2016 10:23:57 -0500
Received: from mail-wm0-f66.google.com ([74.125.82.66]:36396 "EHLO
	mail-wm0-f66.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org
	with ESMTP id S1752103AbcCLPX4 (ORCPT
	<rfc822;linux-kernel@vger.kernel.org>);
	Sat, 12 Mar 2016 10:23:56 -0500
Date: Sat, 12 Mar 2016 16:23:51 +0100
From: Ingo Molnar <mingo@kernel.org>
To: Thomas Gleixner <tglx@linutronix.de>
Cc: LKML <linux-kernel@vger.kernel.org>, Harry Junior <harryjr@outlook.fr>,
        Tony Luck <tony.luck@intel.com>, x86@kernel.org,
        Peter Zijlstra <peterz@infradead.org>,
        Joe Lawrence <joe.lawrence@stratus.com>,
        Borislav Petkov <bp@alien8.de>
Subject: Re: [PATCH] x86/irq: Cure live lock in irq_force_complete_move()
Message-ID: <20160312152351.GC7015@gmail.com>
References: <alpine.DEB.2.11.1603112048030.3657@nanos>
MIME-Version: 1.0
Content-Type: text/plain; charset=us-ascii
Content-Disposition: inline
In-Reply-To: <alpine.DEB.2.11.1603112048030.3657@nanos>
User-Agent: Mutt/1.5.23 (2014-03-12)
Sender: linux-kernel-owner@vger.kernel.org
List-ID: <linux-kernel.vger.kernel.org>
X-Mailing-List: linux-kernel@vger.kernel.org


* Thomas Gleixner <tglx@linutronix.de> wrote:

> Harry reported, that he's able to trigger a system freeze with cpu hot
> unplug. The freeze turned out to be a live lock caused by recent changes in
> irq_force_complete_move().
> 
> When fixup_irqs() and from there irq_force_complete_move() is called on the
> dying cpu, then all other cpus are in stop machine and wait for the dying cpu
> to complete the teardown. If there is a move of an interrupt pending then
> irq_force_complete_move() sends the cleanup IPI to the cpus in the old_domain
> mask and waits for them to clear the mask. That's obviously impossible as
> those cpus are firmly stuck in stop machine with interrupts disabled.
> 
> I should have known that, but I completely overlooked it being concentrated on
> the locking issues around the vectors. And the existance of the call to

s/existence

> __irq_complete_move() in the code, which actually sends the cleanup IPI made
> it look completely logical that waiting for that cleanup to complete is the
> right thing to do. That call was bogus even before the recent changes, but it
> was the pointless distraction which tricked me not to see the real issue. :(
> 
> So looking deeper into the issue I discovered that the cleanup of the vectors
> is actually pretty simple. We have to look at three cases:
> 
> 1) The move_in_progress flag of the interrupt is set
> 
>    A) The interrupt must be moved in interrupt context, i.e. the affinity
>       change takes place when the next interrupt happens.
> 
>       In that case the io_apic is not yet updated to the new vector, so we can
>       simply restore the target domain mask to the previous state,
>       i.e. old_domain, and restore the old vector in the configuration data.
> 
>       Further we need to check whether the affinity update actually changed
>       the vector or merily reduced the target mask. If it's a new vector, then
>       we need to clear the vector entries of the new vector.
> 
>       This undoes the pending affinity change to the old target, but with the
>       outgoing cpu cleared in the target domain mask.
> 
>    B) The interrupt can be moved in any context, i.e. the io_apic has been
>       updated with the new vector already, but no interrupt was delivered
>       after that update, so we know for sure, that the next interrupt will be
>       delivered to the new vector.
> 
>       So it's the same as case #2 where the cleanup IPI has been issued
>       already and the domain cpu mask is not yet empty. See below.
> 
> 2) The move_in_progress flag is not set and the old_domain cpu mask is not
>    empty.
> 
>    That means, that an interrupt was delivered after the change and the
>    cleanup IPI has been sent to the cpus in old_domain, but not all CPUs have
>    responded to it yet.
> 
>    It does not matter in which context the io_apic update happened, the
>    io_apic contains the new vector already. See also case 1B)
> 
>    So we know at this point that the next interrupt will arrive on the new
>    vector, so we can safely cleanup the old vectors on the cpus in the
>    old_domain cpu mask.
> 
> Fixes: 98229aa36caa "x86/irq: Plug vector cleanup race"
> Reported-by: Harry Junior <harryjr@outlook.fr>
> Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
> Cc: stable@vger.kernel.org
> ---
>  arch/x86/include/asm/hw_irq.h |    1 
>  arch/x86/kernel/apic/vector.c |   94 +++++++++++++++++++++++++++++++++---------
>  2 files changed, 77 insertions(+), 18 deletions(-)

Cool fix!! :-)

How much time did it take for you to figure out this one?? ...

Some minor spelchecking nits:

> +	 * All CPUs are stuck in stop machine with interrupts disabled so
> +	 * calling __irq_complete_move() would be completely pointless.
>  	 */
>  	raw_spin_lock(&vector_lock);
> +
> +	/*
> +	 * Clean out all offline cpus (including the outgoing one) from the
> +	 * old_domain mask.

s/cpus/CPUs

> +	 */
>  	cpumask_and(data->old_domain, data->old_domain, cpu_online_mask);
> -	while (!cpumask_empty(data->old_domain)) {
> +
> +	/*
> +	 * If move_in_progress is cleared and the old_domain mask is empty,
> +	 * then there is nothing to cleanup. fixup_irqs() will take care of

s/cleanup/clean up

> +	 * the stale vectors on the outgoing cpu.

s/cpu/CPU

> +	 */
> +	if (!data->move_in_progress && cpumask_empty(data->old_domain)) {
>  		raw_spin_unlock(&vector_lock);
> -		raw_spin_unlock(&desc->lock);
> -		cpu_relax();
> -		raw_spin_lock(&desc->lock);
> +		return;
> +	}
> +
> +	/*
> +	 * We have to distinguish three cases:
> +	 *
> +	 * 1) The interrupt is in move_in_progress state and the interrupt is
> +	 *    not marked with IRQ_MOVE_PCNTXT. That means the io_apic still
> +	 *    points to the old vector.
> +	 *
> +	 * 2) The interrupt is in move_in_progress state and the interrupt is
> +	 *    marked with IRQ_MOVE_PCNTXT. That means the io_apic already has
> +	 *    the new vector.
> +	 *
> +	 * 3) The interrupt has been moved, the io_apic has already the new
> +	 *    vector, but the cleanup IPIs have not been processed yet.
> +	 *
> +	 * #2 and #3 can be handled in the same way as the old vector is not
> +	 * longer in use and the vector entries of the cpus in old_domain mask

s/not longer in use/no longer in use
s/cpus/CPUs

> +	 * can be cleaned up safely now.
> +	 */
> +	if (!irqd_can_move_in_process_context(irqdata) &&
> +	    data->move_in_progress) {
>  		/*
> +		 * We restore old_domain (the offline cpus have been masked

s/cpus/CPUs

> +		/*
> +		 * If old_domain is not empty, then other cpus still have the

s/CPUs

> +		 * irq descriptor set in their vector array. Clean it up, it's
> +		 * not longer possible that the interrupt happens on that
> +		 * vector.

s/it's not longer possible/it's no longer possible

> +		 */
> +		v = cfg->old_vector;
> +		for_each_cpu(cpu, data->old_domain)
> +			per_cpu(vector_irq, cpu)[v] = VECTOR_UNUSED;
>  	}
> +	/* Cleanup the left overs of the (half finished) move */

s/Cleanup/clean up

Thanks,

	Ingo