From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <linux-arm-kernel-bounces+linux-arm-kernel=archiver.kernel.org@lists.infradead.org>
X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on
	aws-us-west-2-korg-lkml-1.web.codeaurora.org
Received: from bombadil.infradead.org (bombadil.infradead.org [198.137.202.133])
	(using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits))
	(No client certificate requested)
	by smtp.lore.kernel.org (Postfix) with ESMTPS id 1A741C4345F
	for <linux-arm-kernel@archiver.kernel.org>; Fri, 12 Apr 2024 13:55:38 +0000 (UTC)
DKIM-Signature: v=1; a=rsa-sha256; q=dns/txt; c=relaxed/relaxed;
	d=lists.infradead.org; s=bombadil.20210309; h=Sender:
	Content-Transfer-Encoding:Content-Type:List-Subscribe:List-Help:List-Post:
	List-Archive:List-Unsubscribe:List-Id:In-Reply-To:MIME-Version:References:
	Message-ID:Subject:Cc:To:From:Date:Reply-To:Content-ID:Content-Description:
	Resent-Date:Resent-From:Resent-Sender:Resent-To:Resent-Cc:Resent-Message-ID:
	List-Owner; bh=aEhg0g9ljrY24h+y2X7VFbyYXN0vgyPgD+rwMB9DJjU=; b=Df3ZIlwTt872u6
	3B+xgeI6OjNoamFvn4qQlSDBDJS5PtwjfXioyn0ZBaBf1CHa7M5ib4o+qU/bJLbvmNY9eWFy/XwWY
	oQAl1ZJiTpcDhKwAg2xdjlDiLd2zXtJcFC64xvmiRjQX6ZB/R8afDgo4o2pIRrT7iv6sgi/ZCkRUs
	p3ID2i/8uFWBQuYO08CjupcP8UbEzH56XD9s43QWJx4XJEDmy8QaYO1WSupfb9voa3N14Ef3tgY0k
	uBf9FfXMomfpQ/SDSIbvroxbh7er+XnC66oxpsi3yOKSB6SIonyX2mGYF0GsNBKUHKXURDFUp55MN
	joHv77U88reXqV67tAdQ==;
Received: from localhost ([::1] helo=bombadil.infradead.org)
	by bombadil.infradead.org with esmtp (Exim 4.97.1 #2 (Red Hat Linux))
	id 1rvHNE-0000000HRT4-3efe;
	Fri, 12 Apr 2024 13:55:24 +0000
Received: from dfw.source.kernel.org ([139.178.84.217])
	by bombadil.infradead.org with esmtps (Exim 4.97.1 #2 (Red Hat Linux))
	id 1rvHNB-0000000HRRn-3i6E
	for linux-arm-kernel@lists.infradead.org;
	Fri, 12 Apr 2024 13:55:23 +0000
Received: from smtp.kernel.org (transwarp.subspace.kernel.org [100.75.92.58])
	by dfw.source.kernel.org (Postfix) with ESMTP id 3D9C36033F;
	Fri, 12 Apr 2024 13:55:21 +0000 (UTC)
Received: by smtp.kernel.org (Postfix) with ESMTPSA id 4964EC113CC;
	Fri, 12 Apr 2024 13:55:17 +0000 (UTC)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=kernel.org;
	s=k20201202; t=1712930120;
	bh=6TmX4w4nn9aEsWZzU+YqQ0aoR97tm4JTGPuVtWmpvk0=;
	h=Date:From:To:Cc:Subject:References:In-Reply-To:From;
	b=M8b90VHG3dFi6r7H4peNaQ7XyuyXMW7KTbXYe1ETTnZez/mDTaliiN7Zy2Lf/9vGG
	 PI5zvovIwRO41dnkMHiJI8UCuvN7FBtIhTgaBTNoTgrLIp9hbJYgOowoOqel3MSN/1
	 D/XOGuTwT7ZVJVjOg55lfvt90KjlWNIxwS+UZAZMOgSoTKXSt11M03H12qL7bkkH+L
	 cT27lKJydShJqHAR7i6W09q4n8LFB7kbt8D+1M2p4CiXjXQj2zR3l4Lnf7lv1gWY3s
	 umsKiPX9Q7meJZW2tlO8jstSeQk/vINPne/Iv2b/spbWDA+hO2p6F6abjHkYhHFRGf
	 Oig6PwAnFDe4A==
Date: Fri, 12 Apr 2024 14:55:13 +0100
From: Will Deacon <will@kernel.org>
To: Douglas Anderson <dianders@chromium.org>
Cc: Catalin Marinas <catalin.marinas@arm.com>,
	Mark Rutland <mark.rutland@arm.com>, Marc Zyngier <maz@kernel.org>,
	Misono Tomohiro <misono.tomohiro@fujitsu.com>,
	Chen-Yu Tsai <wens@csie.org>, Stephen Boyd <swboyd@chromium.org>,
	Daniel Thompson <daniel.thompson@linaro.org>,
	Sumit Garg <sumit.garg@linaro.org>,
	Frederic Weisbecker <frederic@kernel.org>,
	"Guilherme G. Piccoli" <gpiccoli@igalia.com>,
	Josh Poimboeuf <jpoimboe@kernel.org>,
	Kees Cook <keescook@chromium.org>,
	Peter Zijlstra <peterz@infradead.org>,
	Thomas Gleixner <tglx@linutronix.de>,
	Tony Luck <tony.luck@intel.com>,
	Valentin Schneider <vschneid@redhat.com>,
	linux-arm-kernel@lists.infradead.org,
	linux-hardening@vger.kernel.org, linux-kernel@vger.kernel.org
Subject: Re: [PATCH] arm64: smp: smp_send_stop() and crash_smp_send_stop()
 should try non-NMI first
Message-ID: <20240412135513.GA28004@willie-the-truck>
References: <20231207170251.1.Id4817adef610302554b8aa42b090d57270dc119c@changeid>
MIME-Version: 1.0
Content-Disposition: inline
In-Reply-To: <20231207170251.1.Id4817adef610302554b8aa42b090d57270dc119c@changeid>
User-Agent: Mutt/1.10.1 (2018-07-13)
X-CRM114-Version: 20100106-BlameMichelson ( TRE 0.8.0 (BSD) ) MR-646709E3 
X-CRM114-CacheID: sfid-20240412_065522_034050_89DD1BF1 
X-CRM114-Status: GOOD (  39.72  )
X-BeenThere: linux-arm-kernel@lists.infradead.org
X-Mailman-Version: 2.1.34
Precedence: list
List-Id: <linux-arm-kernel.lists.infradead.org>
List-Unsubscribe: <http://lists.infradead.org/mailman/options/linux-arm-kernel>,
 <mailto:linux-arm-kernel-request@lists.infradead.org?subject=unsubscribe>
List-Archive: <http://lists.infradead.org/pipermail/linux-arm-kernel/>
List-Post: <mailto:linux-arm-kernel@lists.infradead.org>
List-Help: <mailto:linux-arm-kernel-request@lists.infradead.org?subject=help>
List-Subscribe: <http://lists.infradead.org/mailman/listinfo/linux-arm-kernel>,
 <mailto:linux-arm-kernel-request@lists.infradead.org?subject=subscribe>
Content-Type: text/plain; charset="us-ascii"
Content-Transfer-Encoding: 7bit
Sender: "linux-arm-kernel" <linux-arm-kernel-bounces@lists.infradead.org>
Errors-To: linux-arm-kernel-bounces+linux-arm-kernel=archiver.kernel.org@lists.infradead.org

Hi Doug,

I'm doing some inbox Spring cleaning!

On Thu, Dec 07, 2023 at 05:02:56PM -0800, Douglas Anderson wrote:
> When testing hard lockup handling on my sc7180-trogdor-lazor device
> with pseudo-NMI enabled, with serial console enabled and with kgdb
> disabled, I found that the stack crawls printed to the serial console
> ended up as a jumbled mess. After rebooting, the pstore-based console
> looked fine though. Also, enabling kgdb to trap the panic made the
> console look fine and avoided the mess.
>
> After a bit of tracking down, I came to the conclusion that this was
> what was happening:
> 1. The panic path was stopping all other CPUs with
>    panic_other_cpus_shutdown().
> 2. At least one of those other CPUs was in the middle of printing to
>    the serial console and holding the console port's lock, which is
>    grabbed with "irqsave". ...but since we were stopping with an NMI
>    we didn't care about the "irqsave" and interrupted anyway.
> 3. Since we stopped the CPU while it was holding the lock it would
>    never release it.
> 4. All future calls to output to the console would end up failing to
>    get the lock in qcom_geni_serial_console_write(). This isn't
>    _totally_ unexpected at panic time but it's a code path that's not
>    well tested, hard to get right, and apparently doesn't work
>    terribly well on the Qualcomm geni serial driver.
> 
> It would probably be a reasonable idea to try to make the Qualcomm
> geni serial driver work better, but also it's nice not to get into
> this situation in the first place.
> 
> Taking a page from what x86 appears to do in native_stop_other_cpus(),
> let's do this:
> 1. First, we'll try to stop other CPUs with a normal IPI and wait a
>    second. This gives them a chance to leave critical sections.
> 2. If CPUs fail to stop then we'll retry with an NMI, but give a much
>    lower timeout since there's no good reason for a CPU not to react
>    quickly to a NMI.
> 
> This works well and avoids the corrupted console and (presumably)
> could help avoid other similar issues.
> 
> In order to do this, we need to do a little re-organization of our
> IPIs since we don't have any more free IDs. We'll do what was
> suggested in previous conversations and combine "stop" and "crash
> stop". That frees up an IPI so now we can have a "stop" and "stop
> NMI".
> 
> In order to do this we also need a slight change in the way we keep
> track of which CPUs still need to be stopped. We need to know
> specifically which CPUs haven't stopped yet when we fall back to NMI
> but in the "crash stop" case the "cpu_online_mask" isn't updated as
> CPUs go down. This is why that code path had an atomic of the number
> of CPUs left. We'll solve this by making the cpumask into a
> global. This has a potential memory implication--with NR_CPUs = 4096
> this is 4096/8 = 512 bytes of globals. On the upside in that same case
> we take 512 bytes off the stack which could potentially have made the
> stop code less reliable. It can be noted that the NMI backtrace code
> (lib/nmi_backtrace.c) uses the same approach and that use also
> confirms that updating the mask is safe from NMI.

Updating the global masks without any synchronisation feels broken though:

> @@ -1085,77 +1080,75 @@ void smp_send_stop(void)
>  {
>  	unsigned long timeout;
>  
> -	if (num_other_online_cpus()) {
> -		cpumask_t mask;
> +	/*
> +	 * If this cpu is the only one alive at this point in time, online or
> +	 * not, there are no stop messages to be sent around, so just back out.
> +	 */
> +	if (num_other_online_cpus() == 0)
> +		goto skip_ipi;
>  
> -		cpumask_copy(&mask, cpu_online_mask);
> -		cpumask_clear_cpu(smp_processor_id(), &mask);
> +	cpumask_copy(to_cpumask(stop_mask), cpu_online_mask);
> +	cpumask_clear_cpu(smp_processor_id(), to_cpumask(stop_mask));

I don't see what prevents multiple CPUs getting in here concurrently and
tripping over the masks. x86 seems to avoid that with an atomic
'stopping_cpu' variable in native_stop_other_cpus(). Do we need something
similar?

Apart from that, I'm fine with the gist of the patch.

Will

_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel