From mboxrd@z Thu Jan 1 00:00:00 1970 Received: from smtp.kernel.org (aws-us-west-2-korg-mail-1.web.codeaurora.org [10.30.226.201]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id BE4ED364A5; Fri, 24 Nov 2023 18:47:05 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; dkim=pass (1024-bit key) header.d=linuxfoundation.org header.i=@linuxfoundation.org header.b="pw1wzg69" Received: by smtp.kernel.org (Postfix) with ESMTPSA id 49731C433C7; Fri, 24 Nov 2023 18:47:05 +0000 (UTC) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=linuxfoundation.org; s=korg; t=1700851625; bh=VTfQGVhC1vftmdrdj4lFrGuOUWWYnscOLeKp2EAqxac=; h=From:To:Cc:Subject:Date:In-Reply-To:References:From; b=pw1wzg69YegrlPmTSLD8M/u8tn8KvsU72+tc1cSLTh2ya7jbpK4q1gqeS15J8rPW+ 1vPX/brI85ZZF18HLKB9qBOE+O3s6E17U+FmF7LT+nlPriLb8I+AWY6iACPigi4X9k kN1miNQ8FsR0fiursHF8r3KwV3QVnEWHGyQWJu28= From: Greg Kroah-Hartman To: stable@vger.kernel.org Cc: Greg Kroah-Hartman , patches@lists.linux.dev, Rik van Riel , "Paul E. McKenney" , Imran Khan , Leonardo Bras , Peter Zijlstra , Valentin Schneider , Juergen Gross , Jonathan Corbet , Randy Dunlap , Sasha Levin Subject: [PATCH 6.1 009/372] smp,csd: Throw an error if a CSD lock is stuck for too long Date: Fri, 24 Nov 2023 17:46:36 +0000 Message-ID: <20231124172010.744731659@linuxfoundation.org> X-Mailer: git-send-email 2.43.0 In-Reply-To: <20231124172010.413667921@linuxfoundation.org> References: <20231124172010.413667921@linuxfoundation.org> User-Agent: quilt/0.67 X-stable: review X-Patchwork-Hint: ignore Precedence: bulk X-Mailing-List: stable@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 Content-Transfer-Encoding: 8bit 6.1-stable review patch. If anyone has any objections, please let me know. ------------------ From: Rik van Riel [ Upstream commit 94b3f0b5af2c7af69e3d6e0cdd9b0ea535f22186 ] The CSD lock seems to get stuck in 2 "modes". When it gets stuck temporarily, it usually gets released in a few seconds, and sometimes up to one or two minutes. If the CSD lock stays stuck for more than several minutes, it never seems to get unstuck, and gradually more and more things in the system end up also getting stuck. In the latter case, we should just give up, so the system can dump out a little more information about what went wrong, and, with panic_on_oops and a kdump kernel loaded, dump a whole bunch more information about what might have gone wrong. In addition, there is an smp.panic_on_ipistall kernel boot parameter that by default retains the old behavior, but when set enables the panic after the CSD lock has been stuck for more than the specified number of milliseconds, as in 300,000 for five minutes. [ paulmck: Apply Imran Khan feedback. ] [ paulmck: Apply Leonardo Bras feedback. ] Link: https://lore.kernel.org/lkml/bc7cc8b0-f587-4451-8bcd-0daae627bcc7@paulmck-laptop/ Signed-off-by: Rik van Riel Signed-off-by: Paul E. McKenney Reviewed-by: Imran Khan Reviewed-by: Leonardo Bras Cc: Peter Zijlstra Cc: Valentin Schneider Cc: Juergen Gross Cc: Jonathan Corbet Cc: Randy Dunlap Signed-off-by: Sasha Levin --- Documentation/admin-guide/kernel-parameters.txt | 7 +++++++ kernel/smp.c | 13 ++++++++++++- 2 files changed, 19 insertions(+), 1 deletion(-) diff --git a/Documentation/admin-guide/kernel-parameters.txt b/Documentation/admin-guide/kernel-parameters.txt index 31af352b4762d..4ad60e127e048 100644 --- a/Documentation/admin-guide/kernel-parameters.txt +++ b/Documentation/admin-guide/kernel-parameters.txt @@ -5671,6 +5671,13 @@ This feature may be more efficiently disabled using the csdlock_debug- kernel parameter. + smp.panic_on_ipistall= [KNL] + If a csd_lock_timeout extends for more than + the specified number of milliseconds, panic the + system. By default, let CSD-lock acquisition + take as long as they take. Specifying 300,000 + for this value provides a 5-minute timeout. + smsc-ircc2.nopnp [HW] Don't use PNP to discover SMC devices smsc-ircc2.ircc_cfg= [HW] Device configuration I/O port smsc-ircc2.ircc_sir= [HW] SIR base I/O port diff --git a/kernel/smp.c b/kernel/smp.c index 06a413987a14a..63e466bb6b03a 100644 --- a/kernel/smp.c +++ b/kernel/smp.c @@ -185,6 +185,8 @@ static DEFINE_PER_CPU(struct cfd_seq_local, cfd_seq_local); static ulong csd_lock_timeout = 5000; /* CSD lock timeout in milliseconds. */ module_param(csd_lock_timeout, ulong, 0444); +static int panic_on_ipistall; /* CSD panic timeout in milliseconds, 300000 for five minutes. */ +module_param(panic_on_ipistall, int, 0444); static atomic_t csd_bug_count = ATOMIC_INIT(0); static u64 cfd_seq; @@ -343,6 +345,7 @@ static bool csd_lock_wait_toolong(struct __call_single_data *csd, u64 ts0, u64 * } ts2 = sched_clock(); + /* How long since we last checked for a stuck CSD lock.*/ ts_delta = ts2 - *ts1; if (likely(ts_delta <= csd_lock_timeout_ns || csd_lock_timeout_ns == 0)) return false; @@ -356,9 +359,17 @@ static bool csd_lock_wait_toolong(struct __call_single_data *csd, u64 ts0, u64 * else cpux = cpu; cpu_cur_csd = smp_load_acquire(&per_cpu(cur_csd, cpux)); /* Before func and info. */ + /* How long since this CSD lock was stuck. */ + ts_delta = ts2 - ts0; pr_alert("csd: %s non-responsive CSD lock (#%d) on CPU#%d, waiting %llu ns for CPU#%02d %pS(%ps).\n", - firsttime ? "Detected" : "Continued", *bug_id, raw_smp_processor_id(), ts2 - ts0, + firsttime ? "Detected" : "Continued", *bug_id, raw_smp_processor_id(), ts_delta, cpu, csd->func, csd->info); + /* + * If the CSD lock is still stuck after 5 minutes, it is unlikely + * to become unstuck. Use a signed comparison to avoid triggering + * on underflows when the TSC is out of sync between sockets. + */ + BUG_ON(panic_on_ipistall > 0 && (s64)ts_delta > ((s64)panic_on_ipistall * NSEC_PER_MSEC)); if (cpu_cur_csd && csd != cpu_cur_csd) { pr_alert("\tcsd: CSD lock (#%d) handling prior %pS(%ps) request.\n", *bug_id, READ_ONCE(per_cpu(cur_csd_func, cpux)), -- 2.42.0