public inbox for linux-kernel@vger.kernel.org
 help / color / mirror / Atom feed
* [PATCH,RFC] smp,csd: throw an error if a CSD lock is stuck for too long
@ 2023-08-21 20:04 Rik van Riel
  2023-08-21 20:29 ` Paul E. McKenney
  2023-09-13 13:22 ` Peter Zijlstra
  0 siblings, 2 replies; 6+ messages in thread
From: Rik van Riel @ 2023-08-21 20:04 UTC (permalink / raw)
  To: linux-kernel
  Cc: kernel-team, Peter Zijlstra, Paul E. McKenney, Valentin Schneider,
	Juergen Gross

The CSD lock seems to get stuck in 2 "modes". When it gets stuck
temporarily, it usually gets released in a few seconds, and sometimes
up to one or two minutes.

If the CSD lock stays stuck for more than several minutes, it never
seems to get unstuck, and gradually more and more things in the system
end up also getting stuck.

In the latter case, we should just give up, so the system can dump out
a little more information about what went wrong, and, with panic_on_oops
and a kdump kernel loaded, dump a whole bunch more information about
what might have gone wrong.

Question: should this have its own panic_on_ipistall switch in
/proc/sys/kernel, or maybe piggyback on panic_on_oops in a different
way than via BUG_ON?

Signed-off-by: Rik van Riel <riel@surriel.com>
---
 kernel/smp.c | 11 ++++++++++-
 1 file changed, 10 insertions(+), 1 deletion(-)

diff --git a/kernel/smp.c b/kernel/smp.c
index 385179dae360..8b808bff15e6 100644
--- a/kernel/smp.c
+++ b/kernel/smp.c
@@ -228,6 +228,7 @@ static bool csd_lock_wait_toolong(struct __call_single_data *csd, u64 ts0, u64 *
 	}
 
 	ts2 = sched_clock();
+	/* How long since we last checked for a stuck CSD lock.*/
 	ts_delta = ts2 - *ts1;
 	if (likely(ts_delta <= csd_lock_timeout_ns || csd_lock_timeout_ns == 0))
 		return false;
@@ -241,9 +242,17 @@ static bool csd_lock_wait_toolong(struct __call_single_data *csd, u64 ts0, u64 *
 	else
 		cpux = cpu;
 	cpu_cur_csd = smp_load_acquire(&per_cpu(cur_csd, cpux)); /* Before func and info. */
+	/* How long since this CSD lock was stuck. */
+	ts_delta = ts2 - ts0;
 	pr_alert("csd: %s non-responsive CSD lock (#%d) on CPU#%d, waiting %llu ns for CPU#%02d %pS(%ps).\n",
-		 firsttime ? "Detected" : "Continued", *bug_id, raw_smp_processor_id(), ts2 - ts0,
+		 firsttime ? "Detected" : "Continued", *bug_id, raw_smp_processor_id(), ts_delta,
 		 cpu, csd->func, csd->info);
+	/*
+	 * If the CSD lock is still stuck after 5 minutes, it is unlikely
+	 * to become unstuck. Use a signed comparison to avoid triggering
+	 * on underflows when the TSC is out of sync between sockets.
+	 */
+	BUG_ON((s64)ts_delta > 300000000000LL);
 	if (cpu_cur_csd && csd != cpu_cur_csd) {
 		pr_alert("\tcsd: CSD lock (#%d) handling prior %pS(%ps) request.\n",
 			 *bug_id, READ_ONCE(per_cpu(cur_csd_func, cpux)),
-- 
2.41.0



^ permalink raw reply related	[flat|nested] 6+ messages in thread

end of thread, other threads:[~2023-09-13 20:18 UTC | newest]

Thread overview: 6+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2023-08-21 20:04 [PATCH,RFC] smp,csd: throw an error if a CSD lock is stuck for too long Rik van Riel
2023-08-21 20:29 ` Paul E. McKenney
2023-09-13 13:22 ` Peter Zijlstra
2023-09-13 14:33   ` Rik van Riel
2023-09-13 16:17     ` Peter Zijlstra
2023-09-13 20:17       ` Rik van Riel

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox